GHC is built on a raft of primitive data types and operations. While you really can use this stuff to write fast code, we generally find it a lot less painful, and more satisfying in the long run, to use higher-level language features and libraries. With any luck, the code you write will be optimised to the efficient unboxed version in any case. And if it isn't, we'd like to know about it.
We do not currently have good, up-to-date documentation about the primitives, perhaps because they are mainly intended for internal use. There used to be a long section about them here in the User Guide, but it became out of date, and wrong information is worse than none.
The Real Truth about what primitive types there are, and what operations
work over those types, is held in the file
fptools/ghc/compiler/prelude/primops.txt.pp
.
This file is used directly to generate GHC's primitive-operation definitions, so
it is always correct! It is also intended for processing into text.
Indeed,
the result of such processing is part of the description of the
External
Core language.
So that document is a good place to look for a type-set version.
We would be very happy if someone wanted to volunteer to produce an SGML
back end to the program that processes primops.txt
so that
we could include the results here in the User Guide.
What follows here is a brief summary of some main points.
Most types in GHC are boxed, which means
that values of that type are represented by a pointer to a heap
object. The representation of a Haskell Int
, for
example, is a two-word heap object. An unboxed
type, however, is represented by the value itself, no pointers or heap
allocation are involved.
Unboxed types correspond to the “raw machine” types you
would use in C: Int#
(long int),
Double#
(double), Addr#
(void *), etc. The primitive operations
(PrimOps) on these types are what you might expect; e.g.,
(+#)
is addition on
Int#
s, and is the machine-addition that we all
know and love—usually one instruction.
Primitive (unboxed) types cannot be defined in Haskell, and are
therefore built into the language and compiler. Primitive types are
always unlifted; that is, a value of a primitive type cannot be
bottom. We use the convention that primitive types, values, and
operations have a #
suffix.
Primitive values are often represented by a simple bit-pattern, such
as Int#
, Float#
,
Double#
. But this is not necessarily the case:
a primitive value might be represented by a pointer to a
heap-allocated object. Examples include
Array#
, the type of primitive arrays. A
primitive array is heap-allocated because it is too big a value to fit
in a register, and would be too expensive to copy around; in a sense,
it is accidental that it is represented by a pointer. If a pointer
represents a primitive value, then it really does point to that value:
no unevaluated thunks, no indirections…nothing can be at the
other end of the pointer than the primitive value.
A numerically-intensive program using unboxed types can
go a lot faster than its “standard”
counterpart—we saw a threefold speedup on one example.
There are some restrictions on the use of primitive types:
The main restriction
is that you can't pass a primitive value to a polymorphic
function or store one in a polymorphic data type. This rules out
things like [Int#]
(i.e. lists of primitive
integers). The reason for this restriction is that polymorphic
arguments and constructor fields are assumed to be pointers: if an
unboxed integer is stored in one of these, the garbage collector would
attempt to follow it, leading to unpredictable space leaks. Or a
seq
operation on the polymorphic component may
attempt to dereference the pointer, with disastrous results. Even
worse, the unboxed value might be larger than a pointer
(Double#
for instance).
You cannot define a newtype whose representation type (the argument type of the data constructor) is an unboxed type. Thus, this is illegal:
newtype A = MkA Int#
You cannot bind a variable with an unboxed type in a top-level binding.
You cannot bind a variable with an unboxed type in a recursive binding.
You may bind unboxed variables in a (non-recursive, non-top-level) pattern binding, but any such variable causes the entire pattern-match to become strict. For example:
data Foo = Foo Int Int# f x = let (Foo a b, w) = ..rhs.. in ..body..
Since b
has type Int#
, the entire pattern
match
is strict, and the program behaves as if you had written
data Foo = Foo Int Int# f x = case ..rhs.. of { (Foo a b, w) -> ..body.. }
Unboxed tuples aren't really exported by GHC.Exts
,
they're available by default with -fglasgow-exts
. An
unboxed tuple looks like this:
(# e_1, ..., e_n #)
where e_1..e_n
are expressions of any
type (primitive or non-primitive). The type of an unboxed tuple looks
the same.
Unboxed tuples are used for functions that need to return multiple
values, but they avoid the heap allocation normally associated with
using fully-fledged tuples. When an unboxed tuple is returned, the
components are put directly into registers or on the stack; the
unboxed tuple itself does not have a composite representation. Many
of the primitive operations listed in primops.txt.pp
return unboxed
tuples.
In particular, the IO
and ST
monads use unboxed
tuples to avoid unnecessary allocation during sequences of operations.
There are some pretty stringent restrictions on the use of unboxed tuples:
Values of unboxed tuple types are subject to the same restrictions as other unboxed types; i.e. they may not be stored in polymorphic data structures or passed to polymorphic functions.
No variable can have an unboxed tuple type, nor may a constructor or function argument have an unboxed tuple type. The following are all illegal:
data Foo = Foo (# Int, Int #) f :: (# Int, Int #) -> (# Int, Int #) f x = x g :: (# Int, Int #) -> Int g (# a,b #) = a h x = let y = (# x,x #) in ...
The typical use of unboxed tuples is simply to return multiple values,
binding those multiple results with a case
expression, thus:
f x y = (# x+1, y-1 #) g x = case f x x of { (# a, b #) -> a + b }
You can have an unboxed tuple in a pattern binding, thus
f x = let (# p,q #) = h x in ..body..
If the types of p
and q
are not unboxed,
the resulting binding is lazy like any other Haskell pattern binding. The
above example desugars like this:
f x = let t = case h x o f{ (# p,q #) -> (p,q) p = fst t q = snd t in ..body..
Indeed, the bindings can even be recursive.