7.2. Unboxed types and primitive operations

7.2. Unboxed types and primitive operations
Prev	Chapter 7. GHC Language Features	Next

GHC is built on a raft of primitive data types and operations. While you really can use this stuff to write fast code, we generally find it a lot less painful, and more satisfying in the long run, to use higher-level language features and libraries. With any luck, the code you write will be optimised to the efficient unboxed version in any case. And if it isn't, we'd like to know about it.

We do not currently have good, up-to-date documentation about the primitives, perhaps because they are mainly intended for internal use. There used to be a long section about them here in the User Guide, but it became out of date, and wrong information is worse than none.

The Real Truth about what primitive types there are, and what operations work over those types, is held in the file fptools/ghc/compiler/prelude/primops.txt.pp. This file is used directly to generate GHC's primitive-operation definitions, so it is always correct! It is also intended for processing into text.

Indeed, the result of such processing is part of the description of the External Core language. So that document is a good place to look for a type-set version. We would be very happy if someone wanted to volunteer to produce an SGML back end to the program that processes primops.txt so that we could include the results here in the User Guide.

What follows here is a brief summary of some main points.

7.2.1. Unboxed types

Most types in GHC are boxed, which means that values of that type are represented by a pointer to a heap object. The representation of a Haskell Int, for example, is a two-word heap object. An unboxed type, however, is represented by the value itself, no pointers or heap allocation are involved.

Unboxed types correspond to the “raw machine” types you would use in C: Int# (long int), Double# (double), Addr# (void *), etc. The primitive operations (PrimOps) on these types are what you might expect; e.g., (+#) is addition on Int#s, and is the machine-addition that we all know and love—usually one instruction.

Primitive (unboxed) types cannot be defined in Haskell, and are therefore built into the language and compiler. Primitive types are always unlifted; that is, a value of a primitive type cannot be bottom. We use the convention that primitive types, values, and operations have a # suffix.

Primitive values are often represented by a simple bit-pattern, such as Int#, Float#, Double#. But this is not necessarily the case: a primitive value might be represented by a pointer to a heap-allocated object. Examples include Array#, the type of primitive arrays. A primitive array is heap-allocated because it is too big a value to fit in a register, and would be too expensive to copy around; in a sense, it is accidental that it is represented by a pointer. If a pointer represents a primitive value, then it really does point to that value: no unevaluated thunks, no indirections…nothing can be at the other end of the pointer than the primitive value. A numerically-intensive program using unboxed types can go a lot faster than its “standard” counterpart—we saw a threefold speedup on one example.

There are some restrictions on the use of primitive types:

The main restriction is that you can't pass a primitive value to a polymorphic function or store one in a polymorphic data type. This rules out things like [Int#] (i.e. lists of primitive integers). The reason for this restriction is that polymorphic arguments and constructor fields are assumed to be pointers: if an unboxed integer is stored in one of these, the garbage collector would attempt to follow it, leading to unpredictable space leaks. Or a seq operation on the polymorphic component may attempt to dereference the pointer, with disastrous results. Even worse, the unboxed value might be larger than a pointer (Double# for instance).
You cannot bind a variable with an unboxed type in a top-level binding.
You cannot bind a variable with an unboxed type in a recursive binding.
You may bind unboxed variables in a (non-recursive, non-top-level) pattern binding, but any such variable causes the entire pattern-match to become strict. For example:
```
  data Foo = Foo Int Int#

  f x = let (Foo a b, w) = ..rhs.. in ..body..
```
Since b has type Int#, the entire pattern match is strict, and the program behaves as if you had written
```
  data Foo = Foo Int Int#

  f x = case ..rhs.. of { (Foo a b, w) -> ..body.. }
```

7.2.2. Unboxed Tuples

Unboxed tuples aren't really exported by GHC.Exts, they're available by default with -fglasgow-exts. An unboxed tuple looks like this:

(# e_1, ..., e_n #)

where e_1..e_n are expressions of any type (primitive or non-primitive). The type of an unboxed tuple looks the same.

Unboxed tuples are used for functions that need to return multiple values, but they avoid the heap allocation normally associated with using fully-fledged tuples. When an unboxed tuple is returned, the components are put directly into registers or on the stack; the unboxed tuple itself does not have a composite representation. Many of the primitive operations listed in primops.txt.pp return unboxed tuples. In particular, the IO and ST monads use unboxed tuples to avoid unnecessary allocation during sequences of operations.

There are some pretty stringent restrictions on the use of unboxed tuples:

Values of unboxed tuple types are subject to the same restrictions as other unboxed types; i.e. they may not be stored in polymorphic data structures or passed to polymorphic functions.

No variable can have an unboxed tuple type, nor may a constructor or function argument have an unboxed tuple type. The following are all illegal:

  data Foo = Foo (# Int, Int #)

  f :: (# Int, Int #) -> (# Int, Int #)
  f x = x

  g :: (# Int, Int #) -> Int
  g (# a,b #) = a

  h x = let y = (# x,x #) in ...

The typical use of unboxed tuples is simply to return multiple values, binding those multiple results with a case expression, thus:

  f x y = (# x+1, y-1 #)
  g x = case f x x of { (# a, b #) -> a + b }

You can have an unboxed tuple in a pattern binding, thus

  f x = let (# p,q #) = h x in ..body..

If the types of p and q are not unboxed, the resulting binding is lazy like any other Haskell pattern binding. The above example desugars like this:

  f x = let t = case h x o f{ (# p,q #) -> (p,q)
            p = fst t
            q = snd t
        in ..body..

Indeed, the bindings can even be recursive.