At the term level, Core resembles a explicitly-typed polymorphic
lambda calculus (Fω), with the addition of
local let
bindings, algebraic type definitions, constructors, and
case
expressions, and primitive types, literals and operators. Its
type system is richer than that of System F, supporting explicit
type equality coercions and type functions.[system-fc]
In this section we concentrate on the less obvious points about Core.
Core programs are organized into modules, corresponding directly
to source-level Haskell modules. Each module has a identifying
name mident. A module identifier consists of a package name
followed by a module name, which may be hierarchical: for
example, base:GHC.Base
is the module identifier for GHC’s Base
module. Its name is Base
, and it lives in the GHC hierarchy
within the base
package. Section 5.8 of the GHC users’ guide
explains package names [ghc-user-guide]. In particular, note that a Core
program may contain multiple modules with the same (possibly
hierarchical) module name that differ in their package names. In
some of the code examples that follow, we will omit package
names and possibly full hierarchical module names from
identifiers for brevity, but be aware that they are always
required.[18]
Each module may contain the following kinds of top-level declarations:
newtype
declarations, each defining a type constructor and a
coercion name; and
No type constructor, data constructor, or top-level value may be declared more than once within a given module. All the type declarations are (potentially) mutually recursive. Value declarations must be in dependency order, with explicit grouping of potentially mutually recursive declarations.
Identifiers defined in top-level declarations may be external or
internal. External identifiers can be referenced from any other
module in the program, using conventional dot notation (e.g.,
base:GHC.Base.Bool
, base:GHC.Base.True
). Internal identifiers
are visible only within the defining module. All type and data
constructors are external, and are always defined and referenced
using fully qualified names (with dots).
A top-level value is external if it is defined and referenced
using a fully qualified name with a dot (e.g., main:MyModule.foo = ...
);
otherwise, it is internal (e.g., bar = ...
). Note that
Core’s notion of an external identifier does not necessarily
coincide with that of “exported” identifier in a Haskell source
module. An identifier can be an external identifier in Core, but
not be exported by the original Haskell source
module.[19]
However, if an identifier was exported by the Haskell source
module, it will appear as an external name in Core.
Core modules have no explicit import or export lists. Modules may be mutually recursive. Note that because of the latter fact, GHC currently prints out the top-level bindings for every module as a single recursive group, in order to avoid keeping track of dependencies between top-level values within a module. An external Core tool could reconstruct dependencies later, of course.
There is also an implicitly-defined module ghc-prim:GHC.Prim
,
which exports the “built-in” types and values that must be
provided by any implementation of Core (including GHC). Details
of this module are in the Primitive Module section.
A Core program is a collection of distinctly-named modules that
includes a module called main:Main having an exported value
called main:ZCMain.main
of type base:GHC.IOBase.IO a
(for some
type a
). (Note that the strangely named wrapper for main
is the
one exception to the rule that qualified names defined within a
module m
must have module name m
.)
Many Core programs will contain library modules, such as
base:GHC.Base
, which implement parts of the Haskell standard
library. In principle, these modules are ordinary Haskell
modules, with no special status. In practice, the requirement on
the type of main:Main.main
implies that every program will
contain a large subset of the standard library modules.
There are five distinct namespaces:
mident
),tycon
),tyvar
),dcon
),var
).
Spaces (1), (2+3), and (4+5) can be distinguished from each other by context. To distinguish (2) from (3) and (4) from (5), we require that data and type constructors begin with an upper-case character, and that term and type variables begin with a lower-case character.
Primitive types and operators are not syntactically distinguished.
Primitive coercion operators, of which there are six, are syntactically distinguished in the grammar. This is because these coercions must be fully applied, and because distinguishing their applications in the syntax makes typechecking easier.
A given variable (type or term) may have multiple definitions
within a module. However, definitions of term variables never
“shadow” one another: the scope of the definition of a given
variable never contains a redefinition of the same variable.
Type variables may be shadowed. Thus, if a term variable has
multiple definitions within a module, all those definitions must
be local (let-bound). The only exception to this rule is that
(necessarily closed) types labelling %external
expressions may
contain tyvar
bindings that shadow outer bindings.
Core generated by GHC makes heavy use of encoded names, in which
the characters Z
and z
are used to introduce escape sequences
for non-alphabetic characters such as dollar sign $
(zd
), hash #
(zh
), plus +
(zp
), etc. This is the same encoding used in .hi
files and in the back-end of GHC itself, except that we
sometimes change an initial z
to Z
, or vice-versa, in order to
maintain case distinctions.
Finally, note that hierarchical module names are z-encoded in
Core: for example, base:GHC.Base.foo
is rendered as
base:GHCziBase.foo
. A parser may reconstruct the module
hierarchy, or regard GHCziBase
as a flat name.
In Core, all type abstractions and applications are explicit. This make it easy to typecheck any (closed) fragment of Core code. An full executable typechecker is available separately.
Types are described by type expressions, which are built from
named type constructors and type variables using type
application and universal quantification. Each type
constructor has a fixed arity ≥ 0. Because it is so widely
used, there is special infix syntax for the fully-applied
function type constructor (->
). (The prefix identifier for
this constructor is ghc-prim:GHC.Prim.ZLzmzgZR
; this should
only appear in unapplied or partially applied form.)
There are also a number of other primitive type constructors
(e.g., Intzh
) that are predefined in the GHC.Prim
module, but
have no special syntax. %data
and %newtype
declarations
introduce additional type constructors, as described below.
Type constructors are distinguished solely by name.
A type may also be built using one of the primitive coercion operators, as described in the Namespaces section. For details on the meanings of these operators, see the System FC paper [system-fc]. Also see the Newtypes section for examples of how GHC uses coercions in Core code.
As described in the Haskell definition, it is necessary to distinguish well-formed type-expressions by classifying them into different kinds [haskell98, p. 41]. In particular, Core explicitly records the kind of every bound type variable.
In addition, Core’s kind system includes equality kinds, as in
System FC [system-fc]. An application of a built-in coercion, or of a
user-defined coercion as introduced by a newtype
declaration,
has an equality kind.
Semantically, a type is lifted if and only if it has bottom as an element. We need to distinguish them because operationally, terms with lifted types may be represented by closures; terms with unlifted types must not be represented by closures, which implies that any unboxed value is necessarily unlifted. We distinguish between lifted and unlifted types by ascribing them different kinds.
Currently, all the primitive types are unlifted (including a
few boxed primitive types such as ByteArrayzh
). Peyton-Jones
and Launchbury [pj:unboxed] described the ideas behind unboxed and
unlifted types.
Every type constructor has a kind, depending on its arity and whether it or its arguments are lifted.
Term variables can only be assigned types that have base
kinds: the base kinds are *
, #
, and ?
. The three base kinds
distinguish the liftedness of the types they classify: *
represents lifted types; #
represents unlifted types; and ?
is
the “open” kind, representing a type that may be either lifted
or unlifted. Of these, only *
ever appears in Core type
declarations generated from user code; the other two are
needed to describe certain types in primitive (or otherwise
specially-generated) code (which, after optimization, could
potentially appear anywhere).
In particular, no top-level identifier (except in
ghc-prim:GHC.Prim
) has a type of kind #
or ?
.
Nullary type constructors have base kinds: for example, the
type Int
has kind *
, and Int#
has kind #
.
Non-nullary type constructors have higher kinds: kinds that
have the form
k1->
k2, where
k1 and k2 are
kinds. For example, the function type constructor ->
has
kind * -> (* -> *)
. Since Haskell allows abstracting
over type constructors, type variables may have higher kinds;
however, much more commonly they have kind *
, so that is the
default if a type binder omits a kind.
There is no mechanism for defining type synonyms
(corresponding to Haskell type
declarations).
Type equivalence is just syntactic equivalence on type expressions (of base kinds) modulo:
%forall
types;->
b ≡ ghc-prim:GHC.Prim.ZLzmzgZR
a bEach data declaration introduces a new type constructor and a
set of one or more data constructors, normally corresponding
directly to a source Haskell data
declaration. For example, the
source declaration
data Bintree a = Fork (Bintree a) (Bintree a) | Leaf a
might induce the following Core declaration
%data Bintree a = { Fork (Bintree a) (Bintree a); Leaf a)}
which introduces the unary type constructor Bintree of kind
*->*
and two data constructors with types
Fork :: %forall a . Bintree a -> Bintree a -> Bintree a Leaf :: %forall a . a -> Bintree a
We define the arity of each data constructor to be the number of
value arguments it takes; e.g. Fork
has arity 2 and Leaf
has
arity 1.
For a less conventional example illustrating the possibility of higher-order kinds, the Haskell source declaration
data A f a = MkA (f a)
might induce the Core declaration
%data A (f::*->*) a = { MkA (f a) }
which introduces the constructor
MkA :: %forall (f::*->*) a . (f a) -> (A f) a
GHC (like some other Haskell implementations) supports an extension to Haskell98 for existential types such as
data T = forall a . MkT a (a -> Bool)
This is represented by the Core declaration
%data T = {MkT @a a (a -> Bool)}
which introduces the nullary type constructor T and the data constructor
MkT :: %forall a . a -> (a -> Bool) -> T
In general, existentially quantified variables appear as extra
universally quantified variables in the data contructor types. An
example of how to construct and deconstruct values of type T
is
shown in the Expression Forms section.
Each Core %newtype
declaration introduces a new type constructor
and an associated representation type, corresponding to a source
Haskell newtype
declaration. However, unlike in source Haskell,
a %newtype
declaration does not introduce any data constructors.
Each %newtype
declaration also introduces a new coercion
(syntactically, just another type constructor) that implies an
axiom equating the type constructor, applied to any type
variables bound by the %newtype
, to the representation type.
For example, the Haskell fragment
newtype U = MkU Bool u = MkU True v = case u of MkU b -> not b
might induce the Core fragment
%newtype U ZCCoU = Bool; u :: U = %cast (True) ((%sym ZCCoU)); v :: Bool = not (%cast (u) ZCCoU);
The newtype
declaration implies that the types U
and Bool
have
equivalent representations, and the coercion axiom ZCCoU
provides evidence that U
is equivalent to Bool
. Notice that in
the body of u
, the boolean value True
is cast to type U
using
the primitive symmetry rule applied to ZCCoU
: that is, using a
coercion of kind Bool :=: U
. And in the body of v
, u
is cast
back to type Bool
using the axiom ZCCoU
.
Notice that the case
in the Haskell source code above translates
to a cast
in the corresponding Core code. That is because
operationally, a case
on a value whose type is declared by a
newtype
declaration is a no-op. Unlike a case
on any other
value, such a case
does no evaluation: its only function is to
coerce its scrutinee’s type.
Also notice that unlike in a previous draft version of External Core, there is no need to handle recursive newtypes specially.
Variables and data constructors are straightforward.
Literal (lit) expressions consist of a literal value, in one of four different formats, and a (primitive) type annotation. Only certain combinations of format and type are permitted; see the Primitive Module section. The character and string formats can describe only 8-bit ASCII characters.
Moreover, because the operational semantics for Core interprets strings as C-style null-terminated strings, strings should not contain embedded nulls.
In Core, value applications, type applications, value
abstractions, and type abstractions are all explicit. To tell
them apart, type arguments in applications and formal type
arguments in abstractions are preceded by an @ symbol
. (In
abstractions, the @
plays essentially the same role as the more
usual Λ symbol.) For example, the Haskell source declaration
f x = Leaf (Leaf x)
might induce the Core declaration
f :: %forall a . a -> BinTree (BinTree a) = \ @a (x::a) -> Leaf @(Bintree a) (Leaf @a x)
Value applications may be of user-defined functions, data constructors, or primitives. None of these sorts of applications are necessarily saturated.
Note that the arguments of type applications are not always of
kind *
. For example, given our previous definition of type A
:
data A f a = MkA (f a)
the source code
MkA (Leaf True)
becomes
(MkA @Bintree @Bool) (Leaf @Bool True)
Local bindings, of a single variable or of a set of mutually
recursive variables, are represented by %let
expressions in the
usual way.
By far the most complicated expression form is %case
. %case
expressions are permitted over values of any type, although they
will normally be algebraic or primitive types (with literal
values). Evaluating a %case
forces the evaluation of the
expression being tested (the “scrutinee”). The value of the
scrutinee is bound to the variable following the %of
keyword,
which is in scope in all alternatives; this is useful when the
scrutinee is a non-atomic expression (see next example). The
scrutinee is preceded by the type of the entire %case
expression: that is, the result type that all of the %case
alternatives have (this is intended to make type reconstruction
easier in the presence of type equality coercions).
In an algebraic %case
, all the case alternatives must be labeled
with distinct data constructors from the algebraic type,
followed by any existential type variable bindings (see below),
and typed term variable bindings corresponding to the data
constructor’s arguments. The number of variables must match the
data constructor’s arity.
For example, the following Haskell source expression
case g x of Fork l r -> Fork r l t@(Leaf v) -> Fork t t
might induce the Core expression
%case ((Bintree a)) g x %of (t::Bintree a) Fork (l::Bintree a) (r::Bintree a) -> Fork @a r l Leaf (v::a) -> Fork @a t t
When performing a %case
over a value of an
existentially-quantified algebraic type, the alternative must
include extra local type bindings for the
existentially-quantified variables. For example, given
data T = forall a . MkT a (a -> Bool)
the source
case x of MkT w g -> g w
becomes
%case x %of (x’::T) MkT @b (w::b) (g::b->Bool) -> g w
In a %case
over literal alternatives, all the case alternatives
must be distinct literals of the same primitive type.
The list of alternatives may begin with a default alternative
labeled with an underscore (%_
), whose right-hand side will be
evaluated if none of the other alternatives match. The default
is optional except for in a case over a primitive type, or when
there are no other alternatives. If the case is over neither an
algebraic type nor a primitive type, then the list of
alternatives must contain a default alternative and nothing
else. For algebraic cases, the set of alternatives need not be
exhaustive, even if no default is given; if alternatives are
missing, this implies that GHC has deduced that they cannot
occur.
%cast
is used to manipulate newtypes, as described in
the Newtype section. The %cast
expression
takes an expression and a coercion: syntactically, the coercion
is an arbitrary type, but it must have an equality kind. In an
expression (cast e co)
, if e :: T
and co
has kind T :=: U
, then
the overall expression has type U
[ghc-fc-commentary]. Here, co
must be a
coercion whose left-hand side is T
.
Note that unlike the %coerce
expression that existed in previous
versions of Core, this means that %cast
is (almost) type-safe:
the coercion argument provides evidence that can be verified by
a typechecker. There are still unsafe %cast
s, corresponding to
the unsafe %coerce
construct that existed in old versions of
Core, because there is a primitive unsafe coercion type that can
be used to cast arbitrary types to each other. GHC uses this for
such purposes as coercing the return type of a function (such as
error) which is guaranteed to never return:
case (error "") of True -> 1 False -> 2
becomes:
%cast (error @ Bool (ZMZN @ Char)) (%unsafe Bool Integer);
%cast
has no operational meaning and is only used in
typechecking.
A %note
expression carries arbitrary internal information that
GHC finds interesting. The information is encoded as a string.
Expression notes currently generated by GHC include the inlining
pragma (InlineMe
) and cost-center labels for profiling.
A %external
expression denotes an external identifier, which has
the indicated type (always expressed in terms of Haskell
primitive types). External Core supports two kinds of external
calls: %external
and %dynexternal
. Only the former is supported
by the current set of stand-alone Core tools. In addition, there
is a %label
construct which GHC may generate but which the Core
tools do not support.
The present syntax for externals is sufficient for describing C functions and labels. Interfacing to other languages may require additional information or a different interpretation of the name string.
The dynamic semantics of Core are defined on the type-erasure of the program: for example, we ignore all type abstractions and applications. The denotational semantics of the resulting type-free program are just the conventional ones for a call-by-name language, in which expressions are only evaluated on demand. But Core is intended to be a call-by-need language, in which expressions are only evaluated once. To express the sharing behavior of call-by-need, we give an operational model in the style of Launchbury [launchbury93natural].
This section describes the model informally; a more formal semantics is separately available as an executable interpreter.
To simplify the semantics, we consider only “well-behaved” Core
programs in which constructor and primitive applications are
fully saturated, and in which non-trivial expresssions of
unlifted kind (#
) appear only as scrutinees in %case
expressions. Any program can easily be put into this form; a
separately available preprocessor illustrates how. In the
remainder of this section, we use “Core” to mean “well-behaved”
Core.
Evaluating a Core expression means reducing it to weak-head normal form (WHNF),
i.e., a primitive value, lambda abstraction,
or fully-applied data constructor. Evaluating a program means
evaluating the expression main:ZCMain.main
.
To make sure that expression evaluation is shared, we make use of a heap, which contains heap entries. A heap entry can be:
Heap pointers point to heap entries: at different times, the same heap pointer can point to either a thunk or a WHNF, because the run-time system overwrites thunks with WHNFs as computation proceeds.
The suspended computation that a thunk represents might represent evaluating one of three different kinds of expression. The run-time system allocates a different kind of thunk depending on what kind of expression it is:
let
-bound (local)).
As computation proceeds, copies of the heap pointer for a given thunk propagate through the executing program. When another computation demands the result of that thunk, the thunk is forced: the run-time system computes the thunk’s result, yielding a WHNF, and overwrites the heap entry for the thunk with the WHNF. Now, all copies of the heap pointer point to the new heap entry: a WHNF. Forcing occurs only in the context of
case
expression; or
When no pointers to a heap entry (whether it is a thunk or WHNF) remain, the garbage collector can reclaim the space it uses. We assume this happens implicitly.
With the exception of functions, arrays, and mutable variables, we intend that values of all primitive types should be held unboxed: they should not be heap-allocated. This does not violate call-by-need semantics: all primitive types are unlifted, which means that values of those types must be evaluated strictly. Unboxed tuple types are not heap-allocated either.
Certain primitives and %external
functions cause side-effects to
state threads or to the real world. Where the ordering of these
side-effects matters, Core already forces this order with data
dependencies on the pseudo-values representing the threads.
An implementation must specially support the raisezh
and
handlezh
primitives: for example, by using a handler stack.
Again, real-world threading guarantees that they will execute in
the correct order.