Chapter 5. Profiling

GHC comes with a time and space profiling system, so that you can answer questions like "why is my program so slow?", or "why is my program using so much memory?".

Profiling a program is a three-step process:

Re-compile your program for profiling with the -prof option, and probably one of the options for adding automatic annotations: -fprof-auto is the most common^[10].
If you are using external packages with cabal, you may need to reinstall these packages with profiling support; typically this is done with cabal install -p package --reinstall.
Having compiled the program for profiling, you now need to run it to generate the profile. For example, a simple time profile can be generated by running the program with +RTS -p, which generates a file named prog.prof where prog is the name of your program (without the .exe extension, if you are on Windows).
There are many different kinds of profile that can be generated, selected by different RTS options. We will be describing the various kinds of profile throughout the rest of this chapter. Some profiles require further processing using additional tools after running the program.
Examine the generated profiling information, use the information to optimise your program, and repeat as necessary.

5.1. Cost centres and cost-centre stacks

GHC's profiling system assigns costs to cost centres. A cost is simply the time or space (memory) required to evaluate an expression. Cost centres are program annotations around expressions; all costs incurred by the annotated expression are assigned to the enclosing cost centre. Furthermore, GHC will remember the stack of enclosing cost centres for any given expression at run-time and generate a call-tree of cost attributions.

Let's take a look at an example:

main = print (fib 30)
fib n = if n < 2 then 1 else fib (n-1) + fib (n-2)

Compile and run this program as follows:

$ ghc -prof -fprof-auto -rtsopts Main.hs
$ ./Main +RTS -p
121393
$

When a GHC-compiled program is run with the -p RTS option, it generates a file called prog.prof. In this case, the file will contain something like this:

        Wed Oct 12 16:14 2011 Time and Allocation Profiling Report  (Final)

           Main +RTS -p -RTS

        total time  =        0.68 secs   (34 ticks @ 20 ms)
        total alloc = 204,677,844 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

fib         Main    100.0  100.0


                                                      individual     inherited
COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                    102           0    0.0    0.0   100.0  100.0
 CAF        GHC.IO.Handle.FD        128           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv   120           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal         110           0    0.0    0.0     0.0    0.0
 CAF        Main                    108           0    0.0    0.0   100.0  100.0
  main      Main                    204           1    0.0    0.0   100.0  100.0
   fib      Main                    205     2692537  100.0  100.0   100.0  100.0

The first part of the file gives the program name and options, and the total time and total memory allocation measured during the run of the program (note that the total memory allocation figure isn't the same as the amount of live memory needed by the program at any one time; the latter can be determined using heap profiling, which we will describe later in Section 5.4, “Profiling memory usage”).

The second part of the file is a break-down by cost centre of the most costly functions in the program. In this case, there was only one significant function in the program, namely fib, and it was responsible for 100% of both the time and allocation costs of the program.

The third and final section of the file gives a profile break-down by cost-centre stack. This is roughly a call-tree profile of the program. In the example above, it is clear that the costly call to fib came from main.

The time and allocation incurred by a given part of the program is displayed in two ways: “individual”, which are the costs incurred by the code covered by this cost centre stack alone, and “inherited”, which includes the costs incurred by all the children of this node.

The usefulness of cost-centre stacks is better demonstrated by modifying the example slightly:

main = print (f 30 + g 30)
  where
    f n  = fib n
    g n  = fib (n `div` 2)

fib n = if n < 2 then 1 else fib (n-1) + fib (n-2)

Compile and run this program as before, and take a look at the new profiling results:

COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN        MAIN                    102           0    0.0    0.0   100.0  100.0
 CAF        GHC.IO.Handle.FD        128           0    0.0    0.0     0.0    0.0
 CAF        GHC.IO.Encoding.Iconv   120           0    0.0    0.0     0.0    0.0
 CAF        GHC.Conc.Signal         110           0    0.0    0.0     0.0    0.0
 CAF        Main                    108           0    0.0    0.0   100.0  100.0
  main      Main                    204           1    0.0    0.0   100.0  100.0
   main.g   Main                    207           1    0.0    0.0     0.0    0.1
    fib     Main                    208        1973    0.0    0.1     0.0    0.1
   main.f   Main                    205           1    0.0    0.0   100.0   99.9
    fib     Main                    206     2692537  100.0   99.9   100.0   99.9

Now although we had two calls to fib in the program, it is immediately clear that it was the call from f which took all the time. The functions f and g which are defined in the where clause in main are given their own cost centres, main.f and main.g respectively.

The actual meaning of the various columns in the output is:

entries: The number of times this particular point in the call tree was entered.
individual %time: The percentage of the total run time of the program spent at this point in the call tree.
individual %alloc: The percentage of the total memory allocations (excluding profiling overheads) of the program made by this call.
inherited %time: The percentage of the total run time of the program spent below this point in the call tree.
inherited %alloc: The percentage of the total memory allocations (excluding profiling overheads) of the program made by this call and all of its sub-calls.

In addition you can use the -P RTS option to get the following additional information:

ticks: The raw number of time “ticks” which were attributed to this cost-centre; from this, we get the %time figure mentioned above.
bytes: Number of bytes allocated in the heap while in this cost-centre; again, this is the raw number from which we get the %alloc figure mentioned above.

What about recursive functions, and mutually recursive groups of functions? Where are the costs attributed? Well, although GHC does keep information about which groups of functions called each other recursively, this information isn't displayed in the basic time and allocation profile, instead the call-graph is flattened into a tree as follows: a call to a function that occurs elsewhere on the current stack does not push another entry on the stack, instead the costs for this call are aggregated into the caller^[11].

5.1.1. Inserting cost centres by hand

Cost centres are just program annotations. When you say -fprof-auto to the compiler, it automatically inserts a cost centre annotation around every binding not marked INLINE in your program, but you are entirely free to add cost centre annotations yourself.

The syntax of a cost centre annotation is

     {-# SCC "name" #-} <expression>

where "name" is an arbitrary string, that will become the name of your cost centre as it appears in the profiling output, and <expression> is any Haskell expression. An SCC annotation extends as far to the right as possible when parsing. (SCC stands for "Set Cost Centre"). The double quotes can be omitted if name is a Haskell identifier, for example:

     {-# SCC my_function #-} <expression>

Here is an example of a program with a couple of SCCs:

main :: IO ()
main = do let xs = [1..1000000]
          let ys = [1..2000000]
          print $ {-# SCC last_xs #-} last xs
          print $ {-# SCC last_init_xs #-} last $ init xs
          print $ {-# SCC last_ys #-} last ys
          print $ {-# SCC last_init_ys #-}last $ init ys

which gives this profile when run:

COST CENTRE     MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN            MAIN                    102           0    0.0    0.0   100.0  100.0
 CAF            GHC.IO.Handle.FD        130           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding.Iconv   122           0    0.0    0.0     0.0    0.0
 CAF            GHC.Conc.Signal         111           0    0.0    0.0     0.0    0.0
 CAF            Main                    108           0    0.0    0.0   100.0  100.0
  main          Main                    204           1    0.0    0.0   100.0  100.0
   last_init_ys Main                    210           1   25.0   27.4    25.0   27.4
   main.ys      Main                    209           1   25.0   39.2    25.0   39.2
   last_ys      Main                    208           1   12.5    0.0    12.5    0.0
   last_init_xs Main                    207           1   12.5   13.7    12.5   13.7
   main.xs      Main                    206           1   18.8   19.6    18.8   19.6
   last_xs      Main                    205           1    6.2    0.0     6.2    0.0

5.1.2. Rules for attributing costs

While running a program with profiling turned on, GHC maintains a cost-centre stack behind the scenes, and attributes any costs (memory allocation and time) to whatever the current cost-centre stack is at the time the cost is incurred.

The mechanism is simple: whenever the program evaluates an expression with an SCC annotation, {-# SCC c -#} E, the cost centre c is pushed on the current stack, and the entry count for this stack is incremented by one. The stack also sometimes has to be saved and restored; in particular when the program creates a thunk (a lazy suspension), the current cost-centre stack is stored in the thunk, and restored when the thunk is evaluated. In this way, the cost-centre stack is independent of the actual evaluation order used by GHC at runtime.

At a function call, GHC takes the stack stored in the function being called (which for a top-level function will be empty), and appends it to the current stack, ignoring any prefix that is identical to a prefix of the current stack.

We mentioned earlier that lazy computations, i.e. thunks, capture the current stack when they are created, and restore this stack when they are evaluated. What about top-level thunks? They are "created" when the program is compiled, so what stack should we give them? The technical name for a top-level thunk is a CAF ("Constant Applicative Form"). GHC assigns every CAF in a module a stack consisting of the single cost centre M.CAF, where M is the name of the module. It is also possible to give each CAF a different stack, using the option -fprof-cafs. This is especially useful when compiling with -ffull-laziness (as is default with -O and higher), as constants in function bodies will be lifted to the top-level and become CAFs. You will probably need to consult the Core (-ddump-simpl) in order to determine what these CAFs correspond to.

^[10]-fprof-auto was known as -auto-all prior to GHC 7.4.1.

^[11]Note that this policy has changed slightly in GHC 7.4.1 relative to earlier versions, and may yet change further, feedback is welcome.

Prev		Next
4.20. Flag reference	Home	5.2. Compiler options for profiling