4.14. Using SMP parallelism

GHC supports running Haskell programs in parallel on an SMP (symmetric multiprocessor).

There's a fine distinction between concurrency and parallelism: parallelism is all about making your program run faster by making use of multiple processors simultaneously. Concurrency, on the other hand, is a means of abstraction: it is a convenient way to structure a program that must respond to multiple asynchronous events.

However, the two terms are certainly related. By making use of multiple CPUs it is possible to run concurrent threads in parallel, and this is exactly what GHC's SMP parallelism support does. But it is also possible to obtain performance improvements with parallelism on programs that do not use concurrency. This section describes how to use GHC to compile and run parallel programs, in Section 7.19, “Concurrent and Parallel Haskell” we describe the language features that affect parallelism.

4.14.1. Compile-time options for SMP parallelism

In order to make use of multiple CPUs, your program must be linked with the -threaded option (see Section 4.11.6, “Options affecting linking”). Additionally, the following compiler options affect parallelism:

-feager-blackholing

Blackholing is the act of marking a thunk (lazy computuation) as being under evaluation. It is useful for three reasons: firstly it lets us detect certain kinds of infinite loop (the NonTermination exception), secondly it avoids certain kinds of space leak, and thirdly it avoids repeating a computation in a parallel program, because we can tell when a computation is already in progress.

The option -feager-blackholing causes each thunk to be blackholed as soon as evaluation begins. The default is "lazy blackholing", whereby thunks are only marked as being under evaluation when a thread is paused for some reason. Lazy blackholing is typically more efficient (by 1-2% or so), because most thunks don't need to be blackholed. However, eager blackholing can avoid more repeated computation in a parallel program, and this often turns out to be important for parallelism.

We recommend compiling any code that is intended to be run in parallel with the -feager-blackholing flag.

4.14.2. RTS options for SMP parallelism

To run a program on multiple CPUs, use the RTS -N option:

-N[x]

Use x simultaneous threads when running the program. Normally x should be chosen to match the number of CPU cores on the machine[9]. For example, on a dual-core machine we would probably use +RTS -N2 -RTS.

Omitting x, i.e. +RTS -N -RTS, lets the runtime choose the value of x itself based on how many processors are in your machine.

Be careful when using all the processors in your machine: if some of your processors are in use by other programs, this can actually harm performance rather than improve it.

Setting -N also has the effect of enabling the parallel garbage collector (see Section 4.16.3, “RTS options to control the garbage collector”).

There is no means (currently) by which this value may vary after the program has started.

The current value of the -N option is available to the Haskell program via GHC.Conc.numCapabilities.

The following options affect the way the runtime schedules threads on CPUs:

-qa

Use the OS's affinity facilities to try to pin OS threads to CPU cores. This is an experimental feature, and may or may not be useful. Please let us know whether it helps for you!

-qm

Disable automatic migration for load balancing. Normally the runtime will automatically try to schedule threads across the available CPUs to make use of idle CPUs; this option disables that behaviour. Note that migration only applies to threads; sparks created by par are load-balanced separately by work-stealing.

This option is probably only of use for concurrent programs that explicitly schedule threads onto CPUs with GHC.Conc.forkOnIO.

4.14.3. Hints for using SMP parallelism

Add the -s RTS option when running the program to see timing stats, which will help to tell you whether your program got faster by using more CPUs or not. If the user time is greater than the elapsed time, then the program used more than one CPU. You should also run the program without -N for comparison.

The output of +RTS -s tells you how many “sparks” were created and executed during the run of the program (see Section 4.16.3, “RTS options to control the garbage collector”), which will give you an idea how well your par annotations are working.

GHC's parallelism support has improved in 6.12.1 as a result of much experimentation and tuning in the runtime system. We'd still be interested to hear how well it works for you, and we're also interested in collecting parallel programs to add to our benchmarking suite.



[9] Whether hyperthreading cores should be counted or not is an open question; please feel free to experiment and let us know what results you find.