4.12. Using SMP parallelism

GHC supports running Haskell programs in parallel on an SMP (symmetric multiprocessor).

There's a fine distinction between concurrency and parallelism: parallelism is all about making your program run faster by making use of multiple processors simultaneously. Concurrency, on the other hand, is a means of abstraction: it is a convenient way to structure a program that must respond to multiple asynchronous events.

However, the two terms are certainly related. By making use of multiple CPUs it is possible to run concurrent threads in parallel, and this is exactly what GHC's SMP parallelism support does. But it is also possible to obtain performance improvements with parallelism on programs that do not use concurrency. This section describes how to use GHC to compile and run parallel programs, in Section 7.15, “Parallel Haskell” we desribe the language features that affect parallelism.

4.12.1. Options to enable SMP parallelism

In order to make use of multiple CPUs, your program must be linked with the -threaded option (see Section 4.10.7, “Options affecting linking”). Then, to run a program on multiple CPUs, use the RTS -N option:

-Nx

Use x simultaneous threads when running the program. Normally x should be chosen to match the number of CPU cores on the machine. There is no means (currently) by which this value may vary after the program has started.

For example, on a dual-core machine we would probably use +RTS -N2 -RTS.

Whether hyperthreading cores should be counted or not is an open question; please feel free to experiment and let us know what results you find.

4.12.2. Hints for using SMP parallelism

Add the -sstderr RTS option when running the program to see timing stats, which will help to tell you whether your program got faster by using more CPUs or not. If the user time is greater than the elapsed time, then the program used more than one CPU. You should also run the program without -N for comparison.

GHC's parallelism support is new and experimental. It may make your program go faster, or it might slow it down - either way, we'd be interested to hear from you.

One significant limitation with the current implementation is that the garbage collector is still single-threaded, and all execution must stop when GC takes place. This can be a significant bottleneck in a parallel program, especially if your program does a lot of GC. If this happens to you, then try reducing the cost of GC by tweaking the GC settings (Section 4.14.3, “RTS options to control the garbage collector”): enlarging the heap or the allocation area size is a good start.