4.14. Using Parallel Haskell

[You won't be able to execute parallel Haskell programs unless PVM3 (Parallel Virtual Machine, version 3) is installed at your site.]

To compile a Haskell program for parallel execution under PVM, use the -parallel option, both when compiling and linking. You will probably want to import Parallel into your Haskell modules.

To run your parallel program, once PVM is going, just invoke it “as normal”. The main extra RTS option is -qp<n>, to say how many PVM “processors” your program to run on. (For more details of all relevant RTS options, please see Section 4.14.4.)

In truth, running Parallel Haskell programs and getting information out of them (e.g., parallelism profiles) is a battle with the vagaries of PVM, detailed in the following sections.

4.14.1. Dummy's guide to using PVM

Before you can run a parallel program under PVM, you must set the required environment variables (PVM's idea, not ours); something like, probably in your .cshrc or equivalent:
setenv PVM_ROOT /wherever/you/put/it
setenv PVM_ARCH `$PVM_ROOT/lib/pvmgetarch`
setenv PVM_DPATH $PVM_ROOT/lib/pvmd

Creating and/or controlling your “parallel machine” is a purely-PVM business; nothing specific to Parallel Haskell. The following paragraphs describe how to configure your parallel machine interactively.

If you use parallel Haskell regularly on the same machine configuration it is a good idea to maintain a file with all machine names and to make the environment variable PVM_HOST_FILE point to this file. Then you can avoid the interactive operations described below by just saying

pvm $PVM_HOST_FILE

You use the pvm command to start PVM on your machine. You can then do various things to control/monitor your “parallel machine;” the most useful being:

Control-Dexit pvm, leaving it running
haltkill off this “parallel machine” & exit
add <host>add <host> as a processor
delete <host>delete <host>
resetkill what's going, but leave PVM up
conflist the current configuration
psreport processes' status
pstat <pid>status of a particular process

The PVM documentation can tell you much, much more about pvm!

4.14.2. Parallelism profiles

With Parallel Haskell programs, we usually don't care about the results—only with “how parallel” it was! We want pretty pictures.

Parallelism profiles (à la hbcpp) can be generated with the -qP RTS option. The per-processor profiling info is dumped into files named <full-path><program>.gr. These are then munged into a PostScript picture, which you can then display. For example, to run your program a.out on 8 processors, then view the parallelism profile, do:

$ ./a.out +RTS -qP -qp8
$ grs2gr *.???.gr > temp.gr # combine the 8 .gr files into one
$ gr2ps -O temp.gr              # cvt to .ps; output in temp.ps
$ ghostview -seascape temp.ps   # look at it!

The scripts for processing the parallelism profiles are distributed in ghc/utils/parallel/.

4.14.3. Other useful info about running parallel programs

The “garbage-collection statistics” RTS options can be useful for seeing what parallel programs are doing. If you do either +RTS -Sstderr or +RTS -sstderr, then you'll get mutator, garbage-collection, etc., times on standard error. The standard error of all PE's other than the `main thread' appears in /tmp/pvml.nnn, courtesy of PVM.

Whether doing +RTS -Sstderr or not, a handy way to watch what's happening overall is: tail -f /tmp/pvml.nnn.

4.14.4. RTS options for Concurrent/Parallel Haskell

Besides the usual runtime system (RTS) options (Section 4.16), there are a few options particularly for concurrent/parallel execution.

-qp<N>:

(PARALLEL ONLY) Use <N> PVM processors to run this program; the default is 2.

-C[<us>]:

Sets the context switch interval to <s> seconds. A context switch will occur at the next heap block allocation after the timer expires (a heap block allocation occurs every 4k of allocation). With -C0 or -C, context switches will occur as often as possible (at every heap block allocation). By default, context switches occur every 20ms milliseconds. Note that GHC's internal timer ticks every 20ms, and the context switch timer is always a multiple of this timer, so 20ms is the maximum granularity available for timed context switches.

-q[v]:

(PARALLEL ONLY) Produce a quasi-parallel profile of thread activity, in the file <program>.qp. In the style of hbcpp, this profile records the movement of threads between the green (runnable) and red (blocked) queues. If you specify the verbose suboption (-qv), the green queue is split into green (for the currently running thread only) and amber (for other runnable threads). We do not recommend that you use the verbose suboption if you are planning to use the hbcpp profiling tools or if you are context switching at every heap check (with -C). -->

-qt<num>:

(PARALLEL ONLY) Limit the thread pool size, i.e. the number of concurrent threads per processor to <num>. The default is 32. Each thread requires slightly over 1K words in the heap for thread state and stack objects. (For 32-bit machines, this translates to 4K bytes, and for 64-bit machines, 8K bytes.)

-qe<num>:

(PARALLEL ONLY) Limit the spark pool size i.e. the number of pending sparks per processor to <num>. The default is 100. A larger number may be appropriate if your program generates large amounts of parallelism initially.

-qQ<num>:

(PARALLEL ONLY) Set the size of packets transmitted between processors to <num>. The default is 1024 words. A larger number may be appropriate if your machine has a high communication cost relative to computation speed.

-qh<num>:

(PARALLEL ONLY) Select a packing scheme. Set the number of non-root thunks to pack in one packet to <num>-1 (0 means infinity). By default GUM uses full-subgraph packing, i.e. the entire subgraph with the requested closure as root is transmitted (provided it fits into one packet). Choosing a smaller value reduces the amount of pre-fetching of work done in GUM. This can be advantageous for improving data locality but it can also worsen the balance of the load in the system.

-qg<num>:

(PARALLEL ONLY) Select a globalisation scheme. This option affects the generation of global addresses when transferring data. Global addresses are globally unique identifiers required to maintain sharing in the distributed graph structure. Currently this is a binary option. With <num>=0 full globalisation is used (default). This means a global address is generated for every closure that is transmitted. With <num>=1 a thunk-only globalisation scheme is used, which generated global address only for thunks. The latter case may lose sharing of data but has a reduced overhead in packing graph structures and maintaining internal tables of global addresses.