[You won't be able to execute parallel Haskell programs unless PVM3 (parallel Virtual Machine, version 3) is installed at your site.]
To compile a Haskell program for parallel execution under PVM, use the
-parallel
option, both when compiling and
linking. You will probably want to import
parallel
into your Haskell modules.
To run your parallel program, once PVM is going, just invoke it
“as normal”. The main extra RTS option is
-qp<n>
, to say how many PVM
“processors” your program to run on. (For more details of
all relevant RTS options, please see Section 4.12.4, “RTS options for Concurrent/parallel Haskell
”.)
In truth, running parallel Haskell programs and getting information out of them (e.g., parallelism profiles) is a battle with the vagaries of PVM, detailed in the following sections.
Before you can run a parallel program under PVM, you must set the
required environment variables (PVM's idea, not ours); something like,
probably in your .cshrc
or equivalent:
setenv PVM_ROOT /wherever/you/put/it setenv PVM_ARCH `$PVM_ROOT/lib/pvmgetarch` setenv PVM_DPATH $PVM_ROOT/lib/pvmd
Creating and/or controlling your “parallel machine” is a purely-PVM business; nothing specific to parallel Haskell. The following paragraphs describe how to configure your parallel machine interactively.
If you use parallel Haskell regularly on the same machine configuration it is a good idea to maintain a file with all machine names and to make the environment variable PVM_HOST_FILE point to this file. Then you can avoid the interactive operations described below by just saying
pvm $PVM_HOST_FILE
You use the pvm command to start PVM on your machine. You can then do various things to control/monitor your “parallel machine;” the most useful being:
Control-D | exit pvm, leaving it running |
halt | kill off this “parallel machine” & exit |
add <host> | add <host> as a processor |
delete <host> | delete <host> |
reset | kill what's going, but leave PVM up |
conf | list the current configuration |
ps | report processes' status |
pstat <pid> | status of a particular process |
The PVM documentation can tell you much, much more about pvm!
With parallel Haskell programs, we usually don't care about the results—only with “how parallel” it was! We want pretty pictures.
parallelism profiles (ā la hbcpp) can be generated with the
-qP
RTS option. The
per-processor profiling info is dumped into files named
<full-path><program>.gr
. These are then munged into a PostScript picture,
which you can then display. For example, to run your program
a.out
on 8 processors, then view the parallelism profile, do:
$
./a.out +RTS -qP -qp8$
grs2gr *.???.gr > temp.gr # combine the 8 .gr files into one$
gr2ps -O temp.gr # cvt to .ps; output in temp.ps$
ghostview -seascape temp.ps # look at it!
The scripts for processing the parallelism profiles are distributed
in ghc/utils/parallel/
.
The “garbage-collection statistics” RTS options can be useful for
seeing what parallel programs are doing. If you do either
+RTS -Sstderr
or +RTS -sstderr
, then
you'll get mutator, garbage-collection, etc., times on standard
error. The standard error of all PE's other than the `main thread'
appears in /tmp/pvml.nnn
, courtesy of PVM.
Whether doing +RTS -Sstderr
or not, a handy way to watch
what's happening overall is: tail -f /tmp/pvml.nnn.
Besides the usual runtime system (RTS) options (Section 4.14, “Running a compiled program”), there are a few options particularly for concurrent/parallel execution.
-qp<N>
:
(paraLLEL ONLY) Use <N>
PVM processors to run this program;
the default is 2.
-C[<us>]
:
Sets
the context switch interval to <s>
seconds.
A context switch will occur at the next heap block allocation after
the timer expires (a heap block allocation occurs every 4k of
allocation). With -C0
or -C
,
context switches will occur as often as possible (at every heap block
allocation). By default, context switches occur every 20ms
milliseconds. Note that GHC's internal timer ticks every 20ms, and
the context switch timer is always a multiple of this timer, so 20ms
is the maximum granularity available for timed context switches.
-q[v]
:
(paraLLEL ONLY) Produce a quasi-parallel profile of thread activity,
in the file <program>.qp
. In the style of hbcpp, this profile
records the movement of threads between the green (runnable) and red
(blocked) queues. If you specify the verbose suboption (-qv
), the
green queue is split into green (for the currently running thread
only) and amber (for other runnable threads). We do not recommend
that you use the verbose suboption if you are planning to use the
hbcpp profiling tools or if you are context switching at every heap
check (with -C
).
-->
-qt<num>
:
(paraLLEL ONLY) Limit the thread pool size, i.e. the number of concurrent
threads per processor to <num>
. The default is
32. Each thread requires slightly over 1K words in
the heap for thread state and stack objects. (For 32-bit machines, this
translates to 4K bytes, and for 64-bit machines, 8K bytes.)
-qe<num>
:
(paraLLEL ONLY) Limit the spark pool size
i.e. the number of pending sparks per processor to
<num>
. The default is 100. A larger number may be
appropriate if your program generates large amounts of parallelism
initially.
-qQ<num>
:
(paraLLEL ONLY) Set the size of packets transmitted between processors
to <num>
. The default is 1024 words. A larger number may be
appropriate if your machine has a high communication cost relative to
computation speed.
-qh<num>
:(paraLLEL ONLY) Select a packing scheme. Set the number of non-root thunks to pack in one packet to <num>-1 (0 means infinity). By default GUM uses full-subgraph packing, i.e. the entire subgraph with the requested closure as root is transmitted (provided it fits into one packet). Choosing a smaller value reduces the amount of pre-fetching of work done in GUM. This can be advantageous for improving data locality but it can also worsen the balance of the load in the system.
-qg<num>
:(paraLLEL ONLY) Select a globalisation scheme. This option affects the generation of global addresses when transferring data. Global addresses are globally unique identifiers required to maintain sharing in the distributed graph structure. Currently this is a binary option. With <num>=0 full globalisation is used (default). This means a global address is generated for every closure that is transmitted. With <num>=1 a thunk-only globalisation scheme is used, which generated global address only for thunks. The latter case may lose sharing of data but has a reduced overhead in packing graph structures and maintaining internal tables of global addresses.