5.6. Profiling Parallel and Concurrent Programs

Combining -threaded and -prof is perfectly fine, and indeed it is possible to profile a program running on multiple processors with the +RTS -N option.[12]

Some caveats apply, however. In the current implementation, a profiled program is likely to scale much less well than the unprofiled program, because the profiling implementation uses some shared data structures which require locking in the runtime system. Furthermore, the memory allocation statistics collected by the profiled program are stored in shared memory but not locked (for speed), which means that these figures might be inaccurate for parallel programs.

We strongly recommend that you use -fno-prof-count-entries when compiling a program to be profiled on multiple cores, because the entry counts are also stored in shared memory, and continuously updating them on multiple cores is extremely slow.

We also recommend using ThreadScope for profiling parallel programs; it offers a GUI for visualising parallel execution, and is complementary to the time and space profiling features provided with GHC.