-prof is perfectly fine, and indeed it is
possible to profile a program running on multiple processors
+RTS -N option.
Some caveats apply, however. In the current implementation, a profiled program is likely to scale much less well than the unprofiled program, because the profiling implementation uses some shared data structures which require locking in the runtime system. Furthermore, the memory allocation statistics collected by the profiled program are stored in shared memory but not locked (for speed), which means that these figures might be inaccurate for parallel programs.
We strongly recommend that you
-fno-prof-count-entries when compiling a
program to be profiled on multiple cores, because the entry
counts are also stored in shared memory, and continuously
updating them on multiple cores is extremely slow.
We also recommend using ThreadScope for profiling parallel programs; it offers a GUI for visualising parallel execution, and is complementary to the time and space profiling features provided with GHC.