Child pages
  • Using OpenMP threads, with or without MPI
Skip to end of metadata
Go to start of metadata

If you are using a program such as WRF compiled for MPI+threading (dmpar + smpar), then you might find that your program runs slower than it could, because the processes overload the compute nodes that they are assigned to.  The symptom of this is that llstatus reports a very high for the nodes that you're running on, up to 10 times what the load should be (32 for a fully-loaded power 755 compute node).

The issue is that you are not controlling the number of threads that each WRF process is running - and the default is for each process to spawn 32 threads (32 being the number of cores for our power755 example).  To fix it, you should add these lines to your loadleveler script:

OMP_NUM_THREADS=4

export OMP_NUM_THREADS

 

... then you'll set each process to each spawn 4 threads.  You could experiment with setting the loadleveler tasks_per_node variable, which sets the number of processes on each node, and the OMP_NUM_THREADS variable, which sets the number of threads, to get optimum loading of each node (as I say, 32 is best in this example), but in any case be sure to add this line to your wrf.ll script:

# @ node_usage = not_shared

... otherwise your process could be impacted by other users running tasks on the node that you're using, since loadeleveler basically can't deal with programs that use threads

Example of OpenMP on AIX (no MPI)

1. Compile this program using "xlc_r -qsmp"

If you compile this without "-qsmp" then the program produces a single-threading version.

2. Run the multi-threading program interactively:

l3n01-c:hello% export OMP_NUM_THREADS=4
l3n01-c:hello% ./a.out
Hello from thread 0 out of 4
Hello from thread 3 out of 4
Hello from thread 2 out of 4
Hello from thread 1 out of 4

3. Run the program from loadleveler:

NB: I got the default of using all available CPUs as the number of threads:

Hello from thread 7 out of 8
Hello from thread 0 out of 8
Hello from thread 5 out of 8
Hello from thread 4 out of 8
Hello from thread 1 out of 8
Hello from thread 2 out of 8
Hello from thread 3 out of 8
Hello from thread 6 out of 8

4. I then added "export OMP_NUM_THREADS=4" to the loadleveler file and got 4 threads:

Hello from thread 0 out of 4
Hello from thread 3 out of 4
Hello from thread 2 out of 4
Hello from thread 1 out of 4

  • No labels