Initial profiling

For an initial profiling we are using a power7 node with 32 cores (up to 128 virtual cores). We have installed and compiled OpenSees on that node. As a profiling tool, we have chosen to instrument our code using TAU (http://www.cs.uoregon.edu/research/tau/home.php).

The test Tcl script used for the profiling was provided by Seokho. It is a smallish version of the Heathcote Valley simulation. We specifically asked for an input problem that is able to run in a reasonable time (~15 minutes) on several cores. This example will be executed on 8, 16 and 32 cores respectively.

Execution times

The execution times for the code are summarized on the next table:

Number of coresExecution times (s)
86219
163785
322805

Note: because the code has been instrumented, the performance is greatly affected.

We note that there is not a good scaling from 16 to 32 cores already for this problem.

Profiling results

The functions that use the most time on the code for the given example are shown next for 8, 16 and 32 cores respectively

At this point, it would be useful for us to have someone with more insight on the numerical method and the OpenSees code to give us some explanation on the most used functions, so that we can better understand what is happening with the code. We note that the MPI communications actually do take a big chunk of the whole time on the 32 core case.

The following communication matrix shows that the most communications happen between 0 and the world and viceversa. Other inter-process communications are irrelevant to the total communication time.

By inspecting the code, we have found that there is a number of blocking MPI communication. This blocking communication is also performed on a point-to-point manner, instead of using a better performing collective type of communication.

SRC/system_of_eqn/linearSOE/diagonal/DistributedDiagonalSolver.cpp
  ...
  // use P0 to gather & send back out
  //
  if (numShared != 0) {
    if (processID != 0) {
      Channel *theChannel = theChannels[0];
      theChannel->sendVector(0, 0, *vectShared);
      theChannel->recvVector(0, 0, *vectShared);
    } 
    else {
      static Vector otherShared(1);
      otherShared.resize(2*numShared);
      for (int i=0; i<numChannels; i++) {
        Channel *theChannel = theChannels[i];
        theChannel->recvVector(0, 0, otherShared);
        *vectShared += otherShared;
      }
      for (int i=0; i<numChannels; i++) {
        Channel *theChannel = theChannels[i];
        theChannel->sendVector(0, 0, *vectShared);
      }
    }
  }
  ...

To illustrate how this may affect the performance, we compare what are processors 0, 2 and 31 spending their time on

 

 

  • No labels