Using GPU's may speed up codes in a very impressive way. This is one of the directions that we want to pursue for OpenSees.

Simple setup

We have seen from the profiling with Allinea that the principal function used by OpenSees is dgemm (matrix multiplication provided by BLAS). It seems to be a reasonable assumption that using the GPU implementation of BLAS may improve the performance.

Based on previous experience, it is possible to replace the normal implementation of BLAS by the one provided by NVidia at runtime very easily.

The Lamb's 50K+ DOF problem has been tested on Sung's machine, as this machine is equiped with a fairly powerful GPU (GTX 970). I here compare the runtime for this problem on 1, 2 and 4 cores with and without GPU.

CoresExecution time (seconds)Execution time with GPU (seconds)
1308305
2275269
4512509

We notice that there is practically no difference between the execution times.

I would tend to think that this is due to the memory access pattern seen in the profiling with Allinea MAP. This is a well-known limitation of GPU as the copy to data from the memory to the CPU and then to GPU is very slow.

We will do some more investigation on this matter at a later stage.

 

  • No labels

1 Comment

  1. Dear Daniel,

    I am Fanjie Luo, one of civil engineering students in University. I am interested in HPC and numeric simulations.

    I think the problem is double precision performance of GPU. GTX 970 is a game card and it has high floating pointing calculation performance. The speed is 3.92 TFlops but it is the single precision performance (FP32). The double precision performance, which is used for most of scientific computation including opensees, is only 1/32 of single floating precision performance. This is 122.5 GFLOPS. (https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620)

    The maximum theoretical CPU performance (I assumed intel i7-3770) is 3.4GHz/core * 4core * 8 DP/cycle = 108.8 GFLOPS (FP64). I usually get 80 – 100GFLOPS, which it is slightly slower than Game GPU performance. If you used suitable compiler such as intel C++ compiler or included the intel math kernel library, all floating pointing unit of CPU would be automatically used even one core specified. If too many thread was used, the performance would be reduced because message passing in different cores. But I don't this is the major reason why the performance reduced sharply when four cores was used.

    If a computing-purpose GPU, such as Nvidia Tesla or AMD Radeon,  is used, the double precision performance would be 1/4 to 1/2 of single precision performance. I think the computational time would be reduced sharply. Sorry I didn't have any chance to proof it.

    Please tell me if I did wrong.

    Thank you very much.


    Yours Sincerely,

    Fanjie