HF CUDA convesion is a project to convert high frequency simulation code (Fortran) in order to utilise GPU.

 Initial Partial Comparisons

What is implemented?

Up to and including the stoc_f function that takes ~40% CPU time itself in CPU version. Includes random number pre-processing, forward FFT, ray variation/amplification? (gf_amp_tt), applying factors to FFT, a1,a2,a3,frank.

What is missing?

Site amplification, reverse FFT, radv/radfreq, highcor_f, storing results back to timeseries.

Results

Really rough as code is a hybrid of both CPU/GPU and parts are commented out for testing.

CPU ~17.5s
GPU ~3s
Common pre-difference ~1s
GPU speedup in code that is on GPU: ~8x

8x faster than optimised CPU code means ~24 times faster than what we have been using.

Notes

Findings:

  • Parallelisation over subfaults requires too much memory (eg multiplying running memory usage by 4070).
  • On-chip memory is also too limiting to work on larger datasets per-thread.
  • Some parts can be run over all subfaults, others should be run over FFT elements.
  • CUDA forces better architecture (reducing variables carried through, smaller functions that work on a single specific problem).
  • Some of the GPU code doesn't effect speed, removing FFT won't make a noticeable difference so assume there are a few bottlenecks.

Code can be optimised much better:

  • Currently atomicAdd used which sums up in a serial fashion. Use shared memory to reduce serial access by up to 1024.
  • Reduce duplicate calculation between threads.
  • Investigation on effect of number of kernels, global memory speed.
  • Optimise memory access.
  • Will probably be slightly faster overall even if just all code is moved to GPU.

Summary

Faster without much in the way of optimisation or experience.

 Running to end and start of Verification

GPU finishes computation before returning only the final timeseries to be stored to disk by CPU.

Sample Output

blue: cpu, orange: gpu
shows xyz (sequential) for each timestep
pre-verification above, below after initial verification

GOOD: start time and waveform shapes.

NOT SO GOOD: amplitude low at mid and high at end.

Expected improvements with continued verification.

Verification

variables verified

a1+40% pass components: cc, smt, omg, dfr, zz, vsh(ksrc), dlm, fail components: omgc, rvfx (based on rvsig1 random)
frank pass components: bigC, fr2, fail components: fc2, rvfx (rvsig1 random)
a2 pass
w pass
fft * w within order of magnitude (random)
site_amp factors pass 2->nfold, nfold
qfexp pass
dy -> dw fixed
ddx, ddy fixed
dst, zet -> fixed
qbar -> fixed
rpath -> fixed
post fft taper arg pass
rdx fail for first 1/2, pass components: pa,cmp, fail components: rdsva/thaa ray 1/2, pass components: thsu, fail: th conditions, -> fixed

After fixing issues, waveform improved (shown above).

Verification is not complete.

Timing

Went from 3s → 4.5s once non-existing variable passing fixed but is complete code which the improved CPU code completes in about 37s (still about 8x faster than improved CPU code). No effort has been put into optimisation after creating initial GPU version that runs and is CUDA architecture compatible.

 Verification Complete, Optimisations

FFT was added for all components and rays. Bug found which gave much smaller amplitudes on GPU.

<old screenshots lost, look just about superimposed except for random changes>

Notes

Different GPUs provide same random number sequences.

Computation is a small part of GPU running time. Malloc and Kernel start times take a large portion. Next step would likely be reducing the current ~12 kernel launches per subfault.

Timing

Overall speed was also increased another 3x. Now up to about 100-120x faster on GPU than original CPU code (hypocentre cpus are slower and GPU is faster though).

 Stochastic Verification

An automated platform for running HF with different seeds and comparing results was created. Here are the results for 5.4.5 vs the CUDA equivalent. 2000 seeds for both versions.

 Using the Nsight profiler

Above is a closeup of the fft based tasks (bulk) of 1 subfault.

  • The red has been removed. Array defined as array(:) have been converted to array(*), pointers.
  • The subfault calculation was separate because of memory usage yet each is trivial (in this case there are 4070 subfaults).
  • Working on more at once could reduce the gaps (inefficiencies) between kernels (currently about 25%).
  • The time scale is important here.
  • DeviceSynchronize() doesn't seem to be causing slowdown (kernels at 100% for most of the runtime).

Another profiler called "Nsight Compute" can look within kernels.


  • No labels