Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Make sure you are currently under the test_calculate_ims folder, run:  pytest -v -s test_calculate_ims.py

 

CHECKPOINTING & SPLITTING A BIG SLURM

Checkpointing

Checkpointing is needed for IM_calculation due to large job size and limited running time on Kupe. Therefore, we implemented checkpointing to track the current progress of an im_calculation job, and carry on from where the job was interruped by slurm time limit on Kupe.

To run checkpointing, first git clone the IM_calculaiton repo, then check out the checkpoint branch

Code Block
$ git clone https://github.com/ucgmsim/IM_calculation.git
$ git checkout checkpoint

Now. open im_calc_sl.template, change the IMPATH variable (line 22) to where you have cloned the git repository

Code Block
# open template
~/IM_calculation-[checkpoint]$ vim im_calc_sl.template
# modify the $IMPATH 
export IMPATH=/home/melody.zhu/IM_calculation

Note, the checkpointing code relies on the input/output directory structure specified in the im_calc_al.template in the checkpoint branch. Failure to match the dir structure will result in runtime error. A quick fix would be modifying the template to suit your own dir structure.

Example:

Input/output structure defined in im_calc_al.template

Image Added

Actual input data structure:

Image Added

The input binary is under:

Code Block
/nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs/BlueMtn/BB/Cant1D_v3-midQ_OneRay_hfnp2mm+_rvf0p8_sd50_k0p045/BlueMtn_HYP28-31_S1514/Acc/BB.bin

The output folder is under:

Image Added

Sample command to run checkpointing:

Code Block
$ python generate_sl2.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o . /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ > test.sl

To submit the slurm script:

Code Block
$cp test.sl /nesi/nobackup/nesi00213/tmp/auto_preproc 
$sbatch test.sl

The reason that we have to run test.sl under  '/nesi/nobackup/nesi00213/tmp/auto_preproc' is otherwise slurm cannot find machine.env specified by the test.sl script

Image Added

Splitting a big slurm

Splitting a big slurm script into several smaller slurms is need due to maximum number of lines allowed in a slurm script on Kupe.

Still in the checkpointing branch, Run the generate_sl3 script that uses both checkpointing and splitting. The -ml argument specifies the maximum number of lines of python call to calculate_ims.py. Header and footer like  #SBATCH --time=15:30:00, date etc are not included.

Say if the max number of lines allowed in a slurm script is 1000, and your (header + footer) are 30 lines, then the number that you passed to -ml should be 967. eg. -ml 967.

Example:

We have 320 simulation dirs to run, by specifying 100 python calls per slurm script, we expect 4 slurm scripts to be outputted.

1-100, 100-200, 200-300, 300-320

Command:

Code Block
python generate_sl3.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o . /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ -ml 100

Output:
Image Added

Todo: tidy up the check_point.py, generate_sl[*|1-2].py

TODO

  • Creation of semi-automatic slurm generation that will have all the calls to produce the results as needed.
  • Progress printing statements
  • Sim ASCII calculation - currently assumes ASCII file is in g for acceleration but this is not the case for sim
  • Rrup calculation on a smaller station list - currently when generating the slurm script it does the full grid even for stations outside the domain

...