Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CHECKPOINTING & SPLITTING A BIG SLURM

Checkpointing

Checkpointing is needed for IM_calculation due to large job size and limited running time on Kupe. Therefore, we implemented checkpointing to track the current progress of an im_calculation job, and carry on from where the job was interrupted by slurm.

...

Code Block
$ git clone https://github.com/ucgmsim/IM_calculation.git
$ git checkout checkpoint

Now. open Open im_calc_sl.template, change the IMPATH variable (line 22) to where you have cloned the git repository

...

Note, the checkpointing code relies on the input/output directory structure specified in the im_calc_al.template in the checkpoint branch. Failure to match the dir structure will result in runtime error. A quick fix would be modifying the template to suit your own dir structure.

Example:

(1) Simulation

Input/output structure defined in im_calc_al.template

...

Actual input data structure:

The input binary file is under:

Code Block
/nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs/BlueMtn/BB/Cant1D_v3-midQ_OneRay_hfnp2mm+_rvf0p8_sd50_k0p045/BlueMtn_HYP28-31_S1514/Acc/BB.bin

The output IM_calc folder is under:


(2) Observed:

Input/output structure defined in im_calc_al.template

Image Added

Actual input data structure:

Image Added

The output IM_calc folder is under:

Image Added


Sample command to run checkpointing:(use generate_sl2.py)

Code Block
$ python generate_sl2.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o .~/test_obs/IMCalcExample /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ > test.sl

...

The reason that we have to run test.sl under  '/nesi/nobackup/nesi00213/tmp/auto_preproc' is otherwise slurm cannot find machine.env specified by the test.sl script:

Splitting a big slurm

Splitting a big slurm script into several smaller slurms is need due to maximum number of lines allowed in a slurm script on Kupe.

Still in the checkpointing branch, Run the generate_sl3.py script that uses both checkpointing and splitting. The -ml argument specifies the maximum number of lines of python call to calculate_ims.py/caculate_rrups.py. Header and footer like  #SBATCH --time=15:30:00, date etc are not included.

Say if the max number of lines allowed in a slurm script is 1000, and your (header + footer) are 30 lines, then the number b that you passed pass to -ml should be 0 < n <=967. eg. -ml 967.

Example:

We have 320 250 simulation dirs to run, by specifying -ml 100 (100 python calls to calculate_ims.py per slurm script), we expect 4 slurm 3 sim slurm scripts to be outputted.(1-100, 100-200,  200-300, 300-320-250)

We have 3 observed dirs to run, by specifying -ml 100 (100 python calls to calculate_ims.py per slurm script), we expect 1 sim slurm scripts to be outputted.

We have 61 rrup files to run, by specifying -ml 100 (100 python calls to calcualte_rrups.py per slurm script), we expect 1 sim slurm scripts to be outputted.

Command:

Code Block
python generate_sl3.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o .~/test_obs/IMCalcExample/ /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ -ml 100

Output:
Image RemovedImage Added

Todo: tidy up the check_point.py, generate_sl[*|1-2].py

...