...
CHECKPOINTING & SPLITTING A BIG SLURM
Checkpointing
Checkpointing is needed for IM_calculation due to large job size and limited running time on Kupe. Therefore, we implemented checkpointing to track the current progress of an im_calculation job, and carry on from where the job was interrupted by slurm.
...
Code Block |
---|
$ git clone https://github.com/ucgmsim/IM_calculation.git $ git checkout checkpoint |
Now. open Open im_calc_sl.template, change the IMPATH variable (line 22) to where you have cloned the git repository
...
Note, the checkpointing code relies on the input/output directory structure specified in the im_calc_al.template in the checkpoint branch. Failure to match the dir structure will result in runtime error. A quick fix would be modifying the template to suit your own dir structure.
Example:
(1) Simulation
Input/output structure defined in im_calc_al.template
...
Actual input data structure:
The input binary file is under:
Code Block |
---|
/nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs/BlueMtn/BB/Cant1D_v3-midQ_OneRay_hfnp2mm+_rvf0p8_sd50_k0p045/BlueMtn_HYP28-31_S1514/Acc/BB.bin |
The output IM_calc folder is under:
(2) Observed:
Input/output structure defined in im_calc_al.template
Actual input data structure:
The output IM_calc folder is under:
Sample command to run checkpointing:(use generate_sl2.py)
Code Block |
---|
$ python generate_sl2.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o .~/test_obs/IMCalcExample /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ > test.sl |
...
The reason that we have to run test.sl under '/nesi/nobackup/nesi00213/tmp/auto_preproc' is otherwise slurm cannot find machine.env specified by the test.sl script:
Splitting a big slurm
Splitting a big slurm script into several smaller slurms is need due to maximum number of lines allowed in a slurm script on Kupe.
Still in the checkpointing branch, Run the generate_sl3.py script that uses both checkpointing and splitting. The -ml argument specifies the maximum number of lines of python call to calculate_ims.py/caculate_rrups.py. Header and footer like #SBATCH --time=15:30:00, date etc are not included.
Say if the max number of lines allowed in a slurm script is 1000, and your (header + footer) are 30 lines, then the number b that you passed pass to -ml should be 0 < n <=967. eg. -ml 967.
Example:
We have 320 250 simulation dirs to run, by specifying -ml 100 (100 python calls to calculate_ims.py per slurm script), we expect 4 slurm 3 sim slurm scripts to be outputted.(1-100, 100-200, 200-300, 300-320-250)
We have 3 observed dirs to run, by specifying -ml 100 (100 python calls to calculate_ims.py per slurm script), we expect 1 sim slurm scripts to be outputted.
We have 61 rrup files to run, by specifying -ml 100 (100 python calls to calcualte_rrups.py per slurm script), we expect 1 sim slurm scripts to be outputted.
Command:
Code Block |
---|
python generate_sl3.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o .~/test_obs/IMCalcExample/ /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ -ml 100 |
Output:
Todo: tidy up the check_point.py, generate_sl[*|1-2].py
...