Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CHECKPOINTING & SPLITTING A BIG SLURM

Responsible scripts

  1. slurn header template: https://github.com/ucgmsim/slurm_gm_workflow/blob/master/templates/slurm_header.cfg
  2.  im_calc_slurm template: https://github.com/ucgmsim/slurm_gm_workflow/blob/master/templates/im_calc_sl.template
  3.  submit_hf.py that generates the slurm files: https://github.com/ucgmsim/slurm_gm_workflow/blob/master/scripts/submit_hf.py
  4.  checkpointing functions: https://github.com/ucgmsim/slurm_gm_workflow/blob/master/scripts/checkpoint.py

Checkpointing

Checkpointing is needed for IM_calculation due to large job size and limited running time on Kupe. Therefore, we implemented checkpointing to track the current progress of an im_calculation jobcalculation job, and  and carry on from where the job was interrupted by slurmby slurm.

Note, the checkpointing code relies on the input/output directory structure specified in the im_calc_al.template in the checkpoint branch. Failure to match the dir structure will result in runtime errorin runtime error. A quick fix would be modifying the template to suit your own dir structure.

...

Splitting a big slurm script into several smaller slurms is needed due to the maximum number of lines allowed in a slurm script on Kupe.

Inside generatesubmit_slimcalc.py The -ml argument specifies the maximum number of lines of python call to calculate_ims.py/caculate_rrups.py. Header and footer like  '#SBATCH --time=15:30:00', 'date' etc are NOT included.

...

Command to run checkpointing and splitting:

Code Block
 python generatesubmit_slimcalc.py -obs ~/test_obs/IMCalcExample/ -sim runs/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_exclude_1k_batch_6/Data/Sources -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -o ~/rrup_out -ml 1001000 -e -s -i OtaraWest02_HYP01-21_S1244 Pahiatua_HYP01-26_S1244 -t 24:00:00

Output:

To submit the slurm script:

...