...
Make sure you are currently under the test_calculate_ims folder, run: pytest -v -s test_calculate_ims.py
CHECKPOINTING & SPLITTING A BIG SLURM
Checkpointing
Checkpointing is needed for IM_calculation due to large job size and limited running time on Kupe. Therefore, we implemented checkpointing to track the current progress of an im_calculation job, and carry on from where the job was interruped by slurm time limit on Kupe.
To run checkpointing, first git clone the IM_calculaiton repo, then check out the checkpoint branch
Code Block |
---|
$ git clone https://github.com/ucgmsim/IM_calculation.git
$ git checkout checkpoint |
Now. open im_calc_sl.template, change the IMPATH variable (line 22) to where you have cloned the git repository
Code Block |
---|
# open template
~/IM_calculation-[checkpoint]$ vim im_calc_sl.template
# modify the $IMPATH
export IMPATH=/home/melody.zhu/IM_calculation |
Note, the checkpointing code relies on the input/output directory structure specified in the im_calc_al.template in the checkpoint branch. Failure to match the dir structure will result in runtime error. A quick fix would be modifying the template to suit your own dir structure.
Example:
Input/output structure defined in im_calc_al.template
Actual input data structure:
The input binary is under:
Code Block |
---|
/nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs/BlueMtn/BB/Cant1D_v3-midQ_OneRay_hfnp2mm+_rvf0p8_sd50_k0p045/BlueMtn_HYP28-31_S1514/Acc/BB.bin |
The output folder is under:
Sample command to run checkpointing:
Code Block |
---|
$ python generate_sl2.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o . /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ > test.sl |
To submit the slurm script:
Code Block |
---|
$cp test.sl /nesi/nobackup/nesi00213/tmp/auto_preproc
$sbatch test.sl |
The reason that we have to run test.sl under '/nesi/nobackup/nesi00213/tmp/auto_preproc' is otherwise slurm cannot find machine.env specified by the test.sl script
Splitting a big slurm
Splitting a big slurm script into several smaller slurms is need due to maximum number of lines allowed in a slurm script on Kupe.
Still in the checkpointing branch, Run the generate_sl3 script that uses both checkpointing and splitting. The -ml argument specifies the maximum number of lines of python call to calculate_ims.py. Header and footer like #SBATCH --time=15:30:00, date etc are not included.
Say if the max number of lines allowed in a slurm script is 1000, and your (header + footer) are 30 lines, then the number that you passed to -ml should be 967. eg. -ml 967.
Example:
We have 320 simulation dirs to run, by specifying 100 python calls per slurm script, we expect 4 slurm scripts to be outputted.
1-100, 100-200, 200-300, 300-320
Command:
Code Block |
---|
python generate_sl3.py -s /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6_batched/v18p6_1k_under2p0G_ab/Runs -srf /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/Runs/test_srfs/ -ll /scale_akl_nobackup/filesets/transit/nesi00213/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.ll -np 80 -o . /nesi/nobackup/nesi00213/RunFolder/Cybershake/v18p6/test_check_point/ -ml 100 |
Output:
Todo: tidy up the check_point.py, generate_sl[*|1-2].py
TODO
- Creation of semi-automatic slurm generation that will have all the calls to produce the results as needed.
- Progress printing statements
- Sim ASCII calculation - currently assumes ASCII file is in g for acceleration but this is not the case for sim
- Rrup calculation on a smaller station list - currently when generating the slurm script it does the full grid even for stations outside the domain
...