The purpose of this document is to describe the various functionalities and outputs of the Slurm job management database.
This is called automatically as part of install.sh – to manually create a db you can use the below
python create_mgmt_db.py <path_to_run_folder> [list of realisations] e.g. python create_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 test_realiastion1 |
Uses the same path name as used to create the db, rather than the absolute path of the db.
Can only progress the status, aka must move in a linear fashion. If a step fails it should advance to failed and a new entry created.
usage: update_mgmt_db.py [-h] [-r RUN_NAME] [-j JOB] [-e ERROR] run_folder {EMOD3D,post_EMOD3D,HF,BB,IM_calculation} {created,in-queue,running,completed,failed} positional arguments: run_folder folder to the collection of runs on Kupe {EMOD3D,post_EMOD3D,HF,BB,IM_calculation} {created,in-queue,running,completed,failed} optional arguments: -h, --help show this help message and exit -r RUN_NAME, --run_name RUN_NAME name of run to be updated -j JOB, --job JOB – Job number on supercomputer -e ERROR, --error ERROR – text notes about why the run failed e.g. python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF in-queue --j 3 --run_name test123 python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF running --j 3 python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF failed --j 3 --error 'Hit wall clock limit 5000' |
Prints the status of the collection of runs.
query_mgmt_db.py [-h] [--error] run_folder [run_name] positional arguments: run_folder folder to the collection of runs on Kupe run_name name of run to be queried optional arguments: -h, --help show this help message and exit --error, -e Optionally add an error string to the database e.g. slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ run_name | process | status | last_modified _______________________________________________________________________________ test123 | BB | in-queue | 2018-05-16 03:53:55 test123 | IM_calculation | in-queue | 2018-05-16 03:53:55 test123 | post_EMOD3D | running | 2018-05-16 04:30:01 test123 | EMOD3D | completed | 2018-05-16 03:58:15 test123 | HF | failed | 2018-05-16 22:56:41 test_realiastion1 | EMOD3D | created | 2018-05-16 03:34:26 test_realiastion1 | post_EMOD3D | created | 2018-05-16 03:34:26 test_realiastion1 | HF | created | 2018-05-16 03:34:26 test_realiastion1 | BB | created | 2018-05-16 03:34:26 test_realiastion1 | IM_calculation | created | 2018-05-16 03:34:26 |
slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 run_name | process | status | last_modified _______________________________________________________________________________ test123 | BB | in-queue | 2018-05-16 03:53:55 test123 | IM_calculation | in-queue | 2018-05-16 03:53:55 test123 | post_EMOD3D | running | 2018-05-16 04:30:01 test123 | EMOD3D | completed | 2018-05-16 03:58:15 test123 | HF | failed | 2018-05-16 22:56:41 |
slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ --error
Run_name: test123
Process: EMOD3D
Status: completed
Last_Modified: 2018-05-16 03:58:15
Error: Demo error
Run_name: test123
Process: HF
Status: failed
Last_Modified: 2018-05-16 22:56:41
Error: hit wall clock limit 5000
Run_name: Kelly_HYP02-03_S1264
Process: EMOD3D
Status: failed
Last_Modified: 2018-05-18 02:30:03
Error: Task removed from squeue without completion
Insert a new entry into the database with the status created for the given run_name
python insert_mgmt_db.py ~/Documents/scratch/test_18p5/ run_name {EMOD3D,post_EMOD3D,HF,BB,IM_calculation} e.g. python insert_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 EMOD3D |
Checking the squeue to see the progress of a task.
python slurm_query_status.py run_folder [poll-interval] e.g. python slurm_query_status.py ~/Documents/scratch/test_18p5/ not updating status (running) of 'post_EMOD3D' on 'test123' not updating status (in-queue) of 'BB' on 'test123' updating 'IM_calculation' on 'test123' to the status of 'running' from 'in-queue' Task 'EMOD3D' on 'test_realiastion1' not found on squeue; changing status to 'failed' python slurm_query_status.py ~/Documents/scratch/test_18p5/ not updating status (running) of 'post_EMOD3D' on 'test123' (2183326) not updating status (in-queue) of 'BB' on 'test123' (2183255) not updating status (running) of 'IM_calculation' on 'test123' (2183303) |