The purpose of this document is to describe the various functionalities and outputs of the Slurm job management database.
Creation of database
This is called automatically as part of install.sh – to manually create a db you can use the below
python create_mgmt_db.py <path_to_run_folder> [list of realisations] e.g. python create_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 test_realiastion1
Updating entries in database
Uses the same path name as used to create the db, rather than the absolute path of the db.
Can only progress the status, aka must move in a linear fashion. If a step fails it should advance to failed and a new entry created.
usage: update_mgmt_db.py [-h] [-r RUN_NAME] [-j JOB] [-e ERROR] run_folder {EMOD3D,post_EMOD3D,HF,BB,IM_calculation} {created,in-queue,running,completed,failed} positional arguments: run_folder folder to the collection of runs on Kupe {EMOD3D,post_EMOD3D,HF,BB,IM_calculation} {created,in-queue,running,completed,failed} optional arguments: -h, --help show this help message and exit -r RUN_NAME, --run_name RUN_NAME name of run to be updated -j JOB, --job JOB – Job number on supercomputer -e ERROR, --error ERROR – text notes about why the run failed e.g. python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF in-queue --j 3 --run_name test123 python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF running --j 3 python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF failed --j 3 --error 'Hit wall clock limit 5000'
Querying status of database
Prints the status of the collection of runs.
query_mgmt_db.py [-h] [--error] run_folder [run_name] positional arguments: run_folder folder to the collection of runs on Kupe run_name name of run to be queried optional arguments: -h, --help show this help message and exit --error, -e Optionally add an error string to the database e.g. slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ run_name | process | status | last_modified _______________________________________________________________________________ test123 | BB | in-queue | 2018-05-16 03:53:55 test123 | IM_calculation | in-queue | 2018-05-16 03:53:55 test123 | post_EMOD3D | running | 2018-05-16 04:30:01 test123 | EMOD3D | completed | 2018-05-16 03:58:15 test123 | HF | failed | 2018-05-16 22:56:41 test_realiastion1 | EMOD3D | created | 2018-05-16 03:34:26 test_realiastion1 | post_EMOD3D | created | 2018-05-16 03:34:26 test_realiastion1 | HF | created | 2018-05-16 03:34:26 test_realiastion1 | BB | created | 2018-05-16 03:34:26 test_realiastion1 | IM_calculation | created | 2018-05-16 03:34:26
slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 run_name | process | status | last_modified _______________________________________________________________________________ test123 | BB | in-queue | 2018-05-16 03:53:55 test123 | IM_calculation | in-queue | 2018-05-16 03:53:55 test123 | post_EMOD3D | running | 2018-05-16 04:30:01 test123 | EMOD3D | completed | 2018-05-16 03:58:15 test123 | HF | failed | 2018-05-16 22:56:41
slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ --error
Run_name: test123
Process: EMOD3D
Status: completed
Last_Modified: 2018-05-16 03:58:15
Error: Demo error
Run_name: test123
Process: HF
Status: failed
Last_Modified: 2018-05-16 22:56:41
Error: hit wall clock limit 5000
Run_name: Kelly_HYP02-03_S1264
Process: EMOD3D
Status: failed
Last_Modified: 2018-05-18 02:30:03
Error: Task removed from squeue without completion
Inserting new tasks into database
Insert a new entry into the database with the status created for the given run_name
python insert_mgmt_db.py ~/Documents/scratch/test_18p5/ run_name {EMOD3D,post_EMOD3D,HF,BB,IM_calculation} e.g. python insert_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 EMOD3D
Querying Slurm
Checking the squeue to see the progress of a task.
python slurm_query_status.py run_folder [poll-interval] e.g. python slurm_query_status.py ~/Documents/scratch/test_18p5/ not updating status (running) of 'post_EMOD3D' on 'test123' not updating status (in-queue) of 'BB' on 'test123' updating 'IM_calculation' on 'test123' to the status of 'running' from 'in-queue' Task 'EMOD3D' on 'test_realiastion1' not found on squeue; changing status to 'failed' python slurm_query_status.py ~/Documents/scratch/test_18p5/ not updating status (running) of 'post_EMOD3D' on 'test123' (2183326) not updating status (in-queue) of 'BB' on 'test123' (2183255) not updating status (running) of 'IM_calculation' on 'test123' (2183303)