The purpose of this document is to describe the various functionalities and outputs of the Slurm job management database.

Creation of database

This is called automatically as part of install.sh – to manually create a db you can use the below

python create_mgmt_db.py <path_to_run_folder> [list of realisations]
e.g.
python create_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 test_realiastion1

 

Updating entries in database

Uses the same path name as used to create the db, rather than the absolute path of the db.

Can only progress the status, aka must move in a linear fashion. If a step fails it should advance to failed and a new entry created.

usage: update_mgmt_db.py [-h] [-r RUN_NAME] [-j JOB] [-e ERROR]
                         run_folder {EMOD3D,post_EMOD3D,HF,BB,IM_calculation}
                         {created,in-queue,running,completed,failed}


positional arguments:
  run_folder            folder to the collection of runs on Kupe
  {EMOD3D,post_EMOD3D,HF,BB,IM_calculation}
  {created,in-queue,running,completed,failed}


optional arguments:
  -h, --help            show this help message and exit
  -r RUN_NAME, --run_name RUN_NAME
                        name of run to be updated
  -j JOB, --job JOB – Job number on supercomputer
  -e ERROR, --error ERROR – text notes about why the run failed
e.g.
python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF in-queue --j 3 --run_name test123
python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF running --j 3
python update_mgmt_db.py ~/Documents/scratch/test_18p5/ HF failed --j 3 --error 'Hit wall clock limit 5000'

Querying status of database

Prints the status of the collection of runs.

 

query_mgmt_db.py [-h] [--error] run_folder [run_name]
positional arguments:
  run_folder   folder to the collection of runs on Kupe
  run_name     name of run to be queried
optional arguments:
  -h, --help   show this help message and exit
  --error, -e  Optionally add an error string to the database

e.g.

slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/
                 run_name |         process |     status |        last_modified
_______________________________________________________________________________
                  test123 |              BB |   in-queue |  2018-05-16 03:53:55
                  test123 |  IM_calculation |   in-queue |  2018-05-16 03:53:55
                  test123 |     post_EMOD3D |    running |  2018-05-16 04:30:01
                  test123 |          EMOD3D |  completed |  2018-05-16 03:58:15
                  test123 |              HF |     failed |  2018-05-16 22:56:41
        test_realiastion1 |          EMOD3D |    created |  2018-05-16 03:34:26
        test_realiastion1 |     post_EMOD3D |    created |  2018-05-16 03:34:26
        test_realiastion1 |              HF |    created |  2018-05-16 03:34:26
        test_realiastion1 |              BB |    created |  2018-05-16 03:34:26
        test_realiastion1 |  IM_calculation |    created |  2018-05-16 03:34:26
slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ test123
                 run_name |         process |     status |        last_modified
_______________________________________________________________________________
                  test123 |              BB |   in-queue |  2018-05-16 03:53:55
                  test123 |  IM_calculation |   in-queue |  2018-05-16 03:53:55
                  test123 |     post_EMOD3D |    running |  2018-05-16 04:30:01
                  test123 |          EMOD3D |  completed |  2018-05-16 03:58:15
                  test123 |              HF |     failed |  2018-05-16 22:56:41

 

slurm_gm_workflow/scripts/management$ python query_mgmt_db.py ~/Documents/scratch/test_18p5/ --error

 Run_name: test123

 Process: EMOD3D

 Status: completed

 Last_Modified: 2018-05-16 03:58:15

 Error: Demo error

 

 Run_name: test123

 Process: HF

 Status: failed

 Last_Modified: 2018-05-16 22:56:41

 Error: hit wall clock limit 5000

 

 Run_name: Kelly_HYP02-03_S1264

 Process: EMOD3D

 Status: failed

 Last_Modified: 2018-05-18 02:30:03

 Error: Task removed from squeue without completion

 

Inserting new tasks into database

Insert a new entry into the database with the status created for the given run_name

python insert_mgmt_db.py ~/Documents/scratch/test_18p5/ run_name {EMOD3D,post_EMOD3D,HF,BB,IM_calculation}
e.g.
python insert_mgmt_db.py ~/Documents/scratch/test_18p5/ test123 EMOD3D

Querying Slurm

Checking the squeue to see the progress of a task.

python slurm_query_status.py run_folder [poll-interval]
e.g.

python slurm_query_status.py ~/Documents/scratch/test_18p5/
not updating status (running) of 'post_EMOD3D' on 'test123'
not updating status (in-queue) of 'BB' on 'test123'
updating 'IM_calculation' on 'test123' to the status of 'running' from 'in-queue'
Task 'EMOD3D' on 'test_realiastion1' not found on squeue; changing status to 'failed'

python slurm_query_status.py ~/Documents/scratch/test_18p5/
not updating status (running) of 'post_EMOD3D' on 'test123' (2183326)
not updating status (in-queue) of 'BB' on 'test123' (2183255)
not updating status (running) of 'IM_calculation' on 'test123' (2183303)
  • No labels