Step 6: Submit simulation jobs to HPC

1.Prepare Data:

To run the install script, the Models must be under certain Folder and structure

Cybershake
└── version
	├── Data
	│	├── Sources
    │   │   └── FaultName
    │   │       └── Realisation
    │   │           ├── Srf
    │   │           │   ├── FaultName_HYP01-S1244.srf
    │   │           │   └── FaultName_HYP01-S1244.info
    │   │           └── Stoch
    │   │               └── FaultName_HYP01-S1244.stoch
	│	└── VMs
	│		└── FaultName
    │           ├── vs3dfile.s
    │           ├── vp3dfile.p
    │           ├── rho3dfile.d
    │           ├── params_vel.json
    │           ├── model_coords_rt01-h0.400
    │           └── model_params_rt01-h0.400
	└── Runs

2. Install

After files are in place, run the install script

THREE arguments needed, 1st if the root folder, which contains the Data and Runs folder. 2nd is a file that contains a list of VMs. 3rd is the version to be run.

python $gmsim/workflow/scripts/cybershake/install_cybershake.py $gmsim/RunFolder/cybershake/v18p5 $gmsim/RunFolder/cybershake/v18p5/list_all 16.1

Keep in mind the 2nd argument must be a file that contains a list of faults and the number of realisations for each fault, followed by an r.

Something like this:

Opotiki02 10r
Opotiki03 12r
OpouaweUruti 5r
Orakeikorako 6r
Orakonui 10r
Oruakukuru 10r

3. Create a screen socket

Running scripts on a screen socket and avoid the need of having the terminal open constantly (which means you can disconnect from Maui but have the script still running on it)

screen -S your_prefered_name_for_socket

To detach a socket, use Ctrl+A+D

To Terminate a socket, use Ctrl+D

to show all available socket created before, use --list

screen -list
There is a screen on:
    289787.cybershake_v18p6    (Detached)
1 Socket in /var/run/uscreens/S-ykh22.

To resume to a specific socket, use -r

screen -r 289787.cybershake_v18p6
or
screen -r 289787

To create two panes in the screen use Ctrl+a (lower case) followed by | (normally Shift + \), then use Ctrl + a, Tab to change pane and finally use Ctrl + a, c.

When you are done with screen, use the bash 'exit' command to close the terminal in the current pane, Ctrl + a, Tab, to change to the other pane and the 'exit' command again. When all terminals have been closed Screen will close.

Summary of useful screen commands:

Action	Command
Split screen into two panes	Ctrl + a, \|
Change pane	Ctrl + a, Tab
Create new terminal in the current pane	Ctrl + a, c
Close terminal in current pane	exit
Detach screen	Ctrl + a, d
Terminate screen	Ctrl + d

4. Run the simulations automatically

To run the simulation automatically two scripts must be run. These can be run in different ssh sessions, or one ssh session with the use of screen.

The first script (auto_submit.py) checks the queues and database to see what can be run. The second script (queue_monitor.py) keeps the database up to date with the status

4.1 Auto_submit.py

The first script takes TWO arguments, 1st is the path to sim_root folder (which is the same as you passed to install script), 2nd is your user name on the HPC.

python $gmsim/workflow/scripts/cybershake/auto_submit.py $gmsim/RunFolder/cybershake/v18p5 jpa198

This script will terminate automatically when nothing has happened for an update cycle (Nothing with valid dependencies waiting to run, nothing found in squeue, nothing with an active state (queued or running) in the database). Cycles occur 5s after the previous one has ended

The additional optional arguments for auto_submit are as below:

Argument	Options	Example usage	Example results
n_runs	The number of processes that can run on each HPC. Must be given as a single integer, or a series of integers with one for each HPC in the order (maui, mahuika). Defaults to 12 each.	--n_runs 10 --n_runs 10 20	Both HPCs can have 10 tasks each Maui can have 10 tasks at a time, Mahuika can have 20 tasks at a time
sleep_time	The amount of time to sleep between update cycles. Defaults to 5.	--sleep_time 10	The script will sleep for 10s per cycle
n_max_retries	The maximum number of times a task can be run before it is assumed to have a fatal error and require user intervention to fix. Defaults to 2.	--n_max_retries 3	Tasks will be run 3 times before being abandoned
log_file	The location of the log file to be used where all debug and user displayed information is saved. Defaults to a file in the current directory with the current date and time in the file name.	--log_file ./auto_submit_log.txt	The log messages will be written to auto_submit_log.txt in the current directory
tasks_to_run	The types of tasks to be run dependencies for these tasks are automatically added, meaning only the final task of workflows need to be added. Defaults to clean_up.	--tasks_to_run merge_ts, IM_calc	The tasks merge_ts, IM_calc and their dependencies will be run
rels_to_run	The realisations to be run this time. Must be formatted as an SQL style regex, which uses % as the wildcard symbol. Defaults to all.	--rels_to_run AlpineF2K%	Only realisations that begin with AlpineF2K will run

The dependencies for workflow are available here.

The output displayed to the user is intended to give an overview of the progress made during the run, while the log file contains far more information, useful during debugging.

4.2 queue_monitor.py

The second script takes one arguments, the path to sim_root folder (which is the same as you passed to install script).

python $gmsim/workflow/scripts/cybershake/queue_monitor.py $gmsim/RunFolder/cybershake/v18p5

Note: this script will keep running in a loop until it is killed by Ctrl-C. Or until the screen socket is terminated(if you followed step 4)

If you are running the script in a 'screen' socket, press Ctrl+A+D to detach it, so you can continue next step within the same terminal (and not worrying about disconnecting)

Argument	Options	Example useage	Example results
sleep_time	The amount of time to sleep between update cycles. Defaults to 5.	--sleep_time 10	The script will sleep for 10s per cycle
log_file	The location of the log file to be used where all debug and user displayed information is saved. Defaults to a file in the current directory with the current date and time in the file name.	--log_file ./queue_monitor_log.txt	The log messages will be written to queue_monitor_log.txt in the current directory

5. Monitor Simulation Status

Monitor the status of each simulation by running query script.

python $gmsim/workflow/scripts/management/query_mgmt_db.py $gmsim/RunFolder/cybershake/v18p5

it should show you something like this:

                 run_name |         process |     status |   job-id |        last_modified
__________________________________________________________________________________________
              2012p075555 |        merge_ts |   in-queue |  2198889 |  2018-05-29 04:34:39
              2012p075555 |              BB |    created |     None |  2018-05-29 04:34:39
              2012p075555 |  IM_calculation |    created |     None |  2018-05-29 04:34:39
              2012p075555 |              HF |  completed |  2198881 |  2018-05-29 21:29:21
              2012p075555 |          EMOD3D |     failed |  2198858 |  2018-05-29 04:43:40
              2012p713691 |        merge_ts |    created |     None |  2018-05-29 04:34:40
              2012p713691 |              BB |    created |     None |  2018-05-29 04:34:40
              2012p713691 |  IM_calculation |    created |     None |  2018-05-29 04:34:40
              2012p713691 |              HF |  completed |  2198882 |  2018-05-29 21:29:21
              2012p713691 |          EMOD3D |     failed |  2198860 |  2018-05-29 04:44:49
              2012p764736 |        merge_ts |    created |     None |  2018-05-29 04:34:40
              2012p764736 |              HF |    created |     None |  2018-05-29 04:34:40
              2012p764736 |              BB |    created |     None |  2018-05-29 04:34:40
              2012p764736 |  IM_calculation |    created |     None |  2018-05-29 04:34:40
              2012p764736 |          EMOD3D |     failed |  2198862 |  2018-05-29 04:44:49
              2012p781523 |        merge_ts |    created |     None |  2018-05-29 04:34:40
              2012p781523 |              BB |    created |     None |  2018-05-29 04:34:40

use -e to show only the failed runs(with the errors)

python $gmsim/workflow/scripts/management/query_mgmt_db.py /nesi/nobackup/nesi00213/test_auto_submit -e

 Run_name: 2012p075555
 Process: EMOD3D
 Status: failed
 Job-ID: 2198858
 Last_Modified: 2018-05-29 04:43:40
 Error: Task removed from squeue without completion 

 Run_name: 2012p713691
 Process: EMOD3D
 Status: failed
 Job-ID: 2198860
 Last_Modified: 2018-05-29 04:44:49
 Error: Task removed from squeue without completion

Child pages