1.Prepare Data:
To run the install script, the Models must be under certain Folder and structure
Cybershake └── version ├── Data │ ├── Sources │ │ └── FaultName │ │ └── Realisation │ │ ├── Srf │ │ │ ├── FaultName_HYP01-S1244.srf │ │ │ └── FaultName_HYP01-S1244.info │ │ └── Stoch │ │ └── FaultName_HYP01-S1244.stoch │ └── VMs │ └── FaultName │ ├── vs3dfile.s │ ├── vp3dfile.p │ ├── rho3dfile.d │ ├── params_vel.json │ ├── model_coords_rt01-h0.400 │ └── model_params_rt01-h0.400 └── Runs
2. Install
After files are in place, run the install script
THREE arguments needed, 1st if the root folder, which contains the Data and Runs folder. 2nd is a file that contains a list of VMs. 3rd is the version to be run.
python $gmsim/workflow/scripts/cybershake/install_cybershake.py $gmsim/RunFolder/cybershake/v18p5 $gmsim/RunFolder/cybershake/v18p5/list_all 16.1
Keep in mind the 2nd argument must be a file that contains a list of faults and the number of realisations for each fault, followed by an r.
Something like this:
Opotiki02 10r Opotiki03 12r OpouaweUruti 5r Orakeikorako 6r Orakonui 10r Oruakukuru 10r
3. Create a screen socket
Running scripts on a screen socket and avoid the need of having the terminal open constantly (which means you can disconnect from Maui but have the script still running on it)
screen -S your_prefered_name_for_socket
To detach a socket, use Ctrl+A+D
To Terminate a socket, use Ctrl+D
to show all available socket created before, use --list
screen -list There is a screen on: 289787.cybershake_v18p6 (Detached) 1 Socket in /var/run/uscreens/S-ykh22.
To resume to a specific socket, use -r
screen -r 289787.cybershake_v18p6 or screen -r 289787
To create two panes in the screen use Ctrl+a (lower case) followed by | (normally Shift + \), then use Ctrl + a, Tab to change pane and finally use Ctrl + a, c.
When you are done with screen, use the bash 'exit' command to close the terminal in the current pane, Ctrl + a, Tab, to change to the other pane and the 'exit' command again. When all terminals have been closed Screen will close.
Summary of useful screen commands:
Action | Command |
---|---|
Split screen into two panes | Ctrl + a, | |
Change pane | Ctrl + a, Tab |
Create new terminal in the current pane | Ctrl + a, c |
Close terminal in current pane | exit |
Detach screen | Ctrl + a, d |
Terminate screen | Ctrl + d |
4. Run the simulations automatically
To run the simulation automatically two scripts must be run. These can be run in different ssh sessions, or one ssh session with the use of screen.
The first script (auto_submit.py) checks the queues and database to see what can be run. The second script (queue_monitor.py) keeps the database up to date with the status
4.1 Auto_submit.py
The first script takes TWO arguments, 1st is the path to sim_root folder (which is the same as you passed to install script), 2nd is your user name on the HPC.
python $gmsim/workflow/scripts/cybershake/auto_submit.py $gmsim/RunFolder/cybershake/v18p5 jpa198
This script will terminate automatically when nothing has happened for an update cycle (Nothing with valid dependencies waiting to run, nothing found in squeue, nothing with an active state (queued or running) in the database). Cycles occur 5s after the previous one has ended
The additional optional arguments for auto_submit are as below:
Argument | Options | Example usage | Example results |
---|---|---|---|
n_runs | The number of processes that can run on each HPC. Must be given as a single integer, or a series of integers with one for each HPC in the order (maui, mahuika). Defaults to 12 each. | --n_runs 10 --n_runs 10 20 | Both HPCs can have 10 tasks each Maui can have 10 tasks at a time, Mahuika can have 20 tasks at a time |
sleep_time | The amount of time to sleep between update cycles. Defaults to 5. | --sleep_time 10 | The script will sleep for 10s per cycle |
n_max_retries | The maximum number of times a task can be run before it is assumed to have a fatal error and require user intervention to fix. Defaults to 2. | --n_max_retries 3 | Tasks will be run 3 times before being abandoned |
log_file | The location of the log file to be used where all debug and user displayed information is saved. Defaults to a file in the current directory with the current date and time in the file name. | --log_file ./auto_submit_log.txt | The log messages will be written to auto_submit_log.txt in the current directory |
tasks_to_run | The types of tasks to be run dependencies for these tasks are automatically added, meaning only the final task of workflows need to be added. Defaults to clean_up. | --tasks_to_run merge_ts, IM_calc | The tasks merge_ts, IM_calc and their dependencies will be run |
rels_to_run | The realisations to be run this time. Must be formatted as an SQL style regex, which uses % as the wildcard symbol. Defaults to all. | --rels_to_run AlpineF2K% | Only realisations that begin with AlpineF2K will run |
The dependencies for workflow are available here.
The output displayed to the user is intended to give an overview of the progress made during the run, while the log file contains far more information, useful during debugging.
4.2 queue_monitor.py
The second script takes one arguments, the path to sim_root folder (which is the same as you passed to install script).
python $gmsim/workflow/scripts/cybershake/queue_monitor.py $gmsim/RunFolder/cybershake/v18p5
Note: this script will keep running in a loop until it is killed by Ctrl-C. Or until the screen socket is terminated(if you followed step 4)
If you are running the script in a 'screen' socket, press Ctrl+A+D to detach it, so you can continue next step within the same terminal (and not worrying about disconnecting)
Argument | Options | Example useage | Example results |
---|---|---|---|
sleep_time | The amount of time to sleep between update cycles. Defaults to 5. | --sleep_time 10 | The script will sleep for 10s per cycle |
log_file | The location of the log file to be used where all debug and user displayed information is saved. Defaults to a file in the current directory with the current date and time in the file name. | --log_file ./queue_monitor_log.txt | The log messages will be written to queue_monitor_log.txt in the current directory |
5. Monitor Simulation Status
Monitor the status of each simulation by running query script.
python $gmsim/workflow/scripts/management/query_mgmt_db.py $gmsim/RunFolder/cybershake/v18p5
it should show you something like this:
run_name | process | status | job-id | last_modified __________________________________________________________________________________________ 2012p075555 | merge_ts | in-queue | 2198889 | 2018-05-29 04:34:39 2012p075555 | BB | created | None | 2018-05-29 04:34:39 2012p075555 | IM_calculation | created | None | 2018-05-29 04:34:39 2012p075555 | HF | completed | 2198881 | 2018-05-29 21:29:21 2012p075555 | EMOD3D | failed | 2198858 | 2018-05-29 04:43:40 2012p713691 | merge_ts | created | None | 2018-05-29 04:34:40 2012p713691 | BB | created | None | 2018-05-29 04:34:40 2012p713691 | IM_calculation | created | None | 2018-05-29 04:34:40 2012p713691 | HF | completed | 2198882 | 2018-05-29 21:29:21 2012p713691 | EMOD3D | failed | 2198860 | 2018-05-29 04:44:49 2012p764736 | merge_ts | created | None | 2018-05-29 04:34:40 2012p764736 | HF | created | None | 2018-05-29 04:34:40 2012p764736 | BB | created | None | 2018-05-29 04:34:40 2012p764736 | IM_calculation | created | None | 2018-05-29 04:34:40 2012p764736 | EMOD3D | failed | 2198862 | 2018-05-29 04:44:49 2012p781523 | merge_ts | created | None | 2018-05-29 04:34:40 2012p781523 | BB | created | None | 2018-05-29 04:34:40
use -e to show only the failed runs(with the errors)
python $gmsim/workflow/scripts/management/query_mgmt_db.py /nesi/nobackup/nesi00213/test_auto_submit -e Run_name: 2012p075555 Process: EMOD3D Status: failed Job-ID: 2198858 Last_Modified: 2018-05-29 04:43:40 Error: Task removed from squeue without completion Run_name: 2012p713691 Process: EMOD3D Status: failed Job-ID: 2198860 Last_Modified: 2018-05-29 04:44:49 Error: Task removed from squeue without completion