Unlike SLURM used by NeSI and TACC, KISTI Nurion uses PBS (Portable Batch System). This means job submission as well as the queue (and job) monitoring may need a change
A job can be only submitted from /scratch
A PBS script usually has an extension name .sh, and must contain all the options below.
#PBS -V | keep the env variables |
#PBS -N | set the name of the job |
#PBS -q | set the queue for the job |
#PBS -l | set the compute resource. eg) select=4:ncpus=32:mpiprocs=32:ompthreads=1 #select (num of nodes) ncpus (num process * num threads per node) mpiproces (num process per node) NOTE: for Python multiprocessing module, try the setting like: select=1:ncpus=64:mpiprocs=4:ompthreads=16 The example above was tested with VM generation running 4 Python multiprocessing processes, where each process deploys 16 openMP threads |
#PBS -A | Add info about the job (for statistical purpose) QuakeCoRE will be "inhouse" |
#!/bin/sh #PBS -N IntelMPI_job #PBS -V #PBS -q normal #PBS -A inhouse #PBS -l select=4:ncpus=32:mpiprocs=32:ompthreads=1 #PBS -l walltime=04:00:00 # normal queue maximum is 48h cd $PBS_O_WORKDIR module purge module load craype-mic-knl intel/18.0.3 impi/18.0.3 python/3.7.0 mpirun ./test_mpi
Environment variables
PBS_JOBID | job id |
PBS_JOBNAME | job name assigned by the user |
PBS_NODEFILE | contains a list of compute nodes (nodes) allocated to the job |
PBS_O_PATH | Value of PATH from submission environment |
PBS_O_WORKDIR | absolute path where "qsub" was executed |
TMPDIR | The job-specific temporary directory for this job. Defaults to /tmp/pbs.job_id on the vnodes. |
Useful Commands
qsub hello.sh : Submit hello.sh
qdel <jobid> : Cancel the job
qsig -s <suspend/resume> <job id>
showq : Show queue
pbs_status : Show idle resource per queue
pbs_queue_check : Show list of queues that can be used with the current account
qstat -u | user's own job only |
qstat -T | Remaining time of jobs in the queue |
qstat -i | Only see jobs in Q/H state |
qstat -f | See details |
qstat -x | See completed jobs |
(python3_nurion) [x1746a08@login04 Hossack_HYP01-10_S1244]$ qstat -u x1746a08 pbs: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 3811446.pbs x1746a08 normal run_emod3d 54332 3 204 -- 00:33 R 00:00 (python3_nurion) [hpc11a02@login03 v20p4p90]$ qselect -u $USER |xargs qdel (stop all user's job)
Supplying arguments to PBS script
qsub -v arg1="$var1/path",arg2='$foo',arg3=3 -otherflags script.sh
Then you can access values of arg1,arg2,arg3 inside the script by $arg,$arg2 and $arg3
quoting usage are bash-like:
" " will translate the variables inside into actual values.
' ' will be taken literally
Parallel Loop: Embarrassingly parallel execution
Bash For
it works, but not tested if it is indeed launching 4 separate processes. Note & at the end of the command, and "wait" after the for loop.
#!/bin/bash # script version: pbs #PBS -N par_loop #PBS -V #PBS -q normal #PBS -A inhouse #PBS -l select=1:ncpus=4:mpiprocs=4:ompthreads=1 #PBS -l walltime=00:00:05 #PBS -W sandbox=PRIVATE for i in `seq 4`; do python $PBS_O_WORKDIR/hello.py $i > $PBS_O_WORKDIR/outfile$i & done wait
Job Arrays
qsub -J 1-8 my_job.sh
This my_job.sh 8 times with 8 different IDs. Inside the script my_job.sh, this id is kept in $PBS_ARRAY_INDEX.
1-8:2
- the jobs submitted will include { 1 3 5 7 }.
#!/bin/bash # script version: pbs #PBS -N job_array #PBS -V #PBS -q normal #PBS -A inhouse #PBS -l select=1:ncpus=1:mpiprocs=1:ompthreads=1 #PBS -l walltime=00:00:05 #PBS -W sandbox=PRIVATE python $PBS_O_WORKDIR/hello.py ${PBS_ARRAY_INDEX} > $PBS_O_WORKDIR/outfile$PBS_ARRAY_INDEX
Example LF/HF/BB PBS scripts
#!/bin/bash #PBS -N run_emod3d.Hossack_HYP01-10_S1244 #PBS -V #PBS -q normal #PBS -A inhouse #PBS -l select=3:ncpus=64:mpiprocs=64:ompthreads=1 #PBS -l walltime=00:33:00 #PBS -W sandbox=PRIVATE module purge module add craype-network-opa intel/18.0.3 craype-mic-knl impi/18.0.3 python/3.7 export gmsim_root=/home01/x1746a08/gmsim source $gmsim_root/Environments/virt_envs/python3_nurion/bin/activate SUCCESS_CODE=0 export outfile=$PBS_O_WORKDIR/result_lf touch $outfile rm $outfile export runtime_fmt="%Y%m%d_%H%M%S" echo `date +$runtime_fmt` >>$outfile mpirun $gmsim_root/tools/emod3d-mpi_v3.0.4 -args "par=$PBS_O_WORKDIR/LF/e3d.par" end_time=`date +$runtime_fmt` echo $end_time >>$outfile #run test script and update mgmt_db #test before update ln -s $PBS_O_WORKDIR/LF/e3d.par $PBS_O_WORKDIR/LF/OutBin/e3d.par timestamp=`date +$runtime_fmt` test_cmd="$gmsim/workflow/scripts/test_emod3d.sh $PBS_O_WORKDIR Hossack_HYP01-10_S1244" res=`$test_cmd` success=$? # Below is to work-around the cacheing issue on Maui. #if [[ $success == $SUCCESS_CODE ]]; then # sleep 2 # echo "Success 1" >> $outfile # res=`$test_cmd` # success=$? #fi if [[ $success == $SUCCESS_CODE ]]; then #passed echo "Success:" $res >> $outfile else echo "Fail" $res >> $outfile fi
#!/bin/bash #PBS -N sim_hf.Hossack_HYP01-10_S1244 #PBS -V #PBS -q normal #PBS -A inhouse #PBS -l select=4:ncpus=64:mpiprocs=64:ompthreads=1 #PBS -l walltime=00:30:00 #PBS -W sandbox=PRIVATE module purge module add craype-network-opa intel/18.0.3 craype-mic-knl impi/18.0.3 python/3.7 export gmsim_root=/home01/x1746a08/gmsim source $gmsim_root/Environments/virt_envs/python3_nurion/bin/activate export outfile=$PBS_O_WORKDIR/result_hf touch $outfile rm $outfile runtime_fmt="%Y-%m-%d_%H:%M:%S" start_time=`date +$runtime_fmt` echo $start_time >> $outfile mkdir -p $PBS_O_WORKDIR/HF/Acc mpirun python $gmsim/workflow/scripts/hf_sim.py $PBS_O_WORKDIR/../fd_rt01-h0.400.ll $PBS_O_WORKDIR/HF/Acc/HF.bin -m $gmsim_root/VelocityModel/Mod-1D/Cant1D_v3-midQ_OneRay.1d --duration 36.42 --dt 0.005 --sim_bin $gmsim_root/tools/hb_high_binmod_v5.4.5 --version 5.4.5 --dt 0.005 --rvfac 0.8 --sdrop 50 --path_dur 1 --kappa 0.045 --seed 34580 --slip $PBS_O_WORKDIR/../../../Data/Sources/Hossack/Stoch/Hossack_HYP01-10_S1244.stoch end_time=`date +$runtime_fmt` echo $end_time >> $outfile timestamp=`date +%Y%m%d_%H%M%S` #test before update test_cmd="$gmsim/workflow/scripts/test_hf.sh $PBS_O_WORKDIR" echo $test_cmd >> $outfile res=`$test_cmd` if [[ $? == 0 ]]; then #passed echo "Success:" $res >> $outfile else echo "Fail" $res >> $outfile fi
#!/bin/bash # BB calculation #PBS -N sim_bb.Hossack_HYP01-10_S1244 #PBS -V #PBS -q normal #PBS -A inhouse #PBS -l select=4:ncpus=64:mpiprocs=64:ompthreads=1 #PBS -l walltime=00:30:00 #PBS -W sandbox=PRIVATE module purge module add craype-network-opa intel/18.0.3 craype-mic-knl impi/18.0.3 python/3.7 export gmsim_root=/home01/x1746a08/gmsim source $gmsim_root/Environments/virt_envs/python3_nurion/bin/activate export outfile=$PBS_O_WORKDIR/result_bb touch $outfile rm $outfile runtime_fmt="%Y-%m-%d_%H:%M:%S" start_time=`date +$runtime_fmt` echo $start_time >> $outfile mkdir -p $PBS_O_WORKDIR/HF/Acc start_time=`date +$runtime_fmt` echo $start_time >> $outfile echo "Computing BB" mkdir -p $PBS_O_WORKDIR/BB/Acc mpirun python $gmsim/workflow/scripts/bb_sim.py $PBS_O_WORKDIR/LF/OutBin $PBS_O_WORKDIR/../../../Data/VMs/Hossack $PBS_O_WORKDIR/HF/Acc/HF.bin $gmsim_root/StationInfo/non_uniform_whole_nz_with_real_stations-hh400_v18p6.vs30 $PBS_O_WORKDIR/BB/Acc/BB.bin --flo 0.25 --version 3.0.4 --site_specific False --fmin 0.2 --fmidbot 0.5 --lfvsref 500.0 end_time=`date +$runtime_fmt` echo $end_time timestamp=`date +%Y%m%d_%H%M%S` #test before update test_cmd="$gmsim/workflow/scripts/test_bb.sh $PBS_O_WORKDIR" echo $test_cmd >> $outfile res=`$test_cmd` if [[ $? == 0 ]]; then #passed echo "Success:" $res >> $outfile else echo "Fail" $res >> $outfile fi