Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. make sure the job submitted has already finished by using llq
    (the new ll script appends the rup_model name to the job name, so using a specific command will be able to test if a specific job is still on load-level queue or not.

    To show all jobs with job name belong to user 'ykh22'
    Code Block
    llq -l -u ykh22 | grep 'Job Name:'

    pipe it to grep to determine if a job is completed.
    lets say we are looking for AlpineF2K_HYP06-21_S1404

    Code Block
    llq -l -u ykh22 | grep 'Job Name: postprocess' | grep 'AlpineF2K_HYP06-21_S1404'

    it will be empty if the job is not in queue, otherwise it should show on screen

    Code Block
    Job Name: postprocess_AlpineF2K_HYP06-21_S1404
  2. check the completed count of Vel files by using `ls` and `wc`

    Code Block
    ls LF/AlpineK2T_HYP10-10_S1514/Vel/ | wc
          6657    6657   98076  

    than compare it with the station count within the domain

    Code Block
    cat fd_rt01-h0.400.ll | wc
        8550   25650  271714

    for this example, we have 8850 stations and only 2219 station finished (6657 / 3).
    so its safe to assume that if we give it more than 4~4.5 times of WCT, it should finish with next submission.

  3. change the WCT multplied in "the templates".( So that all jobs submitted afterwards will use the WCT)

    Code Block
    titleoriginal post_emod3d_mpi.ll.template
     # @ wall_clock_limit     = 0:20:00

    to

    Code Block
    # @ wall_clock_limit     = 1:30:00
  4. re-submit job for all srf in that simulation

    Code Block
    echo "1" | ./submit_post_emod3d.sh

HF:

...

  1. make sure the job submitted has already finished by looking at llq.
    lets say we are looking for AlpineF2K_HYP06-21_S1404

    Code Block
     llq -l -u ykh22 | grep 'Job Name: run_hf_mpi' | grep 'AlpineF2K_HYP06-21_S1404'

    it will be empty if the job is not in queue, otherwise it should show on screen

    Code Block
    Job Name: run_hf_mpi_Cant1D_v2-midQ_leer_hfnp2mm+_rvf0p8_sd50_k0p045__AlpineF2K_HYP06-21_S1404
  2. check the completed count of Vel Acc files by using `ls` and `wc`

    Code Block
    ls LFHF/Cant1D_v2-midQ_leer_hfnp2mm+_rvf0p8_sd50_k0p045/AlpineK2T_HYP10-10_S1514/VelAcc/ | wc
          6657    6657   98076

    than compare it with the station count within the domain

    Code Block
    cat fd_rt01-h0.400.ll | wc
        8550   25650  271714

    for this example, we have 8850 stations and only 2219 station finished (6657 / 3).

    so its safe to assume that if we give it more than 4~4.5 times of WCT, it should finish with next submission.

  3. change the WCT multplied in "the templates".( So that all jobs submitted afterwards will use the WCT)

    Code Block
    # @ wall_clock_limit     = 1:00:00 

    to

    Code Block
    # @ wall_clock_limit     = 4:30:00
  4. re-submit job for all srf in that simulation

    Code Block
    echo "1" | ./submit_hf.sh

...

  1. IMPORTANT:before running batch bb submission, make sure all LF and HF for all runs under the list_vm are done.

    Code Block
    /nesi/projects/nesi00213/RunFolder/Cybershake/workflow/devel/cybershake/submit_cybershake_bb.sh /nesi/projects/nesi00213/RunFolder/Cybershake/v17p9/Runs /nesi/projects/nesi00213/RunFolder/Cybershake/v17p9/Data/list_vma


    1.1 If only specific run's LF and HF are finished and user prefer to run BB for that specific run only. cd to the simulation folder and run ./submit_bb.sh.

  2. run test_cybershake_bb.sh to test which runs finished

    script takes 2 args, 1.path to Runs folder, 2. the list of vms (so it will not run for all the unnecessary runs)

    Code Block
    /nesi/projects/nesi00213/RunFolder/Cybershake/workflow/devel/cybershake/test_cybershake_bb.sh /nesi/projects/nesi00213/RunFolder/Cybershake/v17p9/Runs /nesi/projects/nesi00213/RunFolder/Cybershake/v17p9/Data/list_vma 2>&1 | tee /nesi/projects/nesi00213/RunFolder/Cybershake/v17p9/test_bb_vma.log

    this will output the test result on the screen as well as dumping them into a log file, namely "/nesi/projects/nesi00213/RunFolder/Cybershake/v17p9/test_hf_vma.log"
    (the part of the script after 2>&1 is to redirect the output to both the screen and a file using 'tee' )
    Note:change the file name and location depending on your own requirement.

     

...

Resuming BB

  1. make sure the job submitted has already finished by looking at llq.

    Code Block
    llq -l -u ykh22 | grep 'Job Name: run_bb_mpi' | grep 'AlpineF2K_HYP06-21_S1404'
    Code Block
    Job Name: run_bb_mpi_Cant1D_v2-midQ_leer_hfnp2mm+_rvf0p8_sd50_k0p045__AlpineF2K_HYP06-21_S1404
  2. check the completed count of Vel files by using `ls` and `wc`

    Code Block
    ls HF/Cant1D_v2-midQ_leer_hfnp2mm+_rvf0p8_sd50_k0p045/AlpineK2T_HYP10-10_S1514/Vel/ | wc
          6657    6657   98076

    than compare it with the station count within the domain

    Code Block
    cat fd_rt01-h0.400.ll | wc
        8550   25650  271714

    for this example, we have 8850 stations and only 2219 station finished (6657 / 3).

    so its safe to assume that if we give it more than 4~4.5 times of WCT, it should finish with next submission.

     

  3. change the WCT in "the templates".( So that all jobs submitted afterwards will use the WCT)

    Code Block
    # @ wall_clock_limit     = 1:00:00
    Code Block
    # @ wall_clock_limit     = 4:30:00
  4. re-submit job for all srf in that simulation

    Code Block
    echo "1" | ./submit_bb.sh


TODO:

  • add script to auto test all simulations and submit the next step
    • (currently need to run the test script and submit the next step manually)
  • A script to adjust WCT for HF
    • currently HF has a hard-coded/
  • add script to auto test all simulations and submit the next step
    • (currently need to run the test script and submit the next step manually)
  • A script to adjust WCT for HF
    • currently HF has a hard-coded/static WCT
    • (multiple re-submission of HF is needed if the boundary is large, more than 7 times for Alpine simulations)
  • A script to check if a job is still running(or in queue)
    • currently user needs to manually check that
    • a script to bulk check may help automating
    1
  • 2