Dashboard website on hypocentre fully working.
Dashboard database is located on hypocentre at
/nesi/project/nesi00213/dashboard.db |
1.Git clone slurm_gm_workflow https://github.com/ucgmsim/slurm_gm_workflow and
cd slurm_gm_workflow/dashboard |
2. Make sure you have .ssh/config and .ssh/sockets directory on Maui/Mahuika, as we will need ssh between two machines.
Running dashboard:
# Sample command to run data collection into db (sub melody.zhu with your HPC login name) yzh231@hypocentre: python dashboard/run_data_collection.py melody.zhu /nesi/project/nesi00213/dashboard.db # Sample command to run website: yzh231@hypocentre: python dashboard/run_dashboard_app.py /nesi/project/nesi00213/dashboard.db |
To collect data from a previous date until today:
# Sample command to collect old data from 365 days ago # Runs faster if use --hpc option to collect from the 2 HPCs seperately python run_old_data_collection.py /nesi/project/nesi00213/dashboard/dashboard.db --hpc maui --days_shift 365 |
What to do when red alert error shown on dashboard website:
Possible errors:
1. Connection to HPC timed out
2. Data shown on website is incorrect
3.TypeError: Unexpected keyword argument `sorting_type`
Solution
1.Backup the database at /nesi/project/nesi00213/dashboard.db
2. Kill the current dashboard processes
3. For error 1
a. re-login to HPC
b. If error still persists after 10 minutes, contact Nesi
For error 2
a. re-start dashboard collection and app (to eliminate the error that dashboard is running on cache)
b. If error still persists, kill dashboard process, re-login to HPC and restart dashboard collection and app
c. If error still persists, examine database and run_data_collection, run_dashboard_app code
d. If error still persists, contact Nesi, they might have issue with sreport command
4. Recollect any missing data using run_old_data_collection.py script.
5. For error 3, it might due to deprecated attributes. Refer to https://dash.plot.ly/datatable/reference to see the latest attribute names
To be able to see a snapshot of our current HPC usage and to see it over time.
Realtime
Current time
Core Hour Usage Maui/ Mahuika - nn_corehour_usage
Disk utilisation for nesi00213 - nn_check_quota
Running Jobs for nesi00213 Maui/Mahuika - squeue
Capacity Used Maui (How many nodes are being used as a percentage of how many is available) – https://support.nesi.org.nz/hc/en-gb/articles/360000204116-M%C4%81ui-Slurm-Partitions # mahuika may be difficult as have multiple partitions and partially filled nodes
Historic - (start date, end date, time step)
For (end - start) / step periods
Display Core Hour usage per person Maui/Mahuika – ch_report.sh (there is an updated version in the master branch)
### Investigate if ch_report.sh needs adjusting for jobs on partitions that charge more than 1 ch per ch.
Progress
Task 1-3 done, hpc core_hour usage etc are now refreshed every 60s on Jira Screen
Todo
1.Plot core hour usage
2. Create a UI ,eg website to display/download plot/data