HPC Dashboard

Some of the info here are outdated. For the latest info, see Dashboard update

Current Status

Dashboard website on hypocentre fully working.

Dashboard database is located on hypocentre at

/nesi/project/nesi00213/dashboard.db

Setup

1.Git clone slurm_gm_workflow https://github.com/ucgmsim/slurm_gm_workflow and

cd slurm_gm_workflow/dashboard

2. Make sure you have .ssh/config and .ssh/sockets directory on Maui/Mahuika, as we will need ssh between two machines.

Running dashboard:

Keep sockects open:
1. hypocentre -→ maui
2. hypocentre →mahuika
3. maui -→ mahuika
Open 2 terminals and run the following commands. dashboard must be in your PYTHONPATH and `pip install dash`

# Sample command to run data collection into db (sub melody.zhu with your HPC login name)

yzh231@hypocentre: python dashboard/run_data_collection.py melody.zhu /nesi/project/nesi00213/dashboard.db

# Sample command to run website:
yzh231@hypocentre: python dashboard/run_dashboard_app.py /nesi/project/nesi00213/dashboard.db

To collect data from a previous date until today:

# Sample command to collect old data from 365 days ago
# Runs faster if use --hpc option to collect from the 2 HPCs seperately
 python run_old_data_collection.py /nesi/project/nesi00213/dashboard/dashboard.db --hpc maui --days_shift 365

What to do when red alert error shown on dashboard website:

Possible errors:

1. Connection to HPC timed out

2. Data shown on website is incorrect

3.TypeError: Unexpected keyword argument `sorting_type`

Solution

1.Backup the database at /nesi/project/nesi00213/dashboard.db

2. Kill the current dashboard processes

3. For error 1

a. re-login to HPC

b. If error still persists after 10 minutes, contact Nesi

For error 2

a. re-start dashboard collection and app (to eliminate the error that dashboard is running on cache)

b. If error still persists, kill dashboard process, re-login to HPC and restart dashboard collection and app

c. If error still persists, examine database and run_data_collection, run_dashboard_app code

d. If error still persists, contact Nesi, they might have issue with sreport command

4. Recollect any missing data using run_old_data_collection.py script.

5. For error 3, it might due to deprecated attributes. Refer to https://dash.plot.ly/datatable/reference to see the latest attribute names

Background

To be able to see a snapshot of our current HPC usage and to see it over time.

Notes

Realtime

Current time

Core Hour Usage Maui/ Mahuika - nn_corehour_usage

Disk utilisation for nesi00213 - nn_check_quota

Running Jobs for nesi00213 Maui/Mahuika - squeue

Capacity Used Maui (How many nodes are being used as a percentage of how many is available) – https://support.nesi.org.nz/hc/en-gb/articles/360000204116-M%C4%81ui-Slurm-Partitions # mahuika may be difficult as have multiple partitions and partially filled nodes

Historic - (start date, end date, time step)

For (end - start) / step periods

Display Core Hour usage per person Maui/Mahuika – ch_report.sh (there is an updated version in the master branch)

~~### Investigate if ch_report.sh needs adjusting for jobs on partitions that charge more than 1 ch per ch.~~

Tasks

From a local (UC) computer poll the above scripts to get the data
Print (and log) the data in a useful / readable format
Create a cronjob to poll data (and display it on the laptop Jira screen)

Progress

Task 1-3 done, hpc core_hour usage etc are now refreshed every 60s on Jira Screen

Todo

1.Plot core hour usage

2. Create a UI ,eg website to display/download plot/data

Child pages