Some of the info here are outdated. For the latest info, see Dashboard update


Current Status

Dashboard website on hypocentre fully working.

Dashboard database is located  on hypocentre at

/nesi/project/nesi00213/dashboard.db

Setup

1.Git clone slurm_gm_workflow https://github.com/ucgmsim/slurm_gm_workflow and 

cd slurm_gm_workflow/dashboard


2. Make sure you have .ssh/config and .ssh/sockets directory on Maui/Mahuika, as we will need ssh between two machines. 


Running dashboard:

  1. Keep sockects open:
    1. hypocentre -→ maui
    2. hypocentre →mahuika
    3. maui -→ mahuika
  2. Open 2 terminals and run the following commands. dashboard must be in your PYTHONPATH and `pip install dash`
# Sample command to run data collection into db (sub melody.zhu with your HPC login name)

yzh231@hypocentre: python dashboard/run_data_collection.py melody.zhu /nesi/project/nesi00213/dashboard.db

# Sample command to run website:
yzh231@hypocentre: python dashboard/run_dashboard_app.py /nesi/project/nesi00213/dashboard.db


To collect data from a previous date until today:

# Sample command to collect old data from 365 days ago
# Runs faster if use --hpc option to collect from the 2 HPCs seperately
 python run_old_data_collection.py /nesi/project/nesi00213/dashboard/dashboard.db --hpc maui --days_shift 365


What to do when red alert error shown on dashboard website:

Possible errors:

1. Connection  to HPC timed out

2. Data shown on website is incorrect

3.TypeError: Unexpected keyword argument `sorting_type`

Solution

1.Backup the database at /nesi/project/nesi00213/dashboard.db

2. Kill the current dashboard processes

3. For error 1

             a. re-login to HPC

             b. If error still persists after 10 minutes, contact Nesi

    For error 2

            a. re-start dashboard collection and app  (to eliminate the error that dashboard is  running on cache)

            b. If error still persists, kill dashboard process, re-login to HPC and restart dashboard collection and app

            c. If error still persists, examine database and run_data_collection, run_dashboard_app code

            d. If error still persists, contact Nesi, they might have issue with sreport command

4. Recollect any missing data using run_old_data_collection.py script.

5. For error 3, it might due to deprecated attributes. Refer to https://dash.plot.ly/datatable/reference to see the latest attribute names

         

Background

To be able to see a snapshot of our current HPC usage and to see it over time.

Notes

Realtime

Current time

Core Hour Usage Maui/ Mahuika - nn_corehour_usage

Disk utilisation for nesi00213 - nn_check_quota

Running Jobs for nesi00213 Maui/Mahuika - squeue

Capacity Used Maui (How many nodes are being used as a percentage of how many is available) – https://support.nesi.org.nz/hc/en-gb/articles/360000204116-M%C4%81ui-Slurm-Partitions # mahuika may be difficult as have multiple partitions and partially filled nodes

Historic - (start date, end date, time step)

For (end - start) / step periods

Display Core Hour usage per person Maui/Mahuika – ch_report.sh (there is an updated version in the master branch)

### Investigate if ch_report.sh needs adjusting for jobs on partitions that charge more than 1 ch per ch.

Tasks

  1. From a local (UC) computer poll the above scripts to get the data
  2. Print (and log) the data in a useful / readable format
  3. Create a cronjob to poll data (and display it on the laptop Jira screen)


Progress

Task 1-3 done, hpc core_hour usage etc are now refreshed every 60s on Jira Screen


Todo

1.Plot core hour usage

2. Create a UI ,eg website to display/download plot/data

  • No labels