Repository Maturity Assessment : slurm_gm_workflow

GitHub URL:

What is this repo about?

Repo status

README present	Yes
Is Public?	Yes
Number of commits	1271
Last time Updated	04-04-2019

Functionalities

Description: State how this function is used or interacts with other sw components.
Status: (1: not working, 2: unstable, 3: works under specific condition, 4: works with known issues, 5: perfect )
Tests: (1: none, 2: broken/outdated. 3: with limited coverage, 4: works with known issues, 5: perfect)
Doc (1: none, 2: outdated, 3: with limited coverage, 4: mostly ok, 5: perfect) Give a link
Frequency of use: Daily, Weekly, Monthly, Yearly, Never
Frequency of code/req. change: Daily, Weekly, Monthly, Yearly, Never
Bus Factor: Number of people that are familiar with the code (1-7)

Functionality	Description	Status	Known issues	Tests?	Doc?	Frequency of use	Frequency of code/req. change	Bus Factor
submit slurm scripts	Python scripts that generates slurm scripts to run simulations.	4	Legacy codes that are never been used after mid 2018 are still in repo. Many Bash scripts that are created to remove manual work no longer works due to structural changes to simulations.	4	2	Daily	Daily	3
Dashboard	Web-based scripts to show the usage of HPC (IO/Corehours)	4	Some Functions are still under development.	3	3 Comments + Docstrings	Daily	Weekly	2
e2e_tests	Testing scripts to run full install-to-simulation test to see if the workflow is broken	4	workflow_config.json will not be properly created after branching off master (default value used instead of deployed version) - (the correct workflow_config.json is created upon creation of the envrionment, from then on it is the environments owner responsibility to keep it up to date if values require adding/updating)	1	4 ReadMe + Comments + Docstrings	Daily	Weekly	2
estimation WCT	Used to estimate the runtime of a specific Simulation Step.	4	When data is out of range the estimation may be under-estimating a lot. (band-aged fixed with multiplying WCT with retry-count) - (this should be fixed by the addition of a SVR model, that can handle out of bound data and in general overestimates – DONE)	1	4 Readme + Comments + Docstrings	Daily	Monthly	1
automated simulation workflow	A wrapper that can bulk install simulations and auto-submit jobs.	4	excessive access to the management DB will cause it to be locked on Maui legacy parameters has to be removed from 'example' cybershake_config.json	4	3 CyberShake Install and Auto-submission	Daily	Daily	3
verification	Scripts that will be used to auto-verify if a simulation is valid or is something obviously wrong.	1	Under development and not implemented into automated workflow	1	1	Monthly	Monthly	2
Automated Testing		4	Does not cover the whole repo yet. Some bash scripts that are not in the 'main workflow' are not tested, e.g. scripts that moves files around that is heavily relying on folder/file structures.	1 Test script for test script?	1	Daily	Weekly	3
Templates	Templates used for simulation install or job submission	5		1	1	Daily	Weekly	4
HPC Environments	Scripts for creating/activating HPC Environments	4		1	3 Readme + Comments	Daily	Monthly	1
Deploy workflow	Scripts for deploying workflow	3	Not overly stable, not frequently used and often requires work for it to work as intended. Get rid of it and just use environments?	1	2	Weekly	Monthly	1
Metadata	Scripts for logging and aggregating metadata	5		5?	3 Comments + Docstrings	Daily	Monthly	1
Shared workflow	Shared functionality for the worfklow	3	Could probably use some refactoring, most likely contains some unused functions	3	2 Limited Comments + the odd Docstring	Daily	Monthly	7
Scripts + Cybershake scripts	Scripts	2	Massive number of scripts, most of them are unused, requires some significant tidy up	3	2	Most are outdated?	?	2
Management DB	Lots of scripts for creation and updating of MgmtDB	4	Works but messy code and prone to failure IMO, currently being tidied up as part of https://quakecore.atlassian.net/browse/QSW-1057	3	2	Daily	Monthly	7

Suggested Improvements / New Features

Description: State how/why this will be useful
Timeline: Estimate of how many sprints will it take to develop

Functionality	Description	Timeline
Update DB script revamp	To attempt to address the Lock issue caused by excessive access	3~4 Days
Integrate Pre-processing into automated workflow	One step closer to fully automated workflow. (Including option to only do pre-processing)	1 Sprint
Implement automated verification	First guard/test for running huge simulations (i.e Cybershake)	1~2 Sprint (depends on how complicated the method will be)
Get rid of default HPC deployed workflow and purely use environments	Remove the default workflow and create a deafult (i.e. stable) HPC environment instead. Makes everything consistent, removes requirement to maintain deploy code?	1 day
Logging	Add decent logging for workflow, advantages: Makes debugging significant easier Gives idea of what is slow in automated workflow Single log file with decent format, which can then be easily searched/filtered easily with some third-party or custom log viewer	1-2 Sprints depending on extend
Large quantity of dead scripts	Remove unused/outdated scripts	2 day
Optimize estimation performance	Prevent constant reloading of models, this should make estimation super fast (i.e. not noticeable). Currently the models are loaded from the files for every estimation, which is obviously slow	1-2 days
Visualisation Automation	Add options for enabling plotting/visualisations for All/First Realisation	1-2 Sprints
Error handling	Identify more cases which can be automatically checked / corrected rather than requiring manual intervention for runs.
Update Cybershake related code to fit recent yaml changes	This will deprecate unnecessary parameters and/or the use of redundant config files.	1 Sprint
Change of realisation names	AlpineF2K_HYP01-47_S1244 to AlpineF2K_REL01	2 Sprints+

Child pages

Repository Maturity Assessment : slurm_gm_workflow

GitHub URL:

What is this repo about?

Repo status

Functionalities

Suggested Improvements / New Features