Story

given a station, conditioning IM / value, and miscellaneous parameters with defaults where applicable: get an output similar to what the Matlab / first version of Python code produces.

Tasks

[4h] function that reads across all IM csvs and for a given station returns IMs for different simulations. Consideration in design taken into account for how the data is used below.
[6h] adjust Python code in gm_selection repo to accept data in new format, remove dependencies on matlab data files.
[2h] make a function that can complete all steps required in one go so all you have to give is station name, conditioning im name/value... where possible, default values are provided.
[3h] given the 2 main step process, allow going from one to the other without an intermediate text file (can be left as optional for now).
[3h] repeat above for final results, store results in easier to parse format or as variables, original text format left as optional.

total 18h ~3days

IM Data

Has 4 dimensions, stored across CSV files, have to manually combine the dimensions by:

adding multiple CSV files to make another dimention
splitting a dimension by a column value (component)

The above needs to be fixed.

Results

All tasks complete.

Hurdles/Changes/Summary

xarray which was going to be used for storage was not fit for purpose as many to many relationship resulted in wasted space across multiple dimensions.
sqlite database had to be implemented to combat above issue, cs18p6 resulted in about 24GiB of space (1.5hrs) which takes ~1min to retrieve all IMs/GMs for quite a common station.
no intermediate file (mainly repeated inputs from first step anyway), just pass a few arrays to the second step.
text file at the end kept for now but simpler/minimal/more elegant as end users may not be comfortable using alternatives.

Considerations

DB retrieval for station slow (~1min, cs18p6), should internally loop im_levels/values to prevent re-querying when plugging into seisfinder. Alternatively retrieve more data at once.
Want to use DB for other things? Multiprocessing to load CSV files (bottleneck), store more data (station rrup, lon/lat, realisations vs fault etc.).

Replacing IM_agg CSV files

IMDB already contains the data in CSV files so instead of reading CSV files, we get the data from IMDB making the IM_agg data obsolete.

Timing results (Mahuika)

Using IM_agg CSV files to retrieve 1 IM for 1 station: ~80 seconds
Using IMDB to cache all IMs (22) for 1 station: ~60 seconds
Using IMDB to cache 1 IM for 1 station: ~50 seconds
Loading IMDB cache: < 1 second

Child pages

GM Selection