QuakeCoRE-SCEC Database meeting

Meeting held on the 30 Sep 2016 at 9:30 NZDT

Proposed Agenda

Introduction
Background: QuakeCoRE SeisFinder
SCEC experience
- Overview
- Size
- Number of visitors and who they are expected to be (general public?)
- HW, SW, network?
- Issues around using SQL for this particular type of project. Advantages?
- Have they considered any other alternative that does not use SQL? For example Hadoop (large storage, map-reduce operations) https://github.com/Esri/gis-tools-for-hadoop
- Maintenance: how many people need to be working on the project in order for it to remain usable.
QuakeCoRE requirement
- Support for large size data
- Quick response
- Support for many concurrent queries : http://stackoverflow.com/questions/16628329/hdf5-concurrency-compression-i-o-performance
- Flexible/scalable to add more fields for retrieval
- Support for geographic data type : Need to query "nearest" points.
- hosting (external/internal)
Others
- QuakeCoRE: Workflow. Currently with loadleveler multistep job. Will be incorporating cylc. https://cylc.github.io/cylc/

Notes after the meeting

SCEC database (DB) has a large number of seismic information, in the order of 22 Billion seismic entries. They have started seeing increased query times. They use MySQL as the backend. The DB runs on a grunty machine with 128 GB of RAM and 24 cores under Fedora 24.They also have another machine that is not public facing with smaller specs. Other points here:
- Query performance has increased as the DB has gotten bigger (~ 4Tb)
- They will try to reduce the size of the most problematic tables to improve performance.
- They want to split the current huge DB into two: Production DB (containing only latests analysis) and a read-only SQLlite based DB with less used data.
Their database is not for the public at this point, it is mostly used by researchers. The usage pattern is quite bursty and not constant. It depends mainly on the researchers that require the data.
Besides the performance issues they have noticed with MySQL, other issues related to be noted are:
- Backup will increase as the DB size increases
- Updating MySQL is a complex operation. Nevertheless, it is always a good idea to update for bug fixes/performance improvements/new features.
- Note that they have dedicated staff to administer the machine and the DB
Thinking about a solution that does not use MySQL is not feasible, as they have build a stack on top of it that relies on the DB backend to provide certain features.
- We should be cautious in analyzing our requirements so that we choose a suitable DB for our needs as well.
When discussing about Hadoop, Scott's intake is that it does not suit a querying serving problem as the one they face. On top of it, it requires a custom filesystem and they do not think that their. Basically they have not found a solution that would be so appealing that they would move from the current one.
They seem interested in using HDF5 as a standard for some of the output produced by the codes they use, as currently they have several custom binary formats.
Their workflow uses Pegasus (https://pegasus.isi.edu/)

Blog

QuakeCoRE-SCEC Database meeting

Proposed Agenda

Notes after the meeting