Cactus to Clouds: Processing the SCEDC Open Data Set on AWS

Tim Clements, & Marine A. Denolle

Published August 13, 2019, SCEC Contribution #9476, 2019 SCEC Annual Meeting Poster #302

Data from the Southern California Earthquake Data Center (SCEDC) are now in the cloud! Amazon Web Services (AWS) is currently hosting one year of data (July 2016- July 2017, 552 stations, ~8 TB) from the Southern California Seismic Network as an Open Data Set. Here, we share the promises and pitfalls of leveraging cloud computing for seismic research from our initial workings with SCEDC data on AWS. We present an AWS-based workflow for our usage case: ambient noise cross-correlation of all stations in Southern California for groundwater monitoring. Ambient noise cross-correlation is both I/O and compute heavy.

Analyzing seismic data in the cloud substantially reduces download times. Our Julia and Python APIs for transferring data miniSEED files from Amazon Simple Storage Service (S3) to Amazon Elastic Compute Cloud (EC2) achieve transfer speeds of up to 150 MB/s. We use AWS ParallelCluster for deploying Simple Linux Utility for Resource Management (SLURM)-based clusters on EC2, allowing us to spin up to thousands of cores in minutes. Processing of one year (8TB) of data at 20 Hz costs less than $100. Our initial work with AWS suggests that cloud computing will decrease time-to-science for I/O and compute heavy seismic workloads.

Key Words
Cloud Computing; Ambient Noise; Big Data; AWS

Citation
Clements, T., & Denolle, M. A. (2019, 08). Cactus to Clouds: Processing the SCEDC Open Data Set on AWS. Poster Presentation at 2019 SCEC Annual Meeting.


Related Projects & Working Groups
Computational Science (CS)