Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows -- Experiences from SCEC CyberShake

Scott Callaghan, Philip J. Maechling, Ewa Deelman, Karan Vahi, Gaurang Mehta, Gideon Juve, Kevin R. Milner, Robert W. Graves, Edward H. Field, David A. Okaya, Keith S. Beattie, & Thomas H. Jordan

Published 2008, SCEC Contribution #1237

Researchers at the Southern California Earthquake Center (SCEC) use large-scale grid-based scientific workflows to perform seismic hazard research as a part of SCEC's program of earthquake system science research. The scientific goal of the SCEC CyberShake project is to calculate probabilistic seismic hazard curves for sites in Southern California. For each site of interest, the CyberShake platform includes two large-scale MPI calculations and approximately 840,000 embarrassingly parallel post-processing jobs. In this paper, we describe the computational requirements of CyberShake and detail how we meet these requirements using grid-based, high-throughput, scientific workflow tools. We describe the specific challenges we encountered and we discuss workflow throughput optimizations we developed that reduced our time to solution by a factor of three and we present runtime statistics and propose further optimizations.

Citation
Callaghan, S., Maechling, P. J., Deelman, E., Vahi, K., Mehta, G., Juve, G., Milner, K. R., Graves, R. W., Field, E. H., Okaya, D. A., Beattie, K. S., & Jordan, T. H. (2008). Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows -- Experiences from SCEC CyberShake. Oral Presentation at 4th IEEE International Conference on eScience.