Exploring new statistical metrics to evaluate the magnitude distribution of earthquake forecasting models

Francesco Serafini, Marta Han, Leila Mizrahi, Kirsty Bayliss, José A. Bayona, Pablo C. Iturrieta, Mark Naylor, & Maximilian J. Werner

Submitted November 4, 2025, SCEC Contribution #15001

Evaluating earthquake forecasts is a crucial step in understanding and improving the capabilities of forecasting models. The use of specific metrics to assess the consistency between forecasts and data on one particular aspect of the process is important to understand which aspects of seismicity a model is failing to describe, and, consequently, highlight where new versions of the model should improve. This can be done effectively only by metrics unaffected by inconsistencies in other aspects of the process. The Collaboratory for the Study of Earthquake Predictability (CSEP), which organises earthquake forecasting experiments around the globe, has developed different tests targeting different aspects of the process, such as the number N-test or the magnitude M-test, assessing (in)consistency between the observed and forecasted number and magnitude distributions of events, respectively. We find that the results of the recently proposed M-test for catalog-based forecasts (composed of a collection of synthetic catalogs from the model) depend on the N-test, i.e. the two tests do not isolate the desired aspects appropriately, rendering uninformative forecast results. Here, we address this problem using simulated data and provide a solution based on resampling of simulated catalogs. We implement this new M-test with resampling in the pyCSEP software toolkit, conduct two analyses using actual earthquake forecasts for Europe (1990-2015) and Switzerland (1933-1962, 1962-1992, 1993-2022) that provide inconsistent earthquake counts compared to observations, and analyse how the test results change using this proposed test. Lastly, we investigate alternative metrics, namely an unnormalised M-test, two Chi-square formulations, the Hellinger distance, the Brier score, and a novel Multinomial Log-Likelihood (MLL) score, and compare them based on the ability to find (in)consistency between data and forecast in various synthetic scenarios. We find that there are scenarios in which the alternative metrics outperform the resampled M-test. In particular, the MLL test outperforms the M-test in all the scenarios considered. We also study how the ability of finding inconsistency changes with the number of observations, and the cutoff magnitude, comparing one of the Chi-square metrics, the MLL, and the M-test against classical statistical methods such as the Kolmogorov-Smirnoff (KS) test, the Wilcoxon test, and the Anderson-Darling (AD) test. The MLL and the AD test are the ones providing the highest probability of finding inconsistencies for all combinations of number of observations and cutoff magnitude. This study shows how realistic synthetic examples can be used to compare the ability of different metrics in finding inconsistencies between data and forecasts, and shows that the MLL is the metric providing the best tradeoff between interpretability and probability of finding inconsistencies and, therefore, should be used.

Citation
Serafini, F., Han, M., Mizrahi, L., Bayliss, K., Bayona, J. A., Iturrieta, P. C., Naylor, M., & Werner, M. J. (2025). Exploring new statistical metrics to evaluate the magnitude distribution of earthquake forecasting models. Natural Hazards and Earth System Sciences, (submitted).


Related Projects & Working Groups
Earthquake Forecasting and Predictability (EFP), CSEP