The digital archivist: Automating legacy macroseismic data processing using large language models

Aarnav Agrawal, Susan E. Hough, S. Mostafa Mousavi, Khant N. Hlaing, Clara E. Yoon, & Salvador Blanco

Submitted September 7, 2025, SCEC Contribution #14630, 2025 SCEC Annual Meeting Poster #TBD

Macroseismic data are a key resource to investigate shaking and damage from pre-instrumental and pre-digital eras. However, data are often stored as inconsistently-formatted reports describing observed shaking and damage, making manually parsing and interpreting accounts a labor-intensive process (Hough et al., 2025). This study introduces a novel workflow using Google’s Gemini 2.5 Pro large language model (LLM) to automate macroseismic data extraction and interpretation from summary reports (Hough et al., 2025). We apply this workflow to the March 22, 1957, M5.3 Daly City, California earthquake as a case study. We used Gemini with zero-shot (untrained) prompting to extract addresses, MMI values, and descriptions from each report. For reports without MMI values, Gemini inferred an intensity from damage descriptions. To address coordinate precision limits, addresses were geocoded via Google’s Geocoding API. This workflow yielded over 2,500 geocoded intensity reports for the Daly City earthquake. To assess the accuracy of the workflow, we first evaluated the full dataset by generating a ShakeMap and comparing interpolated intensities with instrumental data from five strong-motion stations in the San Francisco Bay Area. Using ground motion to intensity conversion equations, we converted known peak ground velocity and acceleration values to MMI. The dataset closely matched instrumental values with a mean absolute error of ~0.5. When Gemini used 0.5 increments instead of whole numbers, the error decreased to ~0.35, demonstrating the importance of prompt engineering for optimizing results. We next separated the dataset into two subsets: reports where MMIs were explicitly provided (“extracted”), and reports where Gemini assigned MMIs from descriptions (“AI-generated”). Using the extracted subset as ground truth, we conducted a pixel-by-pixel comparison between the interpolated ShakeMaps from the two datasets and found a mean absolute error within 0.5, indicating Gemini’s inferred ratings aligned well with observed ones. We also present preliminary results for the 1971 Sylmar earthquake. Our results demonstrate LLMs’ potential for reliably extracting and analyzing large, unstructured macroseismic datasets. LLMs could offer a scalable solution for rapidly digitizing macroseismic archives, enabling broader use in modern seismic hazard analysis to constrain ground motion models and improve understanding of site effects in urban areas.

Key Words
Large Language Models, Macroseismic Data

Citation
Agrawal, A., Hough, S. E., Mousavi, S., Hlaing, K. N., Yoon, C. E., & Blanco, S. (2025, 09). The digital archivist: Automating legacy macroseismic data processing using large language models. Poster Presentation at 2025 SCEC Annual Meeting.


Related Projects & Working Groups
Ground Motions (GM)