Specialized Scientific Language Models for Research and Education: A Pilot Study

Majd M. Alahmed, Ahmed E. Elbanna, Napat Tainpakdipat, & Harsh PV

Submitted September 8, 2024, SCEC Contribution #14016, 2024 SCEC Annual Meeting Poster #058

AI chatbots and language models (LMs) like ChatGPT have rapidly advanced, becoming increasingly prevalent. However, these models still struggle with reliability, particularly in higher education, where accuracy in science-related topics is crucial. This is mainly because they are not specifically designed or trained for scientific purposes. As a result, when queried on scientific topics, they often reference internet sources without effectively determining which are most reliable. Developing AI models trained on specialized scientific data could significantly improve educational processes, helping students practice solving engineering problems and enabling researchers to generate literature reviews or summaries of research papers more efficiently.

To explore this concept, we trained an existing LLM to answer seismology-related questions using publications, journals, and research papers uploaded to the model's configuration. We selected GPT-4.0 for an initial trial, configuring it with a sample dataset to address earthquake-related queries, ranging from basic to complex, including those requiring precise calculations. By integrating equation-based computations, we aimed for the model to support advanced exploration of seismology and seismic research. However, challenges arose because GPT is a closed-source LLM. Despite uploading accurate resources, the model sometimes referred to other sources or provided incorrect answers due to the material's complexity.

Additionally, consistent responses were difficult to maintain across different sessions, as conversational learning was limited to a single chat session. In response to these challenges, we developed a new LM system integrating multiple LMs via APIs. This model employs a smaller language model with an extended context window of 128k tokens, integrated into a Retrieval-Augmented Generation (RAG) system. The RAG system uses a vector database to store and retrieve document embeddings. When a query is posed, it is vectorized, and a similarity search retrieves relevant information, augmenting the model's responses. This approach blends external data with the model's internal knowledge through attention mechanisms, providing comprehensive answers. The system can also access the internet for additional information. LLM routers further optimize efficiency by directing queries to the appropriate processing pipelines, ensuring each query is handled with the most suitable resources.

Citation
Alahmed, M. M., Elbanna, A. E., Tainpakdipat, N., & PV, H. (2024, 09). Specialized Scientific Language Models for Research and Education: A Pilot Study. Poster Presentation at 2024 SCEC Annual Meeting.


Related Projects & Working Groups
Seismology