P02-06

Exploring Structured Biological Pathways in Context with Retrieval-Augmented Generation

Rintaro YASHIRO *1, Nobuaki YASUO2, Masakazu SEKIJIMA1

1Department of Computer Science, Institute of Science Tokyo
2Academy for Convergence of Materials and Informatics (TAC-MI), Institute of Science Tokyo


[Purpose]
Drug discovery and development is a high-risk venture that can take more than a decade and cost more than $1 billion. Improving the accuracy of early target discovery and validation is critical to avoid costly failures in the later stages. [1] In target discovery, pathway databases such as KEGG PATHWAY are important for understanding disease mechanisms from life science data. However, their vastness and complexity hinder comprehensive information retrieval. [2] On the other hand, large language models (LLMs) have potential applications in various fields due to their ability to understand and generate language, but their application faces a critical challenge of ‘hallucination,’ or the generation of information that is not based on facts. This problem is particularly acute in the biomedical field, where scientific rigor is required. [3] This study aims to suppress hallucination by retrieving information from KEGG PATHWAY and using it as a source of reliable external knowledge for the LLM. The goal is to construct an information search environment that enables interactive and fact-based responses, thereby improving the accuracy and efficiency of target search in drug discovery.

[Methods]
In this study, a knowledge graph constructed from KEGG PATHWAY is combined with Retrieval-Augmented Generation (RAG), which augments the LLM's generation capability with external knowledge, to build a reliable natural language search system for specialized life science data.

[Results and Discussion]
When evaluating the performance of searches after converting structured data into documents and comparing it with that of existing RAG methods, a decline in the performance of intermediate nodes for queries spanning multiple nodes was observed. It is expected that this approach will reduce the loss of information at intermediate nodes by capturing data in a structured manner.

[Conclusions]
This study proposes a RAG-based system for structured information retrieval from the KEGG PATHWAY knowledge graph. This system presents an approach to overcome the challenge of LLM hallucination by integrating with a structured database, offering a promising direction for the use of reliable AI in the life sciences. Future work will focus on enhancing search performance and conducting experiments with large-scale data.

[References]
[1]: Hughes, James P., et al. "Principles of early drug discovery." British journal of pharmacology 162.6 (2011): 1239-1249.
[2]: Yishu Wang, Juan Qi, and Dongmei Ai. Dpadm: a novel algorithm for detecting drug-pathway associations based on high-throughput transcriptional response to compounds. Briefings in Bioinformatics, 24(1): bbac517, 2023.
[3]: Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, and Vicente Grau. Medical graph rag: Towards safe medical large language model via graph retrieval-augmented generation. arXiv preprint arXiv:2408.04187, 2024.