LLM-Based Workflow for Protein–Protein Interaction Data Curation in Drug Discovery
Kazuyoshi IKEDA *1, 2, Tatsuki AKABANE2, Yixuan SUI2, Tsuyoshi ESAKI3
1Center for Computational Science, RIKEN
2Faculty of Pharmacy, Keio University
3Faculty for Data Science, Shiga University
[Purpose]
Reliable structural and activity data are essential for lead discovery in the drug development process. However, much of this information remains hidden as unstructured data within scientific literature. This study aims to establish a pipeline using large language models (LLMs) that can automatically extract, annotate, and evaluate interaction data, thereby enabling its seamless integration into drug discovery databases.
[Methods]
In this study, we use LLMs to create a workflow for automatically extracting, annotating, and organizing activity and compound data related on protein-protein interactions (PPIs) from scientific literature and supplementary information (SI). The workflow comprises document preparation, prompt-based data extraction, activity labeling, and the generation of formatted output. The model is designed to classify compounds as active or inactive based on a threshold of 10 μM, as specified in the prompt. The annotated results are then exported as CSV files for subsequent integration and analysis. To evaluate consistency, multiple LLMs (o3, o3-pro, and 4o) were applied and compared with prompt engineering to improve extraction and labeling accuracy. The workflow output is validated using a benchmark based on an expert-curated dataset.
[Results and Discussion]
In this study, we first established a workflow capable of automatically extracting protein targets, ligands (in SMILES format), and activity values from the PDF and SI files of research articles. The extracted data were annotated using GPT models, enabling classification of compounds as active or inactive based on a 10 μM threshold. Among the tested models, the o3-pro model achieved the best performance, providing consistent and reliable activity classifications. Interactive AI systems, such as GPT-4o and Claude Sonnet 4, also facilitated natural language access to drug discovery databases, including ChEMBL and PDB, allowing for the efficient retrieval of structural and clinical information. Differences in performance were attributable to misdefinitions of specific technical terms. Although additional instructions were occasionally required for ambiguous queries.
[Conclusions]
Integrating LLMs into the data curation workflow for drug discovery enables the efficient handling of unstructured PPI data. It is expected to contribute to the more efficient development of drug discovery databases.