O05-02

Molecular Structure Generation from Pharmaceutical Text Data Using a Diffusion Language Model

Yuki NAKAMURA *, Yoshihiro YAMANISHI

Department of Complex Systems Science, Graduate School of Informatics, Nagoya University


1. Purpose
In recent years, the application of artificial intelligence (AI) to drug design has attracted considerable attention. However, many existing models heavily rely on known drug or protein structures, which limits their ability to generate promising molecules with novel scaffolds. Meanwhile, pharmaceutical text data often contains abundant information about drug properties (e.g., therapeutic effects, mechanisms of action), yet this information has rarely been utilized for molecule generation. This study aims to develop an AI model capable of efficiently generating novel structures of drug candidate molecules by leveraging underutilized textual data on desired drug properties.

2. Methods
We constructed pharmaceutical text data from public databases containing descriptions on drugs and their properties such as indications, mechanisms of action, and target names. These pieces of information were concatenated into a single description for each drug. We constructed a text-guided diffusion language model for molecular structure generation. It employs a diffusion process that refines the entire molecular structure simultaneously, which is different from conventional autoregressive models that generate molecules sequentially. Chemical structures are represented as continuous vectors and progressively denoised, ultimately yielding Simplified Molecular Input Line Entry System (SMILES) strings that reflect the input text information. In this study, we trained the model using a set of pairs of drug properties and drug chemical structures.

3. Results and Discussion
We applied the proposed model to generating new chemical structures from pharmaceutical text data on desired drug properties. We examined the structural correlation between newly generated molecules and the corresponding drug reference molecules. We confirmed that the newly generated molecules were often similar to known drug molecules, but pharmaceutical text did not always provide sufficient information to uniquely determine molecular structures. These results suggest that the model was able to capture, to some extent, the correspondence between pharmaceutical text and molecular structures.

4. Conclusions
This study demonstrates the potential of a novel approach to molecular structure generation guided by textual data on drug properties. With further development, such an approach may contribute to improving the efficiency of the drug discovery process.