O04-03

The impact of SMILES notation inconsistencies on chemical language model property prediction

Yosuke KIKUCHI *1, Yasuhiro YOSHIKAI1, Shumpei NEMOTO1, Ayako FURUHAMA2, Takashi YAMADA2, Tadahaya MIZUNO1, Hiroyuki KUSUHARA1

1Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo
2Research Organization of Information and Systems, The Institute of Statistical Mathematics


[Purpose]
In recent years, deep learning has been extensively applied to a wide range of chemoinformatics tasks, including molecular generation and property prediction. A prominent approach involves leveraging natural language processing techniques in chemistry, giving rise to chemical language models. Among the molecular representations used in these models, SMILES (Simplified Molecular Input Line Entry System) is particularly common. To represent each molecule as a single, unique string, a standardized form known as Canonical SMILES has been established.
However, our examination of various chemical databases revealed that, in practice, even Canonical SMILES are not always unique. In this study, we refer to such discrepancies as SMILES Inconsistencies. In machine learning applications, it is crucial to ensure that the data under analysis does not deviate substantially from the training data—an aspect known as the applicability domain. This raises a critical question: how do inconsistencies in SMILES notation influence a chemical language model’s ability to interpret and generalize molecular structures? Here, we systematically investigate the impact of these inconsistencies across multiple tasks.

[Methods]
We employed a Transformer architecture with a variational autoencoder (VAE) trained on molecular translation task (Randomized SMILES to Canonical SMILES) as a chemical language model (CLM). For evaluating downstream tasks, we employed datasets from MoleculeNet and the Therapeutics Data Commons (TDC). MoleculeNet serves as a well-established benchmark set for molecular machine learning and is widely adopted in CLM research, while TDC is a versatile benchmark set designed for a broad range of therapeutic prediction tasks. In both datasets, we specifically conducted evaluations on ADMET-related tasks.

[Results and Discussion]
For reconstruction tasks, correcting SMILES Inconsistencies led to an overall improvement in performance. In contrast, for downstream tasks, performance did not improve, and in some datasets, prediction accuracy dropped significantly. A detailed investigation revealed that in certain datasets, grammatical inconsistencies occurred only in positive samples, leading to an inappropriate correlation between the inconsistencies and positive prediction.
Furthermore, for datasets in which downstream tasks were not affected, we extracted the features used in each prediction and examined their differences, finding that they tended to be smaller than in the overall feature set. This result suggests that, in downstream tasks, predictions were based on features that were less influenced by inconsistencies.

[Conclusion]
From these findings, we conclude that applying CLMs to downstream tasks without appropriate SMILES preprocessing can lead to misleading conclusions. Although the extent of the impact is task- and dataset-dependent, it highlights a fundamental challenge to ensuring the reliability and interpretability of these models.