Development of a new drug-likeness index based on pharmaceutical patents
Yugo SHIMIZU *1, Hitomi YUKI2, Masateru OHTA1, Teruki HONMA2, Kazuyoshi IKEDA1
1Center for Computational Science, RIKEN
2Center for Integrative Medical Sciences, RIKEN
[Purpose] In drug development, it is necessary to search for potential pharmaceutical compounds from a vast chemical space. Many conventional drug-likeness indices, such as QED, were developed based on information from approved drugs and therefore are not suitable for the early stages of drug discovery. In this study, we aimed to develop a new drug-likeness index by constructing an AI-based model utilizing information on pharmaceutical patents.
[Methods] Approved or investigational drugs were obtained from the ChEMBL 33 database. Active compounds included in patents granted by 30 high-revenue pharmaceutical companies were collected from the Excelra GOSTAR database. Compounds deemed obviously unsuitable as drugs were excluded, and the remaining molecules were standardized and deduplicated, yielding 1,024,041 positive compounds. An equal number of negative compounds were randomly sampled from ZINC15, ensuring no overlap with positives. Basic and interpretable molecular descriptors, such as molecular weight and atom/bond/ring counts, were calculated using RDKit and Mordred and used as training features. A gradient boosting classifier (XGBoost) was employed to construct a two-class model predicting desirability as a patented pharmaceutical compound. The dataset was divided into 80% training and 20% validation sets using stratified sampling. Early stopping with patience 20 was applied to prevent overfitting. Three hyperparameters (max_depth, colsample_bytree, eta) were optimized using Optuna across 200 trials, maximizing the mean Cohen’s Kappa from 5-fold cross-validation. For efficiency, optimization was conducted on a reduced dataset of 50,000 positive and 50,000 negative compounds.
[Results and Discussion] In a 5-fold cross-validation test, the model achieved high performance across all evaluation metrics, including accuracy, precision, recall, specificity, F1 score, Matthews correlation coefficient, Cohen’s Kappa, ROC AUC, and precision–recall AUC. Time-split tests based on the patented year also showed that the model has high performance in predicting compounds that will be patented in more than one year later. These results demonstrate that the constructed model was appropriate and robust. For comparison, the performance of QED was evaluated on the same dataset, resulting the ROC AUC below 0.5, indicating that it fails to distinguish patented pharmaceutical compounds. These findings confirm that the new AI model is more suitable for assessing drug-likeness in the early stages of drug discovery.
[Conclusions] We developed an AI-based drug-likeness index using patented pharmaceutical compounds as positive data, yielding a robust and accurate predictive model. Unlike conventional indices such as QED, our model successfully captures features relevant to early-stage drug discovery. This model is expected to facilitate compound prioritization at earlier phases of drug development, and we plan to release it publicly for broad research use.