Improving Protein-Ligand Binding Prediction via Vision Transformer and Image-Based Learning
Akira TAKE *, Masakzu SEKIJIMA
Institute of Science Tokyo
Purpose
Improving the efficiency of compound discovery in new drug research and development is an essential challenge for reducing costs and time. In structure-based drug discovery [1], docking simulations are often used prior to wet experiments, but there is a problem with low prediction accuracy, and improvements in accuracy are sought. To address this challenge, AI is expected to contribute to improving the accuracy and reducing the subjectivity of compound discovery. This study aims to replace the experience and intuition of medicinal chemists by using vision transformers to learn from a large number of docking simulation results, with the goal of improving the accuracy of docking simulations.
Method
In this study, we represented the three-dimensional structure of protein-ligand complexes as images and used a Vision Transformer (ViT) model with these images as input to predict binding. Structural data was obtained from public databases such as PDB, and pocket regions around ligands were extracted before rendering the three-dimensional structures as two-dimensional images from multiple viewpoints. The images were color-coded to represent atom types and physicochemical properties of residues, and output at 224×224 pixels suitable for ViT input. For training, we used a ViT model pre-trained on ImageNet-1k and evaluated both shallow fine-tuning, which updates only the classification layer, and deep fine-tuning, which updates up to the Transformer encoder layer. Weighted cross-entropy was used as the loss function, and Adam was adopted for optimization. The training data was divided by stratified extraction of datasets labeled as bound/unbound and model saving was used to prevent overfitting. Evaluation metrics such as AUC were used to verify the effectiveness of ViT compared to docking-score-based baseline.
Results and Discussion
The results showed that the model that performed transfer learning by updating the classification layer and downstream Transformer encoder layer after pre-training with ImageNet1k achieved high accuracy. This suggests that ViT requires sufficient data to learn the spatial relationships in images and that ViT can adapt to different domains through appropriate fine-tuning.
Conclusion
The ViT-based method proposed in this study demonstrated the potential to improve binding prediction through the application of pre-training and transfer learning. Future challenges include developing domain-specific pre-training data and model architectures that more effectively capture three-dimensional structures.
References
1. Evanthia Lionta, George Spyrou, Demetrios Vassilatis, and Zoe Cournia. Structure-based virtual screening for drug discovery: Principles, applications and recent advances. Curr. Top. Med. Chem., 14(16):1923–1938, October 2014.