HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction
SEUNGEON LEE *, Takuto KOYAMA, Itsuki MAEDA, Shigeyuki MATSUMOTO, Yasushi OKUNO
Department of Biomedical Data Intelligence, Graduate School of Medicine, Kyoto University, Japan
1. Purpose
Medium-sized peptides and their derivatives, including cyclic peptides and peptide mimetics, are promising next-generation therapeutic modalities due to their high target specificity and stability. Machine learning (ML) models to predict the molecular properties essential for drug development, such as membrane permeability, are being actively developed. However, data scarcity and the inadequacy of traditional Simplified Molecular Input Line Entry System (SMILES) representations for chemically complex, branched and cyclic structures limit model performance and applicability[1]. To address these challenges, we employed Hierarchical Editing Language for Macromolecules (HELM), a molecular representation that faithfully captures complex peptide architectures, including branched and cyclic motifs[2], for developing ML models for medium-sized peptide property prediction.
2. Methods
HELM sequences were curated from public databases, yielding approximately 39,000 unique sequences. We developed a Bidirectional Encoder Representations from Transformer (BERT)-based model that was designed to process HELM tokens directly. The model, HELM-BERT, was pretrained using a masked language modeling (MLM) objective and subsequently fine-tuned on representative downstream tasks essential for drug development.
3. Results and Discussion
We demonstrated that HELM sequences can be effectively learned by Transformer architectures. During pretraining, the MLM loss decreased steadily, indicating successful learning of HELM grammar and peptide structural patterns. For downstream tasks such as cell permeability prediction, HELM-BERT outperformed SMILES-based baselines. These results indicate that choosing appropriate representations aligned with the target modality is important for achieving strong predictive performance.
4. Conclusions
We developed HELM-BERT and observed strong, competitive performance on tasks associated with medium-sized peptides and their derivatives. This work establishes HELM as a powerful notation for ML-based modeling of structurally complex molecules. The sample efficiency achieved with only approximately 39,000 sequences highlights the advantage of our approach, providing a foundation for accelerating next-generation peptide therapeutic development in data-scarce domains.
[1] S. Balaji, R. Magar, Y. Jadhav, and A. B. Farimani, “GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction,” Oct. 10, 2023, arXiv: arXiv:2310.03030. doi: 10.48550/arXiv.2310.03030.
[2] T. Zhang, H. Li, H. Xi, R. V. Stanton, and S. H. Rotstein, “HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation,” J. Chem. Inf. Model., vol. 52, no. 10, pp. 2796–2806, Oct. 2012, doi: 10.1021/ci3001925.