P10-22

Empowering Federated Learning for Robust Compound-Protein Interaction Prediction across Heterogeneous Cross-Pharma Domains

Takuto KOYAMA *1, Hiroaki IWATA2, Ryosuke KOJIMA1, 3, Takao OTSUKA1, Aki HASEGAWA1, Teruki HONMA3, Shigeyuki MATSUMOTO1, Yasushi OKUNO1, 4

1Graduate School of Medicine, Kyoto University
2Faculty of Medicine, Tottori University
3RIKEN Center for Biosystems Dynamics Research
4RIKEN Center for Computational Science


Purpose
Federated learning (FL) is a privacy-preserving approach for compound-protein interaction (CPI) prediction using multi-institutional data. However, its effectiveness is unclear when data is heterogeneously distributed across different chemical and protein domains. This study evaluates FL under data heterogeneity and proposes a framework to achieve robust performance for both in-domain and out-of-domain predictions.

Methods
The FL model was constructed using a multimodal framework with a graph neural network for compounds and the protein language model ESM-2 for proteins. We used the kMoL[1] library for the FL implementation. Using public (ChEMBL) and proprietary data from multiple companies, we evaluated FL under homogeneous and heterogeneous data distributions. To improve upon standard FL, we developed two strategies: 1) Fine-tuning the global model on each client's local data, and 2) a Similarity-Guided Ensemble (SGE) that combines predictions from the global and fine-tuned models based on data similarity.

Results and Discussion
With heterogeneous data, a critical trade-off was observed: standard FL models excelled on out-of-domain data but underperformed local models on in-domain data. Fine-tuning the FL model with local data effectively improved its in-domain performance while maintaining strong out-of-domain capabilities. The proposed Similarity-Guided Ensemble (SGE) method demonstrated the most robust results, achieving superior performance over all other models for both in-domain and out-of-domain tasks. These findings were validated with real-world industry data from multiple pharmaceutical companies.

Conclusion
Standard FL improves out-of-domain CPI prediction but faces a trade-off with in-domain accuracy in heterogeneous settings. This challenge can be overcome by combining FL with local fine-tuning and our SGE approach, which together provide a robust framework for both specialized and exploratory tasks. Our findings offer a practical workflow to implement FL and accelerate collaborative drug discovery.

[1] Cozac, R., Hasic, H., Choong, J.J. et al. kMoL: an open-source machine and federated learning library for drug discovery. J Cheminform 17, 22 (2025). https://doi.org/10.1186/s13321-025-00967-9