Computational Identification of Antigen-Specific Sequences from BCR Repertoires Using an Antibody Language Model
Genki MASUDA *1, Shunsuke IIZUMI1, Yohei FUNAKOSHI2, Kimikazu YAKUSHIJIN2, Goh OHJI2, Ohue MASAHITO1
1School of Computing, Institute of Science Tokyo
2Kobe University Hospital and Graduate School of Medicine
Comprehensive analysis of B-cell receptor (BCR) repertoires is crucial for understanding the mechanisms of immune responses to infections and vaccinations. However, efficiently identifying the small fraction of antibody sequences produced in response to a specific antigen from the millions to tens of millions of diverse BCR sequences within an individual remains a significant challenge. The objective of this study is to develop a novel computational method that leverages an antibody language model to identify antigen-specific antibody sequences from longitudinal BCR repertoire data. We utilize an antibody language model trained on the biological and physicochemical properties of antibody sequences. Our proposed method consists of two approaches. First, we use the sequence embeddings generated by the language model to perform clustering and detect clusters that significantly expand in size at the peak of the immune response in longitudinal data. Second, we explore a probabilistic approach using perplexity, calculated by the language model, as an indicator to identify sequences presumed to have a high binding probability to the antigen. To validate the effectiveness of our proposed method, we applied it to in-house BCR repertoire data collected from COVID-19 patients and vaccinated individuals at Kobe University Hospital. The results showed that the antibody sequences identified by our method had a significantly higher overlap with known SARS-CoV-2-specific antibodies in the Cov-AbDab database compared to randomly selected sequences. This finding suggests that our method is effective in efficiently narrowing down antigen-specific sequences from vast repertoire information and provides new insights into the field of computational immunology. The antibody language model-based approach proposed in this study can be a powerful tool for computationally identifying antigen-specific BCRs without the need for immediate experimental validation. This method is expected to accelerate the elucidation of complex immune response mechanisms.