Heavy2Light: Teaching AI to decode antibody pairing

How we trained transformers to better understand the hidden grammar of antibody pairing.

Heavy2Light Logo
What if we could teach AI to understand antibody pairing preferences?

Your immune system generates billions of different antibodies, each consisting of two heavy chains and two light chains that must pair correctly to function. But despite years of research, it is still not fully understood which heavy chains prefer to pair with which light chains.

To explore this mystery, we developed Heavy2Light, a deep learning framework that has learned to generate light chain sequences from a given heavy chain.

The challenge: lost pairing information

Modern sequencing allows us to get millions of antibody sequences from a single blood sample, but most of these methods lose track of which heavy chains naturally pair with which light chains[1][2][3]. This results in many millions of unpaired individual heavy and light antibody sequences, but significantly fewer paired sequences.

This lost pairing information is relevant, since random pairing often produces non-functional or self-reactive antibodies[4], which limits what we can do with all this sequencing data.

Why heavy-light pairing matters

  • Therapeutic antibody design: Ensuring stability and reducing immunogenicity
  • Bispecific antibody engineering: Achieving correct heavy-light assembly
  • Repertoire analysis: Decoding and understanding immune responses
  • Basic immunology: Understanding B cell selection and maturation

Our approach: two-stage deep learning

Since we had much more unpaired data than paired data available, we chose to leverage both of these data sources with a two-stage approach:

Stage 1: pre-training

We built two specialized language models, HeavyBERTa (trained on 99M+ heavy chains) and LightGPT (trained on 22M+ light chains), to learn the "grammar" of antibody sequences.

Stage 2: fine-tuning

We combined these models into Heavy2Light, an encoder-decoder architecture that learned to translate heavy chains into plausible light chains using ~588,000 paired sequences.

Heavy2Light Logo

Figure 1. How we trained Heavy2Light. (A) Data processing: We started by collecting millions of unpaired antibody sequences from OAS[5] and PLAbDab[6] databases. To reduce redundancy, we clustered similar sequences together and split them into separate datasets for heavy chains (99M+) and light chains (22M+). We also created a smaller dataset of paired sequences (~588K) where we know which heavy and light chains naturally go together. (B) Pre-training phase: First, we trained two specialized models separately. HeavyBERTa learned to understand heavy chain sequences by predicting randomly masked parts. LightGPT learned to generate light chain sequences by predicting the next amino acid in the sequence. (C) Fine-tuning phase: We combined the two pre-trained models into Heavy2Light using an encoder-decoder setup. The encoder (HeavyBERTa) reads the heavy chain, and the decoder (LightGPT) generates a matching light chain. We kept most of the pre-trained weights frozen and only trained small adapter modules[7] to connect them efficiently.

99M+ Heavy Chains
22M+ Light Chains
588K Paired Sequences

What did we find?

1. Maturation-dependent pairing constraints

Memory B cells exhibit restricted V gene pairing preferences.

When we analyzed the light chains generated from memory B cell heavy chains, we found that 16.1% of memory-derived heavy chains consistently generated light chains using the same V gene family (≥80% consensus). In contrast, only 0.8% of naive B cell heavy chains showed this restriction.

Permutation testing confirmed this pattern significantly exceeds chance expectations (p < 0.001), demonstrating that the model learned genuine maturation-dependent pairing rules rather than simply memorizing training data correlations.

This finding aligns with recent experimental observations that memory B cells show pronounced heavy-light interdependence[8].

Heavy2Light Logo

Figure 2. Memory B cells show stronger V gene pairing preferences. Each bar shows the percentage of heavy chains that consistently generate light chains from the same V gene family (≥80% consensus across 10 generated sequences). The colored bars show what Heavy2Light actually generated, while gray bars show random pairings (*p < 0.05, **p < 0.01, ***p < 0.001).

2. The trimodal kappa distribution

When we had a look at the sequence similarity between generated and true κ light chains, we found three distinct peaks instead of a single broad distribution.

Heavy2Light Logo

Figure 3. Three distinct pairing modes in kappa light chains. Sequence identity distribution between generated and true light chains, separated by B cell type (naive in dark blue, memory in light blue) and light chain isotype (κ or λ). The three peaks in the κ distribution suggests that antibodies use different pairing strategies, from promiscuous public light chains that work with many heavy chains (~30-40% similarity) to highly specific co-evolved pairs (~70-80% similarity). Lambda (λ) light chains show a simpler pattern with just one peak.

Our hypothesis: These three modes could represent different pairing strategies, from promiscuous public light chains (~30-40% similarity) that pair with many heavy chains, to semi-specialized pairings (~50-60%), to highly specific co-evolved pairs (~70-80%) selected during affinity maturation.

3. Structural conservation despite sequence divergence

Generated light chains with as little as 31% sequence identity (31%) to their true counterparts still folded into correct 3D structures (folding was done with Chai-1 [9]).

Heavy2Light Logo

Figure 4. 3D structures of generated light chains (purple) aligned with their native counterparts (teal), showing three examples with different levels of sequence similarity. Even when the generated sequence differs substantially from the true sequence (as low as 31% identity), the overall 3D fold remains similar.

4. Co-evolutionary relationships

Generated light chains showed correlated germline identities with their input heavy chains (Pearson R = 0.440), approaching native pairs (R = 0.593) and far exceeding random pairs (R = -0.002).

Correlation

Figure 5. Heavy2Light learns natural heavy-light chain co-evolution patterns. Each plot shows how "germline-like" the heavy and light chains are (i.e., how similar they are to the original gene templates before mutations). (A) Natural antibody pairs show a clear correlation (R = 0.593), when a heavy chain is highly mutated, its light chain partner tends to be mutated too. (B) Randomly shuffled pairs show no correlation (R = -0.002). (C) Heavy2Light-generated light chains maintain a meaningful correlation with their input heavy chains (R = 0.440), showing the model learned these natural co-evolution patterns without being explicitly trained on them.

What this means

Heavy2Light demonstrates that deep learning can uncover biologically meaningful pairing constraints without explicit supervision. The model learned maturation-dependent preferences, germline co-evolution patterns, and V gene restrictions, all from sequence data alone.

This work demonstrates a shift from classification toward generative biology, where models propose novel, biologically plausible antibody sequences. By incorporating maturation state and other biological knowledge, these approaches could enable more sophisticated antibody design and help decode immune repertoire dynamics in vaccination, infection, and disease.

References

  1. [1] Dudzic P, Chomicz D, Bielska W, Jaszczyszyn I, Zieliński M, Janusz B, Wróbel S, Le Pannérer MM, Philips A, Ponraj P, Kumar S, Krawczyk K (2025). Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data. Commun Biol. , 26;8(1):1110. https://doi.org/10.1038/s42003-025-08388-y
  2. [2] DeKosky, B.J., Kojima, T., Rodin, A., et al. (2015). In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire. Nature Medicine, 21, 86-91. https://doi.org/10.1038/nm.3743
  3. [3] Kovaltsuk A, Krawczyk K, Galson JD, Kelly DF, Deane CM and Trück J (2017). How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody Data. Front. Immunol., 8:1753. https://doi.org/10.3389/fimmu.2017.01753
  4. [4] Novobrantseva, T. (2005). Stochastic pairing of Ig heavy and light chains frequently generates B cell antigen receptors that are subject to editing in vivo. International Immunology, 17, 343-350. https://doi.org/10.1093/intimm/dxh214
  5. [5] Olsen, T.H., Boyles, F., & Deane, C.M. (2022). Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1), 141-146. https://pubmed.ncbi.nlm.nih.gov/34655133/
  6. [6] Brennan Abanades, Tobias H Olsen, Matthew I J Raybould, Broncio Aguilar-Sanjuan, Wing Ki Wong, Guy Georges, Alexander Bujotzek, Charlotte M Deane. (2024). The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures. Nucleic Acids Research, 52(D1), D545-D551. https://doi.org/10.1093/nar/gkad1056
  7. [7] Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulić, I., Ruder, S., Cho, K., and Gurevych, I. (2020). AdapterHub: A Framework for Adapting Transformers. EMNLP 2020 - Conference on Empirical Methods in Natural Language Processing. Proceedings of Systems Demonstrations, 1, 46-54. https://doi.org/10.18653/v1/2020.emnlp-demos.7
  8. [8] Jaffe, D.B., Shahi, P., Adams, B.A., et al. (2022). Functional antibodies exhibit light chain coherence. Nature, 611, 352-357. https://doi.org/10.1038/s41586-022-05371-z
  9. [9] Chai Discovery, Boitreaud, J., Dent, J., et al. (2024). Chai-1: Decoding the molecular interactions of life. bioRxiv. https://doi.org/10.1101/2024.10.10.615955

Resources

Get in touch

Questions about Heavy2Light? You can reach out via email (lea.broennimann@unibe.ch) or open an issue on GitHub.