Heavy2Light: Teaching AI to decode antibody pairing

The challenge: lost pairing information

Modern sequencing allows us to get millions of antibody sequences from a single blood sample, but most of these methods lose track of which heavy chains naturally pair with which light chains[1][2][3]. This results in many millions of unpaired individual heavy and light antibody sequences, but significantly fewer paired sequences.

This lost pairing information is relevant, since random pairing often produces non-functional or self-reactive antibodies[4], which limits what we can do with all this sequencing data.

                Why heavy-light pairing matters
                Therapeutic antibody design: Ensuring stability and reducing immunogenicity
Bispecific antibody engineering: Achieving correct heavy-light assembly
Repertoire analysis: Decoding and understanding immune responses
Basic immunology: Understanding B cell selection and maturation

            

Our approach: two-stage deep learning

Since we had much more unpaired data than paired data available, we chose to leverage both of these data sources with a two-stage approach:

Stage 1: pre-training

We built two specialized language models, HeavyBERTa (trained on 99M+ heavy chains) and LightGPT (trained on 22M+ light chains), to learn the "grammar" of antibody sequences.

Stage 2: fine-tuning

We combined these models into Heavy2Light, an encoder-decoder architecture that learned to translate heavy chains into plausible light chains using ~588,000 paired sequences.

Figure 1. How we trained Heavy2Light. (A) Data processing: We started by collecting millions of unpaired antibody sequences from OAS[5] and PLAbDab[6] databases. To reduce redundancy, we clustered similar sequences together and split them into separate datasets for heavy chains (99M+) and light chains (22M+). We also created a smaller dataset of paired sequences (~588K) where we know which heavy and light chains naturally go together. (B) Pre-training phase: First, we trained two specialized models separately. HeavyBERTa learned to understand heavy chain sequences by predicting randomly masked parts. LightGPT learned to generate light chain sequences by predicting the next amino acid in the sequence. (C) Fine-tuning phase: We combined the two pre-trained models into Heavy2Light using an encoder-decoder setup. The encoder (HeavyBERTa) reads the heavy chain, and the decoder (LightGPT) generates a matching light chain. We kept most of the pre-trained weights frozen and only trained small adapter modules[7] to connect them efficiently.

99M+ Heavy Chains

22M+ Light Chains

588K Paired Sequences

What did we find?

1. Maturation-dependent pairing constraints

Memory B cells exhibit restricted V gene pairing preferences.

When we analyzed the light chains generated from memory B cell heavy chains, we found that 16.1% of memory-derived heavy chains consistently generated light chains using the same V gene family (≥80% consensus). In contrast, only 0.8% of naive B cell heavy chains showed this restriction.

Permutation testing confirmed this pattern significantly exceeds chance expectations (p < 0.001), demonstrating that the model learned genuine maturation-dependent pairing rules rather than simply memorizing training data correlations.

This finding aligns with recent experimental observations that memory B cells show pronounced heavy-light interdependence[8].

Figure 2. Memory B cells show stronger V gene pairing preferences. Each bar shows the percentage of heavy chains that consistently generate light chains from the same V gene family (≥80% consensus across 10 generated sequences). The colored bars show what Heavy2Light actually generated, while gray bars show random pairings (*p < 0.05, **p < 0.01, ***p < 0.001).

2. The trimodal kappa distribution

When we had a look at the sequence similarity between generated and true κ light chains, we found three distinct peaks instead of a single broad distribution.

Figure 3. Three distinct pairing modes in kappa light chains. Sequence identity distribution between generated and true light chains, separated by B cell type (naive in dark blue, memory in light blue) and light chain isotype (κ or λ). The three peaks in the κ distribution suggests that antibodies use different pairing strategies, from promiscuous public light chains that work with many heavy chains (~30-40% similarity) to highly specific co-evolved pairs (~70-80% similarity). Lambda (λ) light chains show a simpler pattern with just one peak.

Our hypothesis: These three modes could represent different pairing strategies, from promiscuous public light chains (~30-40% similarity) that pair with many heavy chains, to semi-specialized pairings (~50-60%), to highly specific co-evolved pairs (~70-80%) selected during affinity maturation.

3. Structural conservation despite sequence divergence

Generated light chains with as little as 31% sequence identity (31%) to their true counterparts still folded into correct 3D structures (folding was done with Chai-1 [9]).

Figure 4. 3D structures of generated light chains (purple) aligned with their native counterparts (teal), showing three examples with different levels of sequence similarity. Even when the generated sequence differs substantially from the true sequence (as low as 31% identity), the overall 3D fold remains similar.

4. Co-evolutionary relationships

Generated light chains showed correlated germline identities with their input heavy chains (Pearson R = 0.440), approaching native pairs (R = 0.593) and far exceeding random pairs (R = -0.002).

Figure 5. Heavy2Light learns natural heavy-light chain co-evolution patterns. Each plot shows how "germline-like" the heavy and light chains are (i.e., how similar they are to the original gene templates before mutations). (A) Natural antibody pairs show a clear correlation (R = 0.593), when a heavy chain is highly mutated, its light chain partner tends to be mutated too. (B) Randomly shuffled pairs show no correlation (R = -0.002). (C) Heavy2Light-generated light chains maintain a meaningful correlation with their input heavy chains (R = 0.440), showing the model learned these natural co-evolution patterns without being explicitly trained on them.

What this means

Heavy2Light demonstrates that deep learning can uncover biologically meaningful pairing constraints without explicit supervision. The model learned maturation-dependent preferences, germline co-evolution patterns, and V gene restrictions, all from sequence data alone.

This work demonstrates a shift from classification toward generative biology, where models propose novel, biologically plausible antibody sequences. By incorporating maturation state and other biological knowledge, these approaches could enable more sophisticated antibody design and help decode immune repertoire dynamics in vaccination, infection, and disease.

References

[1] Dudzic P, Chomicz D, Bielska W, Jaszczyszyn I, Zieliński M, Janusz B, Wróbel S, Le Pannérer MM, Philips A, Ponraj P, Kumar S, Krawczyk K (2025). Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data. Commun Biol. , 26;8(1):1110. https://doi.org/10.1038/s42003-025-08388-y
[2] DeKosky, B.J., Kojima, T., Rodin, A., et al. (2015). In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire. Nature Medicine, 21, 86-91. https://doi.org/10.1038/nm.3743
[3] Kovaltsuk A, Krawczyk K, Galson JD, Kelly DF, Deane CM and Trück J (2017). How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody Data. Front. Immunol., 8:1753. https://doi.org/10.3389/fimmu.2017.01753
[4] Novobrantseva, T. (2005). Stochastic pairing of Ig heavy and light chains frequently generates B cell antigen receptors that are subject to editing in vivo. International Immunology, 17, 343-350. https://doi.org/10.1093/intimm/dxh214
[5] Olsen, T.H., Boyles, F., & Deane, C.M. (2022). Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1), 141-146. https://pubmed.ncbi.nlm.nih.gov/34655133/
[6] Brennan Abanades, Tobias H Olsen, Matthew I J Raybould, Broncio Aguilar-Sanjuan, Wing Ki Wong, Guy Georges, Alexander Bujotzek, Charlotte M Deane. (2024). The Patent and Literature Antibody Database (PLAbDab): an evolving reference set of functionally diverse, literature-annotated antibody sequences and structures. Nucleic Acids Research, 52(D1), D545-D551. https://doi.org/10.1093/nar/gkad1056
[7] Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulić, I., Ruder, S., Cho, K., and Gurevych, I. (2020). AdapterHub: A Framework for Adapting Transformers. EMNLP 2020 - Conference on Empirical Methods in Natural Language Processing. Proceedings of Systems Demonstrations, 1, 46-54. https://doi.org/10.18653/v1/2020.emnlp-demos.7
[8] Jaffe, D.B., Shahi, P., Adams, B.A., et al. (2022). Functional antibodies exhibit light chain coherence. Nature, 611, 352-357. https://doi.org/10.1038/s41586-022-05371-z
[9] Chai Discovery, Boitreaud, J., Dent, J., et al. (2024). Chai-1: Decoding the molecular interactions of life. bioRxiv. https://doi.org/10.1101/2024.10.10.615955

Resources

Read the paper Github repository Model weights Data

Get in touch

Questions about Heavy2Light? You can reach out via email (lea.broennimann@unibe.ch) or open an issue on GitHub.