How we trained transformers to better understand the hidden grammar of antibody pairing.
Your immune system generates billions of different antibodies, each consisting of two heavy chains and two light chains that must pair correctly to function. But despite years of research, it is still not fully understood which heavy chains prefer to pair with which light chains.
To explore this mystery, we developed Heavy2Light, a deep learning framework that has learned to generate light chain sequences from a given heavy chain.
Modern sequencing allows us to get millions of antibody sequences from a single blood sample, but most of these methods lose track of which heavy chains naturally pair with which light chains[1][2][3]. This results in many millions of unpaired individual heavy and light antibody sequences, but significantly fewer paired sequences.
This lost pairing information is relevant, since random pairing often produces non-functional or self-reactive antibodies[4], which limits what we can do with all this sequencing data.
Since we had much more unpaired data than paired data available, we chose to leverage both of these data sources with a two-stage approach:
We built two specialized language models, HeavyBERTa (trained on 99M+ heavy chains) and LightGPT (trained on 22M+ light chains), to learn the "grammar" of antibody sequences.
We combined these models into Heavy2Light, an encoder-decoder architecture that learned to translate heavy chains into plausible light chains using ~588,000 paired sequences.
Figure 1. How we trained Heavy2Light. (A) Data processing: We started by collecting millions of unpaired antibody sequences from OAS[5] and PLAbDab[6] databases. To reduce redundancy, we clustered similar sequences together and split them into separate datasets for heavy chains (99M+) and light chains (22M+). We also created a smaller dataset of paired sequences (~588K) where we know which heavy and light chains naturally go together. (B) Pre-training phase: First, we trained two specialized models separately. HeavyBERTa learned to understand heavy chain sequences by predicting randomly masked parts. LightGPT learned to generate light chain sequences by predicting the next amino acid in the sequence. (C) Fine-tuning phase: We combined the two pre-trained models into Heavy2Light using an encoder-decoder setup. The encoder (HeavyBERTa) reads the heavy chain, and the decoder (LightGPT) generates a matching light chain. We kept most of the pre-trained weights frozen and only trained small adapter modules[7] to connect them efficiently.
Memory B cells exhibit restricted V gene pairing preferences.
When we analyzed the light chains generated from memory B cell heavy chains, we found that 16.1% of memory-derived heavy chains consistently generated light chains using the same V gene family (≥80% consensus). In contrast, only 0.8% of naive B cell heavy chains showed this restriction.
Permutation testing confirmed this pattern significantly exceeds chance expectations (p < 0.001), demonstrating that the model learned genuine maturation-dependent pairing rules rather than simply memorizing training data correlations.
This finding aligns with recent experimental observations that memory B cells show pronounced heavy-light interdependence[8].
Figure 2. Memory B cells show stronger V gene pairing preferences. Each bar shows the percentage of heavy chains that consistently generate light chains from the same V gene family (≥80% consensus across 10 generated sequences). The colored bars show what Heavy2Light actually generated, while gray bars show random pairings (*p < 0.05, **p < 0.01, ***p < 0.001).
When we had a look at the sequence similarity between generated and true κ light chains, we found three distinct peaks instead of a single broad distribution.
Figure 3. Three distinct pairing modes in kappa light chains. Sequence identity distribution between generated and true light chains, separated by B cell type (naive in dark blue, memory in light blue) and light chain isotype (κ or λ). The three peaks in the κ distribution suggests that antibodies use different pairing strategies, from promiscuous public light chains that work with many heavy chains (~30-40% similarity) to highly specific co-evolved pairs (~70-80% similarity). Lambda (λ) light chains show a simpler pattern with just one peak.
Our hypothesis: These three modes could represent different pairing strategies, from promiscuous public light chains (~30-40% similarity) that pair with many heavy chains, to semi-specialized pairings (~50-60%), to highly specific co-evolved pairs (~70-80%) selected during affinity maturation.
Generated light chains with as little as 31% sequence identity (31%) to their true counterparts still folded into correct 3D structures (folding was done with Chai-1 [9]).
Figure 4. 3D structures of generated light chains (purple) aligned with their native counterparts (teal), showing three examples with different levels of sequence similarity. Even when the generated sequence differs substantially from the true sequence (as low as 31% identity), the overall 3D fold remains similar.
Generated light chains showed correlated germline identities with their input heavy chains (Pearson R = 0.440), approaching native pairs (R = 0.593) and far exceeding random pairs (R = -0.002).
Figure 5. Heavy2Light learns natural heavy-light chain co-evolution patterns. Each plot shows how "germline-like" the heavy and light chains are (i.e., how similar they are to the original gene templates before mutations). (A) Natural antibody pairs show a clear correlation (R = 0.593), when a heavy chain is highly mutated, its light chain partner tends to be mutated too. (B) Randomly shuffled pairs show no correlation (R = -0.002). (C) Heavy2Light-generated light chains maintain a meaningful correlation with their input heavy chains (R = 0.440), showing the model learned these natural co-evolution patterns without being explicitly trained on them.
Heavy2Light demonstrates that deep learning can uncover biologically meaningful pairing constraints without explicit supervision. The model learned maturation-dependent preferences, germline co-evolution patterns, and V gene restrictions, all from sequence data alone.
This work demonstrates a shift from classification toward generative biology, where models propose novel, biologically plausible antibody sequences. By incorporating maturation state and other biological knowledge, these approaches could enable more sophisticated antibody design and help decode immune repertoire dynamics in vaccination, infection, and disease.
Questions about Heavy2Light? You can reach out via email (lea.broennimann@unibe.ch) or open an issue on GitHub.