WO2022223451A1 - Engineering of antigen-binding proteins - Google Patents

Engineering of antigen-binding proteins Download PDF

Info

Publication number
WO2022223451A1
WO2022223451A1 PCT/EP2022/060073 EP2022060073W WO2022223451A1 WO 2022223451 A1 WO2022223451 A1 WO 2022223451A1 EP 2022060073 W EP2022060073 W EP 2022060073W WO 2022223451 A1 WO2022223451 A1 WO 2022223451A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
chain
sequences
heavy
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2022/060073
Other languages
English (en)
French (fr)
Inventor
Jinwoo LEEM
Jacob GALSON
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alchemab Therapeutics Ltd
Original Assignee
Alchemab Therapeutics Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alchemab Therapeutics Ltd filed Critical Alchemab Therapeutics Ltd
Priority to CA3215778A priority Critical patent/CA3215778A1/en
Priority to EP22717415.8A priority patent/EP4327328A1/en
Priority to CN202280036343.6A priority patent/CN117337470A/zh
Priority to US18/287,352 priority patent/US20240203523A1/en
Priority to JP2023564230A priority patent/JP2024514691A/ja
Priority to KR1020237039964A priority patent/KR20240011144A/ko
Priority to AU2022260043A priority patent/AU2022260043A1/en
Priority to IL307832A priority patent/IL307832A/en
Publication of WO2022223451A1 publication Critical patent/WO2022223451A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional [2D] or three-dimensional [3D] molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention relates to methods for engineering antigen-binding proteins such as B cell receptors, antibodies and T cell receptors by identifying variable chain pairings from single input variable chains, such as a heavy-light chain pair from an input heavy or light chain, or an a-b chain pair from an input a or b chain.
  • the present invention also relates to methods of providing an antigen-binding protein, such as a therapeutic antibody, derived from an input variable chain, for example a B cell receptor / antibody heavy or light chain.
  • BCRs are comprised of two pairs of two protein chains: two heavy chains and two light chains. Each B cell expresses a (likely unique) pair of heavy and light chains to form its BCR, which is expressed on its surface, or secreted as an antibody. Over 600 million different human heavy chain sequences and approximately 70 million light chain sequences are currently catalogued in the Observed Antibody Space [Kovaltsuk et al., 2018].
  • Single B cell sequencing is more commonly employed for antibody discovery applications, as it preserves the pairing information between the heavy and the light chains.
  • single-cell sequencing has a limited throughput, and different platforms and protocols vary in their coverage of the BCR repertoire present within a single sample. Even the most advanced microfluidic systems can typically only recover the sequences for ⁇ 10 4 B cells per sample [King et al., 2021; Eccles et al., 2020; Setliff et al., 2019].
  • Single-cell sequencing has very specific sample requirements (for example, the cells typically have to remain viable until processed, thus requiring fresh samples processed on the day of collection, or frozen according to a specific protocol), very high costs per sample compared to bulk sequencing (single-cell sequencing being at least an order of magnitude more expensive than bulk sequencing) and requires dedicated laboratory equipment.
  • TCR T cell receptor
  • the total size of the TCR repertoire in humans is estimated to comprise up to ⁇ 10 15 unique ab T cell receptor (TCR) pairs [Carter et al., 2019].
  • TCR T cell receptor
  • experimental approaches for paired ab TCR sequencing have been developed (including single cell approaches [Zheng et al., 2017] and multi-cell deconvolution based approaches [Howie et al., 2015]), these remain specialised and limited in throughput.
  • the majority of the TCR repertoire knowledge available is based on bulk-sequencing on single chain repertoires, mostly the b chain repertoire. This is inherently limited especially as it has been shown that both the a and b TCR chains are involved in alloreactivity and antigen specificity [Carter et al., 2019].
  • the present inventors further identified that for generalised application to antibody discovery, it is desirable to be able to generate a viable light chain for any given heavy chain. It is further desirable to be able to generate this using only heavy chain information as BCR repertoire bulk sequencing efforts often focus limited resources on sequencing the heavy chain, which is believed to play a more important functional role than the light chain.
  • NLP natural language processing
  • the inventors further identified that the same approach could be used to solve the problem of ab TCR chain pairing.
  • a method of identifying an antigen-binding protein comprising a pair of chains comprising: providing a query sequence comprising a first chain sequence, and identifying a corresponding chain sequence by: providing the query sequence to a deep learning model configured to take as input a query first chain sequence and to produce as output at least one corresponding chain sequence, thereby identifying a corresponding chain sequence for the query sequence, wherein the deep learning model has been trained using training first and corresponding chain sequences from known chain pairs.
  • the method may have one or more of the following features.
  • variable chains The pair of chains may be referred to as “variable chains”.
  • known chain pairs refers to pairs of variable chain sequences that are known to be present in antigen-binding proteins showing a desired antigen binding function, or in antigen-binding proteins that form part of at least one subject’s B cell or T cell repertoire. The latter may also be referred to as “native” chain pairs.
  • the antigen-binding protein may comprise a heavy-light chain pair, wherein the first chain sequence is a heavy chain sequence or a light chain sequence, and the corresponding chain sequence is a light chain sequence or a heavy chain sequence.
  • the first chain sequence may be a heavy chain sequence and the corresponding sequence may be a light chain sequence.
  • the antigen-binding protein may be a B cell receptor or antibody, or a protein derived therefrom.
  • the antigen-binding protein may comprise a heavy-light chain pair.
  • the query sequence may comprise a heavy chain sequence or a light chain sequence.
  • the corresponding chain sequence may be a light chain sequence or a heavy chain sequence.
  • the antigen-binding protein may comprise an ab chain pair, wherein the first chain sequence is a b chain sequence or an a chain sequence, and the corresponding chain sequence is an a chain sequence or a b chain sequence.
  • the first chain sequence may be a b chain sequence and the corresponding sequence is an a chain sequence.
  • the antigen-binding protein may comprise a gd chain pair, wherein the first chain sequence is a d chain sequence or a g chain sequence, and the corresponding chain sequence is a g chain sequence or a d chain sequence.
  • the first chain sequence may be a d chain sequence and the corresponding sequence may be a g chain sequence.
  • the antigen-binding protein may be a T cell receptor or a protein derived therefrom.
  • the antigen-binding protein may comprise an ab chain pair or a gd chain pair.
  • the query sequence may comprise a b or d chain sequence or an a or g chain sequence.
  • the corresponding chain sequence may be an a or Y chain sequence or a b or d chain sequence.
  • the deep learning model may be a sequence-to-sequence model.
  • the deep learning model may comprise a recurrent neural network or a transformer.
  • the deep learning model may be a sequence-to-sequence transformer-based model.
  • the recurrent neural network may be a gated recurrent unit (GRU) -based model or a long short-term memory (LSTM) model.
  • GRU-based model may comprise a GRU-based encoder and a GRU-based decoder.
  • the encoder may be a 4-layer bi-directional GRU, for example with a hidden dimension of 1024
  • the decoder may be a 4-layer, forward-only GRU, for example with a hidden dimension of 1024.
  • a transformer is a deep learning model that uses the mechanism of attention.
  • the transformer-based model may be a transformer model with an architecture using self-attention and point-wise, fully connected layers for both the encoder and the decoder.
  • the encoder and/or the decoder may be composed of a stack of 4 identical layers.
  • Each layer of the encoder may have two sublayers: a multi-head self-attention layer and a position-wise fully connected feed forward network layer.
  • Each layer of the decoder may have three sublayers: a self-attention sublayer, a layer that performs multi-head attention over the output of the encoder stack, and a feedforward network layer.
  • the model may have a feed-forward dimension of 1024.
  • the deep learning model may be configured to produce as output one or more corresponding chain sequences. Each corresponding chain sequence may be associated with a confidence metric such as a probability. A corresponding chain sequence for the query sequence may be identified as a sequence of the one or more corresponding chain sequences that is associated with the highest confidence metric amongst the one or more corresponding chain sequences.
  • the deep learning model may be configured to produce as output a single corresponding chain sequence.
  • the deep learning model may be configured to predict each chain in a sequential manner. In other words, the deep learning model may be configured to provide predictions in a greedy manner.
  • the deep learning model may be configured to predict chains using a beam search approach or a related approach such as beam stack search [Zhou & Hansen, 2005] and depth-first beam search [Furcy & Koenig, 2005].
  • the deep learning model may be configured to produce as output a plurality of corresponding chain sequences.
  • Each of the plurality of corresponding chain sequence may be associated with a confidence metric such as a probability.
  • the single corresponding chain sequence associated with the highest confidence metric may be reported.
  • the method may comprise identifying a corresponding chain sequence as one of a plurality of corresponding chain sequences that is associated with the highest confidence metric amongst the one or more corresponding chain sequences.
  • all corresponding chain sequences that have been predicted may be reported, advantageously together with an associated confidence metric.
  • all corresponding chain sequences that have been predicted and that satisfy one or more further criteria may be reported. For example, any corresponding chain sequences that have been predicted and that have an associated confidence metric within a predetermined range from the corresponding chain sequence that is associated with the highest confidence metric may be reported.
  • the training first and corresponding chain sequences from known chain pairs may comprise paired training heavy and light chain sequences from single B cell sequencing data.
  • the training data may comprise one or more datasets each previously obtained by single B cell sequencing of samples obtained from subjects or by sequencing of libraries derived therefrom.
  • the training data may further comprise paired training heavy and light chain sequences from known antibodies / B cell receptors.
  • the training data may comprise paired training heavy and light chain sequences from one or more antibody/BCR databases, from one or more known therapeutic antibodies/BCRs, and/or from one or more antibodies/BCRs that are known to have a desired binding function.
  • the training data may comprise paired training heavy and light chain sequences from naive B cell receptor libraries.
  • the training data may comprise paired training heavy and light chain sequences from antigen-experienced B cell receptor libraries.
  • the training data may comprise paired training heavy and light chain sequences obtained from subjects that have been exposed to one or more specific antigens.
  • the training first and corresponding chain sequences from known chain pairs may comprise paired training a and b chain sequences from single T cell sequencing data.
  • the training data may comprise one or more datasets each previously obtained by single T cell sequencing of samples obtained from subjects or by sequencing of libraries derived therefrom.
  • the training data may further comprise paired training first and corresponding chain sequences from known T cell receptors.
  • the training data may comprise paired training a and b chain sequences from one or more T cell receptor databases, from one or more known therapeutic TCRs, and/or from one or more TCRs that are known to have a desired binding function.
  • the training data may comprise paired training a and b (or d and y) chain sequences from naive T cell receptor libraries.
  • the training data may comprise paired training a and b (or d and g) chain sequences from antigen-experienced T cell receptor libraries.
  • the training data may comprise paired training a and b (or d and g) chain sequences obtained from subjects that have been exposed to one or more specific antigens.
  • the training first and corresponding chain sequences from known chain pairs may comprise paired training chain sequences wherein each pair comprises a chain sequence that comprises or consists of: a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence, and optionally a D-gene sequence or identifier.
  • the training first and corresponding chain sequences from known chain pairs may comprise paired training chain sequences wherein each pair comprises a chain sequence that comprises or consists of: a chain sequence that comprises or consists of: a V-gene sequence or identifier, a J- gene sequence or identifier, and a junction sequence.
  • the training data may comprise at least 80,000, at least 100,000, at least 120,000 or at least 150,000 pairs of training sequences, for example training heavy and light chain sequences.
  • the training data may comprise at least 150,000 pairs of training heavy and light chain sequences.
  • the training data may comprise mammalian, such as e.g. human pairs of chain sequences.
  • the training data may comprise mammalian heavy and/or light chain sequences.
  • the training data may comprise human heavy and/or light chain sequences.
  • the training data may comprise training pairs of sequences from the same species as the query sequence.
  • the training data may comprise at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or at least 95% sequences from the same species as the query sequence.
  • the query sequence may be a sequence that is not present in the training data.
  • the query sequence may be a sequence that has been obtained from a sample from a subject that has a desired characteristic, such as a desired phenotype.
  • the subject may have a particular clinical characteristic.
  • the training data may further comprise unpaired training first and/or corresponding sequences. This will be described further below.
  • the unpaired training first and/or corresponding chain sequences may have any of the features of sequences described in relation to the paired sequences.
  • the unpaired chain sequences may be the same type of sequences as the paired sequences (e.g.
  • the unpaired training first/corresponding chain sequences may comprise unpaired heavy and/or light chains), may comprise sequences from the same organisms (e.g. may comprise mammalian and/or human sequences, may comprise sequences from one or more organisms, may comprise sequences from naive libraries and/or antigen exposed libraries, etc.), may comprise the same information (such as e.g. gene segments identifiers, sequences and combinations thereof).
  • the unpaired training sequences may comprise some or all of the first and/or corresponding sequences that are present in the paired training sequences.
  • the unpaired training sequences may comprise more first chain sequences and/or more corresponding chain sequences than the paired training chain sequences.
  • the deep learning model is a transformer-based model comprising an encoder that has been pre-trained using unpaired training first and/or light corresponding sequences and a decoder that has been pre-trained using unpaired training corresponding and/or first chain sequences.
  • the query chain sequence may comprise or consist of: a V-gene sequence or identifier, a J- gene sequence or identifier, and a junction sequence, and optionally a D-gene sequence or identifier.
  • the corresponding chain sequence may comprise or consist of: a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence.
  • the query chain sequence may comprise or consist of: a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence.
  • the corresponding chain sequence comprises or consists of: a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence, and optionally a D-gene sequence or identifier.
  • the format of the query and corresponding chain sequences is related to the format of training chain sequences.
  • a deep learning model that has been trained using training chain sequences comprising or consisting of: a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence, and optionally a D-gene sequence or identifier, may accept as input or produce as output a chain sequence comprising or consisting of these components.
  • a deep learning model that has been trained using training chain sequences comprising or consisting of: a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence may accept as input or produce as output a chain sequence comprising or consisting of these components.
  • the query sequence may comprise or consist of one or more first chain CDR sequence(s).
  • the corresponding sequence may comprise or consist of one or more corresponding chain CDR sequence(s), optionally wherein the query/corresponding sequence comprises or consists of a CDR3 sequence.
  • All sequences may be amino acid sequences.
  • Providing the query sequence to the deep learning model may comprises encoding the query sequence using an encoding scheme wherein each gene sequence identifier corresponds to an individual token.
  • Providing the query sequence to the deep learning model may comprises encoding the query sequence using an encoding scheme wherein each amino acid corresponds to an individual token.
  • Providing the query sequence to the deep learning model may comprises encoding the query sequence using an encoding scheme wherein sequences (i.e. sequences that are available as full sequences rather than gene identifiers) are encoded using tokens that each correspond to an individual k-mer or using byte-pair encoding.
  • Each sequence may be encoded using overlapping k-mers.
  • the k-mers may be of length 1 to 5.
  • the encoding scheme may have been previously defined based on the content of the training chain sequences.
  • the encoding scheme may have been defined based on the content of the training chain sequences, wherein tokens are excluded from the vocabulary constructed based on the content of the training chain sequences if they are used a number of times below a predetermined threshold (e.g. 2) in the training data (separately or jointly for the first and corresponding chain sequences in the paired training data).
  • the encoding scheme may have been previously defined based on the content of the training chain sequences by constructing a vocabulary separately for the training first chains and for the training corresponding chains in the training data.
  • the training data may have been filtered to exclude any pairs comprising a junction sequence (in the first and/or corresponding chain) that is outside of a predetermined range of lengths.
  • the training data may not comprise any pairs comprising a first (e.g. heavy) chain junction that is outside of a predetermined range of lengths and/or a corresponding (e.g. light) chain junction that is outside of a predetermined range of lengths.
  • pairs comprising a heavy chain junction sequence below a predetermined length such as e.g. 3, 4, 5, 6, 7, 8, 9 or 10 amino acids, may have been excluded.
  • pairs comprising a heavy chain junction sequence above a predetermined length such as e.g.
  • pairs comprising a light chain junction sequence below a predetermined length such as e.g. 3, 4, 5, 6, 7, 8, 9 or 10 amino acids, may have been excluded.
  • pairs comprising a light chain junction sequence above a predetermined length such as e.g. 15, 16, 17, 18,19, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 amino acids, may have been excluded.
  • the predetermined length may be the same or different for the junction sequences in the corresponding (e.g light) chain and in the first (e.g. heavy) chain of a pair.
  • pairs comprising a heavy chain junction sequence of fewer than 7 amino acids may have been excluded and/or pairs comprising a heavy chain junction sequence of more than 30 amino acids may have been excluded.
  • pairs comprising a light chain junction sequence of fewer than 7 amino acids may have been excluded and/or pairs comprising a light chain junction sequence of more than 20 amino acids may have been excluded.
  • the query sequence and/or the corresponding sequence may comprise one or more gene sequence identifiers and the method may further comprise replacing the one or more gene sequence identifiers by the corresponding germline sequence.
  • the deep learning model may be a transformer-based model comprising an encoder that has been pre-trained using unpaired training first and/or corresponding chain sequences and a decoder that has been pre-trained using unpaired training corresponding and/or first chain sequences.
  • the encoder and/or the decoder may comprise a BERT model or a variant thereof, such as e.g. BERT, RoBERTa, or DistilBERT, and/or an autoregressive transformer model, such as GPT-2.
  • the encoder and/or the decoder may comprise a RoBERTa model, a BERT model and/or a GPT-2 model.
  • the encoder and decoder may comprise the same model.
  • the encoder and decoder may both comprise models trained using unpaired training corresponding and first chain sequences.
  • the encoder and decoder may both comprise models trained using random pairs each comprising a first and corresponding chain sequence.
  • the encoder may comprise a model trained using training first (e.g. heavy or light) chain sequences
  • the decoder may comprise a model trained using corresponding (e.g. light or heavy) chain sequences.
  • the encoder and decoder may both comprise the same pre-trained model.
  • the encoder and decoder may be initialised using pre-trained models that have the same architecture with the same parameters.
  • the unpaired training chain sequences may comprise full length sequences for the variable region of the corresponding chain.
  • the unpaired training chain sequences may comprise full sequences for the variable region of the first chain.
  • the transformer-based model may have been trained using paired first and corresponding (e.g. heavy and light) chain sequences from known chain pairs, wherein said sequences do not comprise full length sequences for the variable region of the corresponding chain and/or the first chain.
  • the transformer-based model may have been trained by obtaining paired training sequences that comprise full length sequences for the variable region of the corresponding chain and/or the first chain by imputing missing sequence information.
  • Imputing missing sequence information may comprise replacing a gene identifier by the corresponding germline sequence.
  • Imputing missing sequence information may comprise using the pre-trained encoder and/or the pre-trained decoder (for example depending on the chain for which missing sequence information is imputed) to predict a full-length sequence from each of the paired training first (e.g. heavy) and/or corresponding (e.g. light) chain sequences.
  • the unpaired training corresponding (e.g. light) chain sequences and/or the unpaired training first (e.g. heavy) chain sequences may have been converted to a format that matches the format of the respective paired training sequences prior to training the encoder and/or the decoder.
  • Providing a query sequence may comprise obtaining the query sequence from a user through a user interface, from a computing device, from a sequence acquisition means or a computing device associated with a sequence acquisition means, from a database or other computer readable medium.
  • Providing a query sequence may comprise sequencing a sample comprising genetic material encoding for an antigen-binding molecule comprising the query sequence.
  • Obtaining the query sequence may comprise performing B cell bulk sequencing of a sample comprising B cells, T cell bulk sequencing of a sample comprising T cells, or bulk sequencing of a sample comprising any other cells expressing an antigenbinding molecule comprising the query sequence, or genetic material derived therefrom, such as a B cell receptor library or a T cell receptor library.
  • Providing a query sequence may comprise obtaining a sample comprising B cells, T cells or other cells expressing an antigenbinding molecule comprising the query sequence, or genetic material derived therefrom, such as a B cell receptor library or T cell receptor library.
  • Providing a query sequence may comprise sequencing a sample comprising genetic material encoding for an antigen-binding molecule comprising the query sequence, for example by performing B cell bulk sequencing of a sample comprising B cells (or any other cells expressing an antigen-binding molecule comprising the query sequence, or genetic material derived therefrom, such as a B cell receptor library).
  • Providing a query sequence may comprise obtaining a sample comprising B cells, or other cells expressing an antigen-binding molecule comprising the query sequence, or genetic material derived therefrom, such as a B cell receptor library.
  • the method may further comprise providing the identified corresponding sequence, a part thereof or information derived therefrom, to a user through a user interface.
  • a method of providing antigen-binding protein chain pairings for a plurality of query sequences comprising a first chain sequence comprising: performing the method of any embodiment of the first aspect for each of the query sequences.
  • the plurality of query sequences may be heavy or light chain sequences obtained by bulk B cell repertoire sequencing.
  • the plurality of query sequences may comprise at least 100, at least 1000, at least 10,000, or at least 100,000 sequences.
  • the plurality of query sequences may have been obtained by bulk B cell sequencing of the heavy or light chain repertoire in a sample, such as a sample from a subject.
  • the plurality of sequences may be a subset of a set of sequences obtained by bulk B cell sequencing of the heavy or light chain repertoire in a sample.
  • the method according to the present aspect may have any of the features described in relation to the first aspect.
  • a method of providing an antigen-binding protein having a desired property comprising: providing one or more query sequences comprising a first chain sequence, wherein at least one of the one or more query sequences is likely to have the desired property, and identifying a corresponding chain sequence for each of the one or more query sequences using the method of any embodiment of the first aspect.
  • the method may have any one or more of the following features.
  • the method may further comprise obtaining one or more candidate antigen-binding proteins each comprising one of the query sequences and the corresponding sequence or sequences derived therefrom.
  • the method may further comprise testing the one or more candidate antigen-binding proteins for the desired property.
  • the method of the present aspect may have any of the features described in relation to the first or second aspects.
  • the one or more candidate antigen-binding proteins may be antibodies or fragment thereof. Sequences derived from an identified chain pairing may include sequences: that comprise the same CDRs but with different framework regions, sequences that contain one or more mutations compared to the identified chain pairing, and sequences that contain one or more fragments of the identified chain pairing.
  • Obtaining a candidate antigen-binding protein may comprise identifying a coding sequence for the candidate antigen-binding protein and expressing the sequence in a suitable expression system (such as e.g. in a suitable host cell).
  • the desired property may be a desired binding property (such as e.g. the ability to bind one or more targets, the ability to bind one or more targets with an affinity above one or more respective thresholds, etc.), a desired expression property (such as e.g. an increased expression level compared to a standard in one or more expression systems, an expression level above a predetermined level in one or more expression systems, a yield above a predetermined level in one or more expression systems, etc.), a desired stability property (such as e.g. a stability above a certain threshold in one or more conditions), or a combination thereof.
  • the desired property may include the ability to bind a predetermined target.
  • Testing the one or more candidate antigen-binding proteins for the desired property may comprise identifying one or more antigens that the one or more candidate antigenbinding proteins bind(s) to, for example by testing for binding to one or more candidate antigens.
  • Testing the one or more candidate antigen-binding proteins for the desired property may comprise identifying one or more antigens that the one or more candidate antigen-binding proteins is/are likely to bind to, for example by comparison with one or more antibodies with known targets.
  • the antigen-binding protein may be a therapeutic antibody, and the desired property may comprise binding of a therapeutic target.
  • An antigen-binding protein may also be referred to herein as “immune protein”.
  • the method may further comprise optimising the sequence of at least one of the one or more candidate antigen-binding proteins.
  • Optimising the sequence of a candidate antigenbinding protein may be performed for example using any antibody optimisation technique known in the art.
  • Optimising the sequence of a candidate antigen-binding protein may be performed using information from the sequence data from which the chain pairing was identified, for example by analysing sequences similar to the input sequence from which the chain pairing was identified.
  • Methods for optimising antigen-binding proteins are known in the art and include the methods described in Mason et al. [2021], Seeliger et al., [2015], Warszawski et al. [2019], Hsiao et al. [2019] and Richardson et al. [2021], amongst others. Any of these methods could be used within the context of the present invention.
  • the query sequence may comprise the heavy chain sequence (or part of the heavy chain sequence) of a known antibody.
  • the first chain may be a heavy chain sequence or a part of a heavy chain sequence of a known antibody.
  • the query sequence may have been obtained by bulk BCR sequencing of the heavy chain repertoire in one or more samples.
  • the method may comprise the step of obtaining the query sequence by bulk BCR sequencing of the heavy chain repertoire in one or more samples.
  • the one or more samples may be from one or more subjects.
  • the one or more subjects may have been identified as having a desired characteristic, such as e.g. a particular clinical phenotype or clinically relevant characteristic such as a biomarker profile.
  • the one or more subjects may be resilient to a particular disease or condition.
  • the disease or condition may be selected from a cancer (such as e.g. breast cancer), a neurodegenerative disease (such as e.g. amyotrophic lateral sclerosis), and an infectious disease (such as e.g. COVID-19).
  • the method may comprise identifying a chain pairing (e.g. a heavy-light pairing) for a plurality of query chain sequences (e.g. heavy chain sequences) selected from the first (e.g. heavy) chain sequences identified in the one or more samples, thereby obtaining a set of chain pairings (e.g. heavy-light chain pairings).
  • a chain pairing e.g. a heavy-light pairing
  • the method may further comprise identifying one or more targets by screening antibodies from the same source(s) as the one or more samples against a plurality of candidate peptides.
  • the plurality of candidate peptides may be selected based on the species from which the one or more samples originate.
  • the source of the one or more samples may be one or more human subjects and the antibody repertoire(s) from the same source(s) as the one or more samples may be screened against a set of candidate peptides representative of the human peptidome to select a plurality of candidate peptides.
  • Identifying an antigen that the one or more candidate antigen-binding proteins bind(s) to may comprise using one or more targets identified by screening antibodies from the same source(s) as the one or more samples against a plurality of candidate peptides.
  • the method may further comprise filtering the set of identified chain pairings based on one or more criteria.
  • the one or more criteria may apply to the identity of an antigen or set of antigens that a candidate antigen-binding protein bind(s) to or is predicted to bind to.
  • Providing one or more query sequences may comprise providing a first query (e.g. heavy) chain sequence and a second query (e.g. heavy) chain sequence, and identifying a corresponding (e.g. light) chain sequence for each of the one or more query sequences may comprise identifying one or more first corresponding (e.g. light) chain sequence(s) and one or more second corresponding (e.g. light) chain sequence(s).
  • the method may further comprise comparing the first corresponding chain sequence(s) and the second corresponding chain sequence(s) to identify one or more light chains that may be suitable for use as the common corresponding (e.g. light) chain of a bispecific antibody that includes both of the first (e.g. heavy) chains.
  • a method of providing a tool for identifying an antigen-binding protein comprising a pair of chains comprising: providing training data comprising training first and corresponding sequences from known first and corresponding chain pairs, and training a deep learning model to take as input a query first chain sequence and to produce as output at least one corresponding chain sequence, using the training data.
  • the method of the present aspect may have any of the features described in relation to the first aspect.
  • the method may have any one or more of the following features.
  • the method may further comprise obtaining a vocabulary for encoding of the training first chain sequences and a vocabulary for encoding of the training corresponding chain sequences.
  • the vocabulary may be obtained using an encoding scheme wherein any gene sequence identifier corresponds to an individual token.
  • the vocabulary may be obtained using an encoding scheme wherein any sequence is encoded at least in part using tokens that each correspond to an individual amino acid.
  • the vocabulary may be obtained using an encoding scheme wherein any sequence is encoded at least in part using tokens that each correspond to an individual k-mer or is encoded using byte-pair encoding.
  • Providing training data may comprise providing unpaired training first and corresponding chain sequences.
  • the unpaired training first and corresponding chain sequences may be referred to as pre-training data.
  • the method may further comprise training a first transformer based model using unpaired training first and/or corresponding (e.g. heavy and/or light) chain sequences and training a second transformer-based model using unpaired training corresponding and/or first (e.g. light and/or heavy) chain sequences, and using the pretrained first and second transformer models to initialise the encoder and the decoder, respectively, of the deep learning model.
  • the first and second transformer based models may each comprise a BERT model or a variant thereof, such as e.g. BERT, RoBERTa, or DistilBERT, or an autoregressive transformer model, such as GPT-2.
  • the first and second transformer based models may each comprise a RoBERTa model, a BERT model or a GPT-2 model.
  • the method may further comprise providing the trained deep learning model to a user.
  • a system comprising: a processor; and a computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform the steps of the method of any embodiment of any preceding aspect.
  • the instructions may case the processor to perform the steps of the method of any embodiment of the first and/or fourth aspects.
  • one or more computer readable media comprising instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any embodiment of any preceding aspect.
  • the instructions may case the processor to perform the steps of the method of any embodiment of the first and/or fourth aspects.
  • Figure 1 is a flowchart illustrating schematically a method of identifying a chain pair according to the disclosure.
  • Figure 2 shows an embodiment of a system for identifying a chain pair according to the disclosure.
  • Figure 3 illustrates schematically a plurality of methods for light chain prediction using only the heavy chain as input.
  • A Transformer architecture with heavy chain tokenisation and light chain token conversion to sequence.
  • B GRU model with heavy chain input and light chain output.
  • C database search method workflow.
  • D Frequency searching matches the ranked distributions of heavy and light chain read counts, then pairs similarly ranked chains.
  • Random search is a variation of the database search method and is not illustrated.
  • Figure 4 illustrates schematically the architecture of a heavy chain and a light chain, as well as the configuration of data used herein in relation to this architecture.
  • A Architecture of an Ig heavy chain. The approximate boundaries of the V, D, and J genes are marked, along with the boundaries of the junction. A segment of the V-gene toward the N-terminus is in a dotted boundary as the read length from many NGS methods are too short to cover this region and/or some primers used for NGS are slightly inset within the V region. However, it is still possible to infer the V-gene using the sequence within the solid boundaries.
  • B Same as A., but for the light chain.
  • C A re-creation of the design of the paired read architecture from DeKosky et al. (2015).
  • Figure 5 shows attention heatmaps of heavy chain input and the light chain prediction from a transformer model as described herein. Each column corresponds to an input heavy chain token, and each row represents an output light chain token. Four out of 8 attention heads are shown, showing how each head focuses on different tokens of the heavy chain.
  • Figure 6 shows the prediction performance on the held-out test set, and single-cell blind tests, for the methods illustrated on Figure 3.
  • A proportion of predictions with the correct light chain V-gene.
  • B Levenshtein distance distribution of predicted light chain junction sequences on the King et al. [2021] dataset. The Levenshtein distance is a metric that quantifies how different the predicted light chain junction amino acid sequence is with respect to the original. Higher distances mean poorer predictions.
  • Figure 7 shows a schematic representation of the contacts between the native pertuzumab light chain (yellow) and heavy chain (white); PDB: 1s78.
  • the alanine is shown in ball-and-stick; an asparagine with the larger side chain, as Matchmaker predicted, may have caused clashes, leading to no expression.
  • Figure 8 shows the ELISA Trace for binding of atezolizumab.
  • the filled points represent binding curves for the different antibodies against the target, while unfilled points represent binding against an irrelevant antigen.
  • Figure 9 shows the training procedure for a tandem transformer model as described herein, using two linked AntiBERTa models.
  • A Creation of the training, validation, and test sets for the masked language model task.
  • B The set up of the pre-training procedure, and how the “warmed up” model feeds into the subsequent step.
  • C Outline of how a warmed-up model can be used as part of a Seq2Seq model for NMT.
  • a B cell receptor is a transmembrane protein expressed on the surface of B cells.
  • a B cell receptor comprises a binding moiety (also referred to as “antigen-binding subunit” or “membrane immunoglobulin”, “mlg”) comprising a membrane bound immunoglobulin molecule (also referred to as antibody) that recognises a cognate antigen, and a signal transduction moiety.
  • the membrane-bound immunoglobulin molecule comprises two immunoglobulin light chains and two immunoglobulin heavy chains, and is identical to a corresponding secreted antibody with the exception of an integral membrane domain.
  • the signal transduction moiety is a heterodimer called lg-a/lg-b (CD79), bound together and to the immunoglobulin by disulfide bridges.
  • An antibody (Ab) or immunoglobulin (Ig) is an immune protein, comprising an antigen binding site and a constant region belonging to one of a limited set of isotypes (IgA, IgD, IgE, IgG, or IgM) and mediating interactions with other components of the immune system. In humans and most mammals, antibodies comprise four polypeptide chains: two identical heavy chains and two identical light chains connected by disulfide bonds.
  • variable domains typically consist of one variable domain V L and one constant domain C L
  • heavy chains typically contain one variable domain V H and three to four constant domains C H 1, C H 2,...
  • the variable domains form the antigen binding region and can also be referred to as the Fv region.
  • Each variable domain contains three hypervariable regions referred to as the complementarity-determining regions (CDRs), which together form an antigen binding site.
  • CDRs complementarity-determining regions
  • the variable region of each immunoglobulin heavy or light chain is encoded in several pieces — known as gene segments (subgenes): Ig heavy chains comprise variable (V), diversity (D) and joining (J) segments, and Ig light chains comprise V and J segments.
  • V, D and J gene segments are present in the genome and developing B cells assemble an Ig variable region by (nearly) randomly selecting and combining one V, one D and one J gene segment (or one V and one J segment in the light chain), in a process called V(D)J recombination.
  • V(D)J recombination involves the formation of double-stand breaks between the required segments, which form hairpin loops that are then joined together.
  • the joining process is inaccurate, resulting in the variable addition or subtraction of nucleotides between the V and J (light chain) or V and DJ and D and J (heavy chain) segments, producing a large diversity in the sequences at the junction between segments (referred to as “junction sequences”).
  • each Ig heavy chain variable region comprises: a V segment, a D segment and a J segment, with a junction sequence that spans the join between these segments (as illustrated on Figure 4A).
  • each Ig light chain hypervariable region comprises: a V segment, and a J segment, and a junction sequence that spans the join between these segments (as illustrated on Figure 4B).
  • CDR1 and CDR2 are found in the V segment, and CDR3 includes some of the V, all of D (in the heavy chain) and some of the J segment.
  • a T cell receptor is a membrane anchored protein expressed on the surface of T cells.
  • a T cell receptor comprises a pair of protein chains that together form binding moiety that recognises a cognate antigen. These are expressed in a complex with constant T cell coreceptor chains CD3, comprising a CD3Y chain, a CD36 chain, and two CD3e chains in mammals.
  • the constant chains associate with the T cell receptor and the constant z-chain to form the TCR complex, which together is able to generate a signal upon antigen binding to the T cell receptor.
  • the TCR is a heterodimeric protein, comprising two highly variable chains, the a and b chains (in the majority of T cells), or the alternative g and d chains (in a minority of T cells).
  • Each chain comprises two extracellular domains: a variable region (or variable domain) and a constant region (or constant domain, proximal to the cell membrane), a transmembrane region and a short cytoplasmic tail.
  • the variable regions together bind to a peptide (antigen), within the context of a MHC (major histocompatibility complex) molecule in the case of ab TCRs.
  • Each variable domain contains three hypervariable regions referred to as the complementarity-determining regions (CDRs, respectively referred to as CDR1,
  • the TCR is a member of the immunoglobulin superfamily, which comprises BCRs and antibodies.
  • the variable region of each TCR chain is encoded in several pieces — known as gene segments (subgenes): b and d chains comprise variable (V), diversity (D) and joining (J) segments, and a and g chains comprise V and J segments.
  • V, D and J gene segments are present in the genome and developing T cells assemble a TCR chain variable region by (nearly) randomly selecting and combining one V, one D and one J gene segment (or one V and one J segment in the a / Y chain), in a process called V(D)J recombination.
  • the process involves the formation of double-stand breaks between the required segments, which form hairpin loops that are then joined together.
  • the joining process is inaccurate, resulting in the variable addition or subtraction of nucleotides between the V and J (a / g chain) or V and DJ and D and J (b / d chain) segments, producing a large diversity in the sequences at the junction between segments (referred to as “junction sequences”).
  • each b / d chain variable region comprises: a V segment, a D segment and a J segment, with a junction sequence that spans the join between these segments.
  • each a / g chain hypervariable region comprises: a V segment, and a J segment, and a junction sequence that spans the join between these segments.
  • CDR1 and CDR2 are found in the V segment, and CDR3 includes some of the V, all of D (in the heavy chain) and some of the J segment.
  • a ’’variable chain” (also referred to herein simply as “chain”) of an antigenbinding protein refers to a chain of an antigen-binding protein that is involved in antigen recognition, or a part thereof that contains at least part of a variable region of the chain.
  • Variable chains comprise variable regions that are responsible for the diverse repertoire of antigen recognition properties within antigen-binding proteins.
  • a variable chain may be a BCR heavy or light chain, an antibody heavy or light chain, a TCR a or b chain, a TCR g or d chain, or any part of such chains that contains at least a part of one or more variable regions within these chains.
  • the B cell receptor repertoire (or corresponding antibody repertoire) present in a sample can be investigated using sequencing approaches.
  • two main sequencing approaches are used: single B cell sequencing, and sequencing of bulk B cell populations.
  • the BCR signalling moiety and antigen-binding moiety transmembrane domain is not variable, these techniques focus on the parts that are common between the B cell repertoire and the corresponding antibody repertoire.
  • references to a BCR sequence, BCR repertoire, BCR heavy chain sequence, BCR light chain sequence, and any parts thereof are used interchangeably with the corresponding antibody sequence, antibody repertoire, antibody heavy chain sequence, antibody light chain sequence, and corresponding parts thereof.
  • the term “antigen-binding protein” is used herein to refer to a BCR protein, a TCR protein, an antigenbinding moiety of a BCR protein, an antibody, or any parts thereof that maintain the antigenbinding property of the original BCR protein. TCR protein or antibody.
  • the repertoire of antibodies circulating in the blood of an individual may not match the B cell receptor repertoire present in the sample at the same time point. This is because antibodies that have been produced by B cells that are no longer present in the individual (e.g. because they have died) may be present in the sample.
  • the term “corresponding antibody repertoire” refers to the repertoire of antibodies that would be expressed by the B cells present in a sample, not the repertoire of antibodies (proteins) that are actually present in the sample.
  • Single B cell sequencing can maintain the correspondence between heavy and light chain sequences.
  • Two main approaches can be used to do this.
  • the first approach is physical linkage of VH and VL [DeKosky et al., 2016].
  • the second approach is cell barcoding (such as e.g. provided by 10x Genomics) [King et al., 2021].
  • the physical linkage approach has a higher throughput than the cell barcoding approach, but it is more difficult to recover the full sequence.
  • cell barcoding has a lower throughput but allows easier recovery of the full sequence.
  • single B cell sequencing is limited in terms of throughput (to various extents), as explained above. Some single B cell sequencing technologies are additionally limited in terms of the length of the sequences recovered.
  • BCR/antibody sequences identified using some single B cell sequencing methods may be limited to investigating a single CDR region, for example CDR3 (in other words, although the flanking V and J segments may be identified, they may not be fully sequenced to obtain the sequence of the CDR1 and CDR2 in the V segment), in both the heavy and light chain.
  • datasets from single B cell sequencing methods may vary in the extent to which the sequence of the heavy and light chain is identified. Within the regions sequenced, it may also not be practical to sequence (or record) every single base of the V(D)J segments and as such sequencing efforts may focus on obtaining the junction sequence and enough information to identify the V, D and J genes.
  • such methods may provide information comprising: the identity of the V, D and J segments (e.g. in the form of a V-/D-/J-gene segment identifier) for the heavy chain, the sequence of the junction segment in the heavy chain, the identity of the V and J segments (e.g. in the form of a V-/J-gene identifier) for the light chain, and the sequence of the junction segment in the light chain.
  • the identity of the respective segments can be used to recover the corresponding germline sequence from a database.
  • any mutation that may be present in a particular chain e.g. somatic mutations
  • any mutation that may be present in a particular chain e.g. somatic mutations
  • sequencing of bulk B cell populations does not maintain the pairing between heavy and light chain sequences, but is less limited in terms of sequencing capabilities (in particular depth of sequencing of the BCR repertoire) within the heavy chain and light chain repertoires, respectively.
  • Sequencing of bulk B cell populations may comprise sequencing of the heavy chain repertoire, the light chain repertoire, or both, of a B cell population.
  • Such sequencing may produce information as sparse as that obtained with single cell B sequencing, or more detailed information including e.g.
  • variable chain sequence encompasses the terms “heavy chain sequence”, “light chain sequence”, “a chain sequence”, “b chain sequence” , “y chain sequence” and “d chain sequence” and refer to any information that can be obtained from B cell sequencing or T cell sequencing technologies, ranging from a combination of one or more gene segment identifiers and/or junction sequences at one end, to full chain sequences at the other end.
  • the terms “heavy chain sequence” and “light chain sequence” refer to any information that can be obtained from B cell sequencing technologies, ranging from a combination of one or more gene segment identifiers and/or junction sequences at one end, to full chain sequences at the other end.
  • variable chain sequence refers interchangeably to the amino acid sequence or the corresponding nucleic acid coding sequence.
  • a variable chain pairing or pair refers to a combination of a heavy chain sequence and a light chain sequence, an a chain sequence and a b chain sequence, or a g chain sequence and a d chain sequence as defined herein, each ranging from a combination of one or more gene segment identifiers and/or junction sequences at one end, to full chain sequences at the other end.
  • antibody includes monoclonal antibodies, polyclonal antibodies, multispecific antibodies (e.g., bispecific antibodies), and antibody fragments that exhibit the desired biological activity and that comprise a heavy-light chain pairing identified as described herein or a heavy-light chain pairing derived from a heavy-light chain pairing identified as described herein (for example by further optimisation, affinity maturation, etc.).
  • sample may be a cell or tissue sample, a biological fluid, an extract (e.g. a DNA or RNA extract obtained from the subject), from which B cell genomic material (e.g. RNA or DNA) can be obtained for genomic analysis, such as by sequencing (e.g. whole genome sequencing, whole exome sequencing, targeted / capture sequencing, RNA-seq, etc.) .
  • the sample may be a cell, tissue or biological fluid sample obtained from a subject (e.g. a biopsy). Such samples may be referred to as “subject samples”.
  • the sample may be a blood sample, a lymph node sample, a spleen sample, or a tumour sample, or a sample derived therefrom (such as e.g.
  • the terms “genomic material”, “genomic sequencing” and the like encompasses both to the material / sequence present in the genome and the transcriptome of a sample, unless context indicates otherwise.
  • the sample may be one which has been freshly obtained from a subject or may be one which has been processed and/or stored prior to genomic analysis (e.g. frozen, fixed or subjected to one or more purification, enrichment or extraction steps).
  • the sample may be a cell or tissue culture sample.
  • a sample as described herein may refer to any type of sample comprising B cells or genomic material derived therefrom, whether from a biological sample obtained from a subject, or from a sample obtained from e.g.
  • the sample is preferably from a mammalian (such as e.g. a mammalian cell sample or a sample from a mammalian subject, such as a cat, dog, horse, donkey, sheep, pig, goat, cow, mouse, rat, rabbit or guinea pig), preferably from a human (such as e.g. a human cell sample or a sample from a human subject).
  • a mammalian such as e.g. a mammalian cell sample or a sample from a mammalian subject, such as a cat, dog, horse, donkey, sheep, pig, goat, cow, mouse, rat, rabbit or guinea pig
  • a human such as e.g. a human cell sample or a sample from a human subject
  • the sample may be transported and/or stored, and collection may take place at a location remote from the sequence data acquisition (e.g. sequencing) location, and/or any computer-implemented method steps described herein may
  • sequence data refers to information that is indicative of the presence of genomic material (DNA or RNA) or proteomic material in a sample that has a particular sequence.
  • sequence data may comprise one or more nucleotide sequences and/or one or more amino acid sequences.
  • Such information may be obtained using sequencing technologies, such as e.g. next generation sequencing (NGS), for example whole exome sequencing (WES), whole genome sequencing (WGS), whole transcriptome sequencing (RNAseq) or sequencing of captured genomic loci (targeted or panel sequencing).
  • NGS next generation sequencing
  • WES whole exome sequencing
  • WGS whole genome sequencing
  • RNAseq whole transcriptome sequencing
  • sequence data may comprise a count of the number of sequencing reads that have a particular sequence.
  • Sequence data may be mapped to a reference sequence, for example a reference genome, using methods known in the art (such as e.g. Bowtie (Langmead et al., 2009)).
  • counts of sequencing reads or equivalent non-digital signals may be associated with a particular location or locus (where the “location” refers to a location in the reference genome or transcriptome to which the sequence data was mapped).
  • a location may contain a mutation, in which case counts of sequencing reads or equivalent non-digital signals may be associated with each of the possible variants (also referred to as “alleles”) at the particular location.
  • variant calling The process of identifying the presence of a mutation at a particular location in a sample is referred to as “variant calling” and can be performed using methods known in the art (such as e.g. general purpose NGS variant callers such as the GATK HaplotypeCaller, specifically designed for immune sequences such as IgBLAST, https://www.ncbi.nlm.nih.gov/igblast/, [Ye et al., 2013]).
  • Genomic sequence data may be converted to amino acid sequences by translating coding regions in silico (directly from an mRNA sequence or from identified coding regions in a genomic sequence), as known in the art.
  • treatment refers to reducing, alleviating or eliminating one or more symptoms of the disease which is being treated, relative to the symptoms prior to treatment.
  • prevention refers to delaying or preventing the onset of the symptoms of the disease. Prevention may be absolute (such that no disease occurs) or may be effective only in some individuals or for a limited amount of time.
  • a composition as described herein may be a pharmaceutical composition which additionally comprises a pharmaceutically acceptable carrier, diluent or excipient.
  • the pharmaceutical composition may optionally comprise one or more further pharmaceutically active polypeptides and/or compounds.
  • Such a formulation may, for example, be in a form suitable for intravenous infusion.
  • a computer system includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments.
  • a computer system may comprise a central processing unit (CPU), graphical processing unit (GPU), input means, output means and data storage, which may be embodied as one or more connected computing devices.
  • the computer system has a display or comprises a computing device that has a display to provide a visual output display (for example in the design of the business process).
  • the data storage may comprise RAM, disk drives or other computer readable media.
  • the computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network. It is explicitly envisaged that computer system may consist of or comprise a cloud computer.
  • the term “processor” encompasses any processing unit or combination of processing units, including in particular CPUs and GPUs.
  • computer readable media includes, without limitation, any non- transitory medium or media which can be read and accessed directly by a computer or computer system.
  • the media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.
  • Figure 1 illustrates an embodiment in which a heavy or light chain sequence of a B cell receptor or antibody is used to identify heavy-light chain pairs.
  • Figure 1 illustrates an embodiment in which the variable chain sequences are BCR/antibody heavy and light chains from BCR.
  • the method described by reference to Figure 1 is applicable to embodiments in which a TCR a, b, g, or d chain sequence is used to identify ab (if the input chain pair is an a or b chain) or gd (if the input chain pair is a g or d chain) chain pairs.
  • a sample comprising B cell genomic material (typically in the form of RNA, where the RNA encoding for the BCR expressed by the cells from which the B cell genomic material originated can be extracted and sequenced) may be obtained from a subject.
  • a sample comprising T cell genomic material may be used in embodiments where TCR chain pairs are identified.
  • the BCR repertoire in the sample may be sequenced using bulk BCR sequencing. This may comprise sequencing the heavy chain BCR repertoire in the sample.
  • the TCR repertoire in the sample may be sequenced using bulk TCR sequencing. This may comprise sequencing the b chain repertoire of in the sample.
  • a query chain sequence is provided. In the illustrated embodiment, the query sequence is a heavy chain sequence.
  • the query chain sequence may be a light chain sequence.
  • Providing a query sequence may comprise selecting at step 14A a query sequence as one of the heavy chain sequences sequenced at step 12.
  • Providing a query sequence may comprise providing at step 14B a sequence that comprises a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence.
  • step 14B may comprise extracting, from a bulk BCR sequencing data set, for a selected sequence, a V-gene sequence or identifier, a J-gene sequence or identifier, and a junction sequence. Similar steps may be performed in the context of TCR pairing, for example using a query b chain sequence.
  • a deep learning model is provided, wherein the deep learning model is configured to take as input a query variable chain sequence and to produce as output at least one corresponding variable chain sequence.
  • the query sequence is a heavy chain sequence and thus the deep learning model is configured to take as input a query heavy chain sequence and to produce as output at least one corresponding light chain sequence.
  • the query chain sequence may be a light chain sequence and thus the deep learning model may be configured to take as input a query light chain sequence and to produce as output at least one corresponding heavy chain sequence.
  • the query chain sequence may be a b chain sequence (or an a, d or g chain sequence) and thus the deep learning model may be configured to take as input a query b chain sequence (or an a, d or g chain sequence) and to produce as output at least one corresponding a chain sequence (or at least one corresponding b, g or d chain sequence).
  • the deep learning model may have been previously trained using training variable chain sequences from known variable chain pairs, such as training heavy and light chain sequences from known heavy-light chain pairs in the illustrated embodiment.
  • providing a deep learning model may simply comprise retrieving a trained deep learning model from a computer-readable medium such as a memory associated with a processor executing the method, or otherwise receiving the trained deep learning model.
  • a computer-readable medium such as a memory associated with a processor executing the method
  • the training of the deep learning model is explained in more detail below.
  • the deep learning model may be trained as part of the present method, using training variable chain sequences from known heavy-light chain pairs, such as training heavy and light chain sequences from known heavy-light chain pairs in the illustrated embodiment.
  • the query chain sequence is provided to the deep learning model.
  • Step 18 may comprise optional step 18A of encoding the query sequence using a predetermined encoding scheme.
  • Step 18 may comprise optional step 18B of decoding each of the corresponding sequences output by the deep learning model using a predetermined encoding scheme.
  • the encoding scheme(s) used for the encoding and decoding schemes may have been previously defined based on the content of the training variable chain sequences (e.g. heavy and light) chain sequences used to train the deep learning model.
  • Step 18 may comprise optional step 18C of selecting a sequence of the one or more corresponding variable chain sequences (a light chain sequence in the illustrated embodiment) that is associated with the highest confidence metric amongst the one or more corresponding variable chain sequences, where the deep learning model is configured to produce as output one or more corresponding chain sequences, each associated with a confidence metric such as a probability.
  • Step 18C may be performed before or after step 18B.
  • one or more gene sequence identifiers in the at least one corresponding chain sequence (which is a light chain sequence in the illustrated embodiment) may be replaced with a corresponding germline sequence.
  • the results of any of the preceding steps (and in particular steps 18 and/or 20) may be provided to a user, for example through a user interface. These results may be used for example to provide a therapeutic antibody, as will be described further below.
  • the method may be repeated for a plurality of query sequences. This may comprise repeating steps 14 to 18.
  • training data comprising training variable chain sequences from known variable chain pairs.
  • the training data comprises heavy and light chain sequences from known heavy-light chain pairs.
  • the training data may comprise at least 20,000 training chain pairs, at least 30,000, at least 40,000, at least 50,000, at least 60,000, at least 70,000, at least 80,000, at least 90,000, at least 100,000, at least 120,000 or at least 150,000 training chain pairs.
  • the training data may comprise at least 80,000, at least 100,000, at least 120,000 or at least 150,000 pairs of training heavy and light chain sequences.
  • the training data may further comprise unpaired training sequences, which are heavy and light chain sequences in the illustrated embodiment.
  • the unpaired training chain sequences may be referred to as “pre-training data”.
  • the training data may comprise training data per se (comprising paired chain sequences, in particular paired heavy and light chain sequences in the illustrated embodiment), and pre-training data (comprising unpaired chain sequences, in particular unpaired heavy and light chain sequences in the illustrated embodiment).
  • the training data may comprise at least 100,000, at least 200,000, at least 300,000, at least 400,000, at least 500,000, at least 600,000, at least 700,000, at least 800,000, at least 900,000, at least 1 million (or at least 5, 10, 15, 20, 25, 30, 35 or 40 million) unpaired training chain sequences of the first type and/or of the corresponding type.
  • the training data may comprise at least 1 million (or at least 5, 10, 15, 20, 25, 30, 35 or 40 million) unpaired training heavy chain sequences and at least 1 million (or at least 5, 10 or 15 million) unpaired training light chain sequences.
  • the amount of training and/or pretraining data may be limited by the amount of suitable data available, and may change as more data becomes available.
  • the amount of data available may depend on the particular use case, such as e.g. the identity of the fist and corresponding chain sequences (e.g. more data may be available for ab TCRs than for gd TCRs which are rarer), the criteria used when filtering the data (see step 12’) etc.
  • the numbers provided may apply to the data prior to and/or after any filtering is applied.
  • the training data is filtered.
  • the training data may be filtered to exclude any pairs comprising a junction sequence (e.g. in the heavy and/or light chain) that is outside of a predetermined range of lengths.
  • the training data may be filtered based on any feature of the data, including for example the cell type that the data was derived from, the organism, whether the data is from a naive library, whether the data is from subjects that have been immunised with a particular antigen, etc.
  • the training data may be filtered to ensure that the training data only contains (inclusion filter) or does not contain (exclusion filter) data with one or more features of interest.
  • one or more encoding schemes are defined for the training data by obtaining a vocabulary for encoding of the training chain sequences, in particular for encoding of the training heavy chain sequences and a vocabulary for encoding of the training light chain sequences in the illustrated embodiment.
  • Defining an encoding scheme may comprise excluding from the vocabulary constructed based on the content of the training chain sequences any token that is used a number of times below a predetermined threshold (e.g. 2) in the training data.
  • the training data is used to train a deep learning model to take as input a query heavy chain sequence (in the illustrated embodiment) and to produce as output at least one corresponding light chain sequence (in the illustrated embodiment), using the training data.
  • Training the deep learning model may comprise first training a transformer-based model using the unpaired training chain sequences, and using the pre-trained transformer model to initialise the encoder and the decoder, of the deep learning model.
  • training the deep learning model may comprise training a first and second transformer-based model using the unpaired training chain sequences of the first type and of the second type, respectively (the first type being the heavy chain and the second type being the light chain, in the illustrated embodiment), and using the pre-trained transformer model to initialise the encoder and the decoder, respectively, of the deep learning model.
  • Training the deep learning model may comprise obtaining training sequences that comprise full length sequences for the variable region of the second type of chain and/or the first type of chain (e.g. the light chain and/or the heavy chain, in the illustrated embodiment) by imputing missing sequence information, if the chain sequences from known chain pairs do not comprise full length sequences for said variable regions.
  • FIG. 2 shows an embodiment of a system for identifying a variable chain pair from an input variable chain, according to the present disclosure.
  • the system comprises a computing device 1, which comprises a processor 101 and computer readable memory 102.
  • the computing device 1 also comprises a user interface 103, which is illustrated as a screen but may include any other means of conveying information to a user such as e.g. through audible or visual signals.
  • the computing device 1 is communicably connected, such as e.g. through a network 6, to sequence data acquisition means 3, such as a sequencing machine, and/or to one or more databases 2 storing sequence data.
  • the one or more databases may additionally store other types of information that may be used by the computing device 1, such as e.g. reference sequences, parameters, etc.
  • the computing device may be a smartphone, server, tablet, personal computer or other computing device.
  • the computing device is configured to implement a method for identifying a variable chain pair from an input variable chain (suitably a heavy chain or a b chain, preferably a heavy chain), as described herein.
  • the computing device 1 is configured to communicate with a remote computing device (not shown), which is itself configured to implement a method of identifying a variable chain pair from an input variable chain, as described herein.
  • the remote computing device may also be configured to send the result of the method to the computing device. Communication between the computing device 1 and the remote computing device may be through a wired or wireless connection, and may occur over a local or public network such as e.g. over the public internet or over WiFi.
  • the sequence data acquisition 3 means may be in wired connection with the computing device 1, or may be able to communicate through a wireless connection, such as e.g. through a network 6, as illustrated.
  • the connection between the computing device 1 and the sequence data acquisition means 3 may be direct or indirect (such as e.g. through a remote computer).
  • the sequence data acquisition means 3 are configured to acquire sequence data from nucleic acid samples, for example genomic DNA samples or RNA samples extracted from B cells or T cells purified from fluid and/or tissue samples (such as e.g. peripheral blood, spleen, lymph node, tumour tissue, or any other type of sample comprising B cells or T cells).
  • the sample may have been subject to one or more preprocessing steps such as DNA/RNA purification, fragmentation, library preparation, target sequence capture (such as e.g. exon capture and/or panel sequence capture). Any sample preparation process that is suitable for use in the determination of a B cell receptor sequence or repertoire may be used within the context of the present invention.
  • the sequence data acquisition means is preferably a next generation sequencer.
  • the sequence data acquisition means 3 may be in direct or indirect connection with one or more databases 2, on which sequence data (raw or partially processed) may be stored.
  • the above methods find applications in any context where it is desirable to identify an antibody or BCR that is likely to bind its target from information that is limited to the heavy chain, the light chain or parts thereof (such as e.g. the V-gene, J-gene and junction sequences). This is frequently the case in the context of the discovery process of antibody therapeutics.
  • Antibody therapeutics have been shown to be successful approaches for a wide range of diseases from neurodegenerative diseases to cancer. Thus, the approaches described herein find use in the context of providing therapeutics in each of these clinical contexts. Further, the methods described herein can be used to identify a potentially functional antibody or BCR from any input heavy/light chain or part thereof, whether the input information is newly generated for a particular purpose (e.g. from patients or samples identified as having a desired phenotype) or from existing / historical data sets (for example to mine or re-mine existing datasets to discover new therapies or identify immune proteins that could explain why certain clinical phenotypes persist).
  • the invention also provides a method of providing an antibody therapeutic, the method comprising identifying a heavy-light chain pairing using any of the methods described herein, or that is derived from a heavy-light chain pairing that has been identified using any of the methods described herein (such as e.g. by further optimisation, mutation, etc).
  • the heavy- light chain pairing may be obtained for an input heavy chain sequence that has been obtained by bulk BCR sequencing of the heavy chain repertoire in one or more samples.
  • the one or more samples may be from one or more subjects.
  • the one or more subjects may have been identified as having a desired characteristic, such as e.g. a particular clinical phenotype or clinically relevant characteristic such as a biomarker profile.
  • the one or more subjects may be resilient to a particular disease or condition.
  • the disease or condition may be selected from a cancer (such as e.g. breast cancer), a neurodegenerative disease (such as e.g. amyotrophic lateral sclerosis), or an infectious disease (such as e.g. COVID-19).
  • the method may comprise identifying a heavy-light chain pairing for a plurality of input heavy chain sequences selected from the heavy chain sequences identified in the one or more samples, thereby obtaining a set of heavy-light chain pairings.
  • the method may further comprise identifying the target (or a putative target or sets of targets) of the heavy- light chain pairing or each heavy-light chain pairing in the set of heavy-light chain pairings.
  • the method may further comprise identifying one or more targets by screening antibodies from the same source(s) as the one or more samples against a plurality of candidate peptides.
  • the plurality of candidate peptides may be selected based on the species from which the one or more samples originate.
  • the source of the one or more samples may be one or more human subjects and the antibody repertoire(s) from the same source(s) as the one or more samples may be screened against a set of candidate peptides representative of the human peptidome.
  • Identifying the target (or a putative target or sets of targets) of the heavy-light chain pairing or each heavy-light chain pairing in the set of heavy- light chain pairings may comprise using one or more targets identified by screening antibodies from the same source(s) as the one or more samples against a plurality of candidate peptides.
  • the method may further comprise filtering the set of heavy-light chain pairings based on one or more criteria.
  • the one or more criteria may apply to the identity of the putative targets or sets of targets identified for a heavy-light chain pairing.
  • the method may further comprise obtaining an antibody or fragment thereof which comprises an identified heavy-light chain pairing or a heavy-light chain pairing derived from an identified heavy-light chain pairing.
  • an antibody or fragment thereof may comprise identifying a coding sequence for the antibody or fragment thereof and expressing the sequence in a suitable expression system (such as e.g. in a suitable host cell).
  • the method may further comprise identifying one or more antigens that the antibody or fragment thereof binds to, for example by testing for binding to one or more candidate antigens.
  • the method may further comprise optimising the sequence of the antibody or fragment thereof.
  • Optimising the sequence of the antibody or fragment thereof may be performed using any antibody optimisation technique known in the art.
  • Optimising the sequence of the antibody or fragment thereof may be performed using information from the sequence data from which the heavy-light chain pairing was identified, for example by analysing sequences similar to the input sequence from which the heavy-light chain pairing was identified.
  • the invention also provides a method for providing an immunotherapeutic composition, the method comprising identifying a heavy-light chain pairing as described herein and producing an immunotherapeutic composition that comprises an antibody comprising the heavy-light chain pairing or an antibody that has been derived from the heavy-light chain pairing (such as e.g. by further optimisation, mutation, etc).
  • the methods described herein may also find uses in the context of providing bispecific antibodies.
  • the methods described herein may be used to identify a light chain that would be suitable for pairing with two different heavy chains of interest.
  • the invention also provides a method of providing a bispecific antibody, the method comprising identifying a common light chain pairing for each of two heavy chains using any of the methods described herein, or a combination of a common light chain and two heavy chains that is derived from a heavy-light chain pairing that has been identified using any of the methods described herein (such as e.g. by further optimisation, mutation, etc).
  • the deep learning model may be used to predict a first plurality of corresponding light chain sequences for a first heavy chain sequence and to predict a second plurality of corresponding light chain sequences for a second heavy chain sequence.
  • the first and second plurality of light chain sequences predicted may then be compared to identify one or more light chains that may be suitable for use as the common light chain of a bispecific antibody that includes both of the heavy chains.
  • the methods described herein may also find uses in the context of antibody optimisation.
  • the methods described herein may be used to identify a light chain that would be suitable for pairing with a heavy chain, where the paring has one or more advantageous properties (such as e.g. improved functional or developability properties) compared to an original pairing for the heavy chain.
  • the invention also provides a method of providing an improved antibody, the method comprising identifying a heavy-light chain pairing using any of the methods described herein from an input heavy chain of an original antibody, or a heavy-light chain pairing that is derived from a heavy-light chain pairing that has been identified using any of the methods described herein (such as e.g. by further optimisation, mutation, etc).
  • the methods described herein also find applications in any context where it is desirable to identify a TCR that is likely to bind its target from information that is limited to the b chain, the a chain (or, in less common cases, the g or d chain) or parts thereof (such as e.g. the V-gene, J-gene and junction sequences). This is frequently the case in the context of the discovery process of cell therapeutics such as engineered T cells.
  • the invention also provides a method of providing a TCR-based therapeutic, such as an engineered T cell expressing a particular TCR, the method comprising identifying an ab or gd chain pairing using any of the methods described herein, or that is derived from an ab or gd chain pairing that has been identified using any of the methods described herein (such as e.g. by further optimisation, mutation, etc).
  • a TCR-based therapeutic such as an engineered T cell expressing a particular TCR
  • the method comprising identifying an ab or gd chain pairing using any of the methods described herein, or that is derived from an ab or gd chain pairing that has been identified using any of the methods described herein (such as e.g. by further optimisation, mutation, etc).
  • the methods described herein may also find uses in the context of T cell receptor optimisation, in a similar way as described above for antibodies.
  • Training, validation and test sets Paired heavy-light chain sequences were combined from three donors in DeKosky et al., [2015], and three naive BCR libraries from DeKosky et al. [2016]. These datasets contain entries each comprising: the heavy chain V gene identifier, heavy chain junction sequence (nucleotide and amino acid), heavy chain J gene identifier, light chain V gene identifier, light chain junction sequence (nucleotide and amino acid), and light chain J gene identifier.
  • This training set was picked primarily due to public availability and size. Note that the data entries also comprised the heavy chain D gene identifier but this information was not used. This is because annotation of D genes is believed to be less accurate than annotation of V and J genes.
  • Sequences were filtered to those with heavy chain junctions between 7-30 amino acids in length and light chain junctions between 7-20 amino acids in length. This is because sequences with junction sequences outside of these boundaries are believed to be rare and increasing lengths come at a cost in terms of computing power that would be unlikely to be balanced by the gain in information to longer sequences being rare.
  • This filter removed 84 pairs that had a heavy chain junction with a size outside of the boundaries, and 168 pairs that had a light chain junction with a size outside of the boundaries.
  • the data was also filtered for sequences where the heavy chain V-gene and light chain V-gene were observed in at least two sequences in order to keep a more concise vocabulary.
  • any duplicate heavy-light chain pairs were removed, which is functionally equivalent to a 99% redundancy cut-off (in other words, any heavy-light chain pair across the different sets would be at most 99% identical, confirming that the pairs in the training data set are indeed unique).
  • the length and number of entries filters together removed a total of 253 pairs (84+168+1) out of the 190,240 starting pairs that passed the pseudogene and redundancy filters, leading to a remaining set of 189,987 pairs (see Table 1).
  • the 189987 sequences were split into training, validation, and test sets of 153889, 17099, and 18999 sequences (corresponding to an -80% training / 10% validation / 10% test split). While the same heavy chain sequence can be present across the three sets, none of these heavy chains have an identical light chain partner.
  • the training set data did not contain full-length heavy chain and light chain sequences, because the single-cell sequencing method used to generate this data was not able to recover the full amino acid sequence of the light/heavy chain. Instead, the training data set contained: (i) for each heavy chain: the V gene identifier, the junction sequence the J gene identifier, and the D gene identifier (although the latter was not used), and (ii) for each light chain: the V gene identifier, the junction sequence, and the J gene identifier.
  • each entry comprised a combination of gene identifiers and sequences such as e.g. IGHV3- 23/CAR...DYW/IGHJ6 - IGKV3-20/CQQ.../IGKJ2.
  • the models were trained to take as input a tokenised heavy chain sequence corresponding to a V gene identifier, a junction sequence and a J gene identifier, and to produce as output a tokenised light chain sequence corresponding to a V gene identifier, a junction sequence and a J gene identifier.
  • a custom encoding method was designed for their tokenisation.
  • Each V-gene constituted a single token
  • each J-gene constituted a single token
  • the junction amino acid sequence was tokenised as overlapping 3-mers.
  • the junction sequence is the most diverse region of the sequence, and is believed to mediate most of the binding functionality, hence the increased granularity in tokenisation of this sequence. Tokens were used if there was a minimum of 2 occurrences in the training set (as already explained above in relation to filtering of the data).
  • tokenisation of the junction amino acid sequences include for example byte-pair encoding, or tokens for each amino acid.
  • Schemes such as byte-pair encoding may be particularly useful if more full-sequence sequence data was available, such as e.g. if single-cell data in the order of hundreds of thousands, or even millions of sequences, was used for training the model.
  • a “sentence” is a tokenised representation of a heavy or light chain sequence.
  • Each sentence starts with the special token ⁇ SOS>, followed by a token representing the heavy or light chain’s V-gene, the overlapping 3-mer tokens, the J-gene token, and then the special token ⁇ EOS>.
  • the sequence was padded with the special ⁇ PAD> token.
  • GRU gated recurrent unit
  • Figure 3B a GRU (gated recurrent unit) model with heavy chain input using the same vocabulary as for the transformer, and also providing a light chain output ( Figure 3B);
  • GRUs and more broadly, other recurrent neural network (RNN) architectures such as LSTMs (long short term memory networks) with attention were the previous “state of the art” for neural machine translation before transformers became common for this task;
  • RNN recurrent neural network
  • Random search a variation of the database search method termed “random search”.
  • Matchmaker architecture and inference Matchmaker was built with PyTorch (version 1.6.0). Matchmaker’s hyperparameters and optimisation procedure is similar to the sequence-to- sequence (Seq2Seq) transformer from Vaswani et al. [2017]. Deviations from the original Transformer of Vaswani et al. [2017] are described below. In summary, the model was made slightly smaller (fewer layers) for better constraining, and was trained using a different optimisation technique to improve training. Matchmaker has 4 encoder layers and 4 decoder layers, with a feed-forward dimension of 1024, and a dropout of 0.2. Layer normalisation was applied within the residual block [Child et al., 2018; Xiong et al., 2020].
  • the model was optimised using AdamW, with a weight decay of 0.1. Gradient clipping was implemented with a L2 norm of 1.0. Matchmaker has a total of 31.7M learnable parameters. Training was stopped if the validation loss did not improve for 3 epochs, and the model with the best validation loss was used.
  • Gated recurrent unit (GRU) neural network model An alternate deep learning Seq2Seq model with an attention mechanism [Bahdanau et al., 2015] was trained using two GRU networks [Cho et al., 2014].
  • the encoder is a 4-layer bi-directional GRU with a hidden dimension of 1024
  • the decoder is a 4-layer, forward-only GRU with a hidden dimension of 1024.
  • Other hyperparameters of the model such as the dimension of the embedding layers, were matched as close to Matchmaker as possible. In total, this model has 131.8M learnable parameters.
  • the encoder-decoder GRU model was trained in an identical manner to Matchmaker. For simplicity, this architecture is referred to as the “GRU model”.
  • GRU networks (and by extension, recurrent neural networks) have an entirely different mechanism of how they process sequences compared to Transformers. Briefly, transformers use a series of “self-attention” mechanisms that make them not only faster, but more accurate, while recurrent nets do not have this at all.
  • Both the Matchmaker model and the GRU model as used in this study provided a single prediction for each input chain.
  • the models predicted each chain in a sequential manner, by: at each subsequent position, outputting a probability for all possible tokens for the current position, and selecting the token with the highest probability for the current position before moving on to the next position.
  • Other implementations are possible and envisaged.
  • a plurality of tokens could be considered simultaneously at each position and a combination of tokens across a plurality of positions (such as e.g. the whole sequence, i.e. all positions) that optimise a global probability across the plurality of positions can be selected.
  • Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. After having reached a predetermined maximum depth, a solution with maximum probability may be output.
  • Database search method Heavy chains were paired by sequence homology to known heavy-light chain pairs in Matchmaker's training set. For a query heavy chain sequence, only heavy-light chain pairs with a matching heavy chain V-gene to the query are selected. From this subset, two pairs are selected.
  • the first is the “closest” light chain, which is from the pair with the closest heavy chain junction amino acid sequence to the query.
  • the second is the “top” light chain, which is from the pair with the light chain V-gene that is most often associated with the query heavy chain’s V-gene.
  • the closest light chain is used if the identity to the query junction amino acid sequence is 365%, or the germline V-gene identities between the closest and top light chains are 375%. Otherwise, the top light chain is used.
  • the rationale to distinguish these two cases is that if there is a sufficiently similar junction sequence in the search, then the VL sequence from this could be used. If there is not one sufficiently similar, then a more coarse approach is taken where the most common VL for that V gene is used. Different cutoffs or even no cutoff (i.e. using the closest light chain in all cases) could be used. For either case, this strategy is referred to as the “database search” method.
  • Random search method with this approach, one light chain is chosen randomly from the database without any regard to features of the heavy chain.
  • Frequency-based search method As a baseline, a search for light chain pairs was performed using read counts. Since all the datasets used in this study are pre-paired, a situation with two bulk sequencing libraries was emulated by first disassembling the paired sequences. The number of reads was then aggregated onto the heavy chain sequence, or the light chain sequence. For example, suppose there are sequences HeavyA:LightA with 4 reads and HeavyB:LightA with 5 reads; splitting and aggregation results in HeavyA with 4 reads, HeavyB with 5 reads, and LightA with 9 reads. Heavy chains and light chains are then ranked on the basis of their total read count (see Figure 3D). Heavy chains are then paired with light chains with matching ranks.
  • the output is a light chain V-gene identifier, junction amino acid sequence, and the light chain J-gene identifier ( Figure 3). Since the training set does not have full-length sequences, the light chain V-gene and J-genes were replaced with their germline amino acid sequences.
  • Thermostability of the monoclonal antibodies was measured in a thermal denaturation assay. In triplicate, each antibody was heated from 25°C to 95°C in the presence of SYPROTM orange, and the fluorescence measured. The melt curve derivatives were then plotted as the average of the three replicates. The temperature at which the fluorescence was most rapidly increasing was noted for each antibody.
  • the AntiBERTa model was trained in a similar style to RoBERTa-base (Liu et al., 2019), but with a smaller batch size of 768, a peak learning rate of 10 4 , and over 225000 pre-training steps and 10000 warm-up steps.
  • beam search (Sustkever et al., 2014) was used with a beam width of 3.
  • the two AntiBERTa models were joined as a sequence-to-sequence model, where the encoder and decoder were each initialised as a copy of the AntiBERTa model using the Huggingface transformers library [Wolf et al., 2019].
  • the AntiBERTa-AntiBERTa model was then fine-tuned with a slightly larger dataset of paired sequences than what we was used for MatchMaker and the GRU model as described above.
  • the training data was expanded by introducing more sequences from antigen-experienced libraries (from the same sources as explained above, i.e. DeKosky et al. 2016 and DeKosky et al., 2015). In total, there were 171984 paired sequences.
  • the dataset here does not contain the full heavy chain and full light chain sequence.
  • full sequences were inferred using their germline V and J gene annotations.
  • the model was trained over 20 epochs, with a peak learning rate of 3 x 10 5 and a 5% warmup. Parameters were shared between the encoder AntiBERTa and the decoder AntiBERTa.
  • an equivalent paired data set that includes full-length sequences is generated using one of two alternative approaches.
  • a full-length sequence is obtained by replacing the V and J gene identifiers with their corresponding germline sequences.
  • the pre-trained AntiBERTa models or any other such “checkpoint” model sucg as e.g. a GPT-2 model
  • the pre-trained AntiBERTa models are used to predict the full-length sequence of the training set, independently for the heavy and light chains (using the respective models) based on the known parts of the chains.
  • the prediction from the “checkpoint” models may be obtained using some or all of the known parts of the chain (e.g. gene segment identifiers, partial sequences etc.), optionally in combination with some information obtained from the germline sequences of any segment for which a full sequence is not available (such as e.g. the identity of some of the amino acids of the segment, for example the first k amino acids of the segment, where k can for example be 1 , 2, 3, 5, 10, etc).
  • the AntiBERTa model trained on the unpaired full-length heavy and light chain data may be used to predict the full-length sequence of the heavy chains in the training data from the V gene identifier, J gene identifier and junction sequence provided in the data.
  • the AntiBERTa model trained on the unpaired full-length heavy and light chain data may be used to predict the full- length sequence of the light chains in the training data from the V gene identifier, J gene identifier and junction sequence provided in the data.
  • the same two approaches could be used to map any limited paired training data into data in a more extended format that may have been available to train the “checkpoint” models.
  • the data used to train the “checkpoint” models may be converted to a limited format that matches the format of the paired training data. This may still benefit from the potential additional information gathered by the pre-trained models from the vast number of unpaired sequences available. However, it may not take full advantage of the extent of information available in such unpaired sequence data.
  • NLP-inspired models generate light chains using only heavy chains as input.
  • the problem of heavy-light chain pairing was framed as an NMT task, where a light chain sequence is predicted given only the heavy chain sequence as input.
  • a Seq2Seq Transformer similar to [Vaswani et al., 2017] was implemented.
  • the model was trained on a tokenised representation of heavy chain sequences as input and returns a tokenised representation of light chain sequences as output, as described in the Methods (Figure 3A).
  • the model encoder layers compute selfattention between the V-gene, overlapping junction k-mers, and the J-gene.
  • the encoder’s self-attention scores on the heavy chain sequence are then used by the decoder to autoregressively predict the light chain sequence.
  • An example of the decoder attention is shown in Figure 5, where each attention head focuses on different subsets of heavy chain tokens to determine the output light chain tokens.
  • Figure 6 shows the prediction performance on the held-out test set and single-cell blind tests for all methods, in terms of light chain V-gene prediction.
  • the V-gene results are investigated separately from the full prediction results because the V-gene forms the largest part of the chain portion determining antigen binding and thus should have a large influence on stability and, to some extent binding (the latter being also strongly influenced by the junction sequence).
  • the light chain V-gene prediction results are further discussed below.
  • the number of correct full light chains, consisting of the V-gene, junction amino acid sequence, and the J-gene was much lower than the number of correct V-genes across all methods.
  • the full set of prediction results are in Tables 3 and 4.
  • the database search method was the most accurate, with 4940 correct V-genes (26.3%; Figure 6A). However, as can be seen on Figure 6A, the database search method did not perform as well in the single-cell blind tests as it did in the test set. In fact, the transformer- based model outperformed the database search method on all single-cell blind tests. The higher performance of the database search method in the test set is believed to be due largely to the presence of clonal relatives in the training and test sets, which is not the case in the blind sets. Indeed, members of a single B cell clone were partitioned across the training and test sets (see Table 5). For example, out of 18803 heavy-light pairs for which prediction was made, 897 had an identical heavy chain sequence to the training set.
  • the evaluation on the test set gives a skewed view of the performance of the database search method which would only be realistic if the amount of paired sequence data available to perform such searches was truly representative of the expected diversity of the BCR repertoire (which is far from being the case in reality). In other words, the evaluation on the blind tests gives a much more realistic view of the comparative performance of the methods.
  • Matchmaker was the top performer, with up to 9.8% heavy chain sequences being predicted with the correct light chain V-gene ( Figure 6A).
  • Matchmaker was also able to predict the correct light chain V-gene, junction sequence, and J-gene for 105, 7, and 23 heavy chains in the King, Eccles, and Setliff datasets, respectively (Table 4).
  • the database search method had 59, 4, and 6 correct predictions. While the GRU model was not as accurate as Matchmaker, it still outperformed the database search method on two single-cell datasets.
  • Figure 6B shows the Levenshtein distance distribution of predicted light chain junction sequences on the King et al. [2021] dataset. This shows that the transformer-based method (Matchmaker) has more predictions with lower distances than any other method, indicating that even if the predicted light chain junction amino acid sequence is not entirely correct, it tends to be closer than with any other method.
  • the data demonstrates that the Matchmaker method is able to generate a significant proportion of pairs that show sufficient binding affinity to form a good basis for further engineering, a step that was until now a significant bottleneck in the process of providing functional antibodies from bulk sequencing of BCR repertoires.
  • the performance of the method could be further enhanced using additional training data reflecting how mutated and/or engineered antibody sequences, such as the therapeutic antibodies used here, are paired.
  • expanding the training data to enable the model to learn from more data and/or from data comprising sequences that have been optimised is expected to even further improve the ability of the model to predict functional pairings for native as well as engineered antibodies.
  • the Matchmaker model described above was used to identify heavy-light chain pairings for heavy chain sequences identified in COVID-19 patients as having a high likelihood of being involved in the immune response, and binding to the coronavirus spike protein.
  • 18 heavy chain sequences identified from COVID-19 patients as having these properties selected from the data in Galson et al. (2020) were provided as inputs to the method for pairing with light chains.
  • the pairings were expressed and tested for binding to Wuhan strain spike antigens using Homogeneous Time Resolved Fluorescence (HTRF).
  • HTRF Homogeneous Time Resolved Fluorescence
  • This example describes a machine-learning, NLP-inspired approach to the problem of BCR heavy-light chain pairing.
  • two architectures that consider the problem of BCR heavy-light chain pairing as an NMT task are described: Matchmaker, a Seq2Seq Transformer model, and a Seq2Seq GRU-based model.
  • Matchmaker is the first application of a Transformer model for this purpose.
  • the deeplearning based approach described herein provides the benefit of covering the BCR repertoire as deeply as possible, while also eliminating the need for bulk light chain sequencing.
  • the approach is capable of learning general features of pairings from a set of training data and to use this learning to predict pairings for previously unseen chains.
  • an approach such as a database search approach is likely to quickly breakdown when looking at query chains that are not present or more distant from chains that are in known pairings. This is likely to be advantageous in many cases considering the extreme diversity of the BCR repertoire, but particularly so in the context of applications such as identifying specific antibodies that may underlie a desired phenotype in an individual, or other rare antibodies.
  • the method is able to predict binders for a proportion of heavy chains, and represents a significant improvement over the prior art at least because (a) it only requires the heavy chain sequence as input, (b) it is not limited in terms of type of sequences and datasets that it can use as input (i.e. it is expected to be able to provide useful predictions for any type of sequence and not only for highly abundant sequences within clonally dominated samples), and (c) it has a higher binder prediction hit rate than other methods (even a method newly described herein that was shown to have a better prediction performance than the state of the art). Any such binder that is successfully obtained thus represents an improvement over the prior art and can be used as a promising starting point for further affinity improvement.
  • Matchmaker predicts tokens in a greedy fashion (i.e. one token at a time, in other words the most likely token for each individual position is predicted).
  • Using strategies that consider multiple positions at the same time (for example, 3), such as beam search [Sustkever et al., 2014] should help to increase prediction accuracy because the model explores more solutions and is thus less likely to get “stuck” in a suboptimal solution. This is the approach that as used for the tandem transformer model described above.
  • the Matchmaker and GRU models exemplified above use a combination of overlapping fixed size k-mers and gene identifiers for the encoding of BCR sequences.
  • a limitation of the training set is the lack of full-length heavy and light chain sequences. Indeed, such data is currently only available in limited amounts and thus larger datasets comprising only the J-gene identifier, V-gene identifier and junction sequences were used.
  • the issue of light chain pairing remains pertinent.
  • the deep learning-based approach described herein is unique in its sole dependence on the heavy chain as input.
  • the Matchmaker (transformer-based) model had the highest in silico accuracy based on multiple metrics and was validated in vitro to generate functional antibodies. This puts Matchmaker in a unique position to predict light chains where paired light chain information is not available. This approach thus has the potential to fill gaps in light chain pairing information, thus enabling therapeutic antibody discovery and a better understanding of the immune system.
  • Galson et al. 2020. Deep Sequencing of B Cell Receptor Repertoires From COVID-19 Patients Reveals Strong Convergent Immune Signatures. Front. Immunol., 15 December 2020. doi.org/10.3389/fimmu.2020.605170.
  • IgBLAST an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W34-40.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Peptides Or Proteins (AREA)
  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)
PCT/EP2022/060073 2021-04-22 2022-04-14 Engineering of antigen-binding proteins Ceased WO2022223451A1 (en)

Priority Applications (8)

Application Number Priority Date Filing Date Title
CA3215778A CA3215778A1 (en) 2021-04-22 2022-04-14 Engineering of antigen-binding proteins
EP22717415.8A EP4327328A1 (en) 2021-04-22 2022-04-14 Engineering of antigen-binding proteins
CN202280036343.6A CN117337470A (zh) 2021-04-22 2022-04-14 抗原结合蛋白的工程化
US18/287,352 US20240203523A1 (en) 2021-04-22 2022-04-14 Engineering of antigen-binding proteins
JP2023564230A JP2024514691A (ja) 2021-04-22 2022-04-14 抗原結合タンパク質の遺伝子操作
KR1020237039964A KR20240011144A (ko) 2021-04-22 2022-04-14 항원-결합 단백질의 조작
AU2022260043A AU2022260043A1 (en) 2021-04-22 2022-04-14 Engineering of antigen-binding proteins
IL307832A IL307832A (en) 2021-04-22 2022-04-14 Engineering of antigen-binding proteins

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB2105776.5A GB202105776D0 (en) 2021-04-22 2021-04-22 Engineering of antigen-binding proteins
GB2105776.5 2021-04-22

Publications (1)

Publication Number Publication Date
WO2022223451A1 true WO2022223451A1 (en) 2022-10-27

Family

ID=76193388

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/060073 Ceased WO2022223451A1 (en) 2021-04-22 2022-04-14 Engineering of antigen-binding proteins

Country Status (10)

Country Link
US (1) US20240203523A1 (https=)
EP (1) EP4327328A1 (https=)
JP (1) JP2024514691A (https=)
KR (1) KR20240011144A (https=)
CN (1) CN117337470A (https=)
AU (1) AU2022260043A1 (https=)
CA (1) CA3215778A1 (https=)
GB (1) GB202105776D0 (https=)
IL (1) IL307832A (https=)
WO (1) WO2022223451A1 (https=)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024149766A1 (en) 2023-01-09 2024-07-18 Alchemab Therapeutics Ltd Anti-unc5c antibodies
WO2024208904A2 (en) 2023-04-03 2024-10-10 Alchemab Therapeutics Ltd Anti-cd33 antibodies
WO2025022002A1 (en) 2023-07-26 2025-01-30 Alchemab Therapeutics Ltd Analysis of antigen-binding proteins
EP4517776A1 (en) * 2023-09-01 2025-03-05 Siemens Healthcare Diagnostics Inc. Clinical decision support using transformer-based networks by imputing biomarkers
WO2025125898A1 (en) 2023-12-15 2025-06-19 Alchemab Therapeutics Ltd Anti-unc5c antibodies and uses thereof

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392929B (zh) * 2021-07-01 2024-05-14 中国科学院深圳先进技术研究院 一种基于词嵌入与自编码器融合的生物序列特征提取方法
US12367329B1 (en) * 2024-06-06 2025-07-22 EvolutionaryScale, PBC Protein binder search

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020236839A2 (en) * 2019-05-19 2020-11-26 Just Biotherapeutics, Inc. Generation of protein sequences using machine learning techniques

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020236839A2 (en) * 2019-05-19 2020-11-26 Just Biotherapeutics, Inc. Generation of protein sequences using machine learning techniques

Non-Patent Citations (50)

* Cited by examiner, † Cited by third party
Title
BAHDANAU: "Neural machine translation by jointly learning to align and translate", ARXIV:1409.0473, 2015
BASHFORD-ROGERS ET AL.: "Analysis of the B cell receptor repertoire in six immune-mediated diseases", NATURE, vol. 574, 2019, pages 122 - 126, XP037070580, DOI: 10.1038/s41586-019-1595-3
CARTER JASON A.PREALL JONATHAN B.GRIGAITYTE KRISTINAGOLDFLESS STEPHEN J.JEFFERY ERICBRIGGS ADRIAN W.VIGNEAULT FRANCOISATWAL GURIND: "Single T Cell Sequencing Demonstrates the Functional Role of ap TCR Pairing in Cell Lineage and Antigen Specificity .", FRONTIERS IN IMMUNOLOGY, vol. 10, 2019, pages 1516
CHILD ET AL.: "Generating Long Sequences with Sparse Transformers", ARXIV:1904.10509, 2019
CHO ET AL.: "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling", ARXIV:1412.3555, 2014
DEKOSKY ET AL.: "High-throughput sequencing of the paired human immunoglobulin heavy and light chain repertoire", NATURE BIOTECHNOLOGY, vol. 31, 2013, pages 166 - 169, XP055545134, DOI: 10.1038/nbt.2492
DEKOSKY ET AL.: "In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire", NATURE MEDICINE, vol. 21, 2015, pages 86 - 91, XP037115845, DOI: 10.1038/nm.3743
DEKOSKY ET AL.: "Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires", PNAS, vol. 113, no. 19, 10 May 2016 (2016-05-10), pages E2636 - E2645, XP055611478, DOI: 10.1073/pnas.1525510113
DEVLIN ET AL.: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", ARXIV: 1810.04805, 2019
DUNBARDEANE: "ANARCI: antigen receptor numbering and receptor classification", BIOINFORMATICS, vol. 32, no. 2, 15 January 2016 (2016-01-15), pages 298 - 300
ECCLES ET AL.: "T-bet+ Memory B Cells Link to Local Cross-Reactive IgG upon Human Rhinovirus Infection", CELL REPORTS, vol. 30, 14 January 2020 (2020-01-14), pages 351 - 366
FURCY, DAVIDKOENIG, SVEN: "Limited Discrepancy Beam Search", IJCAI'05: PROCEEDINGS OF THE 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, July 2005 (2005-07-01), pages 125 - 131
GALSON ET AL.: "Deep Sequencing of B Cell Receptor Repertoires From COVID-19 Patients Reveals Strong Convergent Immune Signatures", FRONT. IMMUNOL., 15 December 2020 (2020-12-15)
GALSON JACOB D.SCHAETZLE SEBASTIANBASHFORD-ROGERS RACHAEL J. M.RAYBOULD MATTHEW I. J.KOVALTSUK ALEKSANDRKILPATRICK GAVIN J.MINTER : "Deep Sequencing of B Cell Receptor Repertoires From COVID-19 Patients Reveals Strong Convergent Immune Signatures", FRONTIERS IN IMMUNOLOGY, vol. 11, 2020
GLANVILLE ET AL.: "Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire", PNAS, vol. 106, no. 48, 1 December 2009 (2009-12-01), pages 20216 - 20221, XP055062648, DOI: 10.1073/pnas.0909775106
HOWIE BSHERWOOD AMBERKEBILE ADBERKA JEMERSON ROWILLIAMSON DW ET AL.: "High-throughput pairing of T cell receptor a and β sequences", SCI TRANSL MED., vol. 7, 2015, pages 301 - 131, XP055318674, DOI: 10.1126/scitranslmed.aac5624
JAYARAM ET AL.: "Germline VH/VL pairing in antibodies", PROTEIN ENGINEERING, DESIGN AND SELECTION, vol. 25, 10 October 2012 (2012-10-10), pages 523 - 530, XP055216142, DOI: 10.1093/protein/gzs043
JI YANRONG ET AL: "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome", BIORXIV, 19 September 2020 (2020-09-19), XP055877228, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/2020.09.17.301879v1.full.pdf> [retrieved on 20220110], DOI: 10.1101/2020.09.17.301879 *
KING ET AL.: "Single-cell analysis of human B cell maturation predicts how antibody class switching shapes selection dynamics", SCIENCE IMMUNOLOGY, vol. 6, 12 February 2021 (2021-02-12), pages eabe6291
KOVALTSUK ET AL.: "Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires", J IMMUNOL, vol. 201, no. 8, 15 October 2018 (2018-10-15), pages 2502 - 2509
KRAWCZYK ET AL.: "Looking for therapeutic antibodies in next-generation sequencing repositories", MABS, vol. 11, 2019, pages 1197 - 1205
LEEM ET AL.: "ABodyBuilder: Automated antibody structure prediction with data-driven accuracy estimation", MABS, vol. 8, no. 7, October 2016 (2016-10-01), pages 1259 - 1268, XP055416058, DOI: 10.1080/19420862.2016.1205773
LING ET AL.: "Effect of VH-VL Families in Pertuzumab and Trastuzumab Recombinant Production, Her2 and Fcγ Binding", FRONT. IMMUNOL., 12 March 2018 (2018-03-12)
LIU: "RoBERTa: A Robustly Optimized BERT Pretraining Approach", ARXIV: 1907.11692, 2019
MASON, D.M.FRIEDENSOHN, S.WEBER, C.R. ET AL.: "Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning", NAT BIOMED ENG, 2021
MORAWALCZAK: "How many different clonotypes do immune repertoires contain?", CURRENT OPINION IN SYSTEMS BIOLOGY, vol. 18, December 2019 (2019-12-01), pages 104 - 110
NIELSEN ET AL.: "Human B Cell Clonal Expansion and Convergent Antibody Responses to SARS-CoV-2", BIORXIV, 9 July 2020 (2020-07-09)
RADFORD ET AL., LANGUAGE MODELS ARE UNSUPERVISED MULTITASK LEARNERS, 2019, Retrieved from the Internet <URL:https://openai.com/blog/better-language-models>
RAKOCEVIC ET AL.: "The landscape of high-affinity human antibodies against intratumoral antigens", BIORXIV, 8 February 2021 (2021-02-08)
RAYBOULD ET AL.: "Public Baseline and shared response structures support the theory of antibody repertoire functional commonality", PLOS COMPUT BIOL, vol. 17, no. 3, 2021, pages e1008781
REDDY ET AL.: "Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells", NATURE BIOTECHNOLOGY, vol. 28, 2010, pages 965 - 969, XP055617221, DOI: 10.1038/nbt.1673
REES: "Understanding the human antibody repertoire", MABS, vol. 12, no. 1, January 2020 (2020-01-01), pages 1729683
ROTHE: "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks", ARXIV:1907.12461, 2020
SAKA KOICHIRO ET AL: "Antibody design using LSTM based deep generative model from phage display library for affinity maturation", vol. 11, no. 1, 12 March 2021 (2021-03-12), XP055876990, Retrieved from the Internet <URL:https://www.nature.com/articles/s41598-021-85274-7.pdf> DOI: 10.1038/s41598-021-85274-7 *
SEELIGER DSCHULZ PLITZENBURGER TSPITZ JHOERER SBLECH MENENKEL BSTUDTS JMGARIDEL PKAROW AR: "Boosting antibody developability through rational sequence optimization", MABS, vol. 7, no. 3, 2015, pages 505 - 15
SETLIFF ET AL.: "High-Throughput Mapping of B Cell Receptor Sequences to Antigen Specificity", CELL, vol. 179, 12 December 2019 (2019-12-12), pages 1636 - 1646
SIMONICH ET AL.: "Kappa chain maturation helps drive rapid development of an infant HIV-1 broadly neutralizing antibody lineage", NATURE COMMUNICATIONS, vol. 10, 2019
SUTSKEVER: "Sequence to Sequence Learning with Neural Networks", ARXIV: 1409.3215, 2014
TEPLYAKOV ET AL.: "Structural diversity in a human antibody germline library", MABS, vol. 8, no. 6, August 2016 (2016-08-01), pages 1045 - 63, XP055697646, DOI: 10.1080/19420862.2016.1190060
TILLER ET AL.: "A fully synthetic human Fab antibody library based on fixed VH/VL framework pairings with favorable biophysical properties", MABS, vol. 5, no. 3, 1 May 2013 (2013-05-01), pages 445 - 470, XP055377037, DOI: 10.4161/mabs.24218
VANDER HEIDEN ET AL.: "Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing", J IMMUNOL., vol. 198, no. 4, 15 February 2017 (2017-02-15), pages 1460 - 1473, XP055636836, DOI: 10.4049/jimmunol.1601415
VASWANI ET AL.: "Attention Is All You Need", ARXIV: 1706.03762, 2017
WARSZAWSKI S, BORENSTEIN KATZ A, LIPSH R, KHMELNITSKY L, BEN NISSAN G, JAVITT G: "Optimizing antibody affinity and stability by the automated design of the variable lightheavy chain interfaces", PLOS COMPUT BIOL, vol. 15, no. 8, 2019, pages e1007207, XP055680871, DOI: 10.1371/journal.pcbi.1007207
WOLF ET AL.: "HuggingFace's Transformers: State-of-the-art Natural Language Processing", ARXIV:1910.03771, 2019
XIONG: "On Layer Normalization in the Transformer Architecture", ARXIV:2002.04745, 2020
YE ET AL.: "IgBLAST: an immunoglobulin variable domain sequence analysis tool", NUCLEIC ACIDS RES., vol. 41, July 2013 (2013-07-01), pages W34 - 40
YI-CHUN HSIAOYONGLEI SHANGDANIELLE M. DICARAANGIE YEEJOYCE LAISI HYUN KIMDIEGO ELLERMANRACQUEL CORPUZYONGMEI CHENSHARMILA RAJAN: "Immune repertoire mining for rapid affinity optimization of mouse monoclonal antibodies", MABS, vol. 11, no. 4, 2019, pages 735 - 746, XP055797539, DOI: 10.1080/19420862.2019.1584517
ZHENG GXYTERRY JMBELGRADER PRYVKIN PBENT ZWWILSON R ET AL.: "Massively parallel digital transcriptional profiling of single cells", NAT COMMUN., vol. 8, 2017, pages 14049, XP055503732, DOI: 10.1038/ncomms14049
ZHOU, RONGHANSEN, ERIC: "Beam-Stack Search: Integrating Backtracking with Beam Search", CONFERENCE: PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL CONFERENCE ON AUTOMATED PLANNING AND SCHEDULING (ICAPS 2005, 5 June 2005 (2005-06-05)
ZHU ET AL.: "Mining the antibodyome for HIV-1-neutralizing antibodies with next-generation sequencing and phylogenetic pairing of heavy/light chains", PNAS, vol. 110, no. 16, 16 April 2013 (2013-04-16), pages 6470 - 5, XP055234665, DOI: 10.1073/pnas.1219320110

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024149766A1 (en) 2023-01-09 2024-07-18 Alchemab Therapeutics Ltd Anti-unc5c antibodies
WO2024208904A2 (en) 2023-04-03 2024-10-10 Alchemab Therapeutics Ltd Anti-cd33 antibodies
WO2025022002A1 (en) 2023-07-26 2025-01-30 Alchemab Therapeutics Ltd Analysis of antigen-binding proteins
EP4517776A1 (en) * 2023-09-01 2025-03-05 Siemens Healthcare Diagnostics Inc. Clinical decision support using transformer-based networks by imputing biomarkers
WO2025125898A1 (en) 2023-12-15 2025-06-19 Alchemab Therapeutics Ltd Anti-unc5c antibodies and uses thereof

Also Published As

Publication number Publication date
JP2024514691A (ja) 2024-04-02
CN117337470A (zh) 2024-01-02
CA3215778A1 (en) 2022-10-27
US20240203523A1 (en) 2024-06-20
EP4327328A1 (en) 2024-02-28
GB202105776D0 (en) 2021-06-09
IL307832A (en) 2023-12-01
KR20240011144A (ko) 2024-01-25
AU2022260043A1 (en) 2023-11-16

Similar Documents

Publication Publication Date Title
US20240203523A1 (en) Engineering of antigen-binding proteins
Kim et al. Germinal centre-driven maturation of B cell response to mRNA vaccination
Steichen et al. A generalized HIV vaccine design strategy for priming of broadly neutralizing antibody responses
Wang et al. A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2
Setliff et al. Multi-donor longitudinal antibody repertoire sequencing reveals the existence of public antibody clonotypes in HIV-1 infection
Huang et al. AbAgIntPre: A deep learning method for predicting antibody-antigen interactions based on sequence information
Friedensohn et al. Convergent selection in antibody repertoires is revealed by deep learning
Richardson et al. A computational method for immune repertoire mining that identifies novel binders from different clonotypes, demonstrated by identifying anti-pertussis toxoid antibodies
WO2025022002A1 (en) Analysis of antigen-binding proteins
Könitzer et al. Generation of a highly diverse panel of antagonistic chicken monoclonal antibodies against the GIP receptor
Wang et al. Potent and broad HIV-1 neutralization in fusion peptide-primed SHIV-infected macaques
Swanson et al. Rapid selection of HIV envelopes that bind to neutralizing antibody B cell lineage members with functional improbable mutations
Xu et al. Functional clustering of B cell receptors using sequence and structural features
Xu et al. Advances in antibody discovery from human BCR repertoires
Yan et al. Deep immunoglobulin repertoire sequencing depicts a comprehensive atlas of spike-specific antibody lineages shared among COVID-19 convalescents
Wasdin et al. Generation of antigen-specific paired-chain antibodies using large language models
Reys et al. Integrative modeling in the age of machine learning: A summary of haddock strategies in capri rounds 47–55
EP4602607A1 (en) Engineering of antigen-binding proteins
Campbell et al. Combining random mutagenesis, structure-guided design and next-generation sequencing to mitigate polyreactivity of an anti-IL-21R antibody
Wagner et al. High-throughput specificity profiling of antibody libraries using ribosome display and microfluidics
Ramon et al. AbNatiV: VQ-VAE-based assessment of antibody and nanobody nativeness for hit selection, humanisation, and engineering
Thomas et al. High affinity mAb infusion can enhance maximum affinity maturation during HIV Env immunization
Zou et al. Antibody humanization via protein language model and neighbor retrieval
Zhao et al. Quantitative characterization of the B cell receptor repertoires of human immunized with commercial rabies virus vaccine
Le Bihan et al. de Novo Sequencing of Antibodies for Identification of Neutralizing Antibodies in Human Plasma Post SARS-CoV-2 Vaccination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22717415

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 3215778

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2023564230

Country of ref document: JP

Ref document number: 307832

Country of ref document: IL

WWE Wipo information: entry into national phase

Ref document number: 2022260043

Country of ref document: AU

Ref document number: 805154

Country of ref document: NZ

Ref document number: AU2022260043

Country of ref document: AU

ENP Entry into the national phase

Ref document number: 2022260043

Country of ref document: AU

Date of ref document: 20220414

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 202280036343.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 1020237039964

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2022717415

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 11202307692V

Country of ref document: SG

ENP Entry into the national phase

Ref document number: 2022717415

Country of ref document: EP

Effective date: 20231122