WO2023139031A1 - Method and system for predicting tcr (t cell receptor)-peptide interactions - Google Patents

Method and system for predicting tcr (t cell receptor)-peptide interactions Download PDF

Info

Publication number
WO2023139031A1
WO2023139031A1 PCT/EP2023/050900 EP2023050900W WO2023139031A1 WO 2023139031 A1 WO2023139031 A1 WO 2023139031A1 EP 2023050900 W EP2023050900 W EP 2023050900W WO 2023139031 A1 WO2023139031 A1 WO 2023139031A1
Authority
WO
WIPO (PCT)
Prior art keywords
multimodal
molecules
peptides
tcr
binding
Prior art date
Application number
PCT/EP2023/050900
Other languages
French (fr)
Inventor
Filippo Grazioli
Anja Moesch
Pierre MACHART
Martin Renqiang MIN
Kai Li
Original Assignee
NEC Laboratories Europe GmbH
Nec Laboratories America, Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories Europe GmbH, Nec Laboratories America, Inc filed Critical NEC Laboratories Europe GmbH
Publication of WO2023139031A1 publication Critical patent/WO2023139031A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • the present invention relates to a computer system and a computer-implemented method of predicting an interaction or binding between T cell receptors (TCRs), and peptides presented on the surface of cells, or peptides presented on the surface of cells in complex with major histocompatibility complex (MHC) molecules.
  • TCRs T cell receptors
  • MHC major histocompatibility complex
  • T cells monitor the health status of cells by identifying foreign peptides on their surface. Peptides are presented on the surface of cells in complex with major histocompatibility complex (MHC) molecules. Depending on the cell type, two different MHC molecules are found: class I and class II.
  • MHC major histocompatibility complex
  • TCRs T cell receptors
  • the binding of TCRs with peptide-MHC (pMHC) complexes - also known as TCR recognition - constitute a necessary step for immune response. Only if TCR recognition takes places, cytokines can be released, leading to the death of a target cell. Understanding the rules that govern TCR recognition represents a fundamental step towards the development of personalized and more effective cancer treatments and vaccines.
  • TCR-peptide interaction or binding
  • MHC major histocompatibility complex
  • pMHC peptide
  • the aforementioned object is accomplished by a computer-implemented method of predicting an interaction or binding between T cell receptors, TCRs, and peptides presented on the surface of cells or between peptides presented on the surface of cells and major histocompatibility complex, MHC, molecules, the method comprising a training stage, including: a) providing a training dataset of samples, wherein each sample comprises a multimodal tuple of molecules, and wherein each sample has assigned a ground truth label; b) inferring, for each sample of the training dataset using parametric encoders, a unimodal posterior distribution over latent encodings conditioned by the respective input molecules, and combining the parameters of the inferred unimodal posteriors in a single matrix; c) implicitly learning dependencies among the inferred unimodal posteriors by applying a trainable parametric function leveraging multi-head selfattention onto the matrix and, based thereupon, approximating a multimodal joint posterior distribution over latent encodings
  • the present disclosure provides a system and a method for predicting the interaction between T cell receptors (TCRs) and peptides presented on the surface of cells.
  • TCRs T cell receptors
  • the system is configured to approximate unimodal posterior distributions over latent encodings conditioned by the input molecules.
  • the system is configured to aggregate the predicted parameters of the posteriors in a single matrix.
  • the system is configured to approximate a multimodal posterior distribution over latent encodings, which is then used to estimate the probability of binding between TCR and peptides. Due to the multimodal nature of the system, the invention can account for an arbitrarily large set of heterogeneous input modalities.
  • M unimodal posterior distributions over latent encodings conditioned by the respective input molecules may be determined.
  • the parameters of the posteriors may be approximated and stored in a vector.
  • the M vectors of the parameters of the M unimodal posteriors may be combined (e.g. stacked) in a matrix which contains them all.
  • the matrix may serve as input to a trainable parametric function which leverages multi-head self-attention.
  • This function allows the various unimodal posteriors to attend to each other, estimates their relative importance and implicitly learns inter-dependencies among them.
  • this mechanism is sometimes referred to as AoD (Attention-of-Distributions).
  • a multimodal posterior distribution over latent encodings is eventually approximated and output by the system.
  • the method may include a training stage, in which the system is trained, for instance by performing the following steps:
  • Each sample may consist in a multimodal tuple of molecules, e.g. (peptide, CDR3 ⁇ , CDR3 ⁇ ) for TCR-peptide interaction prediction, or (peptide, MHC) for peptide-MHC binding prediction.
  • a ground truth label is provided for each sample.
  • the method may include an inference stage in which inference is performed with the pre-trained system. Inference may be performed by executing steps 2-4 above.
  • the methods and systems disclosed herein can be compared with state-of-the-art approaches in two different ways.
  • First, the invention can be compared with general- purpose methods for approximating multimodal joint posteriors in deep variational inference (i.e. without focus on the specific biomedical problems disclosed herein).
  • Second, the invention can be compared with state-of-the-art methods for multimodal TCR-peptide binding prediction.
  • a fundamental step in deep multimodal variational inference consists in approximating a distribution of a latent variable Z (encoding), conditioned by all observed input modalities.
  • Grazioli et al. “Microbiome-based disease prediction with multimodal variational information bottlenecks" PLOS Computational Biology, 2022, operates a so-called product of experts (PoE), which consists in approximating the variational multimodal joint posterior as a product of m unimodal posterior distributions: .
  • PoE product of experts
  • This approximation is only valid under the assumption of conditional independence between the modalities, which is not always true in real-world settings.
  • Embodiments of the present invention based on the attention of distributions (AoD) allow to learn relationships of conditional dependence among the various data modalities in a data-driven fashion.
  • An AoD module can, in fact, be trained to weight how much a given modality shall attend to the other ones. In presence of sufficient data, this allows for a more flexible approximation of the multimodal joint posterior.
  • NetTCR-2.0 enables accurate prediction of tcr-peptide binding by using paired ter alpha and beta sequence data
  • 2021 proposes NetTCR-2.0, a deep learning model which can predict TCR-peptide binding prediction by jointly analyzing peptide, CDR3a and CDR3 ⁇ chains.
  • NetTCR-2.0 has the drawback of not learning a latent probability distribution, but just a mapping from inputs to a predicted binding score. This means there is no way to assess uncertainty.
  • the approach according to the present disclosure allows to observe the contribution of each modality to the predicted score, as a unimodal posterior is learned for each input modality.
  • Fig. 1 is a diagram schematically illustrating the general concept of an attention of distribution (AoD) mechanism for Gaussian posteriors
  • Fig. 2 is a schematic view illustrating an architecture of stochastic encoders deployed in accordance with embodiments of the present disclosure
  • Fig. 3 is a schematic view illustrating a trimodal attentive variational information bottleneck (AVIB) architecture in accordance with embodiments of the present disclosure
  • Fig. 4 is a diagram illustrating TCR-peptide distributions for datasets used in accordance with embodiments of the present disclosure
  • Fig. 5 is a diagram illustrating a length distribution of the amino acid chains for datasets used in accordance with embodiments of the present disclosure
  • Fig. 6 is a diagram illustrating a class distribution of a + (i and (i datasets used in accordance with embodiments of the present disclosure
  • Fig. 7 is a table showing TCR-peptide interaction prediction results of experiments performed in accordance with embodiments of the present disclosure
  • Fig. 8 is a table showing multimodal posterior approximation results of experiments performed in accordance with embodiments of the present disclosure.
  • the TCR is a heterodimeric protein, which consists of an a-and a p-chain.
  • the structure of the a- and p-chains determine the interaction with the pMHC complex.
  • Each chain consists of three loops, referred to as complementarity determining regions 1 , 2 and 3 (CDR1 -3).
  • CDR1 -3 complementarity determining regions 1 , 2 and 3
  • the genomic recombination of variable, diversity, and joining (V, D, J) TCR-genes determines the diversity of the CDR3s.
  • the p-chain (CDR3P) is the result of a recombination of the V, D and J genes, while the a-chain (CDR3a) derives from the recombination of the V and J genes. This implies that the CDR3ps present higher variability.
  • TCRdist (cf. P. Dash et al.: “Quantifiable predictive features define epitope-specific t cell receptor repertoires”, in Nature, 547(7661 ):89-93, 2017) computes CDR similarity-weighted distances.
  • SETE (cf. Y. Tong et al.: “Sequence-based ensemble learning approach for ter epitope binding prediction”, in Computational Biology and Chemistry, 87:107281 , 2020) adopts k-mer feature spaces in combination with principal component analysis (PCA) and decision trees.
  • PCA principal component analysis
  • Various methods adopt Random Forest to operate classification (cf., e.g., N.
  • ERGO (cf. I. Springer et al.: “Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs”, in Frontiers in immunology, 11 :1803, 2020) is a deep learning approach that adopts long short-term memory (LSTM) networks and autoencoders to compute representations of the peptide and CDR3 ⁇ .
  • LSTM long short-term memory
  • the present disclosure provides a system and a method for predicting the interaction between T cell receptors (TCRs) and peptides presented on the surface of cells.
  • the system may first approximate unimodal posterior distributions over latent encodings conditioned by the input molecules. Second, the system may aggregate the predicted parameters of the posteriors in a single matrix.
  • the system may approximate a multimodal posterior distribution over latent encodings, which may then be used to estimate the probability of binding between TCR and peptides. Due to the multimodal nature of the system, embodiments of the present disclosure can account for an arbitrarily large set of heterogeneous input modalities, as will be described in detail for some use cases below.
  • Embodiments of the present disclosure use a variant of the Variational Information Bottleneck (VIB) approach, as disclosed in Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K.: “Deep variational information bottleneck”, in arXiv preprint arXiv: 1612.00410, 2016, which is hereby incorporated by reference herein.
  • VIB leverages variational inference to construct a lower bound on the Information Bottleneck (IB) objective (as disclosed in Tishby, N., Pereira, F. C., and Bialek, W.: “The information bottleneck method”, in arXiv preprint physics/0004057, 2000, which is hereby incorporated by reference herein).
  • X, Y,Z are random variables
  • x,y,z are multidimensional instances of random variables
  • S represents a set.
  • Y be a random variable representing a ground truth label associated with an input random variable X.
  • Z be a stochastic encoding of X coming from an intermediate layer of a deep neural network and defined by a parametric encoder )representing the upstream part of such neural model.
  • the goal according to the embodiments of the present disclosure consists in learning an encoding Z which is (a) maximally informative about Y and (b) maximally compressive about X.
  • objective (a) implies maximizing the mutual information /(Z, Y; 0) between the encoding Z and the target Y, where:
  • the first term on the right hand side of Equation 2 causes Z to be predictive of Y, while the second term constraints Z to be a minimal sufficient statistics of controls the trade-off between (a) and (b) and is the mutual information between Z and Y parameterized by .
  • Equation 2 can be rewritten as: where is an auxiliary Gaussian noise variable, is the Kullback- Leibler divergence and is a vector-valued parametric deterministic encoding function (e.g., in the context of the present disclosure, a neural network).
  • the introduction of e consists in the reparameterization trick (cf. Kingma, D. P. and Welling, M.: “Auto-encoding Variational bayes”, arXiv preprint arXiv: 1312.6114, 2013), which allows to write where is now treated as a deterministic variable.
  • This formulation allows the noise variable to be independent of the model parameters.
  • the VIB objective of Equation 3 can be generalized by representing X as a collection of multimodal random input variables (or multiple input sequences)
  • the posterior of Equation 3 consists actually in the joint posterior conditioned by the joint M available sequences.
  • the M different sequences cannot be simply treated as M different modalities.
  • the Multimodal Variational Autoencoder MVAE (cf. Wu, M. and Goodman, N.: “Multimodal generative models for scalable weakly-supervised learning”, arXiv preprint arXiv: 1802.05335, 2018) and MVIB approximate the joint posterior assuming that the M modalities are conditionally independent, given the common latent variable Z.
  • Fig.1 is a diagram schematically illustrating the general concept of an Attention of Distribution (AoD) mechanism 100 (herein sometimes also referred to as AoE, Attention of Experts) for Gaussian posteriors according to embodiments of the present disclosure.
  • AoD Attention of Distribution
  • ⁇ ⁇ is the stochastic Gaussian encoder 101 of the modality.
  • the unimodal posteriors are modelled as Gaussian distributions with diagonal structure: e.g., following the proposals provided in Wu, M. and Goodman, N.: “Multimodal generative models for scalable weakly-supervised learning”, in arXiv preprint arXiv:1802.05335, 2018 or in Grazioli, F., Siarheyeu, R., Pileggi, G., and Meiser, A.: “Microbiome-based disease prediction with multimodal variational information bottlenecks”, in PLOS Computational Biology, 2022, which are both hereby incorporated by reference herein.
  • the dependencies between the M single-sequence posteriors and the multi-sequence joint posterior are implicitly learn by means of multi-head self-attention (as shown at 104 in Fig. 1 including the AoD module) leveraging its power of capturing multiple complex interactions in X, and allowing possible missing sequences, in particular in accordance with the approach disclosed in A. Vaswani et al.: “Attention Is All You Need”, NeurlPS 2017:
  • MultiHead is the standard multi-head attention block defined in Vaswani et al.
  • the above equation is referred to as Attention of Distributions (AoD) in the context of the present disclosure.
  • a multi-sequence VI B which adopts AoD for modelling the multisequencejoint posterior
  • AVIB Attentive Variational Information Bottleneck
  • the AVIB objective is: where the multi-sequence posterior is modelled as
  • Embodiments of the present disclosure provide computer-implemented methods and systems that are configured to apply the Attentive Variational Information Bottleneck (AVIB) for tackling a fundamental problem of immuno-oncology predicting: TCR-peptide interactions. It can be demonstrated that AVIB significantly outperforms existing baseline methods.
  • AVIB may be implemented as a multimodal generalization of the Variational Information Bottleneck (VI B) capable of learning a joint representation of multiple input data modalities through self-attention.
  • the input modalities may include the peptide, the CDR3a, and the CDR3 ⁇ .
  • the model may learn to predict whether the binding between peptide and the TCR takes place or not.
  • the present disclosure provides methods and systems that utilize an approach for combining unimodal posteriors in a joint multimodal posterior using self-attention, termed Attention of Distribution (AoD) herein.
  • AoD Attention of Distribution
  • the present disclosure provides methods and systems that use a simple, hyperparameter-free approach for the detection of out-of- distribution (OOD) amino acid chains, which leverages the multimodal posterior distribution over latent encodings learned by AVIB.
  • OOD out-of- distribution
  • the dedicated stochastic encoder 101 may have the form and are the two output branches 105 of the neural stochastic encoder 101 , as schematically illustrated in Fig. 2.
  • the neural stochastic encoder 101 of Fig. 2 presents an architecture, which is inspired by NetTCR-2.0 (cf. A. Montemurro et al.: “Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcra and 0 sequence data”, in Communications biology, 4(1):1-13, 2021). It encodes the peptides by using BLOSUM50 encoding 110 and operates 1 D convolutions 114 of the encoded peptides 112 with kernel 116 sizes 1 , 3, 5, 7 and 9. After the convolutions 114, 1 D max pooling 118 is operated, followed by ReLu activation functions 120. The obtained vectors are then concatenated, as shown at 122.
  • two parallel fully-connected layers 105 output two vectors of size d z for and is the size of the bottleneck, i.e. the dimension of Z. d z can be set via hyperparameter optimization. For a more stable computation, one may let model the logarithm of the variance logo- 2 .
  • the corresponding decoder consists in a simple neural network with fully-connected layers and ReLu activations. This implements the binary classification.
  • Fig. 3 is a schematic view illustrating a trimodal Attentive Variational Information Bottleneck (AVIB) system in accordance with embodiments of the present disclosure.
  • the system comprises parametric encoders 101 , the AoD (or AoE) mechanism 104 described in detail above, and parametric decoders 124.
  • the latent prior may be treated as d z -dimensional spherical Gaussian distributions.
  • An embodiment may set can be set via hyperparameter optimization.
  • the networks may be trained using the Adam or Stochastic Gradient Descent optimizers.
  • the optimizer may implement a learning rate of 10 3 and a L2 weight decay with The batch size may be set to 4096. 0.3 drop-out may be used at training time.
  • a cosine annealing learning rate scheduler with a period of 10 epochs may be adopted (cf. I. Loshchilov et al.: “Stochastic gradient descent with warm restarts”, arXiv preprint arXiv: 1608.03983, 2016).
  • the training may be performed for 200 epochs and, in order to avoid overfitting, the best model may be selected by saving the weights corresponding to the epoch where the AUROC is maximum on the validation set.
  • the validation set may be obtained via 80/20 stratified random split of the training set. Training and test sets may be obtained via a non-stratified 80/20 split of the whole dataset. Experiments may be repeated 5 times with different training/test splits to ensure unbiased performance evaluation.
  • MIRA set is publicly available in the NetTCR-2.0 repository: https://github.eom/mnielLab/NetTCR-2.0/tree/main/data.
  • MIRA set is publicly available in the NetTCR-2.0 repository: https://github.eom/mnielLab/NetTCR-2.0/tree/main/data.
  • 271.366 human TCR-peptide samples were available. The data were organized creating the following datasets: a+ ⁇ set. 117.753 samples out of 271.366 present peptide information, together with both CDR3aand CDR3 ⁇ chains.
  • this subset is referred to as the a + ⁇ set.
  • the ground truth label is a binary variable which represents whether binding between peptide and TCR chains takes place.
  • Human TCR set The totality of the human TCR-peptide data (i.e. a + ⁇ set U ⁇ set) is referred to as the Human TCR set.
  • Non-human TCR set In addition to the human TCR data, 5036 non-human TCR samples were extracted from the VDJdb database, which were used as OOD (out- of-distribution) samples. These samples come from mice and macaques and present peptide and CDR3 ⁇ information. These samples are referred to as Non - human TCR set.
  • Human MHC set A second set of OOD samples were created, composed of 463.684 peptide-MHC pairs.
  • the peptide sequences are taken from the Human TCR set, i.e. the peptide information is shared among ID and OOD sets.
  • the MHC sequences are amino acid chains corresponding to human MHC alleles (The MHC sequences were extracted from the PUFFIN repository: https://github.com/gifford-lab/PUFFIN). These samples are referred to as Human MHC set.
  • Fig. 4 depicts the distributions of the human TCR data, i.e. Peptide, CDR3a and CDR3 ⁇ distributions, for both the a + ⁇ set and the ⁇ set.
  • a point on the x- axis represents one unique chain of amino acids.
  • the y-axis represents how many samples present that specific chain. Samples are sorted by count considering the a + ⁇ set. It is possible to observe that the two datasets have similar peptide distributions, but present different CDR3 ⁇ sequences.
  • Fig. 5 depicts a length distribution of the datasets used in the context of the present disclosure.
  • the length consists in the number of amino acids which constitute the peptides, CDR3 ⁇ , CDR3 ⁇ and MHC molecules. All these types of molecules are sequences of amino acids.
  • Fig. 6 depicts the class distributions of the ⁇ set and a + ⁇ set.
  • the class distributions of the Non - human TCR set and Human MHC set are not reported, as they were only used for OOD (Out-Of-Distribution) detection experiments, and not for TCR-peptide interaction prediction.
  • TCR data presents many more non-binding samples (0), compared to binding ones (1).
  • This class imbalance may be handled by means of balanced batch sampling, i.e. when a batch is sampled at training time, it may be ensured that the numbers of binding and non-binding samples are essentially equal.
  • Peptides, CDR3a and CDR3 ⁇ chains are sequences of amino acids.
  • the 20 amino acids translated by the genetic code are in general represented as English alphabet letters.
  • the amino acid sequences pre-processed using BLOSUM50 encodings cf. Henikoff, S. and Henikoff, J. G.: “Amino acid substitution matrices from protein blocks”, in Proceedings of the National Academy of Sciences, 89(22):10915-10919, 1992.
  • This allows to represent a sequence of N amino acids as an 20 x N matrix.
  • the elements of the matrix are natural numbers that represent similarity scores among amino acids.
  • features standardization may be operated by removing the mean and scaling to unit variance.
  • 0-padding may be operated after the BLOSUM50 encoding. This ensures that all matrices have shape 20 x N max , where N max is the length of the longest chain.
  • ERGO II (cf. Springer et aI: “Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs”, in Frontiers in immunology, 11 :1803, 2020) and NetTCR-2.0 (cf .Montemurro et aI: “Nettcr-2.0 enables accurate prediction of tcr- peptide binding by using paired tcra and ⁇ sequence data”, in Communications biology, 4(1): 1-13, 2021). Additionally, AVIB is benchmarked against the LUPI- SVM (cf.
  • the method according to embodiments disclosed herein obtains ⁇ 4% higher ALIROC and 8% higher ALIPR compared to the best baseline, ERGO II.
  • AVIB outperforms ERGO II by achieving ⁇ 3% higher ALIROC and ⁇ 4% higher ALIPR.
  • AVIB compares with ERGO II.
  • the trimodal setting when also considering the a- chain, AVIB obtains 1 % higher ALIROC, ⁇ 4% higher ALIROC and ⁇ 6% higher F1 score.
  • This section provides a comparison of two techniques for approximating Gaussian joint posteriors: AoD as described herein and PoE.
  • AVIB which employs AoD
  • MVIB cf. Grazioli et al.: “Microbiome-based disease prediction with multimodal variational information bottlenecks”, PLOS Computational Biology, 2022), which employs PoE.
  • Experiments and benchmark were performed in the bimodal and trimodal settings.
  • Table 2 shown in Fig. 8 presents the TCR-peptide interaction prediction results on the ⁇ + ⁇ set.
  • AoD achieves best results in both the bimodal and trimodal settings.
  • the ALIPR score in particular, improves on PoE up to 2%.
  • TCRs Tumors harbor mutations, of which some can be recognized by the patients’ TCRs.
  • Therapeutic cancer vaccines consist of patient- and therefore tumor-specific mutations meant to stimulate the patient’s immune response against their own tumor.
  • T cell receptor repertoire is able to recognize the immunogenic mutations (neoantigens), i.e. if there are TCRs present that are able to bind to the associated peptides. Being able to reliably predict if TCRs bind to neoantigens can help to improve the selection process for neoantigen candidates that will be used in a therapeutic cancer vaccine.
  • Tumor antigens shared by patients are a highly interesting target for cancer immunotherapy, especially adoptive T cell therapy, where a tumor-recognizing TCR is introduced in the patient’s T cells.
  • These TCRs need to fulfil demanding requirements regarding their safety and efficacy, as shared antigens are usually not strictly tumor-specific. Therefore, prediction of TCR-peptide interaction allows the modelling and engineering of suitable TCRs, which can be evaluated for their safety and efficacy before any wet lab experiments are necessary.
  • TCRs can be used to target tumors or other types of cells depending on the MHC presented peptides.
  • TCRs do usually not bind to one single peptide but multiple peptides.
  • the system may also be used to predict peptide-MHC binding and presentation.
  • the set of MHC molecules may be taken from a specific patient, and the set of peptides are based on mutations present in the cancerous cells of the patient. For all (peptide, MHC) combinations, predictions are made.
  • the first modality is the peptide
  • the second modality is the MHC molecule.
  • the predicted scores can represent either the likelihood of having binding between the two molecules, or the likelihood that the peptide-MHC complex is presented on the surface of the cell.
  • the peptides with the highest likelihood of being presented are good candidates to be included in a personalized cancer vaccine for that patient.
  • the predicted candidates are the output of the peptide-MHC interaction prediction system. In modern cancer vaccines design systems, these candidates may then be processed by further downstream modules, e.g. to predict TCR recognition, or to filter-out peptides which could trigger autoimmune response.
  • AoE can be used as a building block of a broad spectrum of multimodal learning systems, beyond the TCR-specific setting, and also with non-molecular data. Approximating a multimodal posterior from heterogeneous unimodal posteriors can in fact be particularly helpful in sensor fusion for combining various types of sensor data. RGB, infrared and depth images, as well as radar and point cloud information can be combined using AoE to detect, segment and/or classify objects in an environment with higher confidence.
  • the output of the sensor fusion module is provided to, e.g., the planning algorithms, or to additional scene understanding algorithms.
  • the system proposed in the present disclosure can be used to predict if neoantigen vaccine candidates are in fact recognized by a patient’s T cells.
  • most existing immune profiling approaches only leverage a so-called distance-from-self score to predict whether a presented neoantigen is in fact recognized by specific T cells. This is based on the optimistic assumption that if a neoantigen is sufficiently different to the healthy genome of the patient, the corresponding neoeptitopes will be recognized by some T cell. This is however not always true in practice.
  • predicting the binding between peptides and TCRs in accordance with the methods disclosed herein can be a potential key factor in the development of more effective therapies.
  • embodiments of the present disclosure provide a multimodal generalization of the Variational Information Bottleneck (VIB), which is sometimes briefly denoted AVIB herein and which leverages multi-head self-attention to implicitly approximate the posterior distribution over latent encodings conditioned by multiple input modalities.
  • VB Variational Information Bottleneck
  • AVIB may be applied to the TCR-peptide interaction prediction problem, an important challenge in immuno-oncology. It has been shown that the methods disclosed herein significantly improve on the baselines ERGO II and NetTCR-2.0.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The invention presents a system and a method for predicting the interaction between T cell receptors (TCRs) and peptides presented on the surface of cells. First, the system is configured to approximate unimodal posterior distributions over latent encodings conditioned by the input molecules. Second, the system is configured to aggregate the predicted parameters of the posteriors in a single matrix. Third, using a mechanism based on multi-head self-attention, the system is configured to approximate a multimodal posterior distribution over latent encodings, which is then used to estimate the probability of binding between TCR and peptides. Due to the multimodal nature of the system, the invention can account for an arbitrarily large set of heterogeneous input modalities.

Description

METHOD AND SYSTEM FOR PREDICTING TOR (T CELL RECEPTOR)-PEPTIDE INTERACTIONS
The present invention relates to a computer system and a computer-implemented method of predicting an interaction or binding between T cell receptors (TCRs), and peptides presented on the surface of cells, or peptides presented on the surface of cells in complex with major histocompatibility complex (MHC) molecules.
In the human immune system, T cells monitor the health status of cells by identifying foreign peptides on their surface. Peptides are presented on the surface of cells in complex with major histocompatibility complex (MHC) molecules. Depending on the cell type, two different MHC molecules are found: class I and class II. The T cell receptors (TCRs) allow the recognition of the anomalous presented peptides. The binding of TCRs with peptide-MHC (pMHC) complexes - also known as TCR recognition - constitute a necessary step for immune response. Only if TCR recognition takes places, cytokines can be released, leading to the death of a target cell. Understanding the rules that govern TCR recognition represents a fundamental step towards the development of personalized and more effective cancer treatments and vaccines.
It is an object of the present invention to improve and further develop a method and system of the initially described type in such a way that precise and effective predictions of binding between T cell receptors (TCRs) and peptides presented on the surface of cells are possible. This problem, in the computational biology literature, is often referred to TCR-peptide interaction (or binding) prediction. If the major histocompatibility complex (MHC) is also considered together with the peptide (pMHC), the present disclosure refers to TCR-pMHC binding prediction.
In accordance with the invention, the aforementioned object is accomplished by a computer-implemented method of predicting an interaction or binding between T cell receptors, TCRs, and peptides presented on the surface of cells or between peptides presented on the surface of cells and major histocompatibility complex, MHC, molecules, the method comprising a training stage, including: a) providing a training dataset of samples, wherein each sample comprises a multimodal tuple of molecules, and wherein each sample has assigned a ground truth label; b) inferring, for each sample of the training dataset using parametric encoders, a unimodal posterior distribution over latent encodings conditioned by the respective input molecules, and combining the parameters of the inferred unimodal posteriors in a single matrix; c) implicitly learning dependencies among the inferred unimodal posteriors by applying a trainable parametric function leveraging multi-head selfattention onto the matrix and, based thereupon, approximating a multimodal joint posterior distribution over latent encodings; d) sampling the approximated multimodal joint posterior distribution and using parametric decoders to predict ground truth labels for the sampled inputs; and e) updating the parameters of the encoders, of the function leveraging multi-head self-attention and of the decoders by minimizing an objective function that accounts for the error between the ground truth labels and the predictions obtained by step d).
The present disclosure provides a system and a method for predicting the interaction between T cell receptors (TCRs) and peptides presented on the surface of cells. First, the system is configured to approximate unimodal posterior distributions over latent encodings conditioned by the input molecules. Second, the system is configured to aggregate the predicted parameters of the posteriors in a single matrix. Third, using a mechanism based on multi-head self-attention, the system is configured to approximate a multimodal posterior distribution over latent encodings, which is then used to estimate the probability of binding between TCR and peptides. Due to the multimodal nature of the system, the invention can account for an arbitrarily large set of heterogeneous input modalities.
According to embodiments, for each input tuple of M molecules, e.g. (peptide, CDR3β, CDR3α, MHC), by means of M derivable parametric encoders (i.e. each data modality has a dedicated encoder), M unimodal posterior distributions over latent encodings conditioned by the respective input molecules may be determined. For each input molecule, the parameters of the posteriors may be approximated and stored in a vector. The M vectors of the parameters of the M unimodal posteriors may be combined (e.g. stacked) in a matrix which contains them all. According to embodiments, the matrix may serve as input to a trainable parametric function which leverages multi-head self-attention. This function allows the various unimodal posteriors to attend to each other, estimates their relative importance and implicitly learns inter-dependencies among them. In the present disclosure, this mechanism is sometimes referred to as AoD (Attention-of-Distributions). A multimodal posterior distribution over latent encodings is eventually approximated and output by the system.
According to embodiments, the method may include a training stage, in which the system is trained, for instance by performing the following steps:
1. Collecting a dataset of samples. Each sample may consist in a multimodal tuple of molecules, e.g. (peptide, CDR3β, CDR3α) for TCR-peptide interaction prediction, or (peptide, MHC) for peptide-MHC binding prediction. For each sample, a ground truth label is provided.
2. For each sample in the dataset, inferring a unimodal posterior distribution over latent encodings, conditioned by the input molecule. The parameters of the inferred posterior may be combined in a single matrix.
3. Using AoD to approximate a multimodal posterior.
4. Sampling from the multimodal posterior and inferring the ground truth label associated to the input by means of a parametric decoder.
5. Updating the parameters of the encoders, AoD and decoder by minimizing an objective function which accounts for the error between ground truth label and predictions of step 4.
6. Repeating steps 2-5 until convergence of a validation metric.
According to embodiments, the method may include an inference stage in which inference is performed with the pre-trained system. Inference may be performed by executing steps 2-4 above.
The methods and systems disclosed herein can be compared with state-of-the-art approaches in two different ways. First, the invention can be compared with general- purpose methods for approximating multimodal joint posteriors in deep variational inference (i.e. without focus on the specific biomedical problems disclosed herein). Second, the invention can be compared with state-of-the-art methods for multimodal TCR-peptide binding prediction.
Deep Multimodal Variational Inference
A fundamental step in deep multimodal variational inference consists in approximating a distribution of a latent variable Z (encoding), conditioned by all observed input modalities.
Grazioli et al.: “Microbiome-based disease prediction with multimodal variational information bottlenecks" PLOS Computational Biology, 2022, operates a so-called product of experts (PoE), which consists in approximating the variational multimodal joint posterior as a product of m unimodal posterior distributions:
Figure imgf000006_0001
Figure imgf000006_0002
. This approximation is only valid under the assumption of conditional independence between the modalities, which is not always true in real-world settings. Embodiments of the present invention based on the attention of distributions (AoD) allow to learn relationships of conditional dependence among the various data modalities in a data-driven fashion. An AoD module can, in fact, be trained to weight how much a given modality shall attend to the other ones. In presence of sufficient data, this allows for a more flexible approximation of the multimodal joint posterior.
Shi et al.: “Variational mixture-of-experts autoencoders for multi-modal deep generative models”, NeurlPS, 2019 proposes a so-called mixture of experts (MoE), i.e. the multimodal joint posterior is approximated as a sum of unimodal Gaussian distributions: A possible drawback of this approach compared to AoD
Figure imgf000006_0003
according to embodiments of the present disclosure is that the multimodal posterior has a different and more complex distribution compared to the unimodal posterior. This makes the inference step less intuitive.
TCR-peptide binding prediction
Montemurro et al.: “Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired ter alpha and beta sequence data”, in Communications biology, 2021 proposes NetTCR-2.0, a deep learning model which can predict TCR-peptide binding prediction by jointly analyzing peptide, CDR3a and CDR3β chains. Compared to the methods disclosed herein, NetTCR-2.0 has the drawback of not learning a latent probability distribution, but just a mapping from inputs to a predicted binding score. This means there is no way to assess uncertainty. Additionally, the approach according to the present disclosure allows to observe the contribution of each modality to the predicted score, as a unimodal posterior is learned for each input modality.
There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end, it is to be referred to the dependent claims on the one hand and to the following explanation of preferred embodiments of the invention by way of example, illustrated by the figure on the other hand. In connection with the explanation of the preferred embodiments of the invention by the aid of the figure, generally preferred embodiments and further developments of the teaching will be explained. In the drawing
Fig. 1 is a diagram schematically illustrating the general concept of an attention of distribution (AoD) mechanism for Gaussian posteriors,
Fig. 2 is a schematic view illustrating an architecture of stochastic encoders deployed in accordance with embodiments of the present disclosure,
Fig. 3 is a schematic view illustrating a trimodal attentive variational information bottleneck (AVIB) architecture in accordance with embodiments of the present disclosure,
Fig. 4 is a diagram illustrating TCR-peptide distributions for datasets used in accordance with embodiments of the present disclosure,
Fig. 5 is a diagram illustrating a length distribution of the amino acid chains for datasets used in accordance with embodiments of the present disclosure,
Fig. 6 is a diagram illustrating a class distribution of a + (i and (i datasets used in accordance with embodiments of the present disclosure, Fig. 7 is a table showing TCR-peptide interaction prediction results of experiments performed in accordance with embodiments of the present disclosure, and
Fig. 8 is a table showing multimodal posterior approximation results of experiments performed in accordance with embodiments of the present disclosure.
The TCR is a heterodimeric protein, which consists of an a-and a p-chain. The structure of the a- and p-chains determine the interaction with the pMHC complex. Each chain consists of three loops, referred to as complementarity determining regions 1 , 2 and 3 (CDR1 -3). According to recent studies (for reference, see N.L. La Gruta et al.: “Understanding the drivers of mhc restriction of t cell receptors”, in Nature Reviews Immunology, 18(7):467-478,2018), it is believed that the CDR3 loops primarily interact with the peptide of a given pMHC complex; the CDR1 and CDR2 loops interact with the MHC molecule. Hence, the CDR3 loops are primarily responsible for the peptide specificity.
The genomic recombination of variable, diversity, and joining (V, D, J) TCR-genes determines the diversity of the CDR3s. The p-chain (CDR3P) is the result of a recombination of the V, D and J genes, while the a-chain (CDR3a) derives from the recombination of the V and J genes. This implies that the CDR3ps present higher variability.
A broad spectrum of recent studies investigate the prediction of TCR-pMHC interactions. Most leverage data from the Immune Epitope Database (IEDB), VDJdb, and McPAS-TCR. These databases mainly contain CDR3p-data, and lack information on CDR3a.
Various recent studies have demonstrated that both the a- and p-chains carry information on the specificity of the TCR toward its cognate pMHC target. Singlecell (SC) technology is required to investigate the pMHC specificity on paired a- 1 p- chains. SC is expensive. Hence, the amount of publicly available data with both a- and p-chains is scarce. Several recent works have investigated TCR-pMHC and TCR-peptide interaction prediction. Various proposed approaches operate simple CDR3β alignment (cf. E. Wong et al.: “Trav1-2 cd8 t-cells including oligoconal expansions of mait cells are enriched in the airways in human tuberculosis”, in commun biol 2: 203, 2019). TCRdist (cf. P. Dash et al.: “Quantifiable predictive features define epitope-specific t cell receptor repertoires”, in Nature, 547(7661 ):89-93, 2017) computes CDR similarity-weighted distances. SETE (cf. Y. Tong et al.: “Sequence-based ensemble learning approach for ter epitope binding prediction”, in Computational Biology and Chemistry, 87:107281 , 2020) adopts k-mer feature spaces in combination with principal component analysis (PCA) and decision trees. Various methods adopt Random Forest to operate classification (cf., e.g., N. De Neuter et al.: “On the feasibility of mining cd8+ t cell receptor patterns underlying immunogenic peptide recognition”, in Immunogenetics, 70(3):159-168, 2018). ImRex (cf. P. Moris et al.: “Treating biomolecular interaction as an image classification problem - a case study on t-cell receptor-epitope recognition prediction”, in bioRxiv, 2019) tackles the problem with a method based on convolutional neural networks (CNNs). TCRGP (cf. E. Jokinen et al.: ’’Determining epitope specificity of t cell receptors with tergp”, in BioRxiv, pp. 542332, 2019) is a classification method which leverages a Gaussian process. ERGO (cf. I. Springer et al.: “Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs”, in Frontiers in immunology, 11 :1803, 2020) is a deep learning approach that adopts long short-term memory (LSTM) networks and autoencoders to compute representations of the peptide and CDR3β. ERGO II (cf. I. Springer et al.: “Contribution of t cell receptor alpha and beta cdr3, mhc typing, v and j genes to peptide binding prediction”, in Frontiers in immunology, 12, 2021) is an updated version of ERGO which considers additional input modalities, i.e. CDR3asequence, V and J genes, MHC and T cell type. NetTCR-1.0 (cf. V.L Jurtz et al.: “Nettcr: sequence-based prediction of fer binding to peptide-mhc complexes using convolutional neural networks”, in BioRxiv, pp. 433706, 2018) and NetTCR- 2.0 (cf. A. Montemurro et al.: “Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcra and 0 sequence data”, in Communications biology, 4(1 ):1— 13, 2021) propose a simple 1 D CNN-based model, integrating peptide and CDR3 sequence information for the prediction of TCR peptide specificity. The present disclosure provides a system and a method for predicting the interaction between T cell receptors (TCRs) and peptides presented on the surface of cells. According to an embodiment, the system may first approximate unimodal posterior distributions over latent encodings conditioned by the input molecules. Second, the system may aggregate the predicted parameters of the posteriors in a single matrix. Third, using a mechanism based on multi-head self-attention (e.g., following the mechanism described in A. Vaswani et al.: “Attention Is All You Need”, NeurlPS 2017, which is hereby incorporated by reference herein), the system may approximate a multimodal posterior distribution over latent encodings, which may then be used to estimate the probability of binding between TCR and peptides. Due to the multimodal nature of the system, embodiments of the present disclosure can account for an arbitrarily large set of heterogeneous input modalities, as will be described in detail for some use cases below.
Embodiments of the present disclosure use a variant of the Variational Information Bottleneck (VIB) approach, as disclosed in Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K.: “Deep variational information bottleneck”, in arXiv preprint arXiv: 1612.00410, 2016, which is hereby incorporated by reference herein. VIB leverages variational inference to construct a lower bound on the Information Bottleneck (IB) objective (as disclosed in Tishby, N., Pereira, F. C., and Bialek, W.: “The information bottleneck method”, in arXiv preprint physics/0004057, 2000, which is hereby incorporated by reference herein). Although those skilled in the art are assumed to be sufficiently familiar with these concepts, the basic principles are described in the following for the ease of understanding. In the present disclosure, the following notation is adopted: X, Y,Z are random variables; x,y,z are multidimensional instances of random variables; and
Figure imgf000010_0002
are functions and
Figure imgf000010_0001
probability distributions parametrized by a vector of parameters
Figure imgf000010_0003
respectively; S represents a set.
Let Y be a random variable representing a ground truth label associated with an input random variable X. Let Z be a stochastic encoding of X coming from an intermediate layer of a deep neural network and defined by a parametric encoder
Figure imgf000010_0004
)representing the upstream part of such neural model. Following the general Information Bottleneck (IB) approach, the goal according to the embodiments of the present disclosure consists in learning an encoding Z which is (a) maximally informative about Y and (b) maximally compressive about X. Following an information theoretic approach, objective (a) implies maximizing the mutual information /(Z, Y; 0) between the encoding Z and the target Y, where:
Figure imgf000011_0001
A trivial solution for maximizing above Equation 1 would be the identity Z = X. This would ensure a maximally informative representation, but (b) places a constraint on Z. In fact, due to (b), one wants to “forget” as much information as possible about X. This leads to following the IB objective with
Figure imgf000011_0003
Figure imgf000011_0002
where is a Lagrange multiplier. The first term on the right hand side of
Figure imgf000011_0011
Equation 2 causes Z to be predictive of Y, while the second term constraints Z to be a minimal sufficient statistics of controls the trade-off between (a) and (b) and
Figure imgf000011_0012
Figure imgf000011_0014
is the mutual information between Z and Y parameterized by
Figure imgf000011_0013
.
As derived for the VIB, assuming and
Figure imgf000011_0008
are variational approximations
Figure imgf000011_0007
of the true
Figure imgf000011_0005
and
Figure imgf000011_0006
respectively, Equation 2 can be rewritten as:
Figure imgf000011_0004
where
Figure imgf000011_0015
is an auxiliary Gaussian noise variable,
Figure imgf000011_0017
is the Kullback- Leibler divergence and is a vector-valued parametric deterministic encoding
Figure imgf000011_0016
function (e.g., in the context of the present disclosure, a neural network). The introduction of e consists in the reparameterization trick (cf. Kingma, D. P. and Welling, M.: “Auto-encoding Variational bayes”, arXiv preprint arXiv: 1312.6114, 2013), which allows to write
Figure imgf000011_0009
where is now treated
Figure imgf000011_0010
as a deterministic variable. This formulation allows the noise variable to be independent of the model parameters. This way, it is easy to compute gradients of the objective in Equation 3 and optimize via backpropagation. In the present disclosure, embodiments of the proposed method let the variational approximate posteriors be multivariate Gaussian distributions with a diagonal covariance structure a valid reparameterization is
Figure imgf000012_0001
Figure imgf000012_0003
With the variational distribution set to a standard multivariate Gaussian
Figure imgf000012_0002
distribution
Figure imgf000012_0004
as done in practice, VIB can be viewed as a variational encoderdecoder analogous to the Variational Auto-Encoding, VAE (cf. again Kingma, D. P. and Welling, M.: “Auto-encoding variational bayes”, arXiv preprint arXiv:1312.6114, 2013), in which the latent encoding distribution can be viewed as a latent
Figure imgf000012_0008
posterior, and the variational decoding distribution can be viewed as a decoder.
Figure imgf000012_0009
As derived for the MVIB (cf. Grazioli, F., Siarheyeu, R., Pileggi, G., and Meiser, A.: “Microbiome-based disease prediction with multimodal variational information bottlenecks”, PLOS Computational Biology, 2022), the VIB objective of Equation 3 can be generalized by representing X as a collection of multimodal random input variables (or multiple input sequences) In light of this,
Figure imgf000012_0005
in the language of a variational encoder-decoder, the posterior of Equation
Figure imgf000012_0006
3 consists actually in the joint posterior conditioned
Figure imgf000012_0007
by the joint M available sequences. However, it should be noted that for predicting the interaction label Y from X, the M different sequences cannot be simply treated as M different modalities.
The Multimodal Variational Autoencoder, MVAE (cf. Wu, M. and Goodman, N.: “Multimodal generative models for scalable weakly-supervised learning”, arXiv preprint arXiv: 1802.05335, 2018) and MVIB approximate the joint posterior assuming that the M modalities are conditionally independent, given the
Figure imgf000012_0010
common latent variable Z. This allows to express the joint posterior as a product of unimodal approximate posteriors
Figure imgf000013_0001
and a prior
Figure imgf000013_0002
, referred to as Product of Experts (PoE): . The Mixture-
Figure imgf000013_0003
of-experts Mulitmodal Variational Autoencoder, MMVAE (cf. Shi, Y., Siddharth, N., Paige, B., and Torr, P. H.: “ Variational mixture-of-experts autoencoders for multi- modal deep generative models”, arXiv preprint arXiv:1911.03393, 2019) factorizes the joint multimodal posterior as a mixture of Gaussian unimodal posteriors, referred to as Mixture of Experts (MoE):
Figure imgf000013_0004
Fig.1 is a diagram schematically illustrating the general concept of an Attention of Distribution (AoD) mechanism 100 (herein sometimes also referred to as AoE, Attention of Experts) for Gaussian posteriors according to embodiments of the present disclosure. In Fig.1, ^^ is the stochastic Gaussian encoder 101 of the
Figure imgf000013_0006
modality. In accordance with embodiments of the present invention, the unimodal posteriors are modelled as Gaussian distributions with diagonal structure:
Figure imgf000013_0005
e.g., following the proposals provided in Wu, M. and Goodman, N.: “Multimodal generative models for scalable weakly-supervised learning”, in arXiv preprint arXiv:1802.05335, 2018 or in Grazioli, F., Siarheyeu, R., Pileggi, G., and Meiser, A.: “Microbiome-based disease prediction with multimodal variational information bottlenecks”, in PLOS Computational Biology, 2022, which are both hereby incorporated by reference herein. As shown at 102 in Fig.1, by stacking the parameters (represented as column-vectors)
Figure imgf000013_0010
^ and
Figure imgf000013_0011
for all available
Figure imgf000013_0012
, , sequences, one can define the following two matrices
Figure imgf000013_0008
and
Figure imgf000013_0007
, where
Figure imgf000013_0009
is the dimensionality of the latent single-sequence posteriors:
Figure imgf000014_0001
and parametrize the latent prior.
According to embodiments of the present disclosure, it may be provided that the dependencies between the M single-sequence posteriors and the multi-sequence joint posterior are implicitly learn by means of multi-head self-attention (as shown at 104 in Fig. 1 including the AoD module) leveraging its power of capturing multiple complex interactions in X, and allowing possible missing sequences, in particular in accordance with the approach disclosed in A. Vaswani et al.: “Attention Is All You Need”, NeurlPS 2017:
Figure imgf000014_0002
Pool:
Figure imgf000014_0003
is a 1 D max pooling function, as shown at 106 in Fig. 1 , MultiHead is the standard multi-head attention block defined in Vaswani et al. The above equation is referred to as Attention of Distributions (AoD) in the context of the present disclosure.
Furthermore, in the context of the present disclosure, a multi-sequence VI B (Variational Information Bottleneck) which adopts AoD for modelling the multisequencejoint posterior is referred to as Attentive Variational Information Bottleneck (AVIB). The AVIB objective is:
Figure imgf000014_0004
where the multi-sequence posterior is modelled as
Figure imgf000014_0005
Figure imgf000014_0006
Embodiments of the present disclosure provide computer-implemented methods and systems that are configured to apply the Attentive Variational Information Bottleneck (AVIB) for tackling a fundamental problem of immuno-oncology predicting: TCR-peptide interactions. It can be demonstrated that AVIB significantly outperforms existing baseline methods. As described above, AVIB may be implemented as a multimodal generalization of the Variational Information Bottleneck (VI B) capable of learning a joint representation of multiple input data modalities through self-attention. In the context of the present disclosure, the input modalities may include the peptide, the CDR3a, and the CDR3β. The model may learn to predict whether the binding between peptide and the TCR takes place or not.
According to an embodiment, the present disclosure provides methods and systems that utilize an approach for combining unimodal posteriors in a joint multimodal posterior using self-attention, termed Attention of Distribution (AoD) herein.
According to an embodiment, the present disclosure provides methods and systems that use a simple, hyperparameter-free approach for the detection of out-of- distribution (OOD) amino acid chains, which leverages the multimodal posterior distribution over latent encodings learned by AVIB.
According to embodiments of the present disclosure, there are three data modalities:
Figure imgf000015_0001
and
Figure imgf000015_0002
For each modality, the dedicated stochastic encoder 101 may have the form
Figure imgf000015_0003
and
Figure imgf000015_0004
are the two output branches 105 of the neural stochastic encoder 101 , as schematically illustrated in Fig. 2.
The neural stochastic encoder 101 of Fig. 2 presents an architecture, which is inspired by NetTCR-2.0 (cf. A. Montemurro et al.: “Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcra and 0 sequence data”, in Communications biology, 4(1):1-13, 2021). It encodes the peptides by using BLOSUM50 encoding 110 and operates 1 D convolutions 114 of the encoded peptides 112 with kernel 116 sizes 1 , 3, 5, 7 and 9. After the convolutions 114, 1 D max pooling 118 is operated, followed by ReLu activation functions 120. The obtained vectors are then concatenated, as shown at 122. Eventually, two parallel fully-connected layers 105 output two vectors of size dz for
Figure imgf000016_0004
and
Figure imgf000016_0003
is the size of the bottleneck, i.e. the dimension of Z. dz can be set via hyperparameter optimization. For a more stable computation, one may let
Figure imgf000016_0002
model the logarithm of the variance logo-2.
According to embodiments of the present disclosure, it may be provided that the corresponding decoder consists in a simple neural network with fully-connected layers and ReLu activations. This implements the binary classification.
Fig. 3 is a schematic view illustrating a trimodal Attentive Variational Information Bottleneck (AVIB) system in accordance with embodiments of the present disclosure. As will be appreciated by those skilled in the art, higher or lower modalities can be implemented in a corresponding fashion. The system comprises parametric encoders 101 , the AoD (or AoE) mechanism 104 described in detail above, and parametric decoders 124.
According to embodiments, in the loss function (see above equation (6)),
Figure imgf000016_0006
Figure imgf000016_0007
and the latent prior may be treated as dz-dimensional spherical Gaussian
Figure imgf000016_0005
distributions. An embodiment may set can be set via
Figure imgf000016_0001
hyperparameter optimization.
According to embodiments, the networks may be trained using the Adam or Stochastic Gradient Descent optimizers. For example, the optimizer may implement a learning rate of 103 and a L2 weight decay with
Figure imgf000016_0008
The batch size may be set to 4096. 0.3 drop-out may be used at training time. A cosine annealing learning rate scheduler with a period of 10 epochs may be adopted (cf. I. Loshchilov et al.: “Stochastic gradient descent with warm restarts”, arXiv preprint arXiv: 1608.03983, 2016). The training may be performed for 200 epochs and, in order to avoid overfitting, the best model may be selected by saving the weights corresponding to the epoch where the AUROC is maximum on the validation set. The validation set may be obtained via 80/20 stratified random split of the training set. Training and test sets may be obtained via a non-stratified 80/20 split of the whole dataset. Experiments may be repeated 5 times with different training/test splits to ensure unbiased performance evaluation.
All experiments may be performed, e.g., on a CentOS Linux 8 machine with NVIDIA GeForce RTX 2080 Ti GPUs and CUDA 10.2 installed. Algorithms may be implemented, e.g., in Python 3.6 using PyTorch version 1.10.
In what follows, the datasets that may be used in connection with the present disclosure for the TCR-peptide interaction prediction problem are described. In particular, human TCR-peptide data extracted from IEDB (cf. Vita et al.: “The immune epitope database (iedb): 2018 update”, in Nucleic acids research, 47(D1):D339-D343, 2019), VDJdb (cf. Bagaev et al.: “Vdjdb in 2019: database extension, new analysis infrastructure and a t-cell receptor motif compendium”, in Nucleic Acids Research, 48(D1):D1057-D1062, 2020), and McPAS-TCR (cf Tickotsky et al.: “Mcpas-tcr: a manually curated catalogue of pathology-associated t-cell receptor sequences”, in Bioinformatics, 33(18):2924-2929, 2017) may be combined. Additionally, the dataset proposed by Klinger et al.: “Multiplex identification of antigen-specific t cell receptors using a combination of immune assays and immune receptor sequencing” in PLoS One, 10(10):e0141561 , 2015, referred to as MIRA set is considered (The MIRA set is publicly available in the NetTCR-2.0 repository: https://github.eom/mnielLab/NetTCR-2.0/tree/main/data). Overall, 271.366 human TCR-peptide samples were available. The data were organized creating the following datasets: a+β set. 117.753 samples out of 271.366 present peptide information, together with both CDR3aand CDR3βchains. In the present disclosure, this subset is referred to as the a + β set. The ground truth label is a binary variable which represents whether binding between peptide and TCR chains takes place. βset. 153.613 samples out of 271.366 present peptide and CDR3P information. For these samples, the CDR3a chain is missing. This subset is referred to as the β set.
Human TCR set. The totality of the human TCR-peptide data (i.e. a + β set U β set) is referred to as the Human TCR set. Non-human TCR set. In addition to the human TCR data, 5036 non-human TCR samples were extracted from the VDJdb database, which were used as OOD (out- of-distribution) samples. These samples come from mice and macaques and present peptide and CDR3β information. These samples are referred to as Non - human TCR set.
Human MHC set. A second set of OOD samples were created, composed of 463.684 peptide-MHC pairs. The peptide sequences are taken from the Human TCR set, i.e. the peptide information is shared among ID and OOD sets. The MHC sequences are amino acid chains corresponding to human MHC alleles (The MHC sequences were extracted from the PUFFIN repository: https://github.com/gifford-lab/PUFFIN). These samples are referred to as Human MHC set.
Fig. 4 depicts the distributions of the human TCR data, i.e. Peptide, CDR3a and CDR3β distributions, for both the a + β set and the β set. In Fig. 4, a point on the x- axis represents one unique chain of amino acids. The y-axis represents how many samples present that specific chain. Samples are sorted by count considering the a + β set. It is possible to observe that the two datasets have similar peptide distributions, but present different CDR3β sequences.
Fig. 5 depicts a length distribution of the datasets used in the context of the present disclosure. The length consists in the number of amino acids which constitute the peptides, CDR3α, CDR3β and MHC molecules. All these types of molecules are sequences of amino acids.
Fig. 6 depicts the class distributions of the β set and a + β set. The class distributions of the Non - human TCR set and Human MHC set are not reported, as they were only used for OOD (Out-Of-Distribution) detection experiments, and not for TCR-peptide interaction prediction.
It is possible to observe that the TCR data presents many more non-binding samples (0), compared to binding ones (1). This class imbalance may be handled by means of balanced batch sampling, i.e. when a batch is sampled at training time, it may be ensured that the numbers of binding and non-binding samples are essentially equal.
Pre-processing
Peptides, CDR3a and CDR3β chains are sequences of amino acids. The 20 amino acids translated by the genetic code are in general represented as English alphabet letters. In accordance with embodiments of the present disclosure, the amino acid sequences pre-processed using BLOSUM50 encodings (cf. Henikoff, S. and Henikoff, J. G.: “Amino acid substitution matrices from protein blocks”, in Proceedings of the National Academy of Sciences, 89(22):10915-10919, 1992). This allows to represent a sequence of N amino acids as an 20 x N matrix. The elements of the matrix are natural numbers that represent similarity scores among amino acids. After performing BLOSUM50 encoding, features standardization may be operated by removing the mean and scaling to unit variance. As the length of the amino acid chains is not constant (see Fig. 5), 0-padding may be operated after the BLOSUM50 encoding. This ensures that all matrices have shape 20 x Nmax, where Nmax is the length of the longest chain.
TCR-peptide Interaction Prediction
In order to evaluate the predicting capabilities of AVIB on the TCR-peptide interaction prediction task, experiments were performed on three datasets: the a + (3 set, the (3 set and the Human TCR set. For the (3 set and the Human TCR set, experiments were performed in the bimodal setting, i.e. samples are (.xPeptide, xcDR3β) 2-tuples. For the a + (3 set, bimodal experiments were performed, as well as trimodal experiments considering (xPepMe, xCDR3a, xCDR3β) 3-tuples. The approach proposed herein (briefly referred to as AVIB) is benchmarked against two state-of-the-art deep learning methods for TCR-peptide interaction predictions:
ERGO II (cf. Springer et aI: “Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs”, in Frontiers in immunology, 11 :1803, 2020) and NetTCR-2.0 (cf .Montemurro et aI: “Nettcr-2.0 enables accurate prediction of tcr- peptide binding by using paired tcra and β sequence data”, in Communications biology, 4(1): 1-13, 2021). Additionally, AVIB is benchmarked against the LUPI- SVM (cf. Abbasi et al.: “Learning protein binding affinity using privileged information”, in BMC bioinformatics, 19(1 ):1 — 12 , 2018), by leveraging the a-chain at training time as privileged information. For all benchmark methods, the original publicly available implementations were adopted.
The experimental results are summarized in Table 1 shown in Fig. 7. For evaluation, the area under the receiver operator characteristic (ALIROC) curve, the area under the precision-recall (AUPR) curve and the F1 score (F1) were computed on the test sets. 5 repeated experiments with different independent 80/20 training/test random splits were performed for unbiased performance evaluation.
As can be see, on the (3 set, the method according to embodiments disclosed herein obtains ~4% higher ALIROC and 8% higher ALIPR compared to the best baseline, ERGO II. On the β set U a + β set, AVIB outperforms ERGO II by achieving ~3% higher ALIROC and ~4% higher ALIPR. On the a + β set, in the bimodal setting, AVIB compares with ERGO II. In the trimodal setting, when also considering the a- chain, AVIB obtains 1 % higher ALIROC, ~4% higher ALIROC and ~6% higher F1 score.
Multimodal Posterior Approximation
This section provides a comparison of two techniques for approximating Gaussian joint posteriors: AoD as described herein and PoE. AVIB, which employs AoD, is benchmarked against MVIB (cf. Grazioli et al.: “Microbiome-based disease prediction with multimodal variational information bottlenecks”, PLOS Computational Biology, 2022), which employs PoE. Experiments and benchmark were performed in the bimodal and trimodal settings. Table 2 shown in Fig. 8 presents the TCR-peptide interaction prediction results on the α + β set. AoD achieves best results in both the bimodal and trimodal settings. The ALIPR score, in particular, improves on PoE up to 2%.
In the following, some specific use cases of the methods and systems described herein in accordance with embodiments of the present disclosure, are described in detail. It is expressly noted that the described use cases are only exemplary and that further applications are straightforward and readily practicable, as will be appreciated by those skilled in the art.
TCR-peptide binding prediction for personalized cancer vaccines
Tumors harbor mutations, of which some can be recognized by the patients’ TCRs. Therapeutic cancer vaccines consist of patient- and therefore tumor-specific mutations meant to stimulate the patient’s immune response against their own tumor. In order to select those mutations that have the highest likelihood to induce an immune response, it is important to know whether the patient’s T cell receptor repertoire is able to recognize the immunogenic mutations (neoantigens), i.e. if there are TCRs present that are able to bind to the associated peptides. Being able to reliably predict if TCRs bind to neoantigens can help to improve the selection process for neoantigen candidates that will be used in a therapeutic cancer vaccine.
T cell engineering
Tumor antigens shared by patients are a highly interesting target for cancer immunotherapy, especially adoptive T cell therapy, where a tumor-recognizing TCR is introduced in the patient’s T cells. These TCRs need to fulfil demanding requirements regarding their safety and efficacy, as shared antigens are usually not strictly tumor-specific. Therefore, prediction of TCR-peptide interaction allows the modelling and engineering of suitable TCRs, which can be evaluated for their safety and efficacy before any wet lab experiments are necessary.
TCR cross-reactivity assessment
Therapeutic TCRs can be used to target tumors or other types of cells depending on the MHC presented peptides. However, TCRs do usually not bind to one single peptide but multiple peptides. In order to identify suitable TCR candidates for therapeutic purposes, it is important to test for potential cross-reactivity, i.e. to verify that the TCR does not recognize healthy cells. Being able to predict crossreactivities of a TCR before experimental tests can significantly speed up and lower the costs of the process of candidate identification. Peptide-MHC binding and interaction prediction for personalized cancer vaccines
Although the present disclosure focuses on TCR-peptide interaction prediction, in an embodiment the system may also be used to predict peptide-MHC binding and presentation. In this embodiment, the set of MHC molecules may be taken from a specific patient, and the set of peptides are based on mutations present in the cancerous cells of the patient. For all (peptide, MHC) combinations, predictions are made. In this setting, the first modality is the peptide, while the second modality is the MHC molecule. Depending on how the model is trained, the predicted scores can represent either the likelihood of having binding between the two molecules, or the likelihood that the peptide-MHC complex is presented on the surface of the cell. The peptides with the highest likelihood of being presented are good candidates to be included in a personalized cancer vaccine for that patient. The predicted candidates are the output of the peptide-MHC interaction prediction system. In modern cancer vaccines design systems, these candidates may then be processed by further downstream modules, e.g. to predict TCR recognition, or to filter-out peptides which could trigger autoimmune response.
AoE for sensor fusion (Computer vision applications)
AoE can be used as a building block of a broad spectrum of multimodal learning systems, beyond the TCR-specific setting, and also with non-molecular data. Approximating a multimodal posterior from heterogeneous unimodal posteriors can in fact be particularly helpful in sensor fusion for combining various types of sensor data. RGB, infrared and depth images, as well as radar and point cloud information can be combined using AoE to detect, segment and/or classify objects in an environment with higher confidence. In robotics, the output of the sensor fusion module is provided to, e.g., the planning algorithms, or to additional scene understanding algorithms.
Considering the above, methods in accordance with embodiments of the present disclosure can be suitable applied to improve and reinforce existing software frameworks for vaccine design. In particular, the system proposed in the present disclosure can be used to predict if neoantigen vaccine candidates are in fact recognized by a patient’s T cells. This makes the present invention particularly suited to be combined with immune profiling technology. As of now, most existing immune profiling approaches only leverage a so-called distance-from-self score to predict whether a presented neoantigen is in fact recognized by specific T cells. This is based on the optimistic assumption that if a neoantigen is sufficiently different to the healthy genome of the patient, the corresponding neoeptitopes will be recognized by some T cell. This is however not always true in practice. Hence, predicting the binding between peptides and TCRs in accordance with the methods disclosed herein can be a potential key factor in the development of more effective therapies.
To summarize, embodiments of the present disclosure provide a multimodal generalization of the Variational Information Bottleneck (VIB), which is sometimes briefly denoted AVIB herein and which leverages multi-head self-attention to implicitly approximate the posterior distribution over latent encodings conditioned by multiple input modalities. In accordance with embodiments of the present disclosure, AVIB may be applied to the TCR-peptide interaction prediction problem, an important challenge in immuno-oncology. It has been shown that the methods disclosed herein significantly improve on the baselines ERGO II and NetTCR-2.0.
Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

C l a i m s
1. A computer-implemented method of predicting an interaction or binding between T cell receptors, TCRs, and peptides presented on the surface of cells or between peptides presented on the surface of cells and major histocompatibility complex, MHC, molecules, the method comprising a training stage, including: a) providing a training dataset of samples, wherein each sample comprises a multimodal tuple of molecules, and wherein each sample has assigned a ground truth label; b) inferring, for each sample of the training dataset using parametric encoders, a unimodal posterior distribution over latent encodings conditioned by the respective input molecules, and combining the parameters of the inferred unimodal posteriors in a single matrix; c) implicitly learning dependencies among the inferred unimodal posteriors by applying a trainable parametric function leveraging multi-head self-attention onto the matrix and, based thereupon, approximating a multimodal joint posterior distribution over latent encodings; d) sampling the approximated multimodal joint posterior distribution and using parametric decoders (124) to predict ground truth labels for the sampled inputs; and e) updating the parameters of the encoders, of the function leveraging multihead self-attention and of the decoders by minimizing an objective function that accounts for the error between the ground truth labels and the predictions obtained by step d).
2. The method according to claim 1 , further comprising: repeating steps b)-e) until convergence of a predefined validation metric.
3. The method according to claim 1 or 2, the method further comprising an inference stage, including: providing a yet unseen and unlabeled sample comprising a multimodal tuple of molecules; and using the approximated multimodal posterior distribution obtained in the training stage for estimating a probability of binding between TCRs and peptides and/or between peptides and MHC molecules.
4. The system according to any of claims 1 to 3, wherein the parametric encoders (101) operate 1 D convolutions (114) of encoded molecules (112), followed by 1 D max pooling (118) and Rectified Linear Unit, ReLU, activation functions (120).
5. The system according to any of claims 1 to 4, wherein the parametric decoders (124) include a neural network with a number of fully connected layers and ReLU activation functions.
6. The method according to any of claims 1 to 5, wherein the multimodal tuple of molecules of the samples includes at least a specific TCR and a specific peptide.
7. The method according to any of claims 1 to 6, wherein the multimodal tuple of molecules of the samples includes a peptide, a CDR3β molecule and a CDR3a molecule.
8. The method according to any of claims 1 to 7, wherein the estimated probabilities of binding between TCRs and peptides are used for selecting patientspecific neoantigen candidates.
9. The method according to any of claims 1 to 8, wherein the estimated probabilities of binding between TCRs and peptides are used for modelling and/or engineering tumor-recognizing TCRs.
10. The method according to any of claims 1 to 9, wherein the estimated probabilities of binding between TCRs and peptides are used for predicting crossreactivities of a TCR.
11. The method according to any of claims 1 to 10, wherein the estimated probabilities of binding and interaction between peptides and MHCs are used for developing personalized cancer vaccines.
12. A system for predicting an interaction or binding between T cell receptors, TCRs, and peptides presented on the surface of cells or between peptides presented on the surface of cells and major histocompatibility complex, MHC, molecules, in particular for execution of a method according to any of claims 1 to 11 , the system comprising one or more processors which, alone or in combination, are configured to provide for execution of the following steps: a) providing a training dataset of samples, wherein each sample comprises a multimodal tuple of molecules, and wherein each sample has assigned a ground truth label; b) inferring, for each sample of the training dataset using parametric encoders, a unimodal posterior distribution over latent encodings conditioned by the respective input molecules, and combining the parameters of the inferred unimodal posteriors in a single matrix; c) implicitly learning dependencies among the inferred unimodal posteriors by applying a trainable parametric function leveraging multi-head self-attention onto the matrix and, based thereupon, approximating a multimodal joint posterior distribution over latent encodings; d) sampling the approximated multimodal joint posterior distribution and using parametric decoders (124) to predict ground truth labels for the sampled inputs; and e) updating the parameters of the encoders (101), of the function leveraging multi-head self-attention and of the decoders (124) by minimizing an objective function that accounts for the error between the ground truth labels and the predictions obtained by step d).
13. The system according to claim 12, wherein the parametric encoders (101) are configured to operate 1 D convolutions (114) of encoded molecules (112), followed by 1 D max pooling (118) and Rectified Linear Unit, ReLU, activation functions (120).
14. The system according to claims 12 or 13, wherein the parametric decoders (124) include a neural network with a number of fully connected layers and ReLU activation functions.
15. A tangible, non-transitory computer-readable medium containing instructions which, upon execution by one or more processors with access to memory, provide for execution of a method according to any of claims 1 to 11.
PCT/EP2023/050900 2022-01-18 2023-01-16 Method and system for predicting tcr (t cell receptor)-peptide interactions WO2023139031A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22152006.7 2022-01-18
EP22152006 2022-01-18

Publications (1)

Publication Number Publication Date
WO2023139031A1 true WO2023139031A1 (en) 2023-07-27

Family

ID=80445811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/050900 WO2023139031A1 (en) 2022-01-18 2023-01-16 Method and system for predicting tcr (t cell receptor)-peptide interactions

Country Status (1)

Country Link
WO (1) WO2023139031A1 (en)

Non-Patent Citations (29)

* Cited by examiner, † Cited by third party
Title
A. MONTEMURRO ET AL.: "Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcra and β sequence data", COMMUNICATIONS BIOLOGY, vol. 4, no. 1, 2021, pages 1 - 13, XP055923333, DOI: 10.1038/s42003-021-02610-3
A. MONTEMURRO ET AL.: "Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcra and β sequence data", COMMUNICATIONS BIOLOGY,, vol. 4, no. 1, 2021, pages 1 - 13, XP055923333, DOI: 10.1038/s42003-021-02610-3
A. VASWANI ET AL.: "Attention Is All You Need", NEURLPS, 2017
ABBASI ET AL.: "Learning protein binding affinity using privileged information", BMC BIOINFORMATICS, vol. 19, no. 1, 2018, pages 1 - 12, XP021262536, DOI: 10.1186/s12859-018-2448-z
ALEMI, A. A.FISCHER, I.DILLON, J. V.MURPHY, K.: "Deep variational information bottleneck", ARXIV:1612.00410, 2016
BAGAEV ET AL.: "Vdjdb in 2019: database extension, new analysis infrastructure and a t-cell receptor motif compendium", NUCLEIC ACIDS RESEARCH, vol. 48, no. D1, 2020, pages D1057 - D1062
CAI MICHAEL ET AL: "TCR-epitope binding affinity prediction using multi-head self attention model", ICML WORKSHOP ON COMPUTATIONAL BIOLOGY, 25 July 2021 (2021-07-25), https://icml-compbio.github.io/icml-website-2021/#papers, pages 1 - 5, XP055928149, Retrieved from the Internet <URL:https://icml-compbio.github.io/2021/papers/WCBICML2021_paper_59.pdf> [retrieved on 20220606] *
E. JOKINEN ET AL.: "Determining epitope specificity of t cell receptors with tcrgp", BIORXIV, 2019, pages 542332
E. WONG ET AL.: "Trav1-2 cd8 t-cells including oligoconal expansions of mait cells are enriched in the airways in human tuberculosis", COMMUN BIOL, vol. 2, 2019, pages 203
GRAZIOLI, F.SIARHEYEU, RPILEGGI, GMEISER, A.: "Microbiome-based disease prediction with multimodal variational information bottlenecks", PLOS COMPUTATIONAL BIOLOGY, 2022
HENIKOFF, S.HENIKOFF, J. G.: "Amino acid substitution matrices from protein blocks", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, vol. 89, no. 22, 1992, pages 10915 - 10919, XP002599751, DOI: 10.1073/pnas.89.22.10915
I. LOSHCHILOV ET AL.: "Stochastic gradient descent with warm restarts", ARXIV:1608.03983, 2016
I. SPRINGER ET AL.: "Contribution of t cell receptor alpha and beta cdr3, mhc typing, v and j genes to peptide binding prediction", FRONTIERS IN IMMUNOLOGY, vol. 12, 2021
I. SPRINGER ET AL.: "Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs", FRONTIERS IN IMMUNOLOGY, vol. 11, 2020, pages 1803
KLINGER ET AL.: "Multiplex identification of antigen-specific t cell receptors using a combination of immune assays and immune receptor sequencing", PL S ONE, vol. 10, no. 10, 2015, pages e0141561, XP055389430, DOI: 10.1371/journal.pone.0141561
MONTEMURRO ET AL.: "Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcr alpha and beta sequence data", COMMUNICATIONS BIOLOGY, 2021
N. DE NEUTER ET AL.: "On the feasibility of mining cd8+ t cell receptor patterns underlying immunogenic peptide recognition", IMMUNOGENETICS, vol. 70, no. 3, 2018, pages 159 - 168, XP036432332, DOI: 10.1007/s00251-017-1023-5
N.L. LA GRUTA ET AL.: "Understanding the drivers of mhc restriction of t cell receptors", NATURE REVIEWS IMMUNOLOGY, vol. 18, no. 7, 2018, pages 467 - 478, XP036533499, DOI: 10.1038/s41577-018-0007-5
P. DASH ET AL.: "Quantifiable predictive features define epitope-specific t cell receptor repertoires", NATURE, vol. 547, no. 7661, 2017, pages 89 - 93, XP093003452, DOI: 10.1038/nature22383
P. MORIS ET AL.: "Treating biomolecular interaction as an image classification problem - a case study on t-cell receptor-epitope recognition prediction", BIORXIV, 2019
SHI ET AL.: "Variational mixture-of-experts autoencoders for multi-modal deep generative models", NEURLPS, 2019
SHI, Y.SIDDHARTH, N.PAIGE, B.TORR, P. H.: "Variational mixture-of-experts autoencoders for multi-modal deep generative models", ARXIV: 1911. 03393, 2019
SIDHOM JOHN-WILLIAM ET AL: "DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires", NATURE COMMUNICATIONS, vol. 12, no. 1, 11 March 2021 (2021-03-11), XP093038902, Retrieved from the Internet <URL:https://www.nature.com/articles/s41467-021-21879-w> [retrieved on 20230412], DOI: 10.1038/s41467-021-21879-w *
TICKOTSKY ET AL.: "Mcpas-tcr: a manually curated catalogue of pathology-associated t-cell receptor sequences", BIOINFORMATICS, vol. 33, no. 18, 2017, pages 2924 - 2929, XP093009557, DOI: 10.1093/bioinformatics/btx286
TISHBY, N.PEREIRA, F. C.BIALEK, W.: "The information bottleneck method", ARXIV, 2000
V.I. JURTZ ET AL.: "Nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks", BIORXIV, 2018, pages 433706
VITA ET AL.: "The immune epitope database (iedb): 2018 update", NUCLEIC ACIDS RESEARCH, vol. 47, no. D1, 2019, pages D339 - D343
WU, M.GOODMAN, N.: "Multimodal generative models for scalable weakly-supervised learning", ARXIV.-1802.05335, 2018
Y. TONG ET AL.: "Sequence-based ensemble learning approach for tcr epitope binding prediction", COMPUTATIONAL BIOLOGY AND CHEMISTRY, vol. 87, 2020, pages 107281, XP086225294, DOI: 10.1016/j.compbiolchem.2020.107281

Similar Documents

Publication Publication Date Title
JP7459159B2 (en) GAN-CNN for MHC peptide binding prediction
US20220237457A1 (en) Variant pathogenicity prediction using neural network
WO2018095049A1 (en) Method and apparatus for generating recommended results
WO2019143737A1 (en) Systems and methods for modeling probability distributions
CN110136773A (en) A kind of phytoprotein interaction network construction method based on deep learning
US20080172215A1 (en) T-cell epiotope prediction
CN113449859A (en) Data processing method and device
Medina-Ortiz et al. Dmakit: A user-friendly web platform for bringing state-of-the-art data analysis techniques to non-specific users
Grazioli et al. Attentive Variational Information Bottleneck for TCR–peptide interaction prediction
Li et al. Cumulus: a cloud-based data analysis framework for large-scale single-cell and single-nucleus RNA-seq
Malekpour et al. Protein secondary structure prediction using three neural networks and a segmental semi Markov model
Nandhini et al. An optimal stacked ResNet-BiLSTM-based accurate detection and classification of genetic disorders
WO2023139031A1 (en) Method and system for predicting tcr (t cell receptor)-peptide interactions
CN116597903A (en) Human TCR/HLA-I/Peptide ternary complex interactive identification prediction method and system
de Oliveira et al. Ensemble of bidirectional recurrent networks and random forests for protein secondary structure prediction
Attique et al. DeepBCE: evaluation of deep learning models for identification of immunogenic B-cell epitopes
Zhang et al. Deeptap: an RNN-based method of TAP-binding peptide prediction in the selection of tumor neoantigens
CN112908421A (en) Tumor neogenesis antigen prediction method, device, equipment and medium
Ostmeyer et al. Dynamic kernel matching for non-conforming data: A case study of T cell receptor datasets
US20220319635A1 (en) Generating minority-class examples for training data
KR102558549B1 (en) Apparatus and method for generating prediction result for tcr using artificial intelligence technology
Bai et al. A Hybrid Convolutional Network for Prediction of Anti-cancer Drug Response
RU2777926C2 (en) Gan-cnn for prediction of mhc-peptide binding
Jokinen et al. Predicting recognition between T cell receptors and epitopes using contextualized motifs
Souilljee Locating Selective Sweeps with Accelerated Convolutional Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23702397

Country of ref document: EP

Kind code of ref document: A1