CN114649054A

CN114649054A - Antigen affinity prediction method and system based on deep learning

Info

Publication number: CN114649054A
Application number: CN202011506001.5A
Authority: CN
Inventors: 刘宇轩; 李京宇; 刘耿; 李波
Original assignee: Shenzhen Jinuoyin Biotechnology Co ltd
Current assignee: Shenzhen Jinuoyin Biotechnology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-06-21

Abstract

The invention belongs to the field of artificial intelligence and discloses an antigen affinity prediction method and a prediction model based on deep learning. The method comprises the following steps: (1) obtaining a training dataset of oligomer-immunomer binding structures, including affinities; (2) for each group of oligomer-immune molecules, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecules as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of a combined structure of the oligomer and the immune molecules; (3) training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data, and establishing an oligomer-immune molecule affinity prediction model; (4) and inputting the vector of the binding structure of the oligomer to be detected and the immune molecule to be detected into the antigen affinity prediction model, and predicting the binding affinity of the oligomer to be detected and the immune molecule by using the trained deep neural network.

Description

Antigen affinity prediction method and system based on deep learning

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to an antigen affinity prediction method and system based on deep learning.

Background

Epitope (AD) refers to a specific chemical group in an Antigenic molecule that determines the specificity of an antigen. The antigen binds to the corresponding antigen receptor on the surface of the lymphocyte through the antigen epitope, thereby activating the lymphocyte and causing immune response. The nature, number and spatial configuration of the antigenic epitopes determine the specificity of the antigen. The size of the epitope is compatible with the antigen binding site of the corresponding antibody. Generally, one polypeptide epitope contains 5-6 amino acid residues; one polysaccharide epitope contains 5-7 monosaccharides; the epitope of one nucleic acid hapten contains 6-8 nucleotides. The specificity of an epitope is determined by all of the residues that make up it, but some of these residues play a greater role in binding to the antibody than others, and are referred to as immunodominant groups.

T cell epitopes are immunogenic polypeptide fragments that must be processed by antigen presenting cells into small peptide molecules and then bound to Major Histocompatibility Complex (MHC) molecules for recognition by T cells. T Cell antigen receptors (TCRs) recognize only small polypeptides of about 10-20 amino acids. These epitopes are composed of amino acids linked in sequence, and are mainly present in the hydrophobic region of the antigenic molecule, called linear epitopes or sequence epitopes. T cells generally do not recognize conformational epitopes of native antigens since they only recognize processed epitopes.

The methods for analyzing epitope are various, including chemical lysis, biological enzymolysis, nuclear magnetic resonance spectroscopy (NMR), Surface Plasmon Resonance (SPR), hybrid peptide lysis, polypeptide library construction, and theoretical measurement. At present, with the development of computer technology, especially the popularization of artificial intelligence technology, the epitope analysis method with highest flux, lowest cost and shortest period is formed by directly screening and predicting the epitope from biological big data.

In terms of T cell epitopes, the conventional scheme of theoretical prediction is to extract a plurality of key characteristics such as antigen peptide sequences, antigen peptide lengths, antigen peptide expression levels, antigen peptide thermostability and the like by learning the data of combination of MHC molecules and antigen peptides, thereby forming a machine learning model, and combining a high-throughput sequencing technology to theoretically predict potential antigen peptides of unknown proteins or genes. Currently, several companies at home and abroad have developed antigen prediction software successively, such as netMHCpan software developed by Denmark science and technology university, EDGE software developed by Gritstone Oncology company, EPIP software developed by GmbH Biotechnology, Inc. of Wuhan Hua Dajinuoyin, and the like.

With the continuous development and mutual promotion of modern bioinformatics, molecular biology and molecular immunology, epitope research and application thereof have made great progress and show great application potential. The application of the antigen epitope is mainly embodied in three aspects, namely disease diagnosis, vaccine development and disease treatment.

In disease diagnosis, the key to the efficiency of epitope diagnostic methods is sensitivity and specificity. The epitope is a basic unit for stimulating immune reaction, the single epitope stimulates specific single immune reaction, and the multiple epitopes often stimulate multiple mixed immune reaction, thereby generating nonspecific mixed antibody, sensitized lymphocyte or effector. Therefore, the research of epitope peptide in disease diagnosis focuses on the selection of specific epitope peptide, thereby achieving better diagnosis efficiency.

In vaccine development, conventional vaccines each contain a large number of epitopes, including protective, inhibitory or null epitopes. The vaccine can achieve the desired protective effect only if the immune response induced by the protective epitope dominates. Therefore, in the research of the prior antiviral vaccine, how to obtain the protein epitope which has strong immunogenicity, strong sequence conservation and plays a key role in virus invasion is a technical difficulty.

In the disease treatment, the immune response induced by the epitope has high specificity and pertinence, and can be used for the immunotherapy of tumors, infectious diseases and autoimmune diseases. Immunotherapy, which activates a cytotoxic response to an antigen by enhancing the patient's own immune system, has proven to be an effective strategy in recent years. This strategy utilizes multiple antigens on the cell surface formed by various muteins from intracellular proteasomal cleavage. These polypeptides bind to HLA molecules, forming polypeptide-HLA complexes that are presented to T Cell Receptors (TCRs). If the TCR can recognize the polypeptide-HLA complex, CytoToxic T Lymphocyte (CTL) can be activated, the CTL is a subunit of leukocyte, is a specific T cell, specially secretes various cytokines to participate in immunization, has killing effect on certain viruses, tumor cells and other antigen substances, and forms an important defense line of the body against virus and tumor immunity together with natural killer cells. The first step in cytotoxic T lymphocyte therapy is to predict the binding affinity of an antigen to an HLA molecule. The current rapidly evolving Neoantigen (Neoantigen) therapy in the field of tumor therapy is a good application of epitopes in the field of disease therapy. Foreign companies have developed research on treating malignant tumors with new antigens and have entered clinical experimental stages, such as BioNTech, Neon Therapeutics, Gritstone Oncology, etc.

At present, there are four methods for predicting the affinity of antigen and HLA molecule, which are a structure-based method, a machine learning-based method, a location weight matrix (PSSM) -based method, and a combinatorial method, respectively. Wherein the machine learning-based approach learns a high-dimensional classification plane from known binding and non-binding peptide information to predict the affinity of polypeptide binding. Methods of machine learning are capable of accurately predicting the affinity of polypeptides to specific HLA alleles, such as HLA-a 0201, HLA-a 0101 and HLA-B0702 [1,2], and are therefore frequently used in many studies [3-5 ]. The most prevalent of these machine learning algorithms is the pan-specific affinity algorithm, which takes as input both the molecular amino acid sequence of HLA and the amino acid sequence of polypeptide, the affinity prediction output is obtained by a machine learning algorithm, currently the most common algorithm in the industry is NetMHCPan [6], the method characterizes HLA molecule by a segment of 34-bit amino acid Pseudo Sequence (Pseudo Sequence), then preprocesses the Pseudo Sequence and polypeptide short Sequence (Peptide Sequence), taking the parallel preprocessing result as an input characteristic, obtaining polypeptide-HLA molecular affinity prediction through a BP neural network model (BP), this approach models each polypeptide-HLA molecule pair as a unique input sequence, the mapping between the sequence and the polypeptide-HLA affinity value can be learned by a Model, and the method is called a Pan-specific training strategy (Pan Model). NetMHCPan works show that a single BP model does not work well, and in order to obtain the best effect, NetMHCPan generally simultaneously aggregates and learns a large number of models.

In recent years, a plurality of affinity prediction models based on deep learning appear in the industry, wherein a representative algorithm is DeepSeqPan [7], the method respectively learns the characteristics of polypeptide and HLA through two sets of independent convolution networks, and predicts the affinity output through a neural network after combining the characteristics; similar ideas are used for MHCSeqNet [8] which uses Gated Recurrent Units (GRUs) instead of convolutional neural networks in order to learn variable-length data of various polypeptides, and AI-MHC [9] which processes different polypeptides by Padding (Padding) to realize variable-length input learning using more efficient convolutional neural networks; similar work also has ACME [10], which differs from DeepseqPan in that splicing each HLA computation layer splices a polypeptide characterization vector; in addition to using natural language processing techniques, convMHC [11] models the affinity model using a class of image processing ideas: and establishing the position physicochemical data of the polypeptide and the HLA molecules into a plurality of 2-dimensional data matrixes, and predicting an affinity model by adopting a 2-dimensional convolutional neural network.

The main disadvantages of the prior art are that: 1) there is a lack of an efficient way to model expressed polypeptide-HLA complexes. In the method, the sequence characteristics of the polypeptide and the HLA molecule are learned through an independent neural network and then are directly spliced into a whole, and the structure of a complex generated once the polypeptide and the typing are presented is not considered in model design; 2) the polypeptide sequence is short, the effect of directly using a deep network is poor, and related work mostly adopts a shallow network. [6] A large number of traditional neural network aggregation are adopted to make up for the defect of weak learning ability of a single model; [7,9,10] all use shallow convolutional networks for learning polypeptide features; 3) certain methods do not take into account polypeptide length diversity. [11] Although the deep network can be used for polypeptide learning, the design only can consider fixed length and cannot realize variable length; the same problem occurs in [6 ].

Accordingly, the present invention requires an improved deep learning based affinity prediction model.

Disclosure of Invention

Based on the problems in the prior art, the invention aims to provide an improved affinity prediction model based on deep learning and a construction method thereof.

Accordingly, in a first aspect, the present invention provides a method for deep learning antigen affinity, the method comprising:

(1) obtaining a training data set of an oligomer-to-biomolecule binding structure, the training data set comprising an affinity of oligomer-to-biomolecule binding;

(2) for each group in the training data set of the oligomer to be detected, the immune molecule and the oligomer-immune molecule combination structure, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecule as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure;

(3) training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data, and establishing an oligomer-immune molecule affinity prediction model;

(4) and inputting the vector of the binding structure of the oligomer to be detected and the immune molecule to be detected into the antigen affinity prediction model, and predicting the binding affinity of the oligomer to be detected and the immune molecule by using the trained deep neural network.

In one embodiment, the oligomer is selected from the group consisting of: polypeptides, polysaccharides, and nucleic acid haptens.

In one embodiment, the oligomers are vectors of variable length.

In one embodiment, the immune molecule is a major histocompatibility complex molecule, such as a human leukocyte antigen, including but not limited to class I and class II human leukocyte antigens.

In one embodiment, the affinity of the oligomer-to-immune molecule binding is expressed as IC of oligomer-to-immune molecule binding₅₀The value is obtained.

In one embodiment, in (2), a monomer vector representation that reflects the relationship between any two monomers is obtained by word embedding.

In one embodiment, the input oligomer sequence in (2) is (x)₁,x₂,…,x_n) Wherein x is_iIs an amino acid character, x_iE is e { A, C, D,. eta., Y }, and the mapping result is that x is defined as_iInto an m-dimensional vector z_i∈R^mR is real number space, sequence (x)₁，x₂，...，x_n) Is converted into (z)₁，z₂，...，z_n)。

In one embodiment, the monomer position in (2) characterizes a vector: mapping the tag for each amino acid position of the polypeptide to a high-dimensional continuous numerical space: the inputs are unified into a vector (1, 2.. times.n), each location i is mapped by the fully-connected neural network into a vector p_i∈R^mR is a real number space, i.e., (1, 2.. times.n) output is (p)₁，p₂，...，p_n)。

In one embodiment, the sequence of oligomers and the high dimensional vector of monomer positions of the oligomers in (2) are added to form a characterization vector for the oligomers.

In one embodiment, the input for the vector of immune molecules in (2) is (y)₁，y₂，...，y_k)，y_iE { A, C, D,. quadrature.Y }, with the output being (z'₁，z′₂，...，z′_n)，z′_i∈R^mFrom the k-dimensional original amino acid vector, a vector of dimensions n x m, which is the same format as the oligomer, is mapped.

In one embodiment, the characterization vector of the oligomer and the vector of the immune molecule are subjected to tensor multiplication, tensor addition, or attention operations in (2).

In one embodiment, in (3), the deep neural network comprises a deep convolutional layer and a multi-layer fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in the training data set, the feature vectors are input into the multi-layer fully-connected layer, the multi-layer fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.

In one embodiment, for each deep convolutional layer, copying 3 parts of input, learning two parts through different convolutional networks, and performing Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and performing bit-wise multiplication to obtain a residual error A multiplied by B; the other is added to the residual of A B.

In one embodiment, a global pooling layer θ is added after the last convolutional layer: r^n×p→R^pAnd p is the output dimension of the last layer to obtain a feature vector.

In one embodiment, in the multi-layered fully-connected layer, the feature vectors obtained from the deep convolutional layer are mapped to affinity values by the fully-connected layer as follows:

1) performing linear transformation on the output characteristic vector x of the deep convolutional layer to serve as an input vector y of the full-link layer, namely Wx + b;

2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function:

1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through a plurality of layers of mapping networks.

In one embodiment, in (4), the trained deep neural network comprises a deep convolutional layer and a multi-layer fully-connected layer, wherein the deep convolutional layer is used for extracting the feature vector of the oligomer-immune molecule to be detected, and inputting the feature vector into the multi-layer fully-connected layer to predict the binding affinity of the oligomer-immune molecule to be detected.

In a second aspect, the present invention provides a method for building an antigen affinity prediction model based on deep learning, the method comprising building an oligomer-immunomer affinity prediction model by (1) to (3) of the method of the first aspect of the present invention.

In a third aspect, the present invention provides an antigen affinity prediction model established using the method of the second aspect of the invention, the antigen affinity prediction model comprising: a data acquisition module, an input vector establishing module and a model establishing module,

the data acquisition module is used for acquiring a data set of the oligomer-immune molecule combination structure, wherein the data set comprises the affinity of oligomer-immune molecule combination;

the input vector establishing module is used for mapping the sequence of the oligomer, the monomer position of the oligomer and the immune molecule into high-dimensional vectors respectively by using a data set of the oligomer-immune molecule combination structure, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure;

the model building module is used for training a deep neural network by using the vector of the oligomer-immune molecule combination structure and affinity data to build an antigen affinity prediction model.

In one embodiment, in the model building module, the deep neural network comprises a deep convolutional layer and a multi-layer fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in a training data set, the feature vectors are input into the multi-layer fully-connected layer, the multi-layer fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.

In one embodiment, a global pooling layer θ R is added after the last convolutional layer^n×p→R^pAnd p is the output dimension of the last layer to obtain a feature vector.

In one embodiment, in the multi-layered fully-connected layer, the feature vectors obtained from the deep convolutional layer are mapped to affinity values by the fully-connected layer by the following method:

According to the invention, the oligomer-immune molecule composite structure is modeled by a deep learning technology, so that the expression of a deep learning model is enhanced; the deep model of oligomer short sequences is realized by improving the deep learning technology, and the complexity and the learning capacity of a single neural network are enhanced.

Drawings

Fig. 1 shows the overall structure of an antigen affinity prediction model according to one embodiment of the present invention.

Fig. 2 shows an input network structure of an antigen affinity prediction model according to an embodiment of the present invention.

FIG. 3 shows a computational network structure of an antigen affinity prediction model according to one embodiment of the invention.

Detailed Description

The most important two-step design of the deep learning technology is the design of an input feature expression Layer (Embedding Layer) and the design of a deep network structure of a calculation Layer. The invention considers the expression layer design of the compound and realizes the deep design of the polypeptide calculation layer by an effective means, thereby improving the learning capacity of the model, so that the model fusion learning amount is less than that required by the scheme adopted by netMHCPan commonly used in the industry, and the efficiency of the process is improved; the invention is not limited to fixed-length sequences and has wider use scenes.

In the present invention, the deep residual network model is applicable to the correlation prediction of affinity, antigen presentation and immunogenicity models of all polypeptide sequences; the method is also suitable for other short gene sequence scenes, including but not limited to MHC molecule sequence analysis, DNA high-throughput sequencing fragment analysis and the like. The length of the polypeptide sequence is typically 9 or 10 amino acids, but other lengths are also possible.

In the present invention, tensor addition and attention operations can also be used to integrate the vector of polypeptides and the vector of immune molecules into one vector. The attention operation changes the tensor product of the polypeptide and the typing into the inner product operation, then the nonlinear normalization is carried out, and the product operation of the attention weight matrix and the polypeptide/typing tensor is calculated. Tensor addition, however, is less effective than tensor multiplication; attention operations are common in natural language processing models, such as the BERT [12] of Google. The polypeptide and HLA molecule sequence can also be stereoscopically combined, or the polypeptide and the HLA molecule are spliced.

The method for establishing the antigen affinity prediction model based on deep learning comprises the following steps: (1) obtaining a training data set of an oligomer-to-biomolecule binding structure, the training data set comprising an affinity of oligomer-to-biomolecule binding; (2) for each group in the training data set of the oligomer-immune molecule combination structure, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecule as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure; (3) and training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data to establish an oligomer-immune molecule affinity prediction model.

Preferably, the deep neural network includes a deep convolutional layer and a multi-layer fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in a training data set, the feature vectors are input into the multi-layer fully-connected layer, the multi-layer fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.

In one example, in a multi-layer convolutional layer, each convolutional neural network learns the last layer of residuals: the i-th layer input is marked as X, the i-th layer convolution network is conv_iThen the output of the ith layer is X + conv_i(X); copying 3 parts of input, learning two parts of input through different convolution networks, and carrying out Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and carrying out bit-wise multiplication to obtain a residual error (A multiplied by B); the other is added to the residual of A B.

In one example, in a multi-layered fully-connected layer, feature vectors learned by a computational network are mapped to affinity values over the fully-connected network by the following method:

1) performing linear transformation on the input of the mapping network (namely the output of the computing network), and converting an input vector x of the mapping layer into y which is Wx + b;

2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function (relu):

1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through multiple layers.

The invention is exemplified below by the modeling and application of polypeptide-HLA molecules. The polypeptide-HLA molecule dataset used in the present invention is derived from IEDB published data.

In the invention, a Pan-specific Model (Pan Model) needs to map a polypeptide and HLA molecule sequence text into a unique vector, so that a neural network Model establishes a function mapping relation between the vector and a prediction index; in addition, the mapping process requires modeling of the binding structure between the polypeptide and the HLA molecule.

The deep neural network model body of the invention is composed of three parts, and refer to fig. 1: the input network, the computing network and the mapping network form a neural network whole. In the model training process, firstly, network parameters are obtained through a training process, and the network parameters are obtained through a back propagation algorithm by using vector data of an oligomer-immune molecule combination structure with known affinity in the training process; in the testing/predicting process, the vector of the oligomer-immune molecule combination structure to be predicted is input into an input network, and the predicted affinity value is output through a computing network and a mapping network. The following is a description of specific functions.

Inputting a network:

the input network has the functions of establishing a high-dimensional vector by the polypeptide and the HLA molecules through a neural network, modeling a polypeptide-HLA molecule combination structure and realizing the universal specificity input of the model. Different polypeptide + HLA molecules are mapped into different vectors, and it is a high dimension to fully capture the subtle differences of different input sequences. The inventors input the HLA molecule pseudo sequence, which is more accurate.

The specific design refers to fig. 2. Mapping each amino acid molecule of the polypeptide sequence and the HLA molecule sequence into a higher-dimensional vector A by a full-connection network; marking the polypeptide position as 1-polypeptide length, and mapping the position sequence to the same dimension through another network; and expanding the vectors into the tensor in the graph, and then carrying out tensor operation to obtain the final input tensor mixed with the information of the polypeptide and the HLA molecules. Exemplified in fig. 2 is a 15-polypeptide length, 128-dimensional mapping output dimension. The top 4 and bottom 3 full-link circles in FIG. 2 represent the mapping of each amino acid molecule from the character space A-Z to the high-dimensional continuous numerical space, which is a commonly used full-link layer representation, with the top circle representing the full-link input dimension, i.e., the amino acid character space, and the bottom circle representing the full-link output dimension, i.e., the higher-dimensional vector. Taking a 15-length polypeptide as an example, each position is mapped from an amino acid character into a continuous high-dimensional vector, here also exemplified as m. 15 × 128 × 1 is a process of expanding the vector into a tensor in order to complete tensor addition and tensor multiplication in the figure. The principle of tensor operation is that the next two dimensions are operated as a matrix, the previous dimension is regarded as a Batch Size (Batch Size), the addition of the two tensors of 15 × 128 × 1 is still 15 × 128 × 1, and the multiplication with the tensors of 15 × 1 × 128 results in 15 × 128 × 128.

The language model [12-15] can be used to work on Word Embedding (Word Embedding) of text sequences, mapping the original input data into three high-dimensional vectors through a neural network. Here, the polypeptide length n, the HLA molecule length k, and the output dimension are mapped to m dimensions as an example:

1) the polypeptide characterization vector is used as a variable-length vector by fully connecting a neural network, and each amino acid molecule of the vector is mapped to a high-dimensional continuous numerical space from an A-Z character space. The input polypeptide sequence is (x)₁，x₂，...，x_n) Wherein x is_iIs an amino acid character, x_iE is e { A, C, D,. eta., Y }, and the mapping result is that x is defined as_iInto an m-dimensional vector z_i∈R^mM is a parameter requiring cross-validation debugging, R is a real number space, sequence (x)₁，x₂，...，x_n) Is converted into (z)₁，z₂，...，z_n)；

2) Peptide position characterization vector: the labeling of each amino acid position of the polypeptide is mapped to a high-dimensional (e.g., m-dimensional) continuous numerical space by fully-connected neural networks: the inputs are unified into a vector (1, 2.. times.n), each position i is mapped to a vector p by the fully-connected neural network_i∈R^mR is a real number space, i.e., (1, 2.. times.n) output is (p)₁，p₂，，..，p_n)；

3) HLA molecule characterization vector: the HLA molecule corresponding to each polypeptide also needs to be mapped to a high-dimensional continuous numerical space, and considering that each position of the HLA molecule and the polypeptide is an amino acid molecule and has the same meaning, the neural network model used for mapping is the same as the polypeptide characterization vector. The input of HLA molecular vector is (y)₁，y₂，...，y_k)，y_iE.g., (z ') and output is'₁，z′₂，...，z′_n)，z′_i∈R^mMapping the k-dimensional original amino acid vector into an n multiplied by m-dimensional vector with the same format as the polypeptide, wherein the process comprises the following steps:

a) through 1) spiritWill (y) through the network₁，y₂，...，y_k) Is mapped to (y'₁，y′₂，...，y′_k)，y′_j∈R^m；

b) To (y'₁，y′₂，...，y′_k) Taking an average to obtain

Replication of n parts gives (z'₁，z′₂，...，z′_n)，

Two mechanisms can be used in fusing the three vectors, one is to add the three vectors like the traditional language model. However, unlike the language, the genomic sequence, polypeptides and HLA molecules have a spatial binding structure that requires capture by more complex mechanisms, and therefore another mechanism can be used:

1) calculating the characterization vector A ═ (z) for the polypeptide₁+p₁，z₂+p₂，...，z_n+p_n)；

2) For each element a of A_iTensor expansion (kronecker product) is performed: r^m→R^m×1；

3) Characterization vector of HLA molecule, denoted as B, for each element B of B_iPerform tensor expansion (kronecker product): r^m→R^1×m；

4) Bit-wise tensor product of A, B

Wherein:

5) to pair

A flatting (Flatten) operation is performed. I.e. for each bit element in n bits

Converting the result of formula (1) to m²Dimension vector (a)_i1b_i1，...，a_i1b_im，...，a_imb_i1，...，a_imb_im). This vector is the final input to the computational network.

The polypeptide and the HLA molecule are mapped to the same high-dimensional space through a neural network, and the stereo structure mapping of the sequences of the polypeptide and the HLA molecule is generated through tensor product, so that the input is provided for a high-efficiency pan-specific affinity model.

For training data, a learning process needs to be completed by combining a computing network and a mapping network as a whole. And inputting the vector and the affinity value of the oligomer-immune molecule combination structure into the calculation network and the mapping network for training through a deep convolutional neural network of the calculation network and a full connection network of the mapping network, thereby completing the learning of the parameters of the calculation network and the mapping network.

A computing network:

the computing network comprises a multilayer convolutional neural network, and the function of the computing network is to effectively extract short sequence features through a deep convolutional neural network to obtain feature vectors. Because the input polypeptides and HLA molecules are of very short length (HLA class I corresponds to polypeptides typically less than 15 bits, HLA class II is typically less than 26 bits), the low resolution data can result in an overfitting of the deep convolutional neural network, making the deep convolutional neural network less effective. To increase the complexity of deep convolutional neural networks to enhance learning, improve the effectiveness of using deep convolutional neural networks, residual neural networks are used in combination with Gated Linear Unit (Gated Linear Unit) activation (see [15 ]]) The generalization ability of deep convolutional neural networks (for example, at least 5 layers and more, preferably 10 layers and more, for example, more than 15 layers, even more than 20 layers) is significantly enhanced, the complexity is increased, and the algorithm learning ability is enhanced, and the specific design refers to fig. 3. Copy the input 3 parts by twoLearning through different convolution networks, and carrying out Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and carrying out bit-wise multiplication to obtain a residual error (A multiplied by B in the figure); the other is added to the residual of A B. By using the above residual network design (fig. 3), each layer of convolutional neural network learns the last layer of residual: the i-th layer input is marked as X, the i-th layer convolution network is conv_iThen the output of the ith layer is X + conv_i(X). By learning residual errors instead of learning an input strategy, an overfitting effect caused by adding layers to a convolutional neural network is effectively reduced, a final computing network comprises 10 layers of convolutional neural networks, and meanwhile, a gated linear unit is used as an activation mechanism, and the whole process is as follows:

1) for input X ═ X₁，X₂，...，X_n) Respectively input two convolutional neural networks conv_AAnd conv_BThe output dimensions of the two convolutional neural networks are consistent with the input X;

2) respectively calculating to obtain two convolutional neural network outputs conv_A(X) and conv_B(X), performing Sigmoid mapping sigma on the output of one convolutional neural network according to bits: r → [0, 1]For example for conv_B：

3) Bitwise multiplying conv_A(X)_iAnd σ (conv)_B(X)_i) To obtain (conv)_A(X)₁×σ(conv_B(X)₁)，conv_A(X)₂×σ(conv_B(X)₂)，...，conv_A(X)_n×σ(conv_B(X)_n) And X ═ X (X) —₁，X₂，...，X_n) Vector addition is carried out to obtain the output of a calculation layer, and the output is transmitted to the next layer after passing through a Dropout layer;

considering that in practice the polypeptide length is variable, and therefore the output length after passing through all convolutional layers is also variable, a global pooling layer θ is added after the last convolutional neural network layer: r is^n×p→R^pAnd p is the output dimension of the last layer to obtain a feature vector.

Aiming at the deep convolutional neural network model of the amino acid sequence, the generalization of the deep convolutional residual network structure is improved by learning the residual error output by each layer of convolutional neural network instead of the design of an output body, so that a single model can achieve good accuracy, and the overall process efficiency is obviously improved.

Mapping the network:

the mapping network comprises multiple layers, the function of the mapping network is to map the feature vectors learned by the computing network into affinity values through a fully-connected network, and the overall process is as follows:

1) performing linear transformation on the input of the mapping network (namely, the feature vector of the output of the computing network), and converting the input vector x of the mapping network into y which is equal to Wx + b;

For test/prediction data, the polypeptide + HLA molecules to be tested/predicted are input into the deep convolutional neural network of the trained computational network, and the feature vectors of the oligomer-immune molecules are extracted according to the parameters of the network. Inputting the mapping network, the parameters of the mapping network, and the multi-layer fully-connected layer of the mapping network mapping the feature vector into an affinity value. In one example, the affinity value can be IC₅₀。

The effect of the antigen affinity prediction model of the present invention is verified by examples below.

Examples

The algorithm uses the following software environment without video card resources but needs to be configured:

Python：3.6.5

Anaconda：4.5.4

Tensorflow：1.3.0

Cuda：8.0

Cudnn：6.0

the inventor, when in use, by executing the prediction script provided by the software: py path of the file to be predicted, a prediction result of each experimental data is generated. The file format to be predicted needs to be arranged into "mhc, peptide" format. The inventor has put an example file in the code, test _ batch.csv under the data folder, as follows:

the test data is a pMHC (9, 10mer) affinity quantitative data set (http:// www.iedb.org /) verified by an IEDB database MHC ligand assay method, and typing comprises 113 HLA types such as A0101, A0201, A2402 and the like. Using this script will generate a predict. This file records the polypeptide, typing type and software predicted IC for each sample₅₀The value is obtained. The results generated using the IEDB data test are as follows:

the inventors compared the algorithm results of the present invention with the results of NetMHCPan on this data. IC (integrated circuit)₅₀Converted into negative and positive polypeptides of more concern in actual development, so that the evaluation indexes used are Area Under ROC Curve (AUC) and f1 scores commonly used in machine learning classification problems, the model only uses one neural network model, and NetMHCPan uses 200 model fusion learning [ netpan3(2)]. The comparative results are as follows:

the result shows that the accuracy of using 200 models by NetMHCPan is achieved by using a single model, the execution efficiency of the model is high, on the NVIDIA p6000 video card, the time of training 11 thousands of data is about 30 minutes, and the time of predicting 2 thousands of 4 thousands of data of test files test _ batch.csv is about 1 minute; the polypeptide length distribution of the data set is 8-11 bits, and the HLA distribution of the training set and the test set is inconsistent.

By this embodiment, the method of the present invention has the following four advantages: 1) by designing a short peptide-HLA molecule compound characterization layer and designing a short peptide deep residual error network, the method provided by the invention improves the affinity prediction accuracy of a single neural network model; 2) by improving the accuracy of a single model, the method of the invention also reduces the execution time overhead of the model. On the inventor's machine (NVIDIA p6000 video card), the training time of 116559 pieces of data is only half an hour; 3) the methods of the invention allow for variable length input polypeptides; 4) because the invention develops a pan-specific algorithm, it can be used for new typing, i.e. rare typing that has never been seen in the previous training set, and can also give predictions by the algorithm of the invention.

Reference documents:

[1]Nielsen M，Lundegaard C，Worning P，et al.Reliable prediction of T-cell epitopes using neural networks with novel sequence representations.Protein Sci 2003；12：1007-17.

[2]Zhang GL，Ansari HR，Bradley P，et al.Machine learning competition in immunology-prediction of HLA class I binding peptides.J Immunol Meth 2011；374：1-4.

[3]Carreno BM，Magrini V，Becker-Hapak M et al.Cancer immunotherapy.A dendritic cell vaccine increases the breadth and diversity of melanoma neoantigen-specifc T cells.Science 2015；348L803-8.

[4]Walter S，Weinschenk T，Stenzl A，et al.Multipeptide immune response to cancer vaccine IMA901 after single-dose cyclophosphamide associates with longer patient survival.Nat Med 2012；18：1254-61.

[5]Yadav M，Jhunjhunwala S，Phung QT，et al.Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing.Nature 2014；515：572-76.

[6]Nielsen，M.，Andreatta，M.NetMHCpan-3.0；improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets.Genome Med 8，33(2016)doi：10.1186/s13073-016-0288-x.

[7]Liu，Z.，Cui，Y.，Xiong，Z.et al.DeepSeqPan，a novel deep convolutional neural network model for pan-specific class I HLA-peptide binding affinity prediction.Sci Rep 9，794(2019)doi：10.1038/s41598-018-37214-1.

[8]PhloyPhisut，P.，Pomputtapong，N.，Sriswasdi，S.et al.MHCSeqNet：a deep neural network model for universal MHC binding prediction.BMC Bioinformatics 20，270(2019)doi：10.1186/s12859-019-2892-4.

[9]Sidhom，J.-W.a.P.D.a.B.A.AI-MHC：an allele-integrated deep learning framework for improving Class I&Class II HLA-binding predictions.bioRxiv，p.318881(2018).

[10]Yan Hu，Ziqiang Wang，Hailin Hu，FangPing Wan，Lin Chen，Yuanpeng Xiong，Xiaoxia Wang，Dan Zhao，Weiren Huang，Jianyang Zeng，ACME：pan-specific peptide-MHC class I binding prediction through attention-based deep neural networks，Bioinformatics.

[11]Han，Youngmahn&Kim，Dongsup.(2017).Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction.BMC Bioinformatics.18.585.10.1186/s12859-017-1997-x.

[12]Devlin J，Chang M W，Lee K，et al.Bert：Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv：1810.04805，2018.

[13]Vaswani A，Shazeer N，Parmar N，et al.Attention is all you need[C]//Advances in neuralinformation processing systems.2017：5998-6008.

[14]Gehring J，Auli M，Grangier D，et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70.JMLR.org，2017：1243-1252.

[15]Dauphin Y N，Fan A，Auli M，et al.Language modeling with gated convolutional networks[C]//Proceedings of the 34th International Conference on Machine Leaming-Volume 70.JMLR.org，2017：933-941.

Claims

1. a method of predicting affinity of an oligomer for an immune molecule based on deep learning, the method comprising:

2. The method of claim 1, the oligomer being selected from the group consisting of: polypeptides, polysaccharides and nucleic acid haptens.

3. The method of claim 1 or 2, wherein the immune molecule is a major histocompatibility complex molecule, such as a human leukocyte antigen, including but not limited to class I and class II human leukocyte antigens.

4. The method of any one of claims 1 to 3, wherein in (2) a representation of a monomer vector reflecting the relationship between any two monomers is obtained by word embedding.

5. The method of any one of claims 1-4, wherein in (2) the input oligomer sequence is (x)₁，x₂，...，x_n) Wherein x is_iIs an amino acid character, x_iE { A, C, D, a_iInto an m-dimensional vector z_i∈R^mR is real number space, sequence (x)₁，x₂，...，x_n) To (z)₁，z₂，...，z_n)。

6. The method of any one of claims 1-5, wherein in (2) the monomer position characterizes a vector: mapping the tag for each amino acid position of the polypeptide to a high-dimensional continuous numerical space: the inputs are unified into a vector (1, 2.. times.n), each location i is mapped by the fully-connected neural network into a vector p_i∈R^mR is a real number space, i.e., (1, 2.. times.n) output is (p)₁，p₂，...，p_n)。

7. The method of any one of claims 1-6, wherein in (2) the sequence of oligomers and the high dimensional vector of monomer positions of the oligomers are added to form a characterization vector for the oligomers.

8. The method of any one of claims 1-7, wherein in (2) the input of the vector of immune molecules is (y)₁，y₂，...，y_k)，y_iE.g., (z ') and output is'₁，z′₂，...，z′_n)，z′_i∈R^mFrom the k dimensionThe starting amino acid vector maps to the same n × m dimensional vector as the oligomer format.

9. The method of claim 8, wherein the characterization vector of the oligomer and the vector of the immune molecule are subjected to tensor multiplication, tensor addition, or attention operations in (2).

10. The method according to any one of claims 1 to 9, wherein in (3), the deep neural network comprises a deep convolutional layer and a multi-layered fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in the training data set, the feature vectors are input into the multi-layered fully-connected layer, the multi-layered fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.

11. The method of claim 10, copying 3 inputs for each deep convolutional layer, learning two through different convolutional networks, and performing Sigmoid normalization on one of the convolutional outputs to obtain two vectors a and B, performing bitwise multiplication to obtain a residual a x B; the other is added to the residual of A B.

12. The method of claim 11, adding a global pooling layer θ after the last convolutional layer: r^n×p→R^pAnd p is the output dimension of the last layer to obtain a feature vector.

13. The method according to any one of claims 10-12, wherein the affinity value is mapped by fully-connected layers to the feature vectors obtained from deep convolutional layers in the multi-layered fully-connected layers by:

14. The method according to any one of claims 10-13, wherein in (4), the trained deep neural network comprises a deep convolutional layer and a plurality of fully-connected layers, wherein the deep convolutional layer is used for extracting the feature vector of the oligomer-immune molecule to be detected, and the feature vector is input into the plurality of fully-connected layers to predict the binding affinity of the oligomer-immune molecule to be detected.

15. A method of building a model for predicting the affinity of an oligomer to an immune molecule based on deep learning, the method comprising building a model for predicting the affinity of an oligomer to an immune molecule by (1) - (3) of the method of any one of claims 1-14.

16. An antigen affinity prediction model established by the method of claim 15, the antigen affinity prediction model comprising: a data acquisition module, an input vector establishing module and a model establishing module,