CN114649054A - Antigen affinity prediction method and system based on deep learning - Google Patents
Antigen affinity prediction method and system based on deep learning Download PDFInfo
- Publication number
- CN114649054A CN114649054A CN202011506001.5A CN202011506001A CN114649054A CN 114649054 A CN114649054 A CN 114649054A CN 202011506001 A CN202011506001 A CN 202011506001A CN 114649054 A CN114649054 A CN 114649054A
- Authority
- CN
- China
- Prior art keywords
- oligomer
- vector
- affinity
- immune molecule
- immune
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 239000000427 antigen Substances 0.000 title claims abstract description 44
- 108091007433 antigens Proteins 0.000 title claims abstract description 44
- 102000036639 antigens Human genes 0.000 title claims abstract description 44
- 238000013135 deep learning Methods 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims abstract description 134
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000013528 artificial neural network Methods 0.000 claims abstract description 29
- 230000027455 binding Effects 0.000 claims abstract description 23
- 239000000178 monomer Substances 0.000 claims abstract description 14
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 74
- 229920001184 polypeptide Polymers 0.000 claims description 61
- 102000004196 processed proteins & peptides Human genes 0.000 claims description 61
- 238000013507 mapping Methods 0.000 claims description 44
- 150000001413 amino acids Chemical class 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 238000012512 characterization method Methods 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 10
- 108700018351 Major Histocompatibility Complex Proteins 0.000 claims description 7
- 230000020382 suppression by virus of host antigen processing and presentation of peptide antigen via MHC class I Effects 0.000 claims description 7
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 210000000265 leukocyte Anatomy 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 150000004676 glycans Chemical class 0.000 claims description 3
- 102000039446 nucleic acids Human genes 0.000 claims description 3
- 108020004707 nucleic acids Proteins 0.000 claims description 3
- 150000007523 nucleic acids Chemical class 0.000 claims description 3
- 229920001282 polysaccharide Polymers 0.000 claims description 3
- 239000005017 polysaccharide Substances 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000013527 convolutional neural network Methods 0.000 description 23
- 230000008569 process Effects 0.000 description 13
- 238000013461 design Methods 0.000 description 11
- 238000012360 testing method Methods 0.000 description 8
- 238000010801 machine learning Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 201000010099 disease Diseases 0.000 description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 6
- 210000001744 T-lymphocyte Anatomy 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 102000016266 T-Cell Antigen Receptors Human genes 0.000 description 4
- 210000001151 cytotoxic T lymphocyte Anatomy 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000002560 therapeutic procedure Methods 0.000 description 4
- 206010028980 Neoplasm Diseases 0.000 description 3
- 108091008874 T cell receptors Proteins 0.000 description 3
- 241000700605 Viruses Species 0.000 description 3
- 230000000890 antigenic effect Effects 0.000 description 3
- 230000008105 immune reaction Effects 0.000 description 3
- 230000028993 immune response Effects 0.000 description 3
- 210000004698 lymphocyte Anatomy 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000001681 protective effect Effects 0.000 description 3
- 229960005486 vaccine Drugs 0.000 description 3
- 238000005481 NMR spectroscopy Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000009089 cytolysis Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000012165 high-throughput sequencing Methods 0.000 description 2
- 230000005847 immunogenicity Effects 0.000 description 2
- 238000009169 immunotherapy Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000002198 surface plasmon resonance spectroscopy Methods 0.000 description 2
- 108700028369 Alleles Proteins 0.000 description 1
- 102000006306 Antigen Receptors Human genes 0.000 description 1
- 108010083359 Antigen Receptors Proteins 0.000 description 1
- 208000023275 Autoimmune disease Diseases 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 102000004127 Cytokines Human genes 0.000 description 1
- 108090000695 Cytokines Proteins 0.000 description 1
- 241000512668 Eunectes Species 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 108010092262 T-Cell Antigen Receptors Proteins 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 125000000539 amino acid group Chemical group 0.000 description 1
- 230000030741 antigen processing and presentation Effects 0.000 description 1
- 210000000612 antigen-presenting cell Anatomy 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 125000003636 chemical group Chemical group 0.000 description 1
- 238000003776 cleavage reaction Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 229940028617 conventional vaccine Drugs 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007402 cytotoxic response Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 239000012636 effector Substances 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 230000002209 hydrophobic effect Effects 0.000 description 1
- 210000000987 immune system Anatomy 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 230000003053 immunization Effects 0.000 description 1
- 238000002649 immunization Methods 0.000 description 1
- 230000002163 immunogen Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 230000009545 invasion Effects 0.000 description 1
- 230000002147 killing effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000003446 ligand Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000009149 molecular binding Effects 0.000 description 1
- 150000002772 monosaccharides Chemical class 0.000 description 1
- 210000000822 natural killer cell Anatomy 0.000 description 1
- 229910052754 neon Inorganic materials 0.000 description 1
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000007017 scission Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000004936 stimulating effect Effects 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 229960004854 viral vaccine Drugs 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Chemical & Material Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- Medicinal Chemistry (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention belongs to the field of artificial intelligence and discloses an antigen affinity prediction method and a prediction model based on deep learning. The method comprises the following steps: (1) obtaining a training dataset of oligomer-immunomer binding structures, including affinities; (2) for each group of oligomer-immune molecules, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecules as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of a combined structure of the oligomer and the immune molecules; (3) training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data, and establishing an oligomer-immune molecule affinity prediction model; (4) and inputting the vector of the binding structure of the oligomer to be detected and the immune molecule to be detected into the antigen affinity prediction model, and predicting the binding affinity of the oligomer to be detected and the immune molecule by using the trained deep neural network.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to an antigen affinity prediction method and system based on deep learning.
Background
Epitope (AD) refers to a specific chemical group in an Antigenic molecule that determines the specificity of an antigen. The antigen binds to the corresponding antigen receptor on the surface of the lymphocyte through the antigen epitope, thereby activating the lymphocyte and causing immune response. The nature, number and spatial configuration of the antigenic epitopes determine the specificity of the antigen. The size of the epitope is compatible with the antigen binding site of the corresponding antibody. Generally, one polypeptide epitope contains 5-6 amino acid residues; one polysaccharide epitope contains 5-7 monosaccharides; the epitope of one nucleic acid hapten contains 6-8 nucleotides. The specificity of an epitope is determined by all of the residues that make up it, but some of these residues play a greater role in binding to the antibody than others, and are referred to as immunodominant groups.
T cell epitopes are immunogenic polypeptide fragments that must be processed by antigen presenting cells into small peptide molecules and then bound to Major Histocompatibility Complex (MHC) molecules for recognition by T cells. T Cell antigen receptors (TCRs) recognize only small polypeptides of about 10-20 amino acids. These epitopes are composed of amino acids linked in sequence, and are mainly present in the hydrophobic region of the antigenic molecule, called linear epitopes or sequence epitopes. T cells generally do not recognize conformational epitopes of native antigens since they only recognize processed epitopes.
The methods for analyzing epitope are various, including chemical lysis, biological enzymolysis, nuclear magnetic resonance spectroscopy (NMR), Surface Plasmon Resonance (SPR), hybrid peptide lysis, polypeptide library construction, and theoretical measurement. At present, with the development of computer technology, especially the popularization of artificial intelligence technology, the epitope analysis method with highest flux, lowest cost and shortest period is formed by directly screening and predicting the epitope from biological big data.
In terms of T cell epitopes, the conventional scheme of theoretical prediction is to extract a plurality of key characteristics such as antigen peptide sequences, antigen peptide lengths, antigen peptide expression levels, antigen peptide thermostability and the like by learning the data of combination of MHC molecules and antigen peptides, thereby forming a machine learning model, and combining a high-throughput sequencing technology to theoretically predict potential antigen peptides of unknown proteins or genes. Currently, several companies at home and abroad have developed antigen prediction software successively, such as netMHCpan software developed by Denmark science and technology university, EDGE software developed by Gritstone Oncology company, EPIP software developed by GmbH Biotechnology, Inc. of Wuhan Hua Dajinuoyin, and the like.
With the continuous development and mutual promotion of modern bioinformatics, molecular biology and molecular immunology, epitope research and application thereof have made great progress and show great application potential. The application of the antigen epitope is mainly embodied in three aspects, namely disease diagnosis, vaccine development and disease treatment.
In disease diagnosis, the key to the efficiency of epitope diagnostic methods is sensitivity and specificity. The epitope is a basic unit for stimulating immune reaction, the single epitope stimulates specific single immune reaction, and the multiple epitopes often stimulate multiple mixed immune reaction, thereby generating nonspecific mixed antibody, sensitized lymphocyte or effector. Therefore, the research of epitope peptide in disease diagnosis focuses on the selection of specific epitope peptide, thereby achieving better diagnosis efficiency.
In vaccine development, conventional vaccines each contain a large number of epitopes, including protective, inhibitory or null epitopes. The vaccine can achieve the desired protective effect only if the immune response induced by the protective epitope dominates. Therefore, in the research of the prior antiviral vaccine, how to obtain the protein epitope which has strong immunogenicity, strong sequence conservation and plays a key role in virus invasion is a technical difficulty.
In the disease treatment, the immune response induced by the epitope has high specificity and pertinence, and can be used for the immunotherapy of tumors, infectious diseases and autoimmune diseases. Immunotherapy, which activates a cytotoxic response to an antigen by enhancing the patient's own immune system, has proven to be an effective strategy in recent years. This strategy utilizes multiple antigens on the cell surface formed by various muteins from intracellular proteasomal cleavage. These polypeptides bind to HLA molecules, forming polypeptide-HLA complexes that are presented to T Cell Receptors (TCRs). If the TCR can recognize the polypeptide-HLA complex, CytoToxic T Lymphocyte (CTL) can be activated, the CTL is a subunit of leukocyte, is a specific T cell, specially secretes various cytokines to participate in immunization, has killing effect on certain viruses, tumor cells and other antigen substances, and forms an important defense line of the body against virus and tumor immunity together with natural killer cells. The first step in cytotoxic T lymphocyte therapy is to predict the binding affinity of an antigen to an HLA molecule. The current rapidly evolving Neoantigen (Neoantigen) therapy in the field of tumor therapy is a good application of epitopes in the field of disease therapy. Foreign companies have developed research on treating malignant tumors with new antigens and have entered clinical experimental stages, such as BioNTech, Neon Therapeutics, Gritstone Oncology, etc.
At present, there are four methods for predicting the affinity of antigen and HLA molecule, which are a structure-based method, a machine learning-based method, a location weight matrix (PSSM) -based method, and a combinatorial method, respectively. Wherein the machine learning-based approach learns a high-dimensional classification plane from known binding and non-binding peptide information to predict the affinity of polypeptide binding. Methods of machine learning are capable of accurately predicting the affinity of polypeptides to specific HLA alleles, such as HLA-a 0201, HLA-a 0101 and HLA-B0702 [1,2], and are therefore frequently used in many studies [3-5 ]. The most prevalent of these machine learning algorithms is the pan-specific affinity algorithm, which takes as input both the molecular amino acid sequence of HLA and the amino acid sequence of polypeptide, the affinity prediction output is obtained by a machine learning algorithm, currently the most common algorithm in the industry is NetMHCPan [6], the method characterizes HLA molecule by a segment of 34-bit amino acid Pseudo Sequence (Pseudo Sequence), then preprocesses the Pseudo Sequence and polypeptide short Sequence (Peptide Sequence), taking the parallel preprocessing result as an input characteristic, obtaining polypeptide-HLA molecular affinity prediction through a BP neural network model (BP), this approach models each polypeptide-HLA molecule pair as a unique input sequence, the mapping between the sequence and the polypeptide-HLA affinity value can be learned by a Model, and the method is called a Pan-specific training strategy (Pan Model). NetMHCPan works show that a single BP model does not work well, and in order to obtain the best effect, NetMHCPan generally simultaneously aggregates and learns a large number of models.
In recent years, a plurality of affinity prediction models based on deep learning appear in the industry, wherein a representative algorithm is DeepSeqPan [7], the method respectively learns the characteristics of polypeptide and HLA through two sets of independent convolution networks, and predicts the affinity output through a neural network after combining the characteristics; similar ideas are used for MHCSeqNet [8] which uses Gated Recurrent Units (GRUs) instead of convolutional neural networks in order to learn variable-length data of various polypeptides, and AI-MHC [9] which processes different polypeptides by Padding (Padding) to realize variable-length input learning using more efficient convolutional neural networks; similar work also has ACME [10], which differs from DeepseqPan in that splicing each HLA computation layer splices a polypeptide characterization vector; in addition to using natural language processing techniques, convMHC [11] models the affinity model using a class of image processing ideas: and establishing the position physicochemical data of the polypeptide and the HLA molecules into a plurality of 2-dimensional data matrixes, and predicting an affinity model by adopting a 2-dimensional convolutional neural network.
The main disadvantages of the prior art are that: 1) there is a lack of an efficient way to model expressed polypeptide-HLA complexes. In the method, the sequence characteristics of the polypeptide and the HLA molecule are learned through an independent neural network and then are directly spliced into a whole, and the structure of a complex generated once the polypeptide and the typing are presented is not considered in model design; 2) the polypeptide sequence is short, the effect of directly using a deep network is poor, and related work mostly adopts a shallow network. [6] A large number of traditional neural network aggregation are adopted to make up for the defect of weak learning ability of a single model; [7,9,10] all use shallow convolutional networks for learning polypeptide features; 3) certain methods do not take into account polypeptide length diversity. [11] Although the deep network can be used for polypeptide learning, the design only can consider fixed length and cannot realize variable length; the same problem occurs in [6 ].
Accordingly, the present invention requires an improved deep learning based affinity prediction model.
Disclosure of Invention
Based on the problems in the prior art, the invention aims to provide an improved affinity prediction model based on deep learning and a construction method thereof.
Accordingly, in a first aspect, the present invention provides a method for deep learning antigen affinity, the method comprising:
(1) obtaining a training data set of an oligomer-to-biomolecule binding structure, the training data set comprising an affinity of oligomer-to-biomolecule binding;
(2) for each group in the training data set of the oligomer to be detected, the immune molecule and the oligomer-immune molecule combination structure, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecule as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure;
(3) training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data, and establishing an oligomer-immune molecule affinity prediction model;
(4) and inputting the vector of the binding structure of the oligomer to be detected and the immune molecule to be detected into the antigen affinity prediction model, and predicting the binding affinity of the oligomer to be detected and the immune molecule by using the trained deep neural network.
In one embodiment, the oligomer is selected from the group consisting of: polypeptides, polysaccharides, and nucleic acid haptens.
In one embodiment, the oligomers are vectors of variable length.
In one embodiment, the immune molecule is a major histocompatibility complex molecule, such as a human leukocyte antigen, including but not limited to class I and class II human leukocyte antigens.
In one embodiment, the affinity of the oligomer-to-immune molecule binding is expressed as IC of oligomer-to-immune molecule binding50The value is obtained.
In one embodiment, in (2), a monomer vector representation that reflects the relationship between any two monomers is obtained by word embedding.
In one embodiment, the input oligomer sequence in (2) is (x)1,x2,…,xn) Wherein x isiIs an amino acid character, xiE is e { A, C, D,. eta., Y }, and the mapping result is that x is defined asiInto an m-dimensional vector zi∈RmR is real number space, sequence (x)1,x2,...,xn) Is converted into (z)1,z2,...,zn)。
In one embodiment, the monomer position in (2) characterizes a vector: mapping the tag for each amino acid position of the polypeptide to a high-dimensional continuous numerical space: the inputs are unified into a vector (1, 2.. times.n), each location i is mapped by the fully-connected neural network into a vector pi∈RmR is a real number space, i.e., (1, 2.. times.n) output is (p)1,p2,...,pn)。
In one embodiment, the sequence of oligomers and the high dimensional vector of monomer positions of the oligomers in (2) are added to form a characterization vector for the oligomers.
In one embodiment, the input for the vector of immune molecules in (2) is (y)1,y2,...,yk),yiE { A, C, D,. quadrature.Y }, with the output being (z'1,z′2,...,z′n),z′i∈RmFrom the k-dimensional original amino acid vector, a vector of dimensions n x m, which is the same format as the oligomer, is mapped.
In one embodiment, the characterization vector of the oligomer and the vector of the immune molecule are subjected to tensor multiplication, tensor addition, or attention operations in (2).
In one embodiment, in (3), the deep neural network comprises a deep convolutional layer and a multi-layer fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in the training data set, the feature vectors are input into the multi-layer fully-connected layer, the multi-layer fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.
In one embodiment, for each deep convolutional layer, copying 3 parts of input, learning two parts through different convolutional networks, and performing Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and performing bit-wise multiplication to obtain a residual error A multiplied by B; the other is added to the residual of A B.
In one embodiment, a global pooling layer θ is added after the last convolutional layer: rn×p→RpAnd p is the output dimension of the last layer to obtain a feature vector.
In one embodiment, in the multi-layered fully-connected layer, the feature vectors obtained from the deep convolutional layer are mapped to affinity values by the fully-connected layer as follows:
1) performing linear transformation on the output characteristic vector x of the deep convolutional layer to serve as an input vector y of the full-link layer, namely Wx + b;
2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function:
1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through a plurality of layers of mapping networks.
In one embodiment, in (4), the trained deep neural network comprises a deep convolutional layer and a multi-layer fully-connected layer, wherein the deep convolutional layer is used for extracting the feature vector of the oligomer-immune molecule to be detected, and inputting the feature vector into the multi-layer fully-connected layer to predict the binding affinity of the oligomer-immune molecule to be detected.
In a second aspect, the present invention provides a method for building an antigen affinity prediction model based on deep learning, the method comprising building an oligomer-immunomer affinity prediction model by (1) to (3) of the method of the first aspect of the present invention.
In a third aspect, the present invention provides an antigen affinity prediction model established using the method of the second aspect of the invention, the antigen affinity prediction model comprising: a data acquisition module, an input vector establishing module and a model establishing module,
the data acquisition module is used for acquiring a data set of the oligomer-immune molecule combination structure, wherein the data set comprises the affinity of oligomer-immune molecule combination;
the input vector establishing module is used for mapping the sequence of the oligomer, the monomer position of the oligomer and the immune molecule into high-dimensional vectors respectively by using a data set of the oligomer-immune molecule combination structure, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure;
the model building module is used for training a deep neural network by using the vector of the oligomer-immune molecule combination structure and affinity data to build an antigen affinity prediction model.
In one embodiment, in the model building module, the deep neural network comprises a deep convolutional layer and a multi-layer fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in a training data set, the feature vectors are input into the multi-layer fully-connected layer, the multi-layer fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.
In one embodiment, for each deep convolutional layer, copying 3 parts of input, learning two parts through different convolutional networks, and performing Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and performing bit-wise multiplication to obtain a residual error A multiplied by B; the other is added to the residual of A B.
In one embodiment, a global pooling layer θ R is added after the last convolutional layern×p→RpAnd p is the output dimension of the last layer to obtain a feature vector.
In one embodiment, in the multi-layered fully-connected layer, the feature vectors obtained from the deep convolutional layer are mapped to affinity values by the fully-connected layer by the following method:
1) performing linear transformation on the output characteristic vector x of the deep convolutional layer to serve as an input vector y of the full-link layer, namely Wx + b;
2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function:
1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through a plurality of layers of mapping networks.
According to the invention, the oligomer-immune molecule composite structure is modeled by a deep learning technology, so that the expression of a deep learning model is enhanced; the deep model of oligomer short sequences is realized by improving the deep learning technology, and the complexity and the learning capacity of a single neural network are enhanced.
Drawings
Fig. 1 shows the overall structure of an antigen affinity prediction model according to one embodiment of the present invention.
Fig. 2 shows an input network structure of an antigen affinity prediction model according to an embodiment of the present invention.
FIG. 3 shows a computational network structure of an antigen affinity prediction model according to one embodiment of the invention.
Detailed Description
The most important two-step design of the deep learning technology is the design of an input feature expression Layer (Embedding Layer) and the design of a deep network structure of a calculation Layer. The invention considers the expression layer design of the compound and realizes the deep design of the polypeptide calculation layer by an effective means, thereby improving the learning capacity of the model, so that the model fusion learning amount is less than that required by the scheme adopted by netMHCPan commonly used in the industry, and the efficiency of the process is improved; the invention is not limited to fixed-length sequences and has wider use scenes.
In the present invention, the deep residual network model is applicable to the correlation prediction of affinity, antigen presentation and immunogenicity models of all polypeptide sequences; the method is also suitable for other short gene sequence scenes, including but not limited to MHC molecule sequence analysis, DNA high-throughput sequencing fragment analysis and the like. The length of the polypeptide sequence is typically 9 or 10 amino acids, but other lengths are also possible.
In the present invention, tensor addition and attention operations can also be used to integrate the vector of polypeptides and the vector of immune molecules into one vector. The attention operation changes the tensor product of the polypeptide and the typing into the inner product operation, then the nonlinear normalization is carried out, and the product operation of the attention weight matrix and the polypeptide/typing tensor is calculated. Tensor addition, however, is less effective than tensor multiplication; attention operations are common in natural language processing models, such as the BERT [12] of Google. The polypeptide and HLA molecule sequence can also be stereoscopically combined, or the polypeptide and the HLA molecule are spliced.
The method for establishing the antigen affinity prediction model based on deep learning comprises the following steps: (1) obtaining a training data set of an oligomer-to-biomolecule binding structure, the training data set comprising an affinity of oligomer-to-biomolecule binding; (2) for each group in the training data set of the oligomer-immune molecule combination structure, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecule as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure; (3) and training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data to establish an oligomer-immune molecule affinity prediction model.
Preferably, the deep neural network includes a deep convolutional layer and a multi-layer fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in a training data set, the feature vectors are input into the multi-layer fully-connected layer, the multi-layer fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.
In one example, in a multi-layer convolutional layer, each convolutional neural network learns the last layer of residuals: the i-th layer input is marked as X, the i-th layer convolution network is conviThen the output of the ith layer is X + convi(X); copying 3 parts of input, learning two parts of input through different convolution networks, and carrying out Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and carrying out bit-wise multiplication to obtain a residual error (A multiplied by B); the other is added to the residual of A B.
In one example, in a multi-layered fully-connected layer, feature vectors learned by a computational network are mapped to affinity values over the fully-connected network by the following method:
1) performing linear transformation on the input of the mapping network (namely the output of the computing network), and converting an input vector x of the mapping layer into y which is Wx + b;
2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function (relu):
1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through multiple layers.
The invention is exemplified below by the modeling and application of polypeptide-HLA molecules. The polypeptide-HLA molecule dataset used in the present invention is derived from IEDB published data.
In the invention, a Pan-specific Model (Pan Model) needs to map a polypeptide and HLA molecule sequence text into a unique vector, so that a neural network Model establishes a function mapping relation between the vector and a prediction index; in addition, the mapping process requires modeling of the binding structure between the polypeptide and the HLA molecule.
The deep neural network model body of the invention is composed of three parts, and refer to fig. 1: the input network, the computing network and the mapping network form a neural network whole. In the model training process, firstly, network parameters are obtained through a training process, and the network parameters are obtained through a back propagation algorithm by using vector data of an oligomer-immune molecule combination structure with known affinity in the training process; in the testing/predicting process, the vector of the oligomer-immune molecule combination structure to be predicted is input into an input network, and the predicted affinity value is output through a computing network and a mapping network. The following is a description of specific functions.
Inputting a network:
the input network has the functions of establishing a high-dimensional vector by the polypeptide and the HLA molecules through a neural network, modeling a polypeptide-HLA molecule combination structure and realizing the universal specificity input of the model. Different polypeptide + HLA molecules are mapped into different vectors, and it is a high dimension to fully capture the subtle differences of different input sequences. The inventors input the HLA molecule pseudo sequence, which is more accurate.
The specific design refers to fig. 2. Mapping each amino acid molecule of the polypeptide sequence and the HLA molecule sequence into a higher-dimensional vector A by a full-connection network; marking the polypeptide position as 1-polypeptide length, and mapping the position sequence to the same dimension through another network; and expanding the vectors into the tensor in the graph, and then carrying out tensor operation to obtain the final input tensor mixed with the information of the polypeptide and the HLA molecules. Exemplified in fig. 2 is a 15-polypeptide length, 128-dimensional mapping output dimension. The top 4 and bottom 3 full-link circles in FIG. 2 represent the mapping of each amino acid molecule from the character space A-Z to the high-dimensional continuous numerical space, which is a commonly used full-link layer representation, with the top circle representing the full-link input dimension, i.e., the amino acid character space, and the bottom circle representing the full-link output dimension, i.e., the higher-dimensional vector. Taking a 15-length polypeptide as an example, each position is mapped from an amino acid character into a continuous high-dimensional vector, here also exemplified as m. 15 × 128 × 1 is a process of expanding the vector into a tensor in order to complete tensor addition and tensor multiplication in the figure. The principle of tensor operation is that the next two dimensions are operated as a matrix, the previous dimension is regarded as a Batch Size (Batch Size), the addition of the two tensors of 15 × 128 × 1 is still 15 × 128 × 1, and the multiplication with the tensors of 15 × 1 × 128 results in 15 × 128 × 128.
The language model [12-15] can be used to work on Word Embedding (Word Embedding) of text sequences, mapping the original input data into three high-dimensional vectors through a neural network. Here, the polypeptide length n, the HLA molecule length k, and the output dimension are mapped to m dimensions as an example:
1) the polypeptide characterization vector is used as a variable-length vector by fully connecting a neural network, and each amino acid molecule of the vector is mapped to a high-dimensional continuous numerical space from an A-Z character space. The input polypeptide sequence is (x)1,x2,...,xn) Wherein x isiIs an amino acid character, xiE is e { A, C, D,. eta., Y }, and the mapping result is that x is defined asiInto an m-dimensional vector zi∈RmM is a parameter requiring cross-validation debugging, R is a real number space, sequence (x)1,x2,...,xn) Is converted into (z)1,z2,...,zn);
2) Peptide position characterization vector: the labeling of each amino acid position of the polypeptide is mapped to a high-dimensional (e.g., m-dimensional) continuous numerical space by fully-connected neural networks: the inputs are unified into a vector (1, 2.. times.n), each position i is mapped to a vector p by the fully-connected neural networki∈RmR is a real number space, i.e., (1, 2.. times.n) output is (p)1,p2,,..,pn);
3) HLA molecule characterization vector: the HLA molecule corresponding to each polypeptide also needs to be mapped to a high-dimensional continuous numerical space, and considering that each position of the HLA molecule and the polypeptide is an amino acid molecule and has the same meaning, the neural network model used for mapping is the same as the polypeptide characterization vector. The input of HLA molecular vector is (y)1,y2,...,yk),yiE.g., (z ') and output is'1,z′2,...,z′n),z′i∈RmMapping the k-dimensional original amino acid vector into an n multiplied by m-dimensional vector with the same format as the polypeptide, wherein the process comprises the following steps:
a) through 1) spiritWill (y) through the network1,y2,...,yk) Is mapped to (y'1,y′2,...,y′k),y′j∈Rm;
Two mechanisms can be used in fusing the three vectors, one is to add the three vectors like the traditional language model. However, unlike the language, the genomic sequence, polypeptides and HLA molecules have a spatial binding structure that requires capture by more complex mechanisms, and therefore another mechanism can be used:
1) calculating the characterization vector A ═ (z) for the polypeptide1+p1,z2+p2,...,zn+pn);
2) For each element a of AiTensor expansion (kronecker product) is performed: rm→Rm×1;
3) Characterization vector of HLA molecule, denoted as B, for each element B of BiPerform tensor expansion (kronecker product): rm→R1×m;
5) to pairA flatting (Flatten) operation is performed. I.e. for each bit element in n bitsConverting the result of formula (1) to m2Dimension vector (a)i1bi1,...,ai1bim,...,aimbi1,...,aimbim). This vector is the final input to the computational network.
The polypeptide and the HLA molecule are mapped to the same high-dimensional space through a neural network, and the stereo structure mapping of the sequences of the polypeptide and the HLA molecule is generated through tensor product, so that the input is provided for a high-efficiency pan-specific affinity model.
For training data, a learning process needs to be completed by combining a computing network and a mapping network as a whole. And inputting the vector and the affinity value of the oligomer-immune molecule combination structure into the calculation network and the mapping network for training through a deep convolutional neural network of the calculation network and a full connection network of the mapping network, thereby completing the learning of the parameters of the calculation network and the mapping network.
A computing network:
the computing network comprises a multilayer convolutional neural network, and the function of the computing network is to effectively extract short sequence features through a deep convolutional neural network to obtain feature vectors. Because the input polypeptides and HLA molecules are of very short length (HLA class I corresponds to polypeptides typically less than 15 bits, HLA class II is typically less than 26 bits), the low resolution data can result in an overfitting of the deep convolutional neural network, making the deep convolutional neural network less effective. To increase the complexity of deep convolutional neural networks to enhance learning, improve the effectiveness of using deep convolutional neural networks, residual neural networks are used in combination with Gated Linear Unit (Gated Linear Unit) activation (see [15 ]]) The generalization ability of deep convolutional neural networks (for example, at least 5 layers and more, preferably 10 layers and more, for example, more than 15 layers, even more than 20 layers) is significantly enhanced, the complexity is increased, and the algorithm learning ability is enhanced, and the specific design refers to fig. 3. Copy the input 3 parts by twoLearning through different convolution networks, and carrying out Sigmoid normalization operation on one convolution output to obtain two vectors A and B, and carrying out bit-wise multiplication to obtain a residual error (A multiplied by B in the figure); the other is added to the residual of A B. By using the above residual network design (fig. 3), each layer of convolutional neural network learns the last layer of residual: the i-th layer input is marked as X, the i-th layer convolution network is conviThen the output of the ith layer is X + convi(X). By learning residual errors instead of learning an input strategy, an overfitting effect caused by adding layers to a convolutional neural network is effectively reduced, a final computing network comprises 10 layers of convolutional neural networks, and meanwhile, a gated linear unit is used as an activation mechanism, and the whole process is as follows:
1) for input X ═ X1,X2,...,Xn) Respectively input two convolutional neural networks convAAnd convBThe output dimensions of the two convolutional neural networks are consistent with the input X;
2) respectively calculating to obtain two convolutional neural network outputs convA(X) and convB(X), performing Sigmoid mapping sigma on the output of one convolutional neural network according to bits: r → [0, 1]For example for convB:
3) Bitwise multiplying convA(X)iAnd σ (conv)B(X)i) To obtain (conv)A(X)1×σ(convB(X)1),convA(X)2×σ(convB(X)2),...,convA(X)n×σ(convB(X)n) And X ═ X (X) —1,X2,...,Xn) Vector addition is carried out to obtain the output of a calculation layer, and the output is transmitted to the next layer after passing through a Dropout layer;
considering that in practice the polypeptide length is variable, and therefore the output length after passing through all convolutional layers is also variable, a global pooling layer θ is added after the last convolutional neural network layer: r isn×p→RpAnd p is the output dimension of the last layer to obtain a feature vector.
Aiming at the deep convolutional neural network model of the amino acid sequence, the generalization of the deep convolutional residual network structure is improved by learning the residual error output by each layer of convolutional neural network instead of the design of an output body, so that a single model can achieve good accuracy, and the overall process efficiency is obviously improved.
Mapping the network:
the mapping network comprises multiple layers, the function of the mapping network is to map the feature vectors learned by the computing network into affinity values through a fully-connected network, and the overall process is as follows:
1) performing linear transformation on the input of the mapping network (namely, the feature vector of the output of the computing network), and converting the input vector x of the mapping network into y which is equal to Wx + b;
2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function (relu):
1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through a plurality of layers of mapping networks.
For test/prediction data, the polypeptide + HLA molecules to be tested/predicted are input into the deep convolutional neural network of the trained computational network, and the feature vectors of the oligomer-immune molecules are extracted according to the parameters of the network. Inputting the mapping network, the parameters of the mapping network, and the multi-layer fully-connected layer of the mapping network mapping the feature vector into an affinity value. In one example, the affinity value can be IC50。
The effect of the antigen affinity prediction model of the present invention is verified by examples below.
Examples
The algorithm uses the following software environment without video card resources but needs to be configured:
Python:3.6.5
Anaconda:4.5.4
Tensorflow:1.3.0
Cuda:8.0
Cudnn:6.0
the inventor, when in use, by executing the prediction script provided by the software: py path of the file to be predicted, a prediction result of each experimental data is generated. The file format to be predicted needs to be arranged into "mhc, peptide" format. The inventor has put an example file in the code, test _ batch.csv under the data folder, as follows:
the test data is a pMHC (9, 10mer) affinity quantitative data set (http:// www.iedb.org /) verified by an IEDB database MHC ligand assay method, and typing comprises 113 HLA types such as A0101, A0201, A2402 and the like. Using this script will generate a predict. This file records the polypeptide, typing type and software predicted IC for each sample50The value is obtained. The results generated using the IEDB data test are as follows:
the inventors compared the algorithm results of the present invention with the results of NetMHCPan on this data. IC (integrated circuit)50Converted into negative and positive polypeptides of more concern in actual development, so that the evaluation indexes used are Area Under ROC Curve (AUC) and f1 scores commonly used in machine learning classification problems, the model only uses one neural network model, and NetMHCPan uses 200 model fusion learning [ netpan3(2)]. The comparative results are as follows:
the result shows that the accuracy of using 200 models by NetMHCPan is achieved by using a single model, the execution efficiency of the model is high, on the NVIDIA p6000 video card, the time of training 11 thousands of data is about 30 minutes, and the time of predicting 2 thousands of 4 thousands of data of test files test _ batch.csv is about 1 minute; the polypeptide length distribution of the data set is 8-11 bits, and the HLA distribution of the training set and the test set is inconsistent.
By this embodiment, the method of the present invention has the following four advantages: 1) by designing a short peptide-HLA molecule compound characterization layer and designing a short peptide deep residual error network, the method provided by the invention improves the affinity prediction accuracy of a single neural network model; 2) by improving the accuracy of a single model, the method of the invention also reduces the execution time overhead of the model. On the inventor's machine (NVIDIA p6000 video card), the training time of 116559 pieces of data is only half an hour; 3) the methods of the invention allow for variable length input polypeptides; 4) because the invention develops a pan-specific algorithm, it can be used for new typing, i.e. rare typing that has never been seen in the previous training set, and can also give predictions by the algorithm of the invention.
Reference documents:
[1]Nielsen M,Lundegaard C,Worning P,et al.Reliable prediction of T-cell epitopes using neural networks with novel sequence representations.Protein Sci 2003;12:1007-17.
[2]Zhang GL,Ansari HR,Bradley P,et al.Machine learning competition in immunology-prediction of HLA class I binding peptides.J Immunol Meth 2011;374:1-4.
[3]Carreno BM,Magrini V,Becker-Hapak M et al.Cancer immunotherapy.A dendritic cell vaccine increases the breadth and diversity of melanoma neoantigen-specifc T cells.Science 2015;348L803-8.
[4]Walter S,Weinschenk T,Stenzl A,et al.Multipeptide immune response to cancer vaccine IMA901 after single-dose cyclophosphamide associates with longer patient survival.Nat Med 2012;18:1254-61.
[5]Yadav M,Jhunjhunwala S,Phung QT,et al.Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing.Nature 2014;515:572-76.
[6]Nielsen,M.,Andreatta,M.NetMHCpan-3.0;improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets.Genome Med 8,33(2016)doi:10.1186/s13073-016-0288-x.
[7]Liu,Z.,Cui,Y.,Xiong,Z.et al.DeepSeqPan,a novel deep convolutional neural network model for pan-specific class I HLA-peptide binding affinity prediction.Sci Rep 9,794(2019)doi:10.1038/s41598-018-37214-1.
[8]PhloyPhisut,P.,Pomputtapong,N.,Sriswasdi,S.et al.MHCSeqNet:a deep neural network model for universal MHC binding prediction.BMC Bioinformatics 20,270(2019)doi:10.1186/s12859-019-2892-4.
[9]Sidhom,J.-W.a.P.D.a.B.A.AI-MHC:an allele-integrated deep learning framework for improving Class I&Class II HLA-binding predictions.bioRxiv,p.318881(2018).
[10]Yan Hu,Ziqiang Wang,Hailin Hu,FangPing Wan,Lin Chen,Yuanpeng Xiong,Xiaoxia Wang,Dan Zhao,Weiren Huang,Jianyang Zeng,ACME:pan-specific peptide-MHC class I binding prediction through attention-based deep neural networks,Bioinformatics.
[11]Han,Youngmahn&Kim,Dongsup.(2017).Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction.BMC Bioinformatics.18.585.10.1186/s12859-017-1997-x.
[12]Devlin J,Chang M W,Lee K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv:1810.04805,2018.
[13]Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[C]//Advances in neuralinformation processing systems.2017:5998-6008.
[14]Gehring J,Auli M,Grangier D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70.JMLR.org,2017:1243-1252.
[15]Dauphin Y N,Fan A,Auli M,et al.Language modeling with gated convolutional networks[C]//Proceedings of the 34th International Conference on Machine Leaming-Volume 70.JMLR.org,2017:933-941.
Claims (16)
1. a method of predicting affinity of an oligomer for an immune molecule based on deep learning, the method comprising:
(1) obtaining a training data set of an oligomer-to-biomolecule binding structure, the training data set comprising an affinity of oligomer-to-biomolecule binding;
(2) for each group in the training data set of the oligomer to be detected, the immune molecule and the oligomer-immune molecule combination structure, respectively representing the sequence of the oligomer, the monomer position of the oligomer and the immune molecule as high-dimensional vectors, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure;
(3) training a deep neural network by using the vector of the oligomer-immune molecule combination structure in the training data set and affinity data, and establishing an oligomer-immune molecule affinity prediction model;
(4) and inputting the vector of the binding structure of the oligomer to be detected and the immune molecule to be detected into the antigen affinity prediction model, and predicting the binding affinity of the oligomer to be detected and the immune molecule by using the trained deep neural network.
2. The method of claim 1, the oligomer being selected from the group consisting of: polypeptides, polysaccharides and nucleic acid haptens.
3. The method of claim 1 or 2, wherein the immune molecule is a major histocompatibility complex molecule, such as a human leukocyte antigen, including but not limited to class I and class II human leukocyte antigens.
4. The method of any one of claims 1 to 3, wherein in (2) a representation of a monomer vector reflecting the relationship between any two monomers is obtained by word embedding.
5. The method of any one of claims 1-4, wherein in (2) the input oligomer sequence is (x)1,x2,...,xn) Wherein x isiIs an amino acid character, xiE { A, C, D, aiInto an m-dimensional vector zi∈RmR is real number space, sequence (x)1,x2,...,xn) To (z)1,z2,...,zn)。
6. The method of any one of claims 1-5, wherein in (2) the monomer position characterizes a vector: mapping the tag for each amino acid position of the polypeptide to a high-dimensional continuous numerical space: the inputs are unified into a vector (1, 2.. times.n), each location i is mapped by the fully-connected neural network into a vector pi∈RmR is a real number space, i.e., (1, 2.. times.n) output is (p)1,p2,...,pn)。
7. The method of any one of claims 1-6, wherein in (2) the sequence of oligomers and the high dimensional vector of monomer positions of the oligomers are added to form a characterization vector for the oligomers.
8. The method of any one of claims 1-7, wherein in (2) the input of the vector of immune molecules is (y)1,y2,...,yk),yiE.g., (z ') and output is'1,z′2,...,z′n),z′i∈RmFrom the k dimensionThe starting amino acid vector maps to the same n × m dimensional vector as the oligomer format.
9. The method of claim 8, wherein the characterization vector of the oligomer and the vector of the immune molecule are subjected to tensor multiplication, tensor addition, or attention operations in (2).
10. The method according to any one of claims 1 to 9, wherein in (3), the deep neural network comprises a deep convolutional layer and a multi-layered fully-connected layer, the deep convolutional layer is used for extracting feature vectors of oligomer-immune molecules in the training data set, the feature vectors are input into the multi-layered fully-connected layer, the multi-layered fully-connected layer maps the feature vectors into affinity values, and network parameters are obtained through a back propagation algorithm.
11. The method of claim 10, copying 3 inputs for each deep convolutional layer, learning two through different convolutional networks, and performing Sigmoid normalization on one of the convolutional outputs to obtain two vectors a and B, performing bitwise multiplication to obtain a residual a x B; the other is added to the residual of A B.
12. The method of claim 11, adding a global pooling layer θ after the last convolutional layer: rn×p→RpAnd p is the output dimension of the last layer to obtain a feature vector.
13. The method according to any one of claims 10-12, wherein the affinity value is mapped by fully-connected layers to the feature vectors obtained from deep convolutional layers in the multi-layered fully-connected layers by:
1) performing linear transformation on the output characteristic vector x of the deep convolutional layer to serve as an input vector y of the full-link layer, namely Wx + b;
2) and carrying out nonlinear conversion on the linear conversion result by using a linear rectification function:
1) and 2) forming a layer of mapping network, and outputting the affinity prediction result through a plurality of layers of mapping networks.
14. The method according to any one of claims 10-13, wherein in (4), the trained deep neural network comprises a deep convolutional layer and a plurality of fully-connected layers, wherein the deep convolutional layer is used for extracting the feature vector of the oligomer-immune molecule to be detected, and the feature vector is input into the plurality of fully-connected layers to predict the binding affinity of the oligomer-immune molecule to be detected.
15. A method of building a model for predicting the affinity of an oligomer to an immune molecule based on deep learning, the method comprising building a model for predicting the affinity of an oligomer to an immune molecule by (1) - (3) of the method of any one of claims 1-14.
16. An antigen affinity prediction model established by the method of claim 15, the antigen affinity prediction model comprising: a data acquisition module, an input vector establishing module and a model establishing module,
the data acquisition module is used for acquiring a data set of the oligomer-immune molecule combination structure, wherein the data set comprises the affinity of oligomer-immune molecule combination;
the input vector establishing module is used for mapping the sequence of the oligomer, the monomer position of the oligomer and the immune molecule into high-dimensional vectors respectively by using a data set of the oligomer-immune molecule combination structure, and fusing the three high-dimensional vectors into a vector of the oligomer-immune molecule combination structure;
the model building module is used for training a deep neural network by using the vector of the oligomer-immune molecule combination structure and affinity data to build an antigen affinity prediction model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011506001.5A CN114649054A (en) | 2020-12-18 | 2020-12-18 | Antigen affinity prediction method and system based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011506001.5A CN114649054A (en) | 2020-12-18 | 2020-12-18 | Antigen affinity prediction method and system based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114649054A true CN114649054A (en) | 2022-06-21 |
Family
ID=81991026
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011506001.5A Pending CN114649054A (en) | 2020-12-18 | 2020-12-18 | Antigen affinity prediction method and system based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114649054A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115588462A (en) * | 2022-09-15 | 2023-01-10 | 哈尔滨工业大学 | Polypeptide and major histocompatibility complex protein molecule combination prediction method based on transfer learning |
CN116836231A (en) * | 2023-06-30 | 2023-10-03 | 深圳大学总医院 | New antigen peptide of t (8; 21) AML and application thereof |
CN116994644A (en) * | 2023-07-28 | 2023-11-03 | 天津大学 | Medicine target affinity prediction method based on pre-training model |
CN117095825A (en) * | 2023-10-20 | 2023-11-21 | 鲁东大学 | Human immune state prediction method based on multi-instance learning |
CN118016158A (en) * | 2024-02-05 | 2024-05-10 | 常州大学 | TCR-epitope combination prediction method and system based on transfer learning |
CN118248218A (en) * | 2024-05-30 | 2024-06-25 | 北京航空航天大学杭州创新研究院 | Method for developing high-affinity capture antibody EpCAM based on AI algorithm |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109671469A (en) * | 2018-12-11 | 2019-04-23 | 浙江大学 | The method for predicting marriage relation and binding affinity between polypeptide and HLA I type molecule based on Recognition with Recurrent Neural Network |
CN110689965A (en) * | 2019-10-10 | 2020-01-14 | 电子科技大学 | Drug target affinity prediction method based on deep learning |
WO2020046587A2 (en) * | 2018-08-20 | 2020-03-05 | Nantomice, Llc | Methods and systems for improved major histocompatibility complex (mhc)-peptide binding prediction of neoepitopes using a recurrent neural network encoder and attention weighting |
CN111951887A (en) * | 2020-07-27 | 2020-11-17 | 深圳市新合生物医疗科技有限公司 | Leukocyte antigen and polypeptide binding affinity prediction method based on deep learning |
-
2020
- 2020-12-18 CN CN202011506001.5A patent/CN114649054A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020046587A2 (en) * | 2018-08-20 | 2020-03-05 | Nantomice, Llc | Methods and systems for improved major histocompatibility complex (mhc)-peptide binding prediction of neoepitopes using a recurrent neural network encoder and attention weighting |
CN109671469A (en) * | 2018-12-11 | 2019-04-23 | 浙江大学 | The method for predicting marriage relation and binding affinity between polypeptide and HLA I type molecule based on Recognition with Recurrent Neural Network |
CN110689965A (en) * | 2019-10-10 | 2020-01-14 | 电子科技大学 | Drug target affinity prediction method based on deep learning |
CN111951887A (en) * | 2020-07-27 | 2020-11-17 | 深圳市新合生物医疗科技有限公司 | Leukocyte antigen and polypeptide binding affinity prediction method based on deep learning |
Non-Patent Citations (2)
Title |
---|
王远强;丁元;徐东海;刘跃辉;张娅;韩英子;罗兴燕;林治华;: "基于氨基酸结构信息的MHCⅠ类抗原表位的定量构效关系建模", 免疫学杂志, vol. 27, no. 10, 1 October 2011 (2011-10-01), pages 829 - 832 * |
高敬鹏等: "《深度学习 卷积神经网络技术与实践》", 31 July 2020, 机械工业出版社, pages: 72 - 73 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115588462A (en) * | 2022-09-15 | 2023-01-10 | 哈尔滨工业大学 | Polypeptide and major histocompatibility complex protein molecule combination prediction method based on transfer learning |
CN116836231A (en) * | 2023-06-30 | 2023-10-03 | 深圳大学总医院 | New antigen peptide of t (8; 21) AML and application thereof |
CN116836231B (en) * | 2023-06-30 | 2024-02-13 | 深圳大学总医院 | New antigen peptide of t (8; 21) AML and application thereof |
CN116994644A (en) * | 2023-07-28 | 2023-11-03 | 天津大学 | Medicine target affinity prediction method based on pre-training model |
CN116994644B (en) * | 2023-07-28 | 2024-02-02 | 天津大学 | Medicine target affinity prediction method based on pre-training model |
CN117095825A (en) * | 2023-10-20 | 2023-11-21 | 鲁东大学 | Human immune state prediction method based on multi-instance learning |
CN117095825B (en) * | 2023-10-20 | 2024-01-05 | 鲁东大学 | Human immune state prediction method based on multi-instance learning |
CN118016158A (en) * | 2024-02-05 | 2024-05-10 | 常州大学 | TCR-epitope combination prediction method and system based on transfer learning |
CN118248218A (en) * | 2024-05-30 | 2024-06-25 | 北京航空航天大学杭州创新研究院 | Method for developing high-affinity capture antibody EpCAM based on AI algorithm |
CN118248218B (en) * | 2024-05-30 | 2024-07-30 | 北京航空航天大学杭州创新研究院 | Method for developing high-affinity capture antibody EpCAM based on AI algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114649054A (en) | Antigen affinity prediction method and system based on deep learning | |
CN109671469B (en) | Method for predicting binding relationship and binding affinity between polypeptide and HLA type I molecule based on circulating neural network | |
KR102607567B1 (en) | GAN-CNN for MHC peptide binding prediction | |
Tomar et al. | Immunoinformatics: an integrated scenario | |
Wu et al. | TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-binding analyses | |
Dasari et al. | Explainable deep neural networks for novel viral genome prediction | |
Sunny et al. | Protein–protein docking: Past, present, and future | |
CN113762417B (en) | Method for enhancing HLA antigen presentation prediction system based on deep migration | |
Fu et al. | An overview of bioinformatics tools and resources in allergy | |
Doneva et al. | Predicting immunogenicity risk in biopharmaceuticals | |
Guo et al. | A deep convolutional neural network to improve the prediction of protein secondary structure | |
Zheng et al. | B-Cell Epitope Predictions Using Computational Methods | |
Attique et al. | DeepBCE: evaluation of deep learning models for identification of immunogenic B-cell epitopes | |
Zhang et al. | PiTE: TCR-epitope binding affinity prediction pipeline using Transformer-based sequence encoder | |
Huang et al. | Prediction of linear B-cell epitopes of hepatitis C virus for vaccine development | |
CN116130005B (en) | Tandem design method and device for multi-epitope vaccine, equipment and storage medium | |
Zhang et al. | EACVP: An ESM-2 LM Framework Combined CNN and CBAM Attention to Predict Anti-coronavirus Peptides | |
Saxena et al. | OnionMHC: A deep learning model for peptide—HLA-A* 02: 01 binding predictions using both structure and sequence feature sets | |
Bi et al. | An attention based bidirectional LSTM method to predict the binding of TCR and epitope | |
Marzella et al. | Improving generalizability for MHC-I binding peptide predictions through geometric deep learning | |
Zhang et al. | TNFIPs-Net: A deep learning model based on multi-feature fusion for prediction of TNF-α inducing epitopes | |
Xue et al. | FeatureDock: Protein-Ligand Docking Guided by Physicochemical Feature-Based Local Environment Learning using Transformer | |
Shang et al. | Pretraining Transformers for TCR-pMHC Binding Prediction | |
Gupta et al. | Comparative analysis of epitope predictions: proposed library of putative vaccine candidates for HIV | |
Xie et al. | MHC2NNZ: A novel peptide binding prediction approach for HLA DQ molecules |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |