CN113129998B - Method for constructing prediction model of clinical individualized tumor neoantigen - Google Patents
Method for constructing prediction model of clinical individualized tumor neoantigen Download PDFInfo
- Publication number
- CN113129998B CN113129998B CN202110439857.3A CN202110439857A CN113129998B CN 113129998 B CN113129998 B CN 113129998B CN 202110439857 A CN202110439857 A CN 202110439857A CN 113129998 B CN113129998 B CN 113129998B
- Authority
- CN
- China
- Prior art keywords
- training data
- peptide
- training
- length
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 28
- 108090000765 processed proteins & peptides Proteins 0.000 claims abstract description 119
- 238000012549 training Methods 0.000 claims abstract description 85
- 238000013136 deep learning model Methods 0.000 claims abstract description 25
- 108700028369 Alleles Proteins 0.000 claims abstract description 20
- 238000012795 verification Methods 0.000 claims abstract description 16
- 238000012360 testing method Methods 0.000 claims abstract description 15
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 5
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 150000001413 amino acids Chemical class 0.000 claims description 50
- 102000004196 processed proteins & peptides Human genes 0.000 claims description 45
- 235000001014 amino acid Nutrition 0.000 claims description 39
- 239000000427 antigen Substances 0.000 claims description 24
- 108091007433 antigens Proteins 0.000 claims description 24
- 102000036639 antigens Human genes 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 19
- 239000013598 vector Substances 0.000 claims description 19
- 102000007079 Peptide Fragments Human genes 0.000 claims description 15
- 108010033276 Peptide Fragments Proteins 0.000 claims description 15
- 150000001371 alpha-amino acids Chemical class 0.000 claims description 11
- 235000008206 alpha-amino acids Nutrition 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 9
- 238000001294 liquid chromatography-tandem mass spectrometry Methods 0.000 claims description 9
- 238000004949 mass spectrometry Methods 0.000 claims description 6
- 235000018102 proteins Nutrition 0.000 claims description 6
- 102000004169 proteins and genes Human genes 0.000 claims description 6
- 108090000623 proteins and genes Proteins 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims description 2
- 238000000749 co-immunoprecipitation Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 4
- 230000007547 defect Effects 0.000 abstract 1
- 210000001744 T-lymphocyte Anatomy 0.000 description 7
- 210000004027 cell Anatomy 0.000 description 6
- 206010069754 Acquired gene mutation Diseases 0.000 description 5
- 230000037439 somatic mutation Effects 0.000 description 5
- 201000011510 cancer Diseases 0.000 description 4
- 210000000987 immune system Anatomy 0.000 description 4
- 230000000890 antigenic effect Effects 0.000 description 3
- 201000001441 melanoma Diseases 0.000 description 3
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 2
- 108700018351 Major Histocompatibility Complex Proteins 0.000 description 2
- 206010027476 Metastases Diseases 0.000 description 2
- 235000004279 alanine Nutrition 0.000 description 2
- 230000030741 antigen processing and presentation Effects 0.000 description 2
- 230000028993 immune response Effects 0.000 description 2
- 238000009169 immunotherapy Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000001939 inductive effect Effects 0.000 description 2
- 230000009401 metastasis Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 229920001184 polypeptide Polymers 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000020382 suppression by virus of host antigen processing and presentation of peptide antigen via MHC class I Effects 0.000 description 2
- 210000004881 tumor cell Anatomy 0.000 description 2
- 229960005486 vaccine Drugs 0.000 description 2
- 108010001857 Cell Surface Receptors Proteins 0.000 description 1
- 102100028972 HLA class I histocompatibility antigen, A alpha chain Human genes 0.000 description 1
- 102100028976 HLA class I histocompatibility antigen, B alpha chain Human genes 0.000 description 1
- 108010075704 HLA-A Antigens Proteins 0.000 description 1
- 108010058607 HLA-B Antigens Proteins 0.000 description 1
- 238000012404 In vitro experiment Methods 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 108010026552 Proteome Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000000259 anti-tumor effect Effects 0.000 description 1
- 210000003719 b-lymphocyte Anatomy 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005880 cancer cell killing Effects 0.000 description 1
- 229940022399 cancer vaccine Drugs 0.000 description 1
- 238000009566 cancer vaccine Methods 0.000 description 1
- 230000004663 cell proliferation Effects 0.000 description 1
- 238000002659 cell therapy Methods 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 238000013524 data verification Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012645 endogenous antigen Substances 0.000 description 1
- 208000005017 glioblastoma Diseases 0.000 description 1
- 230000002163 immunogen Effects 0.000 description 1
- 230000005847 immunogenicity Effects 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 210000000265 leukocyte Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000036210 malignancy Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000001819 mass spectrum Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 102000006240 membrane receptors Human genes 0.000 description 1
- 230000009456 molecular mechanism Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000001959 radiotherapy Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000004614 tumor growth Effects 0.000 description 1
- 230000005909 tumor killing Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Epidemiology (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Pathology (AREA)
- Computational Linguistics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Bioethics (AREA)
- Primary Health Care (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Peptides Or Proteins (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention relates to a prediction technology, solves the defects of high false positive and low accuracy of the existing clinical individualized tumor neoantigen prediction algorithm, provides a method for constructing a prediction model of clinical individualized tumor neoantigen, and the technical scheme can be summarized as follows: firstly, selecting training data, then cleaning the training data, then carrying out vectorization on all peptide segments, then dividing the HLA alleles into 20 subsets, randomly splitting each subset into a training set, a verification set and a test set, then establishing a deep learning model according to a convolutional neural network, judging whether the output of the deep learning model can be combined and presented by HLA molecules, and finally carrying out model training until the prediction performance of the test set reaches the best. The method has the advantages that the constructed prediction model has better specificity and accuracy, and is suitable for constructing the prediction model of the clinical individualized tumor neoantigen.
Description
Technical Field
The invention relates to a prediction technology, in particular to a prediction technology of a clinical individualized tumor neoantigen.
Background
Tumor immunotherapy, which is a treatment method for fighting against tumors by activating the immune system of a host, has brought significant improvements in terms of survival and quality of life to various malignant tumor patients, has the absolute advantages of high specificity and low side effects and can achieve precise killing of tumors by drug-enhancing the pre-existing immune response mechanism of the body or inducing a novel immune response to fight against tumor growth and metastasis compared with conventional treatment methods (such as chemotherapy, radiotherapy, surgery on the skin, and the like), wherein the molecular mechanism of the host immune system for distinguishing cancer cells from normal cells is a tumor-specific new antigen expressed restrictively only on tumor cells, which is called new antigen (Neoantigen for short).
Since cancer is a disease that accumulates through a series of somatic mutations leading to abnormal cell proliferation, such tumor-specific somatic mutations, if occurring in the protein-coding region, lead to the production of mutant peptides that, when exposed to the humoral environment by the host immune system and recognized by the T cell surface receptor (TCR), trigger T cell-mediated specific killing of cancer cells, i.e., neoantigens, are presented to the tumor cell surface by Major Histocompatibility Complex (MHC) molecules according to endogenous antigen processing pathways.
Current personalized immunotherapy based on the development of neoantigens mainly includes neoantigen vaccines (cancer vaccines) and adoptive T cells. The new antigen vaccine also verifies the capability of inducing an organism to generate new antigen-specific T cells (neoantigen-specific T cells) in melanoma and glioblastoma in preclinical research of human, plays a role in protecting the tumor recurrence of the melanoma and induces the metastasis of the melanoma; adoptive T cell therapy, i.e., the isolation of in vivo neoantigen-specific T cells that are reinfused in vitro, has already played an anti-tumor role in a variety of malignancies and induced tumor regression.
Currently, the information of somatic mutations of patients can be accurately obtained by a second-generation sequencing technology and related bioinformatics tools, however, accurate, efficient and cost-effective prediction of which somatic mutations generate immunogenic new antigens is still difficult to realize, and precise identification of new antigens is limited by the low specificity of current prediction algorithms, mainly because most of these prediction algorithms are trained based on the binding affinity data of antigenic peptides and specific Human Leukocyte Antigen (HLA) alleles, such as mhcfurry, SMM, ANN, pickpocket, netmhcpankba, etc., whereas whether new antigens can be recognized by the immune system, i.e., the immunogenicity of new antigens, depends on a series of complex events, including mutation expression, peptide processing, transport, binding and presentation with HLA molecules, etc., and the affinity data is derived from in vitro experiments and only considers the single factor of antigenic peptide binding with HLA molecules, neglecting other biological characteristics, resulting in a large number of false positive results; in addition, although the current new antigen prediction method uses a machine learning model, the structure of the neural network model is simple, the number of hidden layers is small, and the requirement of the space structure of the amino acid position in the antigen peptide cannot be met. In view of the above, there is an urgent need for a novel and highly accurate method for designing a novel antigen prediction tool for HLA alleles having a wide diversity.
Disclosure of Invention
The invention aims to overcome the problems of high false positive and low accuracy of the conventional tumor neoantigen prediction algorithm and provides a method for constructing a clinical individualized tumor neoantigen prediction model.
The invention solves the technical problem and adopts the technical scheme that the method for constructing the prediction model of the clinical individualized tumor neoantigen comprises the following steps:
step 1, selecting training data, wherein the training data comprises that antigen peptides which are eluted from HLA co-immunoprecipitated molecules and identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) are adopted as positive peptides, and peptide sections which are matched with the positive peptides in length and are not detected by mass spectrometry are randomly extracted from a reference protein group (SwissProt) to serve as negative peptides;
step 2, cleaning the training data, at least including removing peptide segments containing unknown or indistinguishable amino acids in the training data;
step 3, if the maximum amino acid length in the length of each peptide segment in the training data is alpha, expressing all the peptide segments in the training data as vectors with the length of alpha, wherein the length of the peptide segment in the training data is less than the length of alpha amino acids, filling the peptide segment to make the peptide segment expressed as the vector with the length of alpha, and vectorizing the amino acid sequence of each peptide segment by using a one-hot encoding scheme;
step 4, dividing the training data into 20 subsets according to HLA alleles, randomly dividing any subset into a training set, a verification set and a test set, ensuring that the distribution of positive peptides and negative peptides is approximately the same, and ensuring that any peptide only exists in one of the training set, the verification set and the test set;
step 5, establishing a deep learning model according to the convolutional neural network, wherein the output of the deep learning model is whether the deep learning model can be combined by HLA molecules and presented;
and 6, aiming at any HLA allele, inputting the subset corresponding to the HLA allele into the deep learning model for model training, stopping training until the prediction performance of the test set reaches the best, and finishing the construction of the new antigen prediction model aiming at the HLA allele.
Specifically, for the convenience of subsequent computer identification, in step 1, the label of the positive peptide is also set to 1, and the label of the negative peptide is also set to 0.
Further, in order to specify how to remove a peptide fragment containing an unknown or indistinguishable amino acid, in step 2, the removal of the peptide fragment containing an unknown or indistinguishable amino acid in the training data refers to: eliminating peptide segment containing "X" and/or "B".
Specifically, in order to facilitate subsequent computer recognition, in step 2, the peptide segment containing lower case letters is also changed into upper case letters.
Still further, since the peptide length of the neoantigen is only 8-15 amino acids, and 95% of the peptide length of the neoantigen is 8-11 amino acids, in order to reduce the data amount, in step 2, when the training data is washed, the peptide length of the peptide in the training data is smaller than 8 or larger than 11 amino acids;
then, in step 3, α is 11.
Specifically, to illustrate how to fill in the peptide fragment with the length less than α amino acids in the training data, in step 3, the filling in the peptide fragment with the length less than α amino acids in the training data so as to express it as the vector with the length α means: selecting a uniform pad character, wherein the pad character is any letter which does not represent amino acid, and filling the peptide segment with the length less than alpha amino acids in the training data from the middle of the peptide segment by using the pad character so as to express the peptide segment as a vector with the length alpha. The letters not representing amino acids include "O", "J", "U", and "Z".
Further, to illustrate a specific method for vectorizing the amino acid sequence of each peptide segment by using a one-hot encoding scheme (one-hot encoding scheme), in step 3, the method for vectorizing the amino acid sequence of each peptide segment by using the one-hot encoding scheme is as follows:
step 3A, assigning a unique integer to each capital letter in a 21-position amino acid alphabet containing padding characters as an index of the letter in the amino acid alphabet;
step 3B, establishing a unique heat vector which is composed of 0 and 1 and has 21 elements by each amino acid and the filling character according to the integer distributed by the corresponding letter, wherein only the index position is 1, and the rest elements are 0;
and 3C, aiming at any one peptide segment, longitudinally combining the unique heat vectors of all amino acids in the amino acid sequence of the peptide segment into a unique heat matrix, namely converting all the peptide segments in the training data into fixed matrixes of 11 rows and 21 columns which can be recognized by a computer, and finishing vectorization.
Specifically, in order to eliminate the problem of training data class imbalance, thereby saving the time for subsequent model training and improving the training efficiency, the method further includes the following steps between step 3 and step 4:
and 7, oversampling and undersampling the training data to adjust the balance between the positive peptides and the negative peptides.
Further, to illustrate oversampling and undersampling, the oversampling refers to data in which positive peptides are repeated, and undersampling refers to data in which negative peptides are randomly deleted.
Specifically, to explain the deep learning model, in step 5, the deep learning model is composed of three convolution modules connected in parallel, each module includes 8 two-dimensional convolution layers, each convolution module employs filters and step lengths of different numbers and different sizes, and output results of the three convolution modules are connected in a flattened manner, and then the three convolution modules enter a full connected layer (full connected layer) composed of 100 nodes, and finally the three convolution modules enter an output layer including two nodes, where the two nodes respectively correspond to two classification results, i.e., can be combined and presented by HLA molecules and cannot be combined and presented by HLA molecules.
Further, to explain functions, parameters and the like adopted in the deep learning model, in a convolution module and a full connection layer, an escape Rectified Linear Unit (ReLU) activation function is adopted for activation, a Softmax activation function is used in an output layer, a cost function adopts a Softmax cross entropy loss function, an optimizer is established through an Adam optimization algorithm for optimization, an adaptive learning rate is used as input of the optimizer, a small-batch Gradient Descent algorithm (Mini-batch Gradient Description) is adopted, the size of a batch (batch) is set to be 64, and the maximum iteration number (epoch) is set to be 1000.
Specifically, in order to prevent the simulation overfitting, an early stopping (early stopping) strategy and a random discarding (dropout) strategy are introduced into the deep learning model, and the early stopping strategy is represented in model training in the way that if the accuracy or the loss function is not improved after a preset iteration number, the model stops training in advance even if the accuracy or the loss function does not reach the maximum iteration number specified before training; the random dropping strategy is introduced at the full connection layer, and the loss rate is 50%.
Further, to obtain a more stable result, three different verification sets are used in the early termination strategy.
The invention has the beneficial effects that: in the scheme of the invention, by adopting the method for constructing the clinical individualized tumor neoantigen prediction model, the antigen peptide identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) is used as the positive peptide, the naturally-presented polypeptide on the cell surface is analyzed, the influence of multiple steps of intracellular antigen processing, transportation and the like is integrated, and the constructed prediction model has better specificity and accuracy through independent mass spectrometry data verification (a separate test set corresponding to each HLA allele).
Detailed Description
The technical solution of the present invention will be described in detail with reference to the following examples.
The invention relates to a method for constructing a prediction model of a clinical individualized tumor neoantigen, which comprises the following steps:
step 1, selecting training data, wherein the training data comprises that antigen peptides which are eluted from HLA co-immunoprecipitated molecules and identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) are used as positive peptides, and peptide sections which are matched with the positive peptide length and are not detected by mass spectrometry are randomly extracted from a reference protein group (SwissProt) to be used as negative peptides.
For the convenience of subsequent computer identification, the label of the positive peptide can be set to 1 and the label of the negative peptide can be set to 0 in this step.
Specific examples are as follows:
step 1A, 48376 HLA-binding peptides identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS), i.e., antigenic peptides, were collected as positive peptides from 16 genetically engineered HLA-A, HLA-B cell lines stably expressing a single HLA allele and B lymphocytes or cancer cell lines expressing multiple HLA-complex alleles.
Step 1B, randomly extracting peptide segments which are matched with positive peptides in length and are not detected by mass spectrometry from a reference protein group (SwissProt) according to the positive peptides as negative peptides, and combining the negative peptides and the positive peptides together to form 190 more than ten thousand pieces of training data;
step 1C, the label of the positive peptide is set to 1, and the label of the negative peptide is set to 0.
And 2, cleaning the training data, wherein at least the peptide segment containing unknown or indistinguishable amino acids in the training data is removed.
To specify how to remove a peptide stretch containing an unknown or indistinguishable amino acid, in this step, the peptide stretch containing an unknown or indistinguishable amino acid in the training data may be removed as follows: eliminating peptide segment containing "X" and/or "B". In order to facilitate the subsequent computer identification, the peptide segment containing lower case letters is also changed into upper case letters in the step. Because the peptide segment length of the new antigen is only between 8 and 15 amino acids, and the peptide segment length of 95 percent of the new antigen is between 8 and 11 amino acids, in order to reduce the data volume, the peptide segment length of less than 8 or more than 11 amino acids in the training data is also removed when the training data is washed in the step.
And 3, if the maximum amino acid length in the lengths of the peptide segments in the training data is alpha, expressing all the peptide segments in the training data as vectors with the length of alpha, filling the peptide segments with the length of less than alpha amino acids in the training data to express the peptide segments as the vectors with the length of alpha, and vectorizing the amino acid sequence of each peptide segment by using a one-hot encoding scheme.
When the peptide fragment with the length less than 8 or more than 11 amino acids in the training data is removed in step 2, then α ═ 11 in this step.
To illustrate how to fill in the peptide fragment with the length less than α amino acids in the training data, in this step, the vector for filling in the peptide fragment with the length less than α amino acids in the training data to express it as length α may be: selecting a uniform pad character, wherein the pad character is any letter which does not represent amino acid, filling the peptide segment with the length of less than alpha amino acids in the training data from the middle of the peptide segment by using the pad character to express the peptide segment as a vector with the length alpha, and the letters which do not represent amino acids comprise O, J, U and Z.
To illustrate a specific method for vectorizing the amino acid sequence of each peptide fragment by using a one-hot encoding scheme (here, the padding character takes "Z" as an example), in this step, the method for vectorizing the amino acid sequence of each peptide fragment by using a one-hot encoding scheme may be:
step 3A, assigning a unique integer to each capital letter in a 21-position amino acid alphabet containing padding characters as an index of the letter in the amino acid alphabet; when the filling character is "Z", the amino acid alphabet is "ACDEFGHIKNNPQRSTVYZ", the alanine "A", and the corresponding index is 1;
step 3B, establishing a unique heat vector which is composed of 0 and 1 and has 21 elements by each amino acid and the filling character according to the integer distributed by the corresponding letter, wherein only the index position is 1, and the rest elements are 0; for example, alanine "a" has a unique heat vector of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and 3C, aiming at any one peptide segment, longitudinally combining the unique heat vectors of all amino acids in the amino acid sequence of the peptide segment into a unique heat matrix, namely converting all the peptide segments in the training data into fixed matrixes of 11 rows and 21 columns which can be recognized by a computer, and finishing vectorization. Taking the peptide fragment "ARHSZZLLQTLQ" as an example, the immobilization matrix is shown in Table 1.
TABLE 1 immobilization matrix for the peptide fragment "ARHSZZLLQTLQ
Here, since the number of negative peptides is much higher than that of positive peptides, in order to eliminate the problem of training data class imbalance, thereby saving the time for subsequent model training and improving the training efficiency, the method further includes the following steps after the step 4:
and 7, oversampling and undersampling the training data to adjust the balance between the positive peptides and the negative peptides.
To illustrate oversampling and undersampling, oversampling refers to repeating data for positive peptides and undersampling refers to randomly deleting data for negative peptides.
And 4, dividing the training data into 20 subsets according to HLA alleles, and randomly splitting a subset into a training set, a verification set and a test set aiming at the subset, so that the distribution of positive peptides and negative peptides is approximately the same, and any peptide is ensured to only exist in one of the training set, the verification set and the test set.
In this step, the validation set is used only for early stopping, the training set is used for performing feed forward and backward propagation, and the test set is used for evaluating performance, and the main indicators are sensitivity, specificity and AUC (please supplement english full name or chinese standard translation).
And 5, establishing a deep learning model according to the convolutional neural network, wherein the output of the deep learning model is whether the deep learning model can be combined by HLA molecules and presented.
To explain the deep learning model, in this step, the deep learning model preferably consists of three convolution modules connected in parallel, each module contains 8 two-dimensional convolution layers, each convolution module uses filters and step lengths with different numbers and different sizes (i.e. the length of filter is advanced one step each time), and the output results of the three convolution modules are connected in a flattened manner, and then the output results enter a full connected layer (full connected layer) consisting of 100 nodes, and finally the output layer contains two nodes, wherein the two nodes respectively correspond to two classification results, namely, the nodes can be bound by HLA molecules and presented, and the nodes cannot be bound by HLA molecules and presented.
To illustrate functions, parameters and the like adopted in the deep learning model, in the convolution module and the full connection layer, an escape Rectified Linear Unit (ReLU) activation function is adopted for activation (a is 0.2), a Softmax activation function is used in the output layer, a cost function adopts a Softmax cross entropy loss function, an optimizer is established through an Adam optimization algorithm to optimize the cost function, an adaptive learning rate is adopted as the input of the optimizer (the learning rate can be started from 0.003 and is reduced along with the increase of the iteration number until the minimum value is 0.0001), a small batch Gradient Descent algorithm (Mini-batch Gradient decision) is adopted, the size of a batch (batch) is set to be 64, and the maximum iteration number (epoch) is set to be 1000.
In order to prevent the simulation overfitting, an early stopping (early stopping) strategy and a random discarding (dropout) strategy can be introduced into the deep learning model, and the early stopping strategy is shown in the model training that if the accuracy or the loss function is not improved after the preset iteration number, the model stops training in advance even if the model does not reach the maximum iteration number specified before the training; the random discarding strategy is introduced in the fully connected layer, the loss rate of which is 50%, namely, half of neurons in the layer are inactivated randomly in the training process. By the method, the model can be effectively prevented from being excessively dependent on some local features, and meanwhile training time is saved.
In order to obtain more stable results, three different verification sets are preferably used in the early termination strategy. This is because: the general new antigen prediction neural network model adopts a single verification set, and in the example, 3 different verification data sets (each 1000 samples) are adopted, and a stopping rule is defined based on the common improvement of the accuracy of the three verification sets, if the comprehensive performance of the verification data sets is not improved after 300 iterations, the training is terminated, and because the verification sets are randomly extracted from the training data sets, the method does not depend on the single verification data set to stop in advance, and more stable results can be obtained.
The whole convolutional neural network deep learning model can be realized by Tensorflow and python3.7 of v.1.14.0 version.
And 6, aiming at any HLA allele, inputting the subset corresponding to the HLA allele into the deep learning model for model training, stopping training until the prediction performance of the test set reaches the best, and finishing the construction of the new antigen prediction model aiming at the HLA allele.
Through verification of a mass spectrum benchmark test set (namely the test set), the prediction model constructed by the method for constructing the prediction model of the clinical individualized tumor neoantigen is superior to a prediction algorithm (NetMHCpan4EL) recommended by IEDB in positive prediction value and specificity, wherein the positive prediction value is improved by nearly 80%.
MS identification of HLA class I binding peptides was collected from other studies using cell lines engineered to express a single HLA allele. For each mass-characterized binding peptide, the length-matched amino acid polypeptide not observed by mass spectrometry was truncated from the same protein by the Uniprot human reference proteome (UP000005640 — 9606) as a negative peptide. 99 bait peptides were randomly truncated from this, with equal numbers of each length (8, 9, 10, 11). After removing the peptide fragments present in the predictive model training data and the repetition of sampling from different proteins, a reference data set was obtained.
In this baseline dataset, the Positive Predictive Value (PPV) tested using the predictive model described above was 40%, while NetMHCpan4EL was only 22%. On different HLA-specific alleles, only a × 02:01 is slightly lower than NetMHCpan, the remaining subtypes, such as a × 02:03, a × 29:02, a × 32:01, B × 40: 01, and the PPV of APPM is much higher than that of NetMHCpan.
In practical use, firstly, the tumor tissue and the normal tissue of a patient are sequenced, then somatic mutation information is obtained according to sequencing data, and then the corresponding mutant peptide is input into the prediction model to evaluate the possibility of becoming a potential new antigen of the patient, so that the aim of predicting the clinical individualized tumor new antigen is fulfilled.
Claims (10)
1. The method for constructing the prediction model of the clinical individualized tumor neoantigen is characterized by comprising the following steps of:
step 1, selecting training data, wherein the training data comprises antigen peptides which are eluted from HLA co-immunoprecipitation molecules and identified by liquid chromatography-tandem mass spectrometry as positive peptides, and peptide fragments which are matched with the positive peptides in length and are not detected by mass spectrometry are randomly extracted from a reference protein group as negative peptides;
step 2, cleaning the training data, at least including removing peptide segments containing unknown or indistinguishable amino acids in the training data;
step 3, if the maximum amino acid length in the length of each peptide segment in the training data is alpha, expressing all the peptide segments in the training data as vectors with the length of alpha, wherein the peptide segments with the length less than alpha amino acids in the training data are filled to be expressed as the vectors with the length of alpha, and then vectorizing the amino acid sequence of each peptide segment by using an independent thermal coding scheme;
step 4, dividing the training data into 20 subsets according to HLA alleles, randomly dividing any subset into a training set, a verification set and a test set, ensuring that the distribution of positive peptides and negative peptides is approximately the same, and ensuring that any peptide only exists in one of the training set, the verification set and the test set;
step 5, establishing a deep learning model according to the convolutional neural network, wherein the output of the deep learning model is whether the deep learning model can be combined by HLA molecules and presented;
and 6, aiming at any HLA allele, inputting the subset corresponding to the HLA allele into a deep learning model for model training, stopping training until the prediction performance of the test set reaches the best, and completing construction of a new antigen prediction model aiming at the HLA allele.
2. The method for constructing a prediction model of clinically individualized tumor neoantigen as claimed in claim 1, wherein in the step 2, when the training data is washed, the peptide fragments with the length of less than 8 or more than 11 amino acids in the training data are also removed;
then, in step 3, α is 11.
3. The method of claim 1, wherein the step 3 of filling the peptide fragment with a length less than α amino acids in the training data to express as a vector with a length α comprises: selecting a uniform pad character, wherein the pad character is any letter which does not represent amino acid, and filling the peptide segment with the length less than alpha amino acids in the training data from the middle of the peptide segment by using the pad character to express the peptide segment as a vector with the length alpha; the letters not representing amino acids include "O", "J", "U", and "Z".
4. The method for constructing a prediction model of clinically individualized tumor neoantigens according to claim 3, wherein in the step 3, the vectorization of the amino acid sequences of the peptide segments by using a one-hot coding scheme is as follows:
step 3A, assigning a unique integer to each capital letter in a 21-position amino acid alphabet containing padding characters as the index of the letter in the amino acid alphabet;
step 3B, establishing a unique heat vector which is composed of 0 and 1 and has 21 elements by each amino acid and the filling character according to the integer distributed by the corresponding letter, wherein only the index position is 1, and the rest elements are 0;
and 3C, aiming at any one peptide segment, longitudinally combining the unique heat vectors of all amino acids in the amino acid sequence of the peptide segment into a unique heat matrix, namely converting all the peptide segments in the training data into fixed matrixes of 11 rows and 21 columns which can be recognized by a computer, and finishing vectorization.
5. The method for constructing a prediction model of clinically individualized tumor neoantigens according to claim 1, wherein between the step 3 and the step 4, the method further comprises the following steps:
and 7, oversampling and undersampling the training data to adjust the balance between the positive peptides and the negative peptides.
6. The method of claim 5, wherein the oversampling is data of repeating positive peptides and the undersampling is data of randomly deleting negative peptides.
7. The method according to any one of claims 1 to 6, wherein in step 5, the deep learning model comprises three convolution modules connected in parallel, each module comprises 8 two-dimensional convolution layers, each convolution module employs filters and step lengths with different numbers and sizes, and the output results of the three convolution modules are connected in a flattened manner, and then the three convolution modules enter a full-connection layer comprising 100 nodes, and finally enter an output layer comprising two nodes, wherein the two nodes respectively correspond to two classification results, i.e. can be combined and presented by HLA molecules and cannot be combined and presented by HLA molecules.
8. The method of claim 7, wherein the activation is performed by using a Leaky reconstructed Linear Unit activation function in the convolution module and the full connection layer, the Softmax activation function is used in the output layer, the cost function is a Softmax cross entropy loss function, the optimizer is optimized by an Adam optimization algorithm, an adaptive learning rate is used as an input of the optimizer, a small batch gradient descent algorithm is used, the size of the batch is set to 64, and the maximum number of iterations is set to 1000.
9. The method of claim 8, wherein a pre-termination and random discard strategy is introduced into the deep learning model, the pre-termination strategy appearing in the model training that if the accuracy or loss function does not improve after a predetermined number of iterations, the model will stop training prematurely even if it does not reach the maximum number of iterations specified before training; the random dropping strategy is introduced at the full connection layer, and the loss rate is 50%.
10. The method of claim 9, wherein three different validation sets are used in the early termination strategy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110439857.3A CN113129998B (en) | 2021-04-23 | 2021-04-23 | Method for constructing prediction model of clinical individualized tumor neoantigen |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110439857.3A CN113129998B (en) | 2021-04-23 | 2021-04-23 | Method for constructing prediction model of clinical individualized tumor neoantigen |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113129998A CN113129998A (en) | 2021-07-16 |
CN113129998B true CN113129998B (en) | 2022-06-21 |
Family
ID=76779583
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110439857.3A Expired - Fee Related CN113129998B (en) | 2021-04-23 | 2021-04-23 | Method for constructing prediction model of clinical individualized tumor neoantigen |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113129998B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110996990A (en) * | 2017-06-02 | 2020-04-10 | 亚利桑那州立大学董事会 | Universal cancer vaccines and methods of making and using the same |
WO2020132235A1 (en) * | 2018-12-20 | 2020-06-25 | Merck Sharp & Dohme Corp. | Methods and systems for the precise identification of immunogenic tumor neoantigens |
CN111798919A (en) * | 2020-06-24 | 2020-10-20 | 上海交通大学 | Tumor neoantigen prediction method, prediction device and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI765875B (en) * | 2015-12-16 | 2022-06-01 | 美商磨石生物公司 | Neoantigen identification, manufacture, and use |
-
2021
- 2021-04-23 CN CN202110439857.3A patent/CN113129998B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110996990A (en) * | 2017-06-02 | 2020-04-10 | 亚利桑那州立大学董事会 | Universal cancer vaccines and methods of making and using the same |
WO2020132235A1 (en) * | 2018-12-20 | 2020-06-25 | Merck Sharp & Dohme Corp. | Methods and systems for the precise identification of immunogenic tumor neoantigens |
CN111798919A (en) * | 2020-06-24 | 2020-10-20 | 上海交通大学 | Tumor neoantigen prediction method, prediction device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113129998A (en) | 2021-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108796055B (en) | Method, device and storage medium for detecting tumor neoantigen based on second-generation sequencing | |
US20200243164A1 (en) | Systems and methods for patient-specific identification of neoantigens by de novo peptide sequencing for personalized immunotherapy | |
CN108601731A (en) | Discriminating, manufacture and the use of neoantigen | |
CN111415707B (en) | Prediction method of clinical individuation tumor neoantigen | |
CN110752041A (en) | Method, device and storage medium for predicting neoantigen based on next generation sequencing | |
CN113711239A (en) | Identification of novel antigens using class II MHC models | |
CN110706742B (en) | Pan-cancer tumor neoantigen high-throughput prediction method and application thereof | |
CN109682978B (en) | Prediction method for tumor mutant peptide MHC affinity and application thereof | |
KR102159921B1 (en) | Method for predicting neoantigen using a peptide sequence and hla allele sequence and computer program | |
CN107704727A (en) | Neoantigen Activity Prediction and sort method based on tumour neoantigen characteristic value | |
US20220076783A1 (en) | Methods and Systems for the Precise Identification of Immunogenic Tumor Neoantigens | |
CN110277135B (en) | Method and system for selecting individualized tumor neoantigen based on expected curative effect | |
KR102278727B1 (en) | Method for predicting neoantigen using a peptide sequence and hla class ii allele sequence and computer program | |
CN108603175A (en) | The improvement composition and method and its application of Viral delivery for new epitope | |
KR102278586B1 (en) | Method and System for Screening Neoantigens, and Use thereof | |
CN114446389A (en) | Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof | |
CA3217623A1 (en) | Compositions and method for optimized peptide vaccines using residue optimization | |
CN113129998B (en) | Method for constructing prediction model of clinical individualized tumor neoantigen | |
de Oliveira Lopes et al. | Identification of a vaccine against schistosomiasis using bioinformatics and molecular modeling tools | |
JP7140280B2 (en) | Methods and systems for targeting epitopes for neoantigen-based immunotherapy | |
CN115424740B (en) | Tumor immunotherapy effect prediction system based on NGS and deep learning | |
CN113807468B (en) | HLA antigen presentation prediction method and system based on multi-mode depth coding | |
CN114333998A (en) | Tumor neoantigen prediction method and system based on deep learning model | |
CN117690495A (en) | Tumor neoantigen prediction method, system, electronic equipment and storage medium | |
CN116994654B (en) | Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221128 Address after: 610000 Chengdu Economic and Technological Development Zone (Longquanyi District), Chengdu, Sichuan Province Patentee after: Sichuan Yunshixin Medical Laboratory Co.,Ltd. Address before: No.24, 15th floor, building 8, No.88, Shengbang street, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan 610000 Patentee before: Yunce Intelligent Technology Co.,Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220621 |