CN113129998B

CN113129998B - Method for constructing prediction model of clinical individualized tumor neoantigen

Info

Publication number: CN113129998B
Application number: CN202110439857.3A
Authority: CN
Inventors: 赵军宁; 蒿青; 张翼冠; 魏平
Original assignee: Yunce Intelligent Technology Co ltd
Current assignee: Sichuan Yunshixin Medical Laboratory Co ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2022-06-21
Anticipated expiration: 2041-04-23
Also published as: CN113129998A

Abstract

The invention relates to a prediction technology, solves the defects of high false positive and low accuracy of the existing clinical individualized tumor neoantigen prediction algorithm, provides a method for constructing a prediction model of clinical individualized tumor neoantigen, and the technical scheme can be summarized as follows: firstly, selecting training data, then cleaning the training data, then carrying out vectorization on all peptide segments, then dividing the HLA alleles into 20 subsets, randomly splitting each subset into a training set, a verification set and a test set, then establishing a deep learning model according to a convolutional neural network, judging whether the output of the deep learning model can be combined and presented by HLA molecules, and finally carrying out model training until the prediction performance of the test set reaches the best. The method has the advantages that the constructed prediction model has better specificity and accuracy, and is suitable for constructing the prediction model of the clinical individualized tumor neoantigen.

Description

Method for constructing prediction model of clinical individualized tumor neoantigen

Technical Field

The invention relates to a prediction technology, in particular to a prediction technology of a clinical individualized tumor neoantigen.

Background

Tumor immunotherapy, which is a treatment method for fighting against tumors by activating the immune system of a host, has brought significant improvements in terms of survival and quality of life to various malignant tumor patients, has the absolute advantages of high specificity and low side effects and can achieve precise killing of tumors by drug-enhancing the pre-existing immune response mechanism of the body or inducing a novel immune response to fight against tumor growth and metastasis compared with conventional treatment methods (such as chemotherapy, radiotherapy, surgery on the skin, and the like), wherein the molecular mechanism of the host immune system for distinguishing cancer cells from normal cells is a tumor-specific new antigen expressed restrictively only on tumor cells, which is called new antigen (Neoantigen for short).

Since cancer is a disease that accumulates through a series of somatic mutations leading to abnormal cell proliferation, such tumor-specific somatic mutations, if occurring in the protein-coding region, lead to the production of mutant peptides that, when exposed to the humoral environment by the host immune system and recognized by the T cell surface receptor (TCR), trigger T cell-mediated specific killing of cancer cells, i.e., neoantigens, are presented to the tumor cell surface by Major Histocompatibility Complex (MHC) molecules according to endogenous antigen processing pathways.

Current personalized immunotherapy based on the development of neoantigens mainly includes neoantigen vaccines (cancer vaccines) and adoptive T cells. The new antigen vaccine also verifies the capability of inducing an organism to generate new antigen-specific T cells (neoantigen-specific T cells) in melanoma and glioblastoma in preclinical research of human, plays a role in protecting the tumor recurrence of the melanoma and induces the metastasis of the melanoma; adoptive T cell therapy, i.e., the isolation of in vivo neoantigen-specific T cells that are reinfused in vitro, has already played an anti-tumor role in a variety of malignancies and induced tumor regression.

Currently, the information of somatic mutations of patients can be accurately obtained by a second-generation sequencing technology and related bioinformatics tools, however, accurate, efficient and cost-effective prediction of which somatic mutations generate immunogenic new antigens is still difficult to realize, and precise identification of new antigens is limited by the low specificity of current prediction algorithms, mainly because most of these prediction algorithms are trained based on the binding affinity data of antigenic peptides and specific Human Leukocyte Antigen (HLA) alleles, such as mhcfurry, SMM, ANN, pickpocket, netmhcpankba, etc., whereas whether new antigens can be recognized by the immune system, i.e., the immunogenicity of new antigens, depends on a series of complex events, including mutation expression, peptide processing, transport, binding and presentation with HLA molecules, etc., and the affinity data is derived from in vitro experiments and only considers the single factor of antigenic peptide binding with HLA molecules, neglecting other biological characteristics, resulting in a large number of false positive results; in addition, although the current new antigen prediction method uses a machine learning model, the structure of the neural network model is simple, the number of hidden layers is small, and the requirement of the space structure of the amino acid position in the antigen peptide cannot be met. In view of the above, there is an urgent need for a novel and highly accurate method for designing a novel antigen prediction tool for HLA alleles having a wide diversity.

Disclosure of Invention

The invention aims to overcome the problems of high false positive and low accuracy of the conventional tumor neoantigen prediction algorithm and provides a method for constructing a clinical individualized tumor neoantigen prediction model.

The invention solves the technical problem and adopts the technical scheme that the method for constructing the prediction model of the clinical individualized tumor neoantigen comprises the following steps:

step 1, selecting training data, wherein the training data comprises that antigen peptides which are eluted from HLA co-immunoprecipitated molecules and identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) are adopted as positive peptides, and peptide sections which are matched with the positive peptides in length and are not detected by mass spectrometry are randomly extracted from a reference protein group (SwissProt) to serve as negative peptides;

step 2, cleaning the training data, at least including removing peptide segments containing unknown or indistinguishable amino acids in the training data;

step 3, if the maximum amino acid length in the length of each peptide segment in the training data is alpha, expressing all the peptide segments in the training data as vectors with the length of alpha, wherein the length of the peptide segment in the training data is less than the length of alpha amino acids, filling the peptide segment to make the peptide segment expressed as the vector with the length of alpha, and vectorizing the amino acid sequence of each peptide segment by using a one-hot encoding scheme;

step 4, dividing the training data into 20 subsets according to HLA alleles, randomly dividing any subset into a training set, a verification set and a test set, ensuring that the distribution of positive peptides and negative peptides is approximately the same, and ensuring that any peptide only exists in one of the training set, the verification set and the test set;

step 5, establishing a deep learning model according to the convolutional neural network, wherein the output of the deep learning model is whether the deep learning model can be combined by HLA molecules and presented;

and 6, aiming at any HLA allele, inputting the subset corresponding to the HLA allele into the deep learning model for model training, stopping training until the prediction performance of the test set reaches the best, and finishing the construction of the new antigen prediction model aiming at the HLA allele.

Specifically, for the convenience of subsequent computer identification, in step 1, the label of the positive peptide is also set to 1, and the label of the negative peptide is also set to 0.

Further, in order to specify how to remove a peptide fragment containing an unknown or indistinguishable amino acid, in step 2, the removal of the peptide fragment containing an unknown or indistinguishable amino acid in the training data refers to: eliminating peptide segment containing "X" and/or "B".

Specifically, in order to facilitate subsequent computer recognition, in step 2, the peptide segment containing lower case letters is also changed into upper case letters.

Still further, since the peptide length of the neoantigen is only 8-15 amino acids, and 95% of the peptide length of the neoantigen is 8-11 amino acids, in order to reduce the data amount, in step 2, when the training data is washed, the peptide length of the peptide in the training data is smaller than 8 or larger than 11 amino acids;

then, in step 3, α is 11.

Specifically, to illustrate how to fill in the peptide fragment with the length less than α amino acids in the training data, in step 3, the filling in the peptide fragment with the length less than α amino acids in the training data so as to express it as the vector with the length α means: selecting a uniform pad character, wherein the pad character is any letter which does not represent amino acid, and filling the peptide segment with the length less than alpha amino acids in the training data from the middle of the peptide segment by using the pad character so as to express the peptide segment as a vector with the length alpha. The letters not representing amino acids include "O", "J", "U", and "Z".

Further, to illustrate a specific method for vectorizing the amino acid sequence of each peptide segment by using a one-hot encoding scheme (one-hot encoding scheme), in step 3, the method for vectorizing the amino acid sequence of each peptide segment by using the one-hot encoding scheme is as follows:

step 3A, assigning a unique integer to each capital letter in a 21-position amino acid alphabet containing padding characters as an index of the letter in the amino acid alphabet;

step 3B, establishing a unique heat vector which is composed of 0 and 1 and has 21 elements by each amino acid and the filling character according to the integer distributed by the corresponding letter, wherein only the index position is 1, and the rest elements are 0;

and 3C, aiming at any one peptide segment, longitudinally combining the unique heat vectors of all amino acids in the amino acid sequence of the peptide segment into a unique heat matrix, namely converting all the peptide segments in the training data into fixed matrixes of 11 rows and 21 columns which can be recognized by a computer, and finishing vectorization.

Specifically, in order to eliminate the problem of training data class imbalance, thereby saving the time for subsequent model training and improving the training efficiency, the method further includes the following steps between step 3 and step 4:

and 7, oversampling and undersampling the training data to adjust the balance between the positive peptides and the negative peptides.

Further, to illustrate oversampling and undersampling, the oversampling refers to data in which positive peptides are repeated, and undersampling refers to data in which negative peptides are randomly deleted.

Specifically, to explain the deep learning model, in step 5, the deep learning model is composed of three convolution modules connected in parallel, each module includes 8 two-dimensional convolution layers, each convolution module employs filters and step lengths of different numbers and different sizes, and output results of the three convolution modules are connected in a flattened manner, and then the three convolution modules enter a full connected layer (full connected layer) composed of 100 nodes, and finally the three convolution modules enter an output layer including two nodes, where the two nodes respectively correspond to two classification results, i.e., can be combined and presented by HLA molecules and cannot be combined and presented by HLA molecules.

Further, to explain functions, parameters and the like adopted in the deep learning model, in a convolution module and a full connection layer, an escape Rectified Linear Unit (ReLU) activation function is adopted for activation, a Softmax activation function is used in an output layer, a cost function adopts a Softmax cross entropy loss function, an optimizer is established through an Adam optimization algorithm for optimization, an adaptive learning rate is used as input of the optimizer, a small-batch Gradient Descent algorithm (Mini-batch Gradient Description) is adopted, the size of a batch (batch) is set to be 64, and the maximum iteration number (epoch) is set to be 1000.

Specifically, in order to prevent the simulation overfitting, an early stopping (early stopping) strategy and a random discarding (dropout) strategy are introduced into the deep learning model, and the early stopping strategy is represented in model training in the way that if the accuracy or the loss function is not improved after a preset iteration number, the model stops training in advance even if the accuracy or the loss function does not reach the maximum iteration number specified before training; the random dropping strategy is introduced at the full connection layer, and the loss rate is 50%.

Further, to obtain a more stable result, three different verification sets are used in the early termination strategy.

The invention has the beneficial effects that: in the scheme of the invention, by adopting the method for constructing the clinical individualized tumor neoantigen prediction model, the antigen peptide identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) is used as the positive peptide, the naturally-presented polypeptide on the cell surface is analyzed, the influence of multiple steps of intracellular antigen processing, transportation and the like is integrated, and the constructed prediction model has better specificity and accuracy through independent mass spectrometry data verification (a separate test set corresponding to each HLA allele).

Detailed Description

The technical solution of the present invention will be described in detail with reference to the following examples.

The invention relates to a method for constructing a prediction model of a clinical individualized tumor neoantigen, which comprises the following steps:

step 1, selecting training data, wherein the training data comprises that antigen peptides which are eluted from HLA co-immunoprecipitated molecules and identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS) are used as positive peptides, and peptide sections which are matched with the positive peptide length and are not detected by mass spectrometry are randomly extracted from a reference protein group (SwissProt) to be used as negative peptides.

For the convenience of subsequent computer identification, the label of the positive peptide can be set to 1 and the label of the negative peptide can be set to 0 in this step.

Specific examples are as follows:

step 1A, 48376 HLA-binding peptides identified by liquid chromatography-tandem mass spectrometry (LC-MS/MS), i.e., antigenic peptides, were collected as positive peptides from 16 genetically engineered HLA-A, HLA-B cell lines stably expressing a single HLA allele and B lymphocytes or cancer cell lines expressing multiple HLA-complex alleles.

Step 1B, randomly extracting peptide segments which are matched with positive peptides in length and are not detected by mass spectrometry from a reference protein group (SwissProt) according to the positive peptides as negative peptides, and combining the negative peptides and the positive peptides together to form 190 more than ten thousand pieces of training data;

step 1C, the label of the positive peptide is set to 1, and the label of the negative peptide is set to 0.

And 2, cleaning the training data, wherein at least the peptide segment containing unknown or indistinguishable amino acids in the training data is removed.

To specify how to remove a peptide stretch containing an unknown or indistinguishable amino acid, in this step, the peptide stretch containing an unknown or indistinguishable amino acid in the training data may be removed as follows: eliminating peptide segment containing "X" and/or "B". In order to facilitate the subsequent computer identification, the peptide segment containing lower case letters is also changed into upper case letters in the step. Because the peptide segment length of the new antigen is only between 8 and 15 amino acids, and the peptide segment length of 95 percent of the new antigen is between 8 and 11 amino acids, in order to reduce the data volume, the peptide segment length of less than 8 or more than 11 amino acids in the training data is also removed when the training data is washed in the step.

And 3, if the maximum amino acid length in the lengths of the peptide segments in the training data is alpha, expressing all the peptide segments in the training data as vectors with the length of alpha, filling the peptide segments with the length of less than alpha amino acids in the training data to express the peptide segments as the vectors with the length of alpha, and vectorizing the amino acid sequence of each peptide segment by using a one-hot encoding scheme.

When the peptide fragment with the length less than 8 or more than 11 amino acids in the training data is removed in step 2, then α ═ 11 in this step.

To illustrate how to fill in the peptide fragment with the length less than α amino acids in the training data, in this step, the vector for filling in the peptide fragment with the length less than α amino acids in the training data to express it as length α may be: selecting a uniform pad character, wherein the pad character is any letter which does not represent amino acid, filling the peptide segment with the length of less than alpha amino acids in the training data from the middle of the peptide segment by using the pad character to express the peptide segment as a vector with the length alpha, and the letters which do not represent amino acids comprise O, J, U and Z.

To illustrate a specific method for vectorizing the amino acid sequence of each peptide fragment by using a one-hot encoding scheme (here, the padding character takes "Z" as an example), in this step, the method for vectorizing the amino acid sequence of each peptide fragment by using a one-hot encoding scheme may be:

step 3A, assigning a unique integer to each capital letter in a 21-position amino acid alphabet containing padding characters as an index of the letter in the amino acid alphabet; when the filling character is "Z", the amino acid alphabet is "ACDEFGHIKNNPQRSTVYZ", the alanine "A", and the corresponding index is 1;

step 3B, establishing a unique heat vector which is composed of 0 and 1 and has 21 elements by each amino acid and the filling character according to the integer distributed by the corresponding letter, wherein only the index position is 1, and the rest elements are 0; for example, alanine "a" has a unique heat vector of [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

and 3C, aiming at any one peptide segment, longitudinally combining the unique heat vectors of all amino acids in the amino acid sequence of the peptide segment into a unique heat matrix, namely converting all the peptide segments in the training data into fixed matrixes of 11 rows and 21 columns which can be recognized by a computer, and finishing vectorization. Taking the peptide fragment "ARHSZZLLQTLQ" as an example, the immobilization matrix is shown in Table 1.

TABLE 1 immobilization matrix for the peptide fragment "ARHSZZLLQTLQ

Here, since the number of negative peptides is much higher than that of positive peptides, in order to eliminate the problem of training data class imbalance, thereby saving the time for subsequent model training and improving the training efficiency, the method further includes the following steps after the step 4:

To illustrate oversampling and undersampling, oversampling refers to repeating data for positive peptides and undersampling refers to randomly deleting data for negative peptides.

And 4, dividing the training data into 20 subsets according to HLA alleles, and randomly splitting a subset into a training set, a verification set and a test set aiming at the subset, so that the distribution of positive peptides and negative peptides is approximately the same, and any peptide is ensured to only exist in one of the training set, the verification set and the test set.

In this step, the validation set is used only for early stopping, the training set is used for performing feed forward and backward propagation, and the test set is used for evaluating performance, and the main indicators are sensitivity, specificity and AUC (please supplement english full name or chinese standard translation).

And 5, establishing a deep learning model according to the convolutional neural network, wherein the output of the deep learning model is whether the deep learning model can be combined by HLA molecules and presented.

To explain the deep learning model, in this step, the deep learning model preferably consists of three convolution modules connected in parallel, each module contains 8 two-dimensional convolution layers, each convolution module uses filters and step lengths with different numbers and different sizes (i.e. the length of filter is advanced one step each time), and the output results of the three convolution modules are connected in a flattened manner, and then the output results enter a full connected layer (full connected layer) consisting of 100 nodes, and finally the output layer contains two nodes, wherein the two nodes respectively correspond to two classification results, namely, the nodes can be bound by HLA molecules and presented, and the nodes cannot be bound by HLA molecules and presented.

To illustrate functions, parameters and the like adopted in the deep learning model, in the convolution module and the full connection layer, an escape Rectified Linear Unit (ReLU) activation function is adopted for activation (a is 0.2), a Softmax activation function is used in the output layer, a cost function adopts a Softmax cross entropy loss function, an optimizer is established through an Adam optimization algorithm to optimize the cost function, an adaptive learning rate is adopted as the input of the optimizer (the learning rate can be started from 0.003 and is reduced along with the increase of the iteration number until the minimum value is 0.0001), a small batch Gradient Descent algorithm (Mini-batch Gradient decision) is adopted, the size of a batch (batch) is set to be 64, and the maximum iteration number (epoch) is set to be 1000.

In order to prevent the simulation overfitting, an early stopping (early stopping) strategy and a random discarding (dropout) strategy can be introduced into the deep learning model, and the early stopping strategy is shown in the model training that if the accuracy or the loss function is not improved after the preset iteration number, the model stops training in advance even if the model does not reach the maximum iteration number specified before the training; the random discarding strategy is introduced in the fully connected layer, the loss rate of which is 50%, namely, half of neurons in the layer are inactivated randomly in the training process. By the method, the model can be effectively prevented from being excessively dependent on some local features, and meanwhile training time is saved.

In order to obtain more stable results, three different verification sets are preferably used in the early termination strategy. This is because: the general new antigen prediction neural network model adopts a single verification set, and in the example, 3 different verification data sets (each 1000 samples) are adopted, and a stopping rule is defined based on the common improvement of the accuracy of the three verification sets, if the comprehensive performance of the verification data sets is not improved after 300 iterations, the training is terminated, and because the verification sets are randomly extracted from the training data sets, the method does not depend on the single verification data set to stop in advance, and more stable results can be obtained.

The whole convolutional neural network deep learning model can be realized by Tensorflow and python3.7 of v.1.14.0 version.

Through verification of a mass spectrum benchmark test set (namely the test set), the prediction model constructed by the method for constructing the prediction model of the clinical individualized tumor neoantigen is superior to a prediction algorithm (NetMHCpan4EL) recommended by IEDB in positive prediction value and specificity, wherein the positive prediction value is improved by nearly 80%.

MS identification of HLA class I binding peptides was collected from other studies using cell lines engineered to express a single HLA allele. For each mass-characterized binding peptide, the length-matched amino acid polypeptide not observed by mass spectrometry was truncated from the same protein by the Uniprot human reference proteome (UP000005640 — 9606) as a negative peptide. 99 bait peptides were randomly truncated from this, with equal numbers of each length (8, 9, 10, 11). After removing the peptide fragments present in the predictive model training data and the repetition of sampling from different proteins, a reference data set was obtained.

In this baseline dataset, the Positive Predictive Value (PPV) tested using the predictive model described above was 40%, while NetMHCpan4EL was only 22%. On different HLA-specific alleles, only a × 02:01 is slightly lower than NetMHCpan, the remaining subtypes, such as a × 02:03, a × 29:02, a × 32:01, B × 40: 01, and the PPV of APPM is much higher than that of NetMHCpan.

In practical use, firstly, the tumor tissue and the normal tissue of a patient are sequenced, then somatic mutation information is obtained according to sequencing data, and then the corresponding mutant peptide is input into the prediction model to evaluate the possibility of becoming a potential new antigen of the patient, so that the aim of predicting the clinical individualized tumor new antigen is fulfilled.

Claims

1. The method for constructing the prediction model of the clinical individualized tumor neoantigen is characterized by comprising the following steps of:

step 1, selecting training data, wherein the training data comprises antigen peptides which are eluted from HLA co-immunoprecipitation molecules and identified by liquid chromatography-tandem mass spectrometry as positive peptides, and peptide fragments which are matched with the positive peptides in length and are not detected by mass spectrometry are randomly extracted from a reference protein group as negative peptides;

step 3, if the maximum amino acid length in the length of each peptide segment in the training data is alpha, expressing all the peptide segments in the training data as vectors with the length of alpha, wherein the peptide segments with the length less than alpha amino acids in the training data are filled to be expressed as the vectors with the length of alpha, and then vectorizing the amino acid sequence of each peptide segment by using an independent thermal coding scheme;

and 6, aiming at any HLA allele, inputting the subset corresponding to the HLA allele into a deep learning model for model training, stopping training until the prediction performance of the test set reaches the best, and completing construction of a new antigen prediction model aiming at the HLA allele.

2. The method for constructing a prediction model of clinically individualized tumor neoantigen as claimed in claim 1, wherein in the step 2, when the training data is washed, the peptide fragments with the length of less than 8 or more than 11 amino acids in the training data are also removed;

then, in step 3, α is 11.

3. The method of claim 1, wherein the step 3 of filling the peptide fragment with a length less than α amino acids in the training data to express as a vector with a length α comprises: selecting a uniform pad character, wherein the pad character is any letter which does not represent amino acid, and filling the peptide segment with the length less than alpha amino acids in the training data from the middle of the peptide segment by using the pad character to express the peptide segment as a vector with the length alpha; the letters not representing amino acids include "O", "J", "U", and "Z".

4. The method for constructing a prediction model of clinically individualized tumor neoantigens according to claim 3, wherein in the step 3, the vectorization of the amino acid sequences of the peptide segments by using a one-hot coding scheme is as follows:

step 3A, assigning a unique integer to each capital letter in a 21-position amino acid alphabet containing padding characters as the index of the letter in the amino acid alphabet;

5. The method for constructing a prediction model of clinically individualized tumor neoantigens according to claim 1, wherein between the step 3 and the step 4, the method further comprises the following steps:

6. The method of claim 5, wherein the oversampling is data of repeating positive peptides and the undersampling is data of randomly deleting negative peptides.

7. The method according to any one of claims 1 to 6, wherein in step 5, the deep learning model comprises three convolution modules connected in parallel, each module comprises 8 two-dimensional convolution layers, each convolution module employs filters and step lengths with different numbers and sizes, and the output results of the three convolution modules are connected in a flattened manner, and then the three convolution modules enter a full-connection layer comprising 100 nodes, and finally enter an output layer comprising two nodes, wherein the two nodes respectively correspond to two classification results, i.e. can be combined and presented by HLA molecules and cannot be combined and presented by HLA molecules.

8. The method of claim 7, wherein the activation is performed by using a Leaky reconstructed Linear Unit activation function in the convolution module and the full connection layer, the Softmax activation function is used in the output layer, the cost function is a Softmax cross entropy loss function, the optimizer is optimized by an Adam optimization algorithm, an adaptive learning rate is used as an input of the optimizer, a small batch gradient descent algorithm is used, the size of the batch is set to 64, and the maximum number of iterations is set to 1000.

9. The method of claim 8, wherein a pre-termination and random discard strategy is introduced into the deep learning model, the pre-termination strategy appearing in the model training that if the accuracy or loss function does not improve after a predetermined number of iterations, the model will stop training prematurely even if it does not reach the maximum number of iterations specified before training; the random dropping strategy is introduced at the full connection layer, and the loss rate is 50%.

10. The method of claim 9, wherein three different validation sets are used in the early termination strategy.