CN115497613A

CN115497613A - Model for predicting immunogenicity and application

Info

Publication number: CN115497613A
Application number: CN202211060598.4A
Authority: CN
Inventors: 李明; 陈玉辉; 单宝珍; 辛磊
Original assignee: Baizhen Biotechnology Wuhan Co ltd
Current assignee: Baizhen Biotechnology Wuhan Co ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-20

Abstract

The invention provides a model for predicting immunogenicity and application, and HLA-I self-peptide is obtained by using MS-based immunopeptide omics to serve as negative selection of CD8+ T cells of a patient; all epitopes reported in the positive T cell assay were collected from the IEDB that matched the HLA-I allele of the patient as positive selections for CD8+ T cells in the patient. A binary classification model is trained on the patient using the negative and positive selection personal data sets to predict the immunogenicity of candidate neoantigens of CD8+ T cells in cancer patients. The invention overcomes the difficulty that the complete TCR sequence of each patient is difficult to sequence at present, creates a method for establishing a model by using the MS immune peptide group data of the self peptide of the patient and provides a novel prediction model of the immunogenicity of the individualized tumor neoantigen.

Description

Model for predicting immunogenicity and application

Technical Field

The invention relates to the technical field of antigen prediction, in particular to a model for predicting immunogenicity and application.

Background

Tumor-specific Human Leukocyte Antigen (HLA) peptides are expressed only on the surface of cancer cells and are ideal targets for the immune system to distinguish cancer cells from normal cells. However, due to the high specificity of T cells, only a small portion of these polypeptides can be recognized by T cells to trigger an immune response, and such antigenic HLA peptides are referred to as tumor neoantigens. Although neoantigens work well in cancer immunotherapy, the probability of finding neoantigens is very low, usually less than 6 of thousands of individual cell mutations per patient. The commonly used discovery method of new antigens (such as CN 111415707A) is to perform high-throughput DNA sequence detection on tumor genome of each individual patient to search all tumor-specific mutations; then, the antigenicity prediction is carried out on the protein and the polypeptide expressed by the mutation, the potential new antigenic peptide segment is synthesized in vitro, and then the in vitro verification is carried out, and the technologies are time-consuming and very expensive.

Immunogenicity prediction is the use of bioinformatic tools to predict the binding affinity of HLA-I and HLA-II (pHLA) complexes, or the entire pHLA presentation pathway. Since cancer and immunotherapy in each patient are associated with a large number of individual genetic mutations, and HLA complexes, T cell populations, and tumor mutations themselves vary significantly between patients, personalized approaches for cancer immunotherapy should be considered. One potential approach (e.g. CN 113160887A) is to sequence the patient's TCR and predict TCR-phala recognition based on its sequence or structure. However, despite the continuous and intensive sequencing of TCRs, it is difficult to sequence the complete TCR sequence for each patient. Therefore, it is an urgent problem to establish a personalized model for predicting TCR-pHLA identification without sequencing TCR.

Disclosure of Invention

In view of the above, the present invention provides a model for predicting immunogenicity and an application thereof, and the immunogenicity of candidate neoantigens of patients is predicted by simulating central tolerance of individual CD8+ T cells.

The technical scheme of the invention is realized as follows: in one aspect, the present invention provides a model for predicting immunogenicity, the construction of the model comprising the steps of:

s1: extracting HLA-I polypeptide of a patient, carrying out MS data acquisition, searching the acquired MS data in a Swiss-Prot human protein database, and carrying out characteristic length distribution inspection on the identified HLA-I polypeptide; removing the peptide fragments with the length of less than 8 or more than 14 to obtain the peptide fragments which are negative selection data models of the patients;

s2: downloading an IEDB T cell epitope database, selecting a positive epitope matched with the HLA-I allele of the patient, and removing a peptide segment with the length less than 8 or more than 14 to obtain a positive selection data model of the patient;

s3: converting amino acid letters into integer indexes by using TensorFlow, training a Keras sequence model by using the negative selection data model in the step S1 and the positive data model in the step S2, performing 100 rounds of training, and finally only keeping the model with the best verification loss;

s4: dividing the training data of each patient into a training set, a verification set and a test set for testing, training 100 integrated models for each patient, sequencing the integrated models according to the performance of the integrated models on the verification set, and selecting the average value of the first 10 models as the final prediction model of the patient;

s5: scoring based on the peptide fragments output by the final prediction model, a higher score indicates that the input peptide is more likely to be recognized by CD8+ T cells.

Based on the above technical solution, preferably, in step S1, MS data is searched in the Swiss-Prot human protein database using PEAKS Xpro.

Based on the above technical solution, preferably, in step S2, when downloading the IEDB T cell epitope database, the following filtering conditions are set: linear epitopes, HLA class I, host is human or mouse.

On the basis of the above technical scheme, preferably, the positive epitopes in step S2 include positive, positive-high, positive-medium and positive-low epitopes.

Based on the above technical solution, preferably, the step S3 of training the Keras sequence model in the tensrflow includes: an embedded layer containing 8 neural cells, a bi-directional LSTM layer containing 8 cells, a fully connected layer with L2 regularization and a sigmoid activation layer.

On the basis of the above technical solution, preferably, in step S3, an Adam optimizer and a binary cross entropy function are used for 100 rounds of training.

On the basis of the above technical solution, preferably, in step S4, the ratio of the training set, the validation set, and the test set is 8.

Based on the above technical solution, preferably, in step S4, since the number of negative polypeptides is generally higher than the number of positive polypeptides, the negative peptides are down-sampled to ensure that the ratio of positive peptides to negative peptides in the data set is the same.

In another aspect, the invention provides the use of a model for predicting immunogenicity in predicting candidate neoantigen immunity in a cancer patient.

Compared with the prior art, the model and the application for predicting the immunogenicity by simulating the central tolerance of the individual CD8+ T cells have the following beneficial effects: the present invention uses MS immunopeptide group data from patient HLA-I self peptides as negative training data and allele-matched positive T cell epitopes from IEDB as positive training data. Multiple experiments show that the accuracy of the prediction model for predicting the new antigen in individual patients reaches 79 percent, and the prediction model is superior to the existing immunogenicity prediction tool. More importantly, candidate peptides can be ranked according to the immunogenicity, and the candidate peptides with the immunogenicity are arranged in the top 2%, so that the test cost and the time cost of further in vitro verification are greatly saved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a summary of HLA alleles, training and evaluation data for 3 cancer patients and 1 mouse cancer cell line in a particular example;

FIG. 2 is the area under the ROC curve of the prediction tool versus experimentally validated neoantigen in a particular implementation;

FIG. 3 is a predictive ranking of immunogenic neo-antigens in candidate polypeptides determined by de novo sequencing of MS data from patient Mel-15 in a particular implementation.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The technical scheme adopted by the invention is that the immunogenicity of the candidate new antigen of the CD8+ T cells of the cancer patients is predicted by establishing a CD8 positive selection model and a negative selection model, namely a central nervous system tolerance model of the T cells, and the construction of the model comprises the following steps:

s1, extracting HLA-I polypeptide of a patient, and then carrying out MS data acquisition.

S2, search of MS data with the standard Swiss-Prot human protein database using PEAKS Xpro for enzyme-free specific cleavage with a false positive rate (FDR) of 1%.

S3, performing HLA-I characteristic length distribution check on the identified polypeptide; the amino acid polypeptides with the length of <8 or >14 are eliminated, the obtained polypeptides are considered as HLA-I self polypeptides of the patient, the polypeptides are not recognized by T cells of the patient (namely, non-immunogenicity), and the peptide fragments screened by the method are negative selection data models of the patient.

S4, downloading an IEDB T cell epitope database, and setting the following filtering conditions: linear epitopes, HLA class I, host human or mouse.

S5, selecting a positive epitope matched with the HLA-I allele of the patient, wherein the positive epitope comprises: positive, positive-high, positive-medium and positive-low; a polypeptide that knocks out an amino acid epitope of length <8 or >14, the resulting polypeptide being assumed to be recognized by T cells of the patient (i.e., immunogenic), and the peptide fragments screened by this method are a positive selection data model for the patient.

S6, converting amino acid letters into integer indexes in TensorFlow, representing peptide sequences by 0-15, and training a binary classification model by using the negative data model of the step S3 and the positive data model of the step S5. Training of the Keras sequence model in TensorFlow includes: an embedded layer containing 8 neural cells, a bi-directional LSTM layer containing 8 cells, a fully connected layer with L2 regularization and a sigmoid activation layer. And 100 rounds of training are performed by using an Adam optimizer and binary cross entropy loss, and finally, only the model with the best verification loss is reserved.

S7, dividing the training data of each patient into three groups for testing: training set, verification set and test set, training set: and (3) verification set: the proportion of test sets was 8. Since the number of negative polypeptides is usually several times higher than the number of positive polypeptides, negative peptides were down-sampled, ensuring the same ratio of positive to negative peptides in the data set, and 100 integrated models were trained for each patient. The integrated models are ranked according to their performance on the validation set and the mean of the top 10 models is selected as the final predictive model for that patient.

S8, the immunogenicity prediction model was named depemin. The peptide fragments output from the final prediction model were scored from 0 to 1, with higher scores indicating that the input peptides are more likely to be recognized by CD8+ T cells in a particular cancer patient.

In a specific embodiment, HLA MS data were modeled for 3 cancer patients and 1 Mouse cancer cell line (Mel-15, mel-0D5P, mel-51, and Mouse-EL 4), respectively, with HLA-I self-polypeptides ranging in number from 746 to 35548 per patient and immunogenic epitopes ranging in number from 304 to 2417 per patient, as shown in FIG. 1.

For performance evaluation, we compared depimmun to three leading tools, including PRIME, netMHCpan, and IEDB immunogenicity predictors. NetMHCpan is one of the earliest and most common tools for HLA-I prediction. IEDB immunogenicity predictors are the earliest tools for HLA-I immunogenicity prediction, showing associations between immunogenicity and amino acid positions 4-6, as well as other properties of amino acid residues, such as hydrophobicity, polarity or large aromatic side chains. PRIME is a recent immunogenicity prediction tool that can mimic both HLA-I binding and TCR recognition. Notably, netMHCpan is designed to predict HLA-I binding, not immunogenicity, but it is widely used in many new antigen prediction workflows. The present invention evaluates these four predictive tools based on two criteria.

The first criterion is their area under the predicted receiver operating characteristic curve (ROC-AUC) for the candidate neoantigen. DeepImmun performed better in Meml-15 and Mel-0D5P patients than other tools. In patient Mel-51, deepImmun, PRIME and NetMHCpan performed comparably. For Mouse-EL4, the IEDB predictor reached the highest AUC of 90%, followed by DeepImmun and NetMHCpan (PRIME does not support Mouse data), see FIG. 2. Overall, the mean AUC for depimmun on each data set was 79%. The relative performance between the four prediction tools also reflects the characteristics of their underlying models: deepImmun is a personalized model; PRIME is an allele-specific model; the IEDB predictor is a general model trained on a limited data set; netMHCpan is a predictive model of HLA binding.

The second evaluation criterion is the ability to rank potential neoantigens in the mutant polypeptides identified from the patient. These mutant polypeptides, including neoantigens, cannot be found in the standard Swiss-Prot human protein database. Thus, a sequencing workflow is first applied to the MS data of a patient to identify new peptide fragments that are not present in the database. These candidate peptides were then ranked using the four predictive tools (depimmun, PRIME, netMHCpan and IEDB). The order of the neoantigens indicates the higher accuracy of the model the further the candidate peptide list is. Our MS-based methods only consider candidate peptides identified from MS data, while other genomic methods generally consider candidate peptides translated from genomic sequences.

The MS data from patient Mel-15 tumor tissue were sequenced de novo to determine 5 neoantigens from 3638 candidate peptides, and the results are shown in figure 3. The first two neoantigens KLILWRGLK and RLFLGLAIK of the 5 neoantigens were ranked within the first 1% by DeepImmun, while the next two neoantigens GRIAFFLKY and RTYSLSSALR were ranked within the first 1.5% and 15%, respectively. The last novel antigen, SQIILRQH, ranks very low in 4 for the four predictive tools, but it has been identified and tested by Willelm et al. Overall, deepImmun outperformed other tools in this ranking evaluation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A model for predicting immunogenicity, comprising: the construction of the model comprises the following steps:

s4: dividing the training data of each patient into a training set, a verification set and a test set for testing, training 100 integrated models for each patient, sequencing the integrated models according to the performances of the integrated models on the verification set, and selecting the average value of the first 10 models as a final prediction model of the patient;

2. The model for predicting immunogenicity according to claim 1, wherein: in step S1, MS data were searched in the Swiss-Prot human protein database using PEAKS Xpro.

3. The model for predicting immunogenicity according to claim 1, wherein: in step S2, when downloading the IEDB T cell epitope database, the following filtering conditions are set: linear epitopes, HLA class I, host is human or mouse.

4. The model for predicting immunogenicity according to claim 1, wherein: the positive epitopes in step S2 include positive, positive-high, positive-medium and positive-low epitopes.

5. The model of claim 4, wherein said model predicts immunogenicity: step S3, the training of the Keras sequence model in TensorFlow comprises the following steps: an embedded layer containing 8 neural units, a bi-directional LSTM layer containing 8 units, a fully connected layer with L2 regularization and a sigmoid activation layer.

6. The model of claim 5 for predicting immunogenicity, wherein: and S3, performing 100 rounds of training by using an Adam optimizer and a binary cross entropy function.

7. The model for predicting immunogenicity according to claim 1, wherein: in step S4, the ratio of the training set, the validation set, and the test set is 8.

8. The model for predicting immunogenicity according to claim 1, wherein: in step S4, since the number of negative polypeptides is generally higher than the number of positive polypeptides, negative peptides are down-sampled to ensure that the positive and negative peptides are in the same ratio in the data set.

9. Use of a model for predicting immunogenicity according to any one of claims 1 to 8 in predicting candidate neoantigen immunity in cancer patients.