CN115497613A - Model for predicting immunogenicity and application - Google Patents

Model for predicting immunogenicity and application Download PDF

Info

Publication number
CN115497613A
CN115497613A CN202211060598.4A CN202211060598A CN115497613A CN 115497613 A CN115497613 A CN 115497613A CN 202211060598 A CN202211060598 A CN 202211060598A CN 115497613 A CN115497613 A CN 115497613A
Authority
CN
China
Prior art keywords
model
patient
positive
training
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211060598.4A
Other languages
Chinese (zh)
Inventor
李明
陈玉辉
单宝珍
辛磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baizhen Biotechnology Wuhan Co ltd
Original Assignee
Baizhen Biotechnology Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baizhen Biotechnology Wuhan Co ltd filed Critical Baizhen Biotechnology Wuhan Co ltd
Priority to CN202211060598.4A priority Critical patent/CN115497613A/en
Publication of CN115497613A publication Critical patent/CN115497613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Primary Health Care (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The invention provides a model for predicting immunogenicity and application, and HLA-I self-peptide is obtained by using MS-based immunopeptide omics to serve as negative selection of CD8+ T cells of a patient; all epitopes reported in the positive T cell assay were collected from the IEDB that matched the HLA-I allele of the patient as positive selections for CD8+ T cells in the patient. A binary classification model is trained on the patient using the negative and positive selection personal data sets to predict the immunogenicity of candidate neoantigens of CD8+ T cells in cancer patients. The invention overcomes the difficulty that the complete TCR sequence of each patient is difficult to sequence at present, creates a method for establishing a model by using the MS immune peptide group data of the self peptide of the patient and provides a novel prediction model of the immunogenicity of the individualized tumor neoantigen.

Description

Model for predicting immunogenicity and application
Technical Field
The invention relates to the technical field of antigen prediction, in particular to a model for predicting immunogenicity and application.
Background
Tumor-specific Human Leukocyte Antigen (HLA) peptides are expressed only on the surface of cancer cells and are ideal targets for the immune system to distinguish cancer cells from normal cells. However, due to the high specificity of T cells, only a small portion of these polypeptides can be recognized by T cells to trigger an immune response, and such antigenic HLA peptides are referred to as tumor neoantigens. Although neoantigens work well in cancer immunotherapy, the probability of finding neoantigens is very low, usually less than 6 of thousands of individual cell mutations per patient. The commonly used discovery method of new antigens (such as CN 111415707A) is to perform high-throughput DNA sequence detection on tumor genome of each individual patient to search all tumor-specific mutations; then, the antigenicity prediction is carried out on the protein and the polypeptide expressed by the mutation, the potential new antigenic peptide segment is synthesized in vitro, and then the in vitro verification is carried out, and the technologies are time-consuming and very expensive.
Immunogenicity prediction is the use of bioinformatic tools to predict the binding affinity of HLA-I and HLA-II (pHLA) complexes, or the entire pHLA presentation pathway. Since cancer and immunotherapy in each patient are associated with a large number of individual genetic mutations, and HLA complexes, T cell populations, and tumor mutations themselves vary significantly between patients, personalized approaches for cancer immunotherapy should be considered. One potential approach (e.g. CN 113160887A) is to sequence the patient's TCR and predict TCR-phala recognition based on its sequence or structure. However, despite the continuous and intensive sequencing of TCRs, it is difficult to sequence the complete TCR sequence for each patient. Therefore, it is an urgent problem to establish a personalized model for predicting TCR-pHLA identification without sequencing TCR.
Disclosure of Invention
In view of the above, the present invention provides a model for predicting immunogenicity and an application thereof, and the immunogenicity of candidate neoantigens of patients is predicted by simulating central tolerance of individual CD8+ T cells.
The technical scheme of the invention is realized as follows: in one aspect, the present invention provides a model for predicting immunogenicity, the construction of the model comprising the steps of:
s1: extracting HLA-I polypeptide of a patient, carrying out MS data acquisition, searching the acquired MS data in a Swiss-Prot human protein database, and carrying out characteristic length distribution inspection on the identified HLA-I polypeptide; removing the peptide fragments with the length of less than 8 or more than 14 to obtain the peptide fragments which are negative selection data models of the patients;
s2: downloading an IEDB T cell epitope database, selecting a positive epitope matched with the HLA-I allele of the patient, and removing a peptide segment with the length less than 8 or more than 14 to obtain a positive selection data model of the patient;
s3: converting amino acid letters into integer indexes by using TensorFlow, training a Keras sequence model by using the negative selection data model in the step S1 and the positive data model in the step S2, performing 100 rounds of training, and finally only keeping the model with the best verification loss;
s4: dividing the training data of each patient into a training set, a verification set and a test set for testing, training 100 integrated models for each patient, sequencing the integrated models according to the performance of the integrated models on the verification set, and selecting the average value of the first 10 models as the final prediction model of the patient;
s5: scoring based on the peptide fragments output by the final prediction model, a higher score indicates that the input peptide is more likely to be recognized by CD8+ T cells.
Based on the above technical solution, preferably, in step S1, MS data is searched in the Swiss-Prot human protein database using PEAKS Xpro.
Based on the above technical solution, preferably, in step S2, when downloading the IEDB T cell epitope database, the following filtering conditions are set: linear epitopes, HLA class I, host is human or mouse.
On the basis of the above technical scheme, preferably, the positive epitopes in step S2 include positive, positive-high, positive-medium and positive-low epitopes.
Based on the above technical solution, preferably, the step S3 of training the Keras sequence model in the tensrflow includes: an embedded layer containing 8 neural cells, a bi-directional LSTM layer containing 8 cells, a fully connected layer with L2 regularization and a sigmoid activation layer.
On the basis of the above technical solution, preferably, in step S3, an Adam optimizer and a binary cross entropy function are used for 100 rounds of training.
On the basis of the above technical solution, preferably, in step S4, the ratio of the training set, the validation set, and the test set is 8.
Based on the above technical solution, preferably, in step S4, since the number of negative polypeptides is generally higher than the number of positive polypeptides, the negative peptides are down-sampled to ensure that the ratio of positive peptides to negative peptides in the data set is the same.
In another aspect, the invention provides the use of a model for predicting immunogenicity in predicting candidate neoantigen immunity in a cancer patient.
Compared with the prior art, the model and the application for predicting the immunogenicity by simulating the central tolerance of the individual CD8+ T cells have the following beneficial effects: the present invention uses MS immunopeptide group data from patient HLA-I self peptides as negative training data and allele-matched positive T cell epitopes from IEDB as positive training data. Multiple experiments show that the accuracy of the prediction model for predicting the new antigen in individual patients reaches 79 percent, and the prediction model is superior to the existing immunogenicity prediction tool. More importantly, candidate peptides can be ranked according to the immunogenicity, and the candidate peptides with the immunogenicity are arranged in the top 2%, so that the test cost and the time cost of further in vitro verification are greatly saved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a summary of HLA alleles, training and evaluation data for 3 cancer patients and 1 mouse cancer cell line in a particular example;
FIG. 2 is the area under the ROC curve of the prediction tool versus experimentally validated neoantigen in a particular implementation;
FIG. 3 is a predictive ranking of immunogenic neo-antigens in candidate polypeptides determined by de novo sequencing of MS data from patient Mel-15 in a particular implementation.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The technical scheme adopted by the invention is that the immunogenicity of the candidate new antigen of the CD8+ T cells of the cancer patients is predicted by establishing a CD8 positive selection model and a negative selection model, namely a central nervous system tolerance model of the T cells, and the construction of the model comprises the following steps:
s1, extracting HLA-I polypeptide of a patient, and then carrying out MS data acquisition.
S2, search of MS data with the standard Swiss-Prot human protein database using PEAKS Xpro for enzyme-free specific cleavage with a false positive rate (FDR) of 1%.
S3, performing HLA-I characteristic length distribution check on the identified polypeptide; the amino acid polypeptides with the length of <8 or >14 are eliminated, the obtained polypeptides are considered as HLA-I self polypeptides of the patient, the polypeptides are not recognized by T cells of the patient (namely, non-immunogenicity), and the peptide fragments screened by the method are negative selection data models of the patient.
S4, downloading an IEDB T cell epitope database, and setting the following filtering conditions: linear epitopes, HLA class I, host human or mouse.
S5, selecting a positive epitope matched with the HLA-I allele of the patient, wherein the positive epitope comprises: positive, positive-high, positive-medium and positive-low; a polypeptide that knocks out an amino acid epitope of length <8 or >14, the resulting polypeptide being assumed to be recognized by T cells of the patient (i.e., immunogenic), and the peptide fragments screened by this method are a positive selection data model for the patient.
S6, converting amino acid letters into integer indexes in TensorFlow, representing peptide sequences by 0-15, and training a binary classification model by using the negative data model of the step S3 and the positive data model of the step S5. Training of the Keras sequence model in TensorFlow includes: an embedded layer containing 8 neural cells, a bi-directional LSTM layer containing 8 cells, a fully connected layer with L2 regularization and a sigmoid activation layer. And 100 rounds of training are performed by using an Adam optimizer and binary cross entropy loss, and finally, only the model with the best verification loss is reserved.
S7, dividing the training data of each patient into three groups for testing: training set, verification set and test set, training set: and (3) verification set: the proportion of test sets was 8. Since the number of negative polypeptides is usually several times higher than the number of positive polypeptides, negative peptides were down-sampled, ensuring the same ratio of positive to negative peptides in the data set, and 100 integrated models were trained for each patient. The integrated models are ranked according to their performance on the validation set and the mean of the top 10 models is selected as the final predictive model for that patient.
S8, the immunogenicity prediction model was named depemin. The peptide fragments output from the final prediction model were scored from 0 to 1, with higher scores indicating that the input peptides are more likely to be recognized by CD8+ T cells in a particular cancer patient.
In a specific embodiment, HLA MS data were modeled for 3 cancer patients and 1 Mouse cancer cell line (Mel-15, mel-0D5P, mel-51, and Mouse-EL 4), respectively, with HLA-I self-polypeptides ranging in number from 746 to 35548 per patient and immunogenic epitopes ranging in number from 304 to 2417 per patient, as shown in FIG. 1.
For performance evaluation, we compared depimmun to three leading tools, including PRIME, netMHCpan, and IEDB immunogenicity predictors. NetMHCpan is one of the earliest and most common tools for HLA-I prediction. IEDB immunogenicity predictors are the earliest tools for HLA-I immunogenicity prediction, showing associations between immunogenicity and amino acid positions 4-6, as well as other properties of amino acid residues, such as hydrophobicity, polarity or large aromatic side chains. PRIME is a recent immunogenicity prediction tool that can mimic both HLA-I binding and TCR recognition. Notably, netMHCpan is designed to predict HLA-I binding, not immunogenicity, but it is widely used in many new antigen prediction workflows. The present invention evaluates these four predictive tools based on two criteria.
The first criterion is their area under the predicted receiver operating characteristic curve (ROC-AUC) for the candidate neoantigen. DeepImmun performed better in Meml-15 and Mel-0D5P patients than other tools. In patient Mel-51, deepImmun, PRIME and NetMHCpan performed comparably. For Mouse-EL4, the IEDB predictor reached the highest AUC of 90%, followed by DeepImmun and NetMHCpan (PRIME does not support Mouse data), see FIG. 2. Overall, the mean AUC for depimmun on each data set was 79%. The relative performance between the four prediction tools also reflects the characteristics of their underlying models: deepImmun is a personalized model; PRIME is an allele-specific model; the IEDB predictor is a general model trained on a limited data set; netMHCpan is a predictive model of HLA binding.
The second evaluation criterion is the ability to rank potential neoantigens in the mutant polypeptides identified from the patient. These mutant polypeptides, including neoantigens, cannot be found in the standard Swiss-Prot human protein database. Thus, a sequencing workflow is first applied to the MS data of a patient to identify new peptide fragments that are not present in the database. These candidate peptides were then ranked using the four predictive tools (depimmun, PRIME, netMHCpan and IEDB). The order of the neoantigens indicates the higher accuracy of the model the further the candidate peptide list is. Our MS-based methods only consider candidate peptides identified from MS data, while other genomic methods generally consider candidate peptides translated from genomic sequences.
The MS data from patient Mel-15 tumor tissue were sequenced de novo to determine 5 neoantigens from 3638 candidate peptides, and the results are shown in figure 3. The first two neoantigens KLILWRGLK and RLFLGLAIK of the 5 neoantigens were ranked within the first 1% by DeepImmun, while the next two neoantigens GRIAFFLKY and RTYSLSSALR were ranked within the first 1.5% and 15%, respectively. The last novel antigen, SQIILRQH, ranks very low in 4 for the four predictive tools, but it has been identified and tested by Willelm et al. Overall, deepImmun outperformed other tools in this ranking evaluation.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A model for predicting immunogenicity, comprising: the construction of the model comprises the following steps:
s1: extracting HLA-I polypeptide of a patient, carrying out MS data acquisition, searching the acquired MS data in a Swiss-Prot human protein database, and carrying out characteristic length distribution inspection on the identified HLA-I polypeptide; removing the peptide fragments with the length of less than 8 or more than 14 to obtain the peptide fragments which are negative selection data models of the patients;
s2: downloading an IEDB T cell epitope database, selecting a positive epitope matched with the HLA-I allele of the patient, and removing a peptide segment with the length less than 8 or more than 14 to obtain a positive selection data model of the patient;
s3: converting amino acid letters into integer indexes by using TensorFlow, training a Keras sequence model by using the negative selection data model in the step S1 and the positive data model in the step S2, performing 100 rounds of training, and finally only keeping the model with the best verification loss;
s4: dividing the training data of each patient into a training set, a verification set and a test set for testing, training 100 integrated models for each patient, sequencing the integrated models according to the performances of the integrated models on the verification set, and selecting the average value of the first 10 models as a final prediction model of the patient;
s5: scoring based on the peptide fragments output by the final prediction model, a higher score indicates that the input peptide is more likely to be recognized by CD8+ T cells.
2. The model for predicting immunogenicity according to claim 1, wherein: in step S1, MS data were searched in the Swiss-Prot human protein database using PEAKS Xpro.
3. The model for predicting immunogenicity according to claim 1, wherein: in step S2, when downloading the IEDB T cell epitope database, the following filtering conditions are set: linear epitopes, HLA class I, host is human or mouse.
4. The model for predicting immunogenicity according to claim 1, wherein: the positive epitopes in step S2 include positive, positive-high, positive-medium and positive-low epitopes.
5. The model of claim 4, wherein said model predicts immunogenicity: step S3, the training of the Keras sequence model in TensorFlow comprises the following steps: an embedded layer containing 8 neural units, a bi-directional LSTM layer containing 8 units, a fully connected layer with L2 regularization and a sigmoid activation layer.
6. The model of claim 5 for predicting immunogenicity, wherein: and S3, performing 100 rounds of training by using an Adam optimizer and a binary cross entropy function.
7. The model for predicting immunogenicity according to claim 1, wherein: in step S4, the ratio of the training set, the validation set, and the test set is 8.
8. The model for predicting immunogenicity according to claim 1, wherein: in step S4, since the number of negative polypeptides is generally higher than the number of positive polypeptides, negative peptides are down-sampled to ensure that the positive and negative peptides are in the same ratio in the data set.
9. Use of a model for predicting immunogenicity according to any one of claims 1 to 8 in predicting candidate neoantigen immunity in cancer patients.
CN202211060598.4A 2022-08-31 2022-08-31 Model for predicting immunogenicity and application Pending CN115497613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211060598.4A CN115497613A (en) 2022-08-31 2022-08-31 Model for predicting immunogenicity and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211060598.4A CN115497613A (en) 2022-08-31 2022-08-31 Model for predicting immunogenicity and application

Publications (1)

Publication Number Publication Date
CN115497613A true CN115497613A (en) 2022-12-20

Family

ID=84468838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211060598.4A Pending CN115497613A (en) 2022-08-31 2022-08-31 Model for predicting immunogenicity and application

Country Status (1)

Country Link
CN (1) CN115497613A (en)

Similar Documents

Publication Publication Date Title
CN113160887B (en) Screening method of tumor neoantigen fused with single cell TCR sequencing data
US20200243164A1 (en) Systems and methods for patient-specific identification of neoantigens by de novo peptide sequencing for personalized immunotherapy
US20040072246A1 (en) System and method for identifying t cell and other epitopes and the like
EP4229640A1 (en) Method, system and computer program product for determining peptide immunogenicity
TWI672503B (en) Ranking system for immunogenic cancer-specific epitopes
CN112771214A (en) Methods for selecting neoepitopes
CN114446389B (en) Tumor neoantigen feature analysis and immunogenicity prediction tool and application thereof
KR20210092723A (en) Cancer mutation selection to create personalized cancer vaccines
CN110706742A (en) Pan-cancer tumor neoantigen high-throughput prediction method and application thereof
CN114929899A (en) Method and system for screening new antigen and application thereof
Tang et al. TruNeo: an integrated pipeline improves personalized true tumor neoantigen identification
CN115747327A (en) Novel antigen prediction methods involving frameshift mutations
Falta et al. Beryllium-specific CD4+ T cells induced by chemokine neoantigens perpetuate inflammation
CN112210596B (en) Tumor neoantigen prediction method based on gene fusion event and application thereof
CN115497613A (en) Model for predicting immunogenicity and application
Jurtz et al. Computational methods for identification of T cell neoepitopes in tumors
WO2023089203A1 (en) Methods for predicting immunogenicity of mutations or neoantigenic peptides in tumors
Kato et al. Hidden Markov model-based approach as the first screening of binding peptides that interact with MHC class II molecules
US20220238188A1 (en) Method for determining responsiveness to an epitope
CN110349627B (en) Design method of polypeptide vaccine sequence and its automatic design product
Liu et al. A Deep Learning Approach for NeoAG-Specific Prediction Considering Both HLA-Peptide Binding and Immunogenicity: Finding Neoantigens to Making T-Cell Products More Personal
Sun et al. B-cell epitope prediction method based on deep ensemble architecture and sequences
Margalit et al. Insights from MHC‐bound peptides
Lian et al. Prediction of MHC Class II Binding Peptides Using a Multi-Objective evolutionary Algorithm
Tran et al. Predicting immunogenicity by modeling the central tolerance of CD8+ T cells in individual patients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination