CN111415707B - Prediction method of clinical individuation tumor neoantigen - Google Patents

Prediction method of clinical individuation tumor neoantigen Download PDF

Info

Publication number
CN111415707B
CN111415707B CN202010162494.9A CN202010162494A CN111415707B CN 111415707 B CN111415707 B CN 111415707B CN 202010162494 A CN202010162494 A CN 202010162494A CN 111415707 B CN111415707 B CN 111415707B
Authority
CN
China
Prior art keywords
tumor
software
tumor cells
affinity
somatic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010162494.9A
Other languages
Chinese (zh)
Other versions
CN111415707A (en
Inventor
许恒
舒洋
杨莉
丁振宇
魏于全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202010162494.9A priority Critical patent/CN111415707B/en
Publication of CN111415707A publication Critical patent/CN111415707A/en
Application granted granted Critical
Publication of CN111415707B publication Critical patent/CN111415707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Abstract

The invention relates to a prediction technology, solves the defects that the traditional tumor neoantigen analysis cannot adopt a one-step method to obtain a final result and the accuracy of prediction is improved according to the feedback of clinical results, and provides a prediction method of clinical individuation tumor neoantigen, wherein the technical scheme can be summarized as follows: comparing the sequencing data of the tumor cells of the patient with corresponding reference genome to obtain DNA and RNA comparison results, preprocessing, analyzing somatic variation, clone type, tumor purity, gene expression quantity in the tumor cells and expression abundance of somatic variation alleles in the tumor cells to obtain analysis results, analyzing affinity of potential neoantigens and calculating polypeptide presentation efficacy, scoring and sequencing the corresponding neoantigens according to the analysis results, the affinity of the potential neoantigens and the polypeptide presentation efficacy, and presenting the neoantigens. The method has the advantages that the standardized process can improve accuracy, has more convenient operability and repeatability, and is suitable for predicting clinical individuation tumor new antigens.

Description

Prediction method of clinical individuation tumor neoantigen
Technical Field
The invention relates to a prediction technology, in particular to a prediction technology of an individual tumor neoantigen.
Background
Immunotherapy of tumors is considered to be a new generation of tumor treatment following surgery, radiotherapy, chemotherapy and small molecule targeted therapies, which differs from the previous traditional treatments in that: the immunotherapy changes the idea of directly killing tumor cells into activating immune cells, treats tumor by enhancing the anti-tumor immunity of a patient, and has the advantages of accurate killing, small side effect, durable curative effect, high individuation degree and the like compared with the traditional therapy. In addition, the immune system of the organism has the characteristic of immune memory, so that the immune therapy can help a patient to form memory type immunity, and has remarkable advantages in preventing tumor recurrence and metastasis.
Currently immunotherapy is largely divided into two categories: immune therapy based on immune checkpoint inhibitors and cellular immune therapy. Immune checkpoint inhibitors have been marketed in a variety of drugs (including PD-1 inhibitors from the company schnobiletin and merck, etc.), and have achieved significant success in the treatment of advanced tumors, particularly in many patients who are resistant to conventional therapies. The therapy has been used as a first line treatment in the treatment of partial tumors. However, the overall benefit of this therapy is about 10-30% of the total patient population, representing that there are still significant patients who cannot benefit.
Cell immunotherapy is also called as cell adoptive immunotherapy, and mainly carries out external modification or stimulation on self T cells of a patient, so that common T cells can specifically identify tumor cells, and attack and kill the tumor cells. As a representative of cellular immunotherapy, CAR-T has achieved great success in hematological tumors and in some solid tumors. However, for solid tumors, immune cell therapy still faces significant challenges, and new and more effective immunotherapy critical techniques need to be developed. The mechanism by which T cells specifically recognize tumor cells is: there are genetic mutations in tumor cells that are different from normal cells, and these mutations are translated into proteins that are different from normal proteins, wherein some polypeptides formed by ubiquitination modification and degradation of mutant proteins can be recognized by the immune system, called antigenicity of the polypeptides, and these antigenic polypeptides are presented to the cell surface by the Major Histocompatibility Complex (MHC) of the tumor cells themselves, and then recognized by the immune system as foreign, so that the tumor cells are attacked and killed by immune cells. Such antigenic muteins are also known as tumor neoantigens (neoantigens). Cell adoptive immunotherapy is largely classified into the following categories according to its course of development: autologous lymphokine activated killer cells (LAK), autologous tumor infiltrating lymphocytes (Tumor Infiltrating Lymphocytes, TILs), dendritic cell-binding cytokine-induced killer cells (DC-CIK), and genetically modified T cells (CAR-T). CAR-T therapy is designed to be specific to a certain antigen through modification, has better effect on treating B lymphocyte leukemia and lymphoma at present, but has poorer effect on solid tumors mainly because of fewer available tumor antigens and single target point. Other cell therapies, except CAR-T, do not have a specifically designed antigen, and immune cells are required to recognize the specific antigen of the tumor through a natural response, so these therapies do not benefit much (about 10%) mainly because the antigen of the tumor itself is not sufficient to activate immune cells. Recent clinical studies have found that co-culturing tumor cell lysates with immune cells can enhance therapeutic effects, while detecting specific T cells specifically recognizing tumor antigens in patients, demonstrating that the immune system attacks tumor cells by recognizing tumor antigens. Similarly, the main task of the surgical division of the national cancer institute, steven Rosenberg doctor, extracts tumor-infiltrating T lymphocytes (which are derived from tumors and are known immune cells capable of recognizing and attacking cancers) from resected tumor or biopsy samples, picks out the cells with the strongest anti-tumor activity, proliferates in vitro and conveys the cells back to the patient, and the treatment method achieves good curative effect, and even enables complete tumor regression in a few patients. However, these methods are passive, because many patients have insufficient tumor cells to activate the autoimmune system, or the number of lymphoinfiltrating T cells is too small to meet the final therapeutic requirements, one strategy to effectively improve the therapeutic efficacy of cell therapies should be to actively search for tumor neoantigens, stimulate immune cells with high intensity by in vitro culture, and then return cells that specifically recognize tumor neoantigens to the patient for attack of tumor cells with tumor neoantigens. In clinical practice, researchers typically withdraw blood cells from tumor patients to culture mature DC cells, and then stimulate the DC cells with "mutant antigens" to prepare DC vaccines. Studies at the university of Washington in the United states indicate that the use of such DC vaccines can trigger an effective immune response in the body of patients with advanced melanoma, resulting in T cells with high killing activity that accurately recognize cancer cells.
The international large tumor genome project (TCGA) has found somatic mutations for a large number of different tumors and mapped them to mutation patterns, and studies have been conducted to analyze and verify new antigens in mutation patterns. However, it is worth noting that, due to the difference between eastern and western ethnicity and the characteristic of great heterogeneity of tumor itself, the tumor mutation spectrum obtained based on the European and American patients may not be mutated or have a lower mutation frequency in the patients in China, but may not be reported in the western population at the sites belonging to high-frequency mutation in China; in addition, the critical determinants of immunogenicity in determining whether mutations vary greatly from race to race, suggesting that we need to conduct race-specific studies and analyses on the chinese population.
To actively find a tumor neoantigen, two conditions must be met: the first is the ability to perform high throughput DNA sequence detection on the tumor genome of each patient individual to find all tumor-specific mutations; on the other hand, the proteins and polypeptides expressed by the mutations are subjected to antigenicity prediction, potential new antigen peptide fragments are synthesized in vitro (the peptide fragments are also called tumor vaccines), and finally, high-concentration and high-strength tumor vaccines are used for activating immune cells. Currently, there is a good theoretical and practical basis for neoantigen immunotherapy against tumors, mainly represented by the following aspects:
1. With the development of second generation sequencing technology and bioinformatics, short-time large-scale sequencing and analysis are possible, and at present, the whole scanning of human genome only needs one week, and particularly, the whole exon sequencing can be performed only on a gene coding region, so that the difficulty of sequencing quantity and analysis is greatly reduced, and the process of identifying all mutations in tumors is simpler and simpler. On the other hand, the software for tumor somatic cell detection is already very mature, and can efficiently and rapidly identify somatic mutations specific to tumor tissues. In addition, successful development and application of tumor neoantigen prediction programs and software also provides a powerful tool for us to find neoantigens. The mature bioinformatics analysis method and software can complete the work of somatic mutation detection, HLA genotype prediction, RNA abundance calculation of mutation, tumor cloning state analysis, new antigen affinity prediction and the like of patients in one day, and precious time is saved for clinical treatment.
2. Scientists have long been on the basis of basic research in immunology, where studies on T cell antigen recognition have made significant progress. It has been found that polypeptides (also called epitopes) formed by degradation of certain variant proteins in tumors can be specifically recognized by MHC molecules and form MHC-antigen peptide complexes, which are then presented on the surface of antigen presenting cells and recognized by T cells as non-hexose components, such that T cells are activated to effector T cells for attack and clearance of tumor cells, the most critical step being the binding of MHC molecules to polypeptide molecules. MHC molecules in humans, also known as Human Leukocyte Antigens (HLA), are largely divided into MHC class I molecules, which are expressed by most cells and are largely involved in the presentation of endogenous antigens, and MHC class II molecules; the latter is mainly expressed by antigen presenting cells, mainly involved in the presentation of foreign antigens. Wherein three genes encoding MHC I (HLA-A, HLA-B and HLA-C) and three MHC II (HLA-DR, HLA-DQ, HLA-DP) exist in the genome of a human, the genes have very large variation, and tens of thousands of different genotypes have been found, and the different genotypes generally have different affinities for the same polypeptide, namely, the higher the affinity, the higher the recognition capability.
3. Up to now, scientists in the relevant field have constructed a plurality of antigen epitope databases and developed a plurality of antigen epitope prediction related software.
The IEDB database was developed by Peter et al to provide predictive and relevant annotated services for users, including epitope binding prediction, prediction of antigen epitopes based on peptides present in cells, prediction of the relative ability of polypeptide/MHC complexes to elicit immune responses, and the like. Currently, the main methods of antigen prediction are all used for predicting antigen peptides combined with MHC molecules, and the prediction methods can be classified into three types according to the characteristics used in the prediction: a prediction method based on sequence information, a prediction method based on structure information, and a prediction method combining sequence information and structure information. The structure of MHC class I and class II molecules binding antigen peptide is located in the antigen binding groove at the far membrane end of the molecule. Wherein, the groove of the MHC class I molecule is closed, the length of the antigen peptide combined with the groove is 8-10 amino acid residues, the groove of the MHC class II molecule is open, and the length of the antigen combined with the groove is 13-17 amino acid residues longer than that of the MHC class I molecule. Thus, the antigen peptide bound to the MHC class I molecule has a specific anchor group than the antigen peptide bound to the MHC class II molecule. Since there is no great difference in the current methods for predicting MHC class I and MHC class II binding peptides, the results of predicting MHC class I binding peptides will be more accurate. The prediction method based on the sequence information is because similar sequence offices have similar structures and the same functions. The earliest prediction method is to predict an epitope through statistical sequence information and construct a sequence motif through the occurrence frequency of amino acids of an anchoring group. It was later found that amino acids at other positions than the amino acid of the anchor group have an effect on the antigen peptide bound to the MHC molecule. Thus, a matrix model of L X20 was constructed taking into account the frequency of 20 amino acids occurring at all positions of the binding peptide. Research has found that the physicochemical properties of amino acids and the correlation coefficients of adjacent amino acids can provide effective features for prediction, and in addition, with the development of computational science, prediction methods such as machine learning (ANN, HMM) and linear programming (SMM) are introduced into antigen prediction.
4. The four independent researches up to date show that the new antigen predicted by the combination sequence information and the structure information prediction method can be used as a tumor vaccine for treating patients and has good effect.
These successful cases suggest that immunotherapy with neoantigens has a very high application prospect. However, the individualization of the method is very strong, and the steps of complicated and complex analysis procedures, especially the participation of genetic and bioinformatics experts, are required to be carried out on each individual from sequencing to final new antigen selection. In the reported clinical study, such an operation was acceptable due to the small number of patients involved. However, if such successful treatment regimens are once subjected to routine clinical treatment, most hospitals lack the expertise responsible for selection of the neoantigen, not only because selection of the neoantigen delays the treatment time, but also because standardization is difficult in different hospitals, which may cause great differences in the final treatment effect. Thus, the large scale popularization of neoantigen-based immunotherapy requires a standardized procedure such as software and analysis procedures using a one-step method. On the other hand, most of the clinical researches are based on European and American population, and due to the existence of the race difference, a race-specific optimization process is also required for the new antigen prediction process and analysis.
Disclosure of Invention
The invention aims to overcome the defect that the current tumor neoantigen analysis cannot obtain a final result by adopting a one-step method and improve the accuracy of prediction according to the feedback of clinical results, and provides a prediction method of clinical individuation tumor neoantigens.
The invention solves the technical problems, adopts the technical proposal that the prediction method of clinical individuation tumor neoantigen comprises the following steps:
step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result;
step 2, preprocessing DNA comparison results and RNA comparison results;
step 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation alleles in tumor cells by using DNA comparison results and RNA comparison results to obtain analysis results;
step 4, predicting the genotype of the MHC class I molecules of the tumor cells of the patient, analyzing the affinity of the potential neoantigens and calculating the polypeptide presentation efficacy according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient;
and 5, scoring and sequencing the corresponding neoantigens according to the analysis result, the affinity of the potential neoantigens and the polypeptide presentation efficacy, and presenting the neoantigens.
Specifically, in step 1, the sequencing data of tumor cells of a patient are compared with a preset corresponding reference genome, and a Burrow-Wheeler transformation algorithm is adopted to obtain a DNA comparison result; and (5) obtaining an RNA comparison result by adopting a STAR algorithm.
Further, before the step 1, the method further comprises the following steps:
and step 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
Specifically, in step 0, the pretreatment includes trimming sequencing linker sequences, trimming sequencing tag sequences, and removing sequences with poor sequencing quality, which can be performed using trimmatic software.
Still further, in step 2, the preprocessing includes deduplication, addition of group information, and alkali matrix amount recalculation, which may be performed using GATK software.
Specifically, the step 3 includes the following steps:
step 3A, analyzing somatic mutation sites in tumor cells by using DNA comparison results;
step 3B, analyzing and obtaining the expression abundance of the variant alleles of the somatic cells and the gene expression quantity in tumor cells according to the comparison result of the mutation sites of the somatic cells and the binding RNA;
step 3C, filtering false positive sites in somatic mutation sites through the expression abundance of somatic mutation alleles to obtain actual somatic mutation sites;
Step 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation;
step 3E, calculating copy number variation files of the tumor;
step 3F, calculating the tumor purity and the clone structure according to the somatic cell variation and the copy number variation file of the tumor;
and step 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
Still further, in step 3D, the somatic mutation refers to: further filtering was performed on the annotated actual somatic mutation sites, leaving only non-synonymous somatic mutation sites in the exon region.
Specifically, step 3A uses mutet 2 software for analysis; step 3B adopts the HaplotyperCaller software analysis of GATK; step 3C, adopting FilterMutectCalls software to analyze; step 3D, annotating with ANNOVAR software; step 3E, adopting cnvkit software to calculate; in the step 3F, the calculated tumor purity is calculated by adopting ABSOULTE software, and the cloning structure is calculated by adopting pyClone software; the cloning structure refers to the cloning ratio of mutant genes, i.e., the cloning ratio of each individual cell mutation site.
Still further, in step 3G, analysis of gene expression level refers to analysis by using featureCounts software to obtain count values, and converting the count values into the number of transcript sequenced fragments per million sequenced fragments (TPM values).
Specifically, the conversion formula of the obtained counts into the number of transcript sequencing fragments in each million sequencing fragments is as follows:
Figure GDA0002524955080000051
wherein R is the count number of the genes, L is the length of the genes,
Figure GDA0002524955080000052
R k for the count number of gene k, L k For the length of gene k, n is the total number of genes.
Still further, step 4 includes the steps of:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient;
step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation sites;
step 4C, predicting the ability of MHC class I molecules to bind wild type polypeptides and variant polypeptides according to the non-synonymous mutant protein data to obtain affinity prediction;
step 4D, analyzing the polypeptide sequence to obtain the polypeptide dissociation efficiency, the transport efficiency and the MHC binding efficiency, and calculating the polypeptide presentation efficiency according to the obtained polypeptide dissociation efficiency, transport efficiency and MHC binding efficiency;
step 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
Specifically, step 4A uses pollever software for analysis; step 4C uses IEDB provided MHC class I epitope predictors software for analysis; and step 4D, analyzing and calculating by adopting netCTLpan software.
Still further, step 4B specifically includes: downloading fasta files of human genome protein sequences provided by UCSC websites, extracting sequences of 15 amino acids before and after mutated amino acids according to annotated actual somatic mutation sites, respectively obtaining flanking amino acid sequences of wild type proteins and mutant proteins, and combining all the sequences into fasta files of polypeptide sequences, namely non-synonymous mutant protein data.
Specifically, in step 5, the method for scoring each corresponding neoantigen according to the analysis result, the affinity of the potential neoantigen and the polypeptide presenting efficacy is as follows:
score=abundance*dissimilarity*clonality
wherein, the liquid crystal display device comprises a liquid crystal display device,
abundance=L(IC m )*Freq*tanh(TPM);
dissimilarity=(1–L(IC w /50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+e x ) Tan () is calculated as hyperbolic tangent function, score is the fraction obtained by calculation of neoantigen, abundance is the abundance of mutant polypeptide, TPM is gene expression abundance, dissimilarity is Dissimilarity (Dissimilarity) is measured as the difference of affinity of mutant peptide fragment to corresponding normal peptide fragment, clonality (clonality) is measured as the efficiency of neoantigen from mutation to short peptide NCTL and the proportion MC of neoantigen distributed in all tumor cells, freq is the expression abundance ratio of somatic variant allele, which is the ratio of expression abundance of somatic variant allele to expression of gene in tumor cells, IC m For affinity prediction of mutant polypeptides, IC w For the binding affinity of the wild-type polypeptide obtained in affinity prediction, NCTL was the presentation potency and MC was the mutant gene cloning ratio.
Specifically, the method further comprises the following steps:
and 6, optimizing the analysis method of the affinity of the potential neoantigen in the step 4 by using a machine learning method based on experimental results.
The beneficial effects of the invention are as follows: in the scheme of the invention, the prediction method of the clinical individuation tumor neoantigen is adopted, and an automatic and standardized neoantigen prediction platform is established by using the existing mature high-throughput analysis software. In the method, a plurality of steps from original data comparison to potential neoantigen sequencing are organically integrated, all potential neoantigen information of a patient can be obtained through analysis by a simple computer command, and polypeptide sequence information which can be clinically used for neoantigen immunotherapy is provided; although the data processing and analysis are carried out based on the existing open source software method, the data processing and analysis method is relatively independent relative to the software package, the software package needs to be integrated after each operation, and complex dependency relationship exists between the software.
The update speed of the affinity prediction database which is released at present is slower, and no relevant data record exists for certain specific HLA genotypes, and the affinity prediction database is expanded according to the data which are continuously accumulated clinically, so that more accurate prediction results can be obtained, which is also an important breakthrough of the method of the invention.
The method of the invention obtains a standardized analysis flow by integrating a plurality of high-throughput sequencing data analysis software and antigen affinity analysis software. Through the running test of various conditions, the prediction process has better code robustness.
The specificity of the polypeptide selected by the invention is detected by ELISPOT after clinical application, the result has consistency with the prediction result of the new antigen analysis flow, the prediction accuracy reaches more than 50% (the detected positive rates of two patients are 10/13 and 8/12, wherein the tumor specific strong antigen positive rate is 6/13 and 5/12), which is obviously higher than other related reports, and the prediction specificity and accuracy can be further improved and improved by the algorithm based on clinical feedback.
Detailed Description
The technical scheme of the present invention is described in detail below with reference to examples.
The invention relates to a prediction method for clinical individuation tumor neoantigen sequencing, which comprises the following steps:
step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result;
step 2, preprocessing DNA comparison results and RNA comparison results;
step 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation alleles in tumor cells by using DNA comparison results and RNA comparison results to obtain analysis results;
step 4, predicting the genotype of the MHC class I molecules of the tumor cells of the patient, analyzing the affinity of the potential neoantigens and calculating the polypeptide presentation efficacy according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient;
and 5, scoring and sequencing the corresponding neoantigens according to the analysis result, the affinity of the potential neoantigens and the polypeptide presentation efficacy, and presenting the neoantigens.
In the step 1, the sequencing data of the tumor cells of the patient are compared with a preset corresponding reference genome, and a Burrow-Wheeler transformation algorithm is adopted to obtain a DNA comparison result; and (5) obtaining an RNA comparison result by adopting a STAR algorithm.
Before step 1, the method may further comprise the following steps:
and step 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
In step 0, the pretreatment includes trimming sequencing adaptor sequences, trimming sequencing tag sequences, and removing sequences with poor sequencing quality, which can be performed using trimmatic software.
In step 2, the pretreatment may include deduplication, addition of group information, and alkali matrix weight calculation, which may be performed using GATK software.
Step 3 may comprise the steps of:
step 3A, analyzing somatic mutation sites in tumor cells by using DNA comparison results;
step 3B, analyzing and obtaining the expression abundance of the variant alleles of the somatic cells and the gene expression quantity in tumor cells according to the comparison result of the mutation sites of the somatic cells and the binding RNA;
step 3C, filtering false positive sites in somatic mutation sites through the expression abundance of somatic mutation alleles to obtain actual somatic mutation sites;
step 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation;
step 3E, calculating copy number variation files of the tumor;
step 3F, calculating the tumor purity and the clone structure according to the somatic cell variation and the copy number variation file of the tumor;
And step 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
In step 3D, the somatic variation may be: further filtering was performed on the annotated actual somatic mutation sites, leaving only non-synonymous somatic mutation sites in the exon region.
Step 3A is preferably analyzed using mutec 2 software; step 3B preferably employs a HaplotyperCaller software analysis of GATK; step 3C is preferably performed using FilterMutectCalls software; step 3D is preferably annotated with ANNOVAR software; step 3E is preferably calculated by adopting cnvkit software; in step 3F, the calculated tumor purity is preferably calculated using ABSOULTE software and the cloned structure is preferably calculated using pyClone software; the cloning construct is generally referred to as the cloning ratio of the mutated gene, i.e., the cloning ratio of the mutation site per cell.
In step 3G, the analysis of the gene expression level may be performed by using the featureContents software to obtain counts, and converting the counts into the number of transcript sequenced fragments per million sequenced fragments (TPM values).
The conversion formula of the obtained counts into the number of transcript sequencing fragments per million sequencing fragments can be:
Figure GDA0002524955080000081
Wherein R is the count number of the genes, L is the length of the genes,
Figure GDA0002524955080000082
R k for the count number of gene k, L k For the length of gene k, n is the total number of genes.
Step 4 may comprise the steps of:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient;
step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation sites;
step 4C, predicting the ability of MHC class I molecules to bind wild type polypeptides and variant polypeptides according to the non-synonymous mutant protein data to obtain affinity prediction;
step 4D, analyzing the polypeptide sequence to obtain the polypeptide dissociation efficiency, the transport efficiency and the MHC binding efficiency, and calculating the polypeptide presentation efficiency according to the obtained polypeptide dissociation efficiency, transport efficiency and MHC binding efficiency;
step 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
Step 4A, adopting a pollever software to analyze; step 4C uses IEDB provided MHC class I epitopepredictors software for analysis; and step 4D, analyzing and calculating by adopting netCTLpan software.
The step 4B may specifically be: downloading fasta files of human genome protein sequences provided by UCSC websites, extracting sequences of 15 amino acids before and after mutated amino acids according to annotated actual somatic mutation sites, respectively obtaining flanking amino acid sequences of wild type proteins and mutant proteins, and combining all the sequences into fasta files of polypeptide sequences, namely non-synonymous mutant protein data.
In step 5, the method for scoring each corresponding neoantigen according to the analysis result, the affinity of the potential neoantigen and the polypeptide presenting efficacy is preferably as follows:
score=abundance*dissimilarity*clonality
wherein, the liquid crystal display device comprises a liquid crystal display device,
abundance=L(IC m )*Freq*tanh(TPM);
dissimilarity=(1–L(IC w /50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+e x ) Tan () is calculated as hyperbolic tangent function, score is the fraction obtained by calculation of neoantigen, abundance is the abundance of mutant polypeptide, TPM is gene expression abundance, dissimilarity is Dissimilarity (Dissimilarity) is measured as the difference of affinity of mutant peptide fragment and corresponding normal peptide fragment, clonality (clonality) is measured as the ratio MC of neoantigen to short peptide efficiency NCTL and neoantigen distribution in all tumor cells, freq is the ratio of somatic variant allele expression abundance which is the ratio of the somatic variant allele expression abundance to the gene expression amount in tumor cells, IC m For affinity prediction of mutant polypeptides, IC w For the binding affinity of the wild-type polypeptide obtained in affinity prediction, NCTL was the presentation potency and MC was the mutant gene cloning ratio.
The method can further comprise the following steps:
and 6, optimizing the analysis method of the affinity of the potential neoantigen in the step 4 by using a machine learning method based on experimental results.
Examples
In this example, the specific prediction method for clinical personalized tumor neoantigen sequencing is as follows:
and step 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
The method comprises the following steps: after obtaining the high throughput sequencing file (fastq file) of the original tumor cell DNA/RNA of the patient and the DNA material of the blood of the patient, trimming the sequencing linker sequence, trimming the sequencing tag sequence and removing the sequencing quality of the sequencing fastq file by using the Trimmomatic software.
Wherein the parameters are selected as follows: 1) Selecting a PE mode, namely a pair-end mode, for data analysis; 2) Setting the ILLUMINACIP parameter as TruSeq3-PE.fa:2:30:10, namely removing the linker by using linker sequence information in a TruSeq3-PE library construction scheme of ILLUMINA company, wherein the information exists in a TruSeq3-PE.fa file, a target sequence and a source sequence cannot be mismatched by more than 2 bases, setting a palindromic sequence alignment threshold value as 30 (if library inserts are shorter than sequencing reads, one base in forward and reverse sequencing fragments can be completely complementary, aligning two linker sequences with the sequencing fragments, and simultaneously aligning two paired sequencing fragments with each other, accurately removing the linker sequences at the tail ends of the sequencing fragments, wherein the parameter controls the alignment quality value of palindromic structures between the two paired sequencing fragments to be more than 30, and finally setting the lowest alignment of the excised linker sequences as 10; 3) Setting the leader parameter to 3, namely cutting out a plurality of continuous bases with the forefront sequencing quality value of the sequencing fragment lower than 3 (the sequencing quality at the two ends of the sequencing fragment is relatively poor); 4) Setting the TRAILING parameter to 3, namely cutting out a plurality of continuous bases with the sequencing quality value lower than 3 at the rearmost end of the sequencing fragment, and the reason is the same as the above; 5) The SLIDINGWINDOW parameter is set to 4:15, i.e. the software sliding window is set to the size of 4 bases, and if the mass average value of the 4 bases is lower than 15, the bases in the window are excised; 6) The MINLEN parameter is set to 36, i.e., the length of the sequenced fragment after trimming is less than 36 bases, and the sequenced fragment is discarded.
The trimmatic program uses the original fastq file as an input file, runs the processed fastq file, called a clean fastq file, and is used for the comparison operation of the sequencing fragments in the subsequent step 1.
Step 1, comparing the sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result.
The method comprises the following steps: step 0 has resulted in a purified fastq file (i.e., a clean fastq file) and these sequenced fragments are then aligned to the corresponding reference genome. Taking into account the differences between the whole exon sequencing and the whole transcriptome database building modes, the genome alignment of the whole exon data is performed by adopting BWA software, and the genome alignment of the whole transcriptome data is performed by adopting STAR software:
important parameters for using BWA are: 1) The k parameter is set to 19, i.e. the shortest length of the seed sequence is 19; 2) The w parameter is set to be 100, and if the length of the comparison gap is greater than 100, the comparison gap is ignored; 3) The method comprises the steps of carrying out a first treatment on the surface of the 4) The r parameter is set to 1.5, if the seed length exceeds k×r, i.e. 19×1.5=28.5 bp, a new seed sequence is found inside the seed sequence. This parameter is a key parameter for the performance of the heuristic algorithm adjustment algorithm, and if the value is set too large, this can lead to a faster alignment speed, but can sacrifice part accuracy; 5) c, setting a parameter to be 500, and if the repetition number of the seed sequence exceeds 500, ignoring the seed sequence; 6) The chimSegmentMin is set to 0, only the correctly aligned sequencing fragments are output, and the chimeric fragments are not output;
Parameters using STAR are: 1) Setting the parameter genomdir to a directory path to store the reference genome file; 2) Setting a parameter readfilesIn as a path for inputting a clip file; 3) Setting a parameter twopassMode as a Basic mode, and setting a channel mapping mode, namely adopting a two-channel comparison mode; 4) Setting the parameter outReadUnmapp None to None, i.e., not outputting fragments that failed or were partially aligned; 5) Setting the parameter chimSegmentMin to 12, namely setting the minimum length of the embedded fragment to be 12bp; 6) Setting the parameter chimJuctionOverhangMin as 12, namely the minimum allowable protrusion length of the chimeric joint is 12bp; 7) Setting the parameter alignsjdbaverhangmin to 10, i.e. in the annotated binding region, the lowest run highlights the single stranded sequence of 10 bp; 8) The parameter alignnmatesgapmax is set to 200000, i.e. the maximum gap between the two fits is set to 200000bp; 9) The parameter alignIntronmax was set to 200000, i.e. the maximum intron size was set to 200000bp;10 Setting the parameter chimSegmentREadGapMax to 3, i.e.the maximum gap between the chimeric fragments is 3bp;11 Setting the parameter alignsjstitchmissmatchnmax to 5-1.5, the longest mismatch number that can be used to splice stitched fragments is limited to: (1) non-classical mode not exceeding 5bp, (2) GT/AG and CT/AC mode not limiting, (3) GC/AG and CT/GC mode not exceeding 5bp, and (4) AT/AC and GT/AT mode not exceeding 5bp;12 Setting the runthread parameter to the required number of parallel runs; 13 Setting the parameter outfiltermultimaplnmax to 2, i.e. allowing a single fragment to match up to two regions; 14 Setting the parameter outfiltermissmatchnoverlmax to 0.04, i.e. outputting the fragment alignment result only if the ratio of mismatch length to fragment length is not greater than 0.04; 15 Set the parameter outSAMtype to BAM SortedByCoordinate, i.e., set output as BAM files ordered by coordinates; 16 Setting the parameter outfileNameprefix as a prefix of an output file name, and automatically generating a plurality of corresponding result files according to the prefix by a program; 17 The parameter readfilesCommand is set to handle the shell command in the input file format so that its output result is a FASTA or FASTQ file.
Through this step, a sequencing sequence file, i.e., a BAM file, is obtained that has been aligned to the selected reference genome, which is the DNA alignment result and the RNA alignment result.
And 2, preprocessing DNA comparison results and RNA comparison results.
The method comprises the following steps: after the sequenced fragments are aligned to the reference genome using BWA or STAR, the alignment (including DNA and RNA alignment) is subjected to deduplication, addition of panel information, and alkaline matrix amount recalculation. These steps may operate using GATK.
The parameters for the deduplication of the sequencing alignment fragments using GATK were: 1) The parameter PG is set to MarkDapplicates to enable program record ID for PG record creation. This string may be suffixed to avoid collision with other program record IDs; 2) Setting the parameter PG_NAME to MarkDapilicates, i.e. to create the PN mark value of the PG log; 3) Setting MAX_FILE_HANDLES to 8000, namely setting the maximum number of FILE HANDLES which are kept open when the tail of the fragment overflows to the disk to 8000; 4) Setting the parameter SORTING_COLLECTION_SIZE_RATIO to 0.25, namely determining that the memory occupation amount used by some SORTING sets is 0.25 times of the total memory capacity; 5) The parameter optical_duration_pixel_distance is set to 100, i.e., the maximum offset between two repeated clusters is set to 100.
The parameters at the time of adding the sequencing alignment fragment group information by using GATK are as follows: 1) The parameter ID is set to 1, i.e., the read group ID default is set to 1; 2) The parameter version is set to false, i.e. the version number of the software is not displayed.
The parameters for mass recalculation of bases in the sequenced fragment using GATK were: 1) Setting the parameter R to hg19 reference genome fasta file; 2) The known variations were set to 1000 g_phase1.indexes, hg19.sites, vcf.gz, mills_and 1000 g_gold_standard.indexes, hg19.sites, vcf.gz, and dbsnp150, and the quality of the bases in the BAM file was re-evaluated using these known variations as references.
After this step, a quality-controlled BAM file is obtained, and the file is used in subsequent steps 3A, 3E, 3G, and 4A.
And 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation alleles in tumor cells by using the DNA comparison result and the RNA comparison result to obtain an analysis result.
The method comprises the following specific steps:
and 3A, analyzing somatic mutation sites in tumor cells by using a DNA comparison result.
And (3) processing the BAM file in the step (2) to obtain a BAM file of a comparison result of the tumor sample and the normal control sample on the genome, and then comparing and analyzing the BAM file of the normal-tumor paired sample by using mutec 2 software to find out all potential mutation sites.
Parameters using mutec 2 were: 1) Setting the parameter base-quality-score-threshold to 18, i.e., the base mass fraction threshold to 18, the base below this threshold will fall to a minimum of 6; 2) Setting the parameter heterozygo to 0.001, i.e., setting the a priori likelihood value for any locus heterozygosity to 0.001; 3) Setting the index-heterozygo to 1.25E-4, i.e., setting the a priori likelihood value for any locus heterozygosity to 1.25E-4; 4) Setting the parameter mbq to 10, i.e. the minimum mass of acceptable bases allowed at the time of retrieving the mutation is 10; 5) Setting the parameter native-pair-hmm-threads as 4, namely setting the thread number used by the native pair HMM implementation as 4; 6) Setting the parameter minimum-mapping-quality to 4, i.e. discarding fragments with a comparison quality lower than 20; 7) The parameter min-read-length is set to 30, i.e. only fragments with a length of 30 or more are reserved.
Through this step, vcf files containing somatic mutation site information are obtained, which are used in subsequent steps 3B, 3C.
And step 3B, analyzing and obtaining the expression abundance of the variant alleles of the somatic cells and the gene expression quantity in tumor cells according to the comparison result of the mutation sites of the somatic cells and the binding RNA.
The method comprises the following steps: the vcf file storing the position information of the somatic cell mutation is converted into a bed file of position coordinates by using vcf2bed software, and the information of the positions in the RNA sequencing alignment file is analyzed by using a HaplotyperCaller of GATK.
Parameters using the biplotyperciller were: 1) -L provides a becf 2 becf generated becf file; 2) Reference uses the GRC37/hg19 reference genome; 3) -ERC selects GVCF format, outputting GVCF file; 4) Output-mode selects emit_all_sites and outputs ALL the loci in the target area, whether or not it is discriminated as a variance.
Next, the gvcf was genotyped using GenotypeGCFs command of GATK, and the wild-type base number and mutant base number of somatic mutation sites at RNA level were obtained.
Through the step, the expression abundance of the variant alleles of the somatic cells and the gene expression level in tumor cells are obtained and used in the step 3C and the step 5.
And step 3C, filtering false positive sites in the somatic mutation sites through the expression abundance of the somatic mutation alleles to obtain actual somatic mutation sites.
The method comprises the following steps: after step 3A is implemented, we can find out the somatic mutation sites of tumor cells different from normal cells, but many false positive sites caused by sequencing technology and the like are doped in the somatic mutation sites, and then in the step, the false positive sites are filtered by using FilterMutectctCalls software, and the mutation sites with RNA expression are further screened out by combining with the expression abundance of somatic mutation alleles.
Parameters using filtermutetctcalls software were: 1) Setting the parameter base-quality-score-threshold to 18, i.e. the base quality below the threshold 18 will drop to a minimum of 6; 2) Setting the parameter heterozygo to 0.001, i.e., setting the a priori likelihood value for any locus heterozygosity to 0.001; 3) Setting the parameter heterozygo-STDEV to 0.01, i.e. the heterozygosity standard deviation of SNP and indel to 0.01; 4) Setting the parameter index-heterozygo to 1.25E-4, i.e., setting the prior likelihood value for any locus heterozygosity to 1.25E-4; 5) Setting the parameter max-alt-all-count to 1, i.e., deleting mutation sites with multiple alleles; 6) The parameter max-germline-database was set to 0.1, i.e., the maximum posterior probability value that the allele was a germline variation was 0.1; 7) Setting the parameter max-strand-artifact-probability to 0.99, and filtering the variant site if the probability of the chain artifact exceeds 0.99; 8) Setting the parameter mbq to 10, i.e. filtering out bases below 10 in mass; 9) Setting the parameter min-medium-base-quality to 20, namely filtering mutation sites with the median value of the base quality of the allelic fragment less than 20; 10 Setting the parameter min-median-mapping-quality to 30, i.e., filtering out mutation sites with allele fragment alignment median less than 30; 11 Setting the parameter min-mean-read-position to 5, i.e., filtering out mutation sites within 5 bases near the end of the fragment; 12 Setting the parameter native-pair-hmm-threads to 4, namely the number of threads used by the native pair HMM implementation to be 4;13 Setting the parameter normal-P-value-threshold to 0.001, i.e., setting the P-value threshold of the normal artifact filter to 0.001;14 Setting the parameter output-mode to emit_variants_only, i.e. outputting ONLY mutation sites; 15 The parameter stand-call-conf is set to 10.0, i.e., the minimum phed-scaled confidence threshold for searching for variant sites is set to 10.0.
Through the step, the filtered somatic mutation vcf file is obtained and used for the subsequent step 3D.
And 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation.
The method comprises the following steps: the function of the actual somatic mutation site was annotated with ANNOVAR software, whose parameters of use were: 1) Setting the parameter buildur as hg19, namely using a database constructed by the genes of hg19 as a reference; 2) The parameter maxgenetread is set to 6. Designating the maximum thread number allowed in the gene annotation as 6; 3) Annotation was made using refGene, avsnp150, clinvar, cosmic,1000g EAS, EXAC, gnomad, icgc, dbnsfp33 as reference. The mutation sites were further filtered according to the annotation result, leaving only non-synonymous mutation sites of the exon region.
Through this step, vcf files containing non-synonymous mutation site annotation information of the exon regions, i.e., somatic variations, were obtained for subsequent steps 3F and 4B.
And step 3E, calculating copy number variation files of the tumor.
The method comprises the following steps: in order to estimate the purity of a tumor, a copy number variation file (seg file) of the tumor needs to be prepared. In order to calculate the somatic copy number variation of tumors, the bam files beside the tumors and cancers were analyzed using cnvkit software, and the specific parameters of the analysis were: 1) Setting the parameter method as hybrid, i.e. the sequencing protocol as hybrid capture; 2) The parameter process is set to process each BAM in serial, i.e. each BAM is processed in parallel.
Through the step, a seg file containing tumor copy number variation is obtained and used for the subsequent step 3F.
And step 3F, calculating the tumor purity and the clone structure according to the somatic cell mutation and the copy number mutation file of the tumor.
The step may specifically include the following 3 parts:
I. obtaining MAF file
In order to estimate the purity of the tumor, not only a copy number variation file (seg file) of the tumor but also a MAF file containing non-synonymous mutation sites of an exon region is needed, after a vcf file containing non-synonymous mutation site annotation information of the exon region (i.e. the vcf file obtained in step 3D) is obtained, the vcf file is used for estimating the purity of the tumor, and the vcf file needs to be converted into the MAF file by using vcf2MAF software first, and specific use parameters of the vcf2MAF software are as follows: 1) Setting the parameter VEP-forks to 4, namely the number of running processes used when the VEP is operated is 4; 2) The parameter NCBI-build was set to GRCh37, i.e. MAF at the mutation site was assembled using NCBI's GRCh37 as reference.
From this section, MAF files containing non-synonymous mutation sites for the exon regions were obtained for subsequent sections II, III.
II. Calculation of tumor purity
In the above section I and step 3E, MAF files containing non-synonymous mutation sites in exon regions and copy number variation files (seg files) of tumors were obtained, respectively, and in this section, tumor purity was calculated by ABSOLUTE software using these two files as input files. The relevant parameters of ABSOLUTE software are: 1) Setting the parameter sigma p to 0, i.e. the temporary value exceeding the sample level variance for pattern search is set to 0; 2) The parameter max sigma h is set to 0.015. The maximum value beyond the sample level variance is set to 0.015; 3) Setting the parameter min ploidy to 0.95N, and designating the minimum ploidy value N to be considered by the algorithm to be 0.95N; 4) Setting the parameter max ploidy to 10N, i.e., designating the maximum ploidy value N to be considered to be 10N, and discarding models of potentially greater ploidy values; 5) The parameter primary treatment is set to the corresponding sample name. The input 'NA' is referenced to the use of the pan-cancer karyotype. If the provided input does not match the type in the list item, ABSOLUTE defaults to reference for a carcinoma; 6) Setting the parameter max as seg count to 1500, i.e., marking samples with allele sector numbers greater than 1500 as "failed"; 7) The parameter max sigma h is set to 0.005. Sometimes, due to noise in the data, ABSOLUTE modeling can be less than zero on the part of the genome that is attributed to tumor subcloning. This parameter specifies a maximum allowable component of the genome of 0.005, which can be modeled as less than zero without giving up a given solution; 8) The maximum genomic value that can be modeled as unclonable was set to 0.05 by setting the parameter max non clone to 0.05. This parameter can be increased appropriately for samples expected to have a higher proportion of heterogeneity.
Through this fraction, a purity value of the tumor tissue was obtained for the subsequent fraction III.
III, analysis of tumor clone Structure
The tumor purity was calculated and the cloning status of the tumor was analyzed based on MAF files containing non-synonymous mutation sites in the exon region and copy number variation files (seg files) of the tumor. Analysis of clone numbers was performed using pyClone software using MAF files containing non-synonymous mutation sites in the exon region, copy number variation files of tumors (seg files) and purity values of tumor tissue as input files. Specific parameters used by the pyClone software are: 1) Using a run analysis pipeline subcommand; 2) The in_files parameter uses files obtained according to maf arrangement; 3) The priority parameter uses total_copy_number; 4) tumor purity values were calculated from tumour_contents using ABSOLUTE.
Through this section, a cloning proportion file of all mutation sites, i.e. tumor cloning structure, was obtained for the subsequent step 5.
And step 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result.
The method comprises the following steps: and (3) analyzing the RNA-seq comparison result (contained in the BAM file with the quality controlled obtained in the step (2)) by using the featurecontrol software to obtain a gene expression magnitude FPKM (i.e. count value), and further converting the result into a TPM value.
Parameters using the featuresource software were: 1) The parameter minOverlap is set to 1. The minimum overlapping base number of the fragment allocation used fragment is 1; 2) The parameter T is set to 8, i.e. run in parallel using 8 threads.
The conversion formula is as follows:
Figure GDA0002524955080000151
wherein R is the count number of the genes, L is the length of the genes,
Figure GDA0002524955080000152
R k for the count number of gene k, L k For the length of gene k, n is the total number of genes.
The file containing all the gene expression amounts, i.e., the gene expression amounts, was obtained through this step and used in step 5.
And 4, predicting the genotype of the MHC class I molecules of the tumor cells of the patient, analyzing the affinity of the potential neoantigens according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient, and calculating the polypeptide presentation efficacy.
The method comprises the following specific steps:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient.
The method comprises the following steps: presume HLA allele with polysover software, its relevant parameter is as follows, 1) the Bam parameter is set up as BAM file that step 2 gets; 2) Setting the parameter race to Asia, i.e. setting the patient's race to Asian (of course, european or African, etc. as the case requires); 3) Setting the parameter includeeFreq to 1, i.e. population level allele frequency is used as a priori condition; 4) Setting the parameter build as hg19, namely setting the sequencing data alignment genome as GRCh37/hg19; 5) Setting the parameter format as STDFQ, namely adopting Hiseq as sequencing data, and returning the sequencing data as a standard fastq file; 6) The insertecalculated is set to 1, i.e. an empirical insert size distribution model is used in the model.
Through this step, information on the typing of HLA in the tumor, i.e. the genotype of MHC class I molecules of the tumor cells of the patient, is obtained for the subsequent step 4C.
And step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation site.
The method comprises the following steps: and (3) screening the somatic variation obtained in the step (3D), and retaining the nonsensical mutation in the somatic variation. Simultaneously, downloading a fasta file of a human genome protein sequence provided by a UCSC website, extracting sequences of 15 amino acids before and after the mutated amino acid according to the mutated position in the annotated file of somatic mutation, and respectively obtaining wild type and mutated flanking amino acid sequences. All sequences are combined into a fasta file of polypeptide sequences.
Through this step, protein fasta files containing somatic mutation site polypeptide information, i.e., non-synonymous mutein data, were obtained for subsequent steps 4C and 4E.
And 4C, predicting the ability of MHC class I molecules to bind to wild type polypeptides and variant polypeptides according to the non-synonymous mutant protein data to obtain affinity prediction.
The method comprises the following steps: the ability of MHC class I molecules to bind polypeptides was predicted using MHC _i software provided on the IEDB functional network, with the following parameters associated: 1) The method parameter is set to be iedb_recommended, that is, a setting recommended by IEDB (ann, netmhcpan, pickpocket, etc. methods are included therein); 2) mhc parameters are set as HLA typing results obtained by polysolver; 3) The peptide length is set to the potential neoantigen polypeptide length, we generally set to 8,9, 10, 11, 12,13,14; 4) The input_file parameter is the fastq file obtained by the arrangement in the step 13.
Through this step, the affinities of all potential polypeptides to each HLA molecule, i.e. affinity predictions, were obtained for subsequent step 4E.
Step 4D, analyzing the polypeptide sequence to obtain the polypeptide dissociation efficiency, the transport efficiency and the MHC binding efficiency, and calculating the polypeptide presentation efficiency according to the obtained polypeptide dissociation efficiency, transport efficiency and MHC binding efficiency.
The method comprises the following steps: polypeptide sequences were analyzed using netCTLpan software to obtain polypeptide dissociation efficiency (Proteasome cleavage prediction), transport potency (transport efficiency), MHC binding potency, and presentation potency was calculated from the results obtained. Parameters using netCTLpan software are 1) -a HLA allele, and a plurality of HLA alleles binding to polypeptide sequences can be set, wherein the HLA alleles contain ann, netmhcpan, pickpocket and other method genes. 2) S is output ordered by sign column, set to 0, i.e. output by integrated result (presentation effectiveness). 3) L peptide fragment length, the setting requires calculation of a length that is too short.
Through this step, a potency value for each polypeptide presented by the HLA molecule, i.e., the polypeptide presentation potency, is obtained for the subsequent step 5.
Step 4E, obtaining the affinity of the potential new antigen according to the affinity prediction and the non-synonymous mutant protein data.
The method comprises the following steps: and (3) processing the result file by using a script according to the affinity prediction obtained in the step (4C) and the non-synonymous mutein data obtained in the step (4B). The main criteria are as follows: 1) Variant amino acids are located in the predicted polypeptide; 2) Removing if the wild-type sequence and the non-synonymous mutant sequence are not present at the same time in the predicted outcome; 3) Outputting information about wild-type and non-synonymous mutations, including HLA type, at the original starting and ending position, length, predicted sequence of the polypeptide, predicted binding polypeptide affinity value (IC 50) and sorting the wild-type and non-synonymous mutation information into a row.
Through this step, the affinity of the wild-type/mutant effective polypeptide (comprising the amino acid altered by somatic mutation) for HLA binding, i.e. the affinity of the potential neoantigen, is obtained for subsequent step 5.
And 5, scoring and sequencing the corresponding neoantigens according to the analysis result, the affinity of the potential neoantigens and the polypeptide presentation efficacy, and presenting the neoantigens.
The method comprises the following steps: the predicted result of binding polypeptide affinity, mutation frequency, tumor purity, polypeptide expression amount and the like are integrated, the predicted result is comprehensively scored according to related rules, and main indexes are affinity values (IC 50) of wild type and mutant type polypeptide sequences, polypeptide expression levels, polypeptide lengths and ranking of potential new antigens. The scoring rules are as follows:
score=abundance*dissimilarity*clonality
Wherein, the liquid crystal display device comprises a liquid crystal display device,
abundance=L(IC m )*Freq*tanh(TPM);
dissimilarity=(1–L(IC w /50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+e x ) Tan () is calculated as hyperbolic tangent function, score is the fraction obtained by calculation of neoantigen, abundance is the abundance of mutant polypeptide, TPM is gene expression abundance, dissimilarity is Dissimilarity (Dissimilarity) is measured as the difference of affinity of mutant peptide fragment to corresponding normal peptide fragment, clonality is measured as the efficiency of neoantigen from mutation to short peptide NCTL and the proportion MC of neoantigen distributed in all tumor cells, freq is the expression abundance ratio of somatic variant allele, which is referred to as somatic cellRatio of variant allele expression abundance to gene expression in tumor cells, IC m For affinity prediction of mutant polypeptides, IC w For the binding affinity of the wild-type polypeptide obtained in affinity prediction, NCTL was the presentation potency and MC was the mutant gene cloning ratio.
In this example, the method further comprises the following steps:
and 6, optimizing the analysis method of the affinity of the potential neoantigen in the step 5 by using a machine learning method based on experimental results.
The method comprises the following three specific steps:
step 6A, data collection
The latest comprehensive dataset of MHC binding peptidases was collected from IEDB (http:// www.iedb.org/database_export_v3. Php) and the MHC binding peptide data from the own experiments was added.
Step 6B, data arrangement
The collected data were collated and the binding peptide fragment affinity value (IC 50) was measured at 50000nM with an IC50 value greater than 50000nM. The affinity values of the resulting binding peptide fragments were converted to the formula t= (1-log (aff)/log (50000)). If the repeated sequence data exists, taking the average value of the binding peptide affinity as the final data combination result, and adding the final data combination result to training data.
Step 6C, data set training
The results of the collection and sorting of step 6B were trained using different methods, including pickpocket, netMHCpan, netMHCcons and/or netMHC, etc., respectively. And updating the original database file of the mhc _i software by using the model file obtained by new training, so as to improve the accuracy of prediction.

Claims (11)

1. A method of predicting a clinical personalized tumor neoantigen comprising the steps of:
step 1, comparing sequencing data of tumor cells of a patient with a preset corresponding reference genome to obtain a DNA comparison result and an RNA comparison result;
step 2, preprocessing DNA comparison results and RNA comparison results;
step 3, analyzing somatic variation, clone type, tumor purity, gene expression quantity in tumor cells and expression abundance of somatic variation alleles in tumor cells by using DNA comparison results and RNA comparison results to obtain analysis results; step 3 comprises the following steps:
Step 3A, analyzing somatic mutation sites in tumor cells by using DNA comparison results;
step 3B, analyzing and obtaining the expression abundance of the variant alleles of the somatic cells and the gene expression quantity in tumor cells according to the comparison result of the mutation sites of the somatic cells and the binding RNA;
step 3C, filtering false positive sites in somatic mutation sites through the expression abundance of somatic mutation alleles to obtain actual somatic mutation sites;
step 3D, annotating and further filtering actual somatic mutation sites to obtain somatic mutation;
step 3E, calculating copy number variation files of the tumor;
step 3F, calculating the tumor purity and the clone structure according to the somatic cell variation and the copy number variation file of the tumor;
step 3G, analyzing the gene expression quantity in the tumor cells according to the RNA comparison result;
step 4, predicting the genotype of the MHC class I molecules of the tumor cells of the patient, analyzing the affinity of the potential neoantigens and calculating the polypeptide presentation efficacy according to the DNA analysis result and the genotype of the MHC class I molecules of the tumor cells of the patient; step 4 comprises the steps of:
step 4A, predicting the genotype of MHC class I molecules of tumor cells of a patient;
step 4B, acquiring non-synonymous mutant protein data according to the annotated actual somatic mutation sites;
Step 4C, predicting the ability of MHC class I molecules to bind wild type polypeptides and variant polypeptides according to the non-synonymous mutant protein data to obtain affinity prediction;
step 4D, analyzing the polypeptide sequence to obtain the polypeptide dissociation efficiency, the transport efficiency and the MHC binding efficiency, and calculating the polypeptide presentation efficiency according to the obtained polypeptide dissociation efficiency, transport efficiency and MHC binding efficiency;
step 4E, obtaining the affinity of the potential new antigen according to affinity prediction and non-synonymous mutant protein data;
step 5, scoring and sequencing the corresponding neoantigens according to the analysis result, the affinity of the potential neoantigens and the polypeptide presentation efficacy, and presenting the neoantigens;
the method for scoring each corresponding neoantigen according to the analysis result, the affinity of the potential neoantigen and the polypeptide presenting efficacy comprises the following steps:
score=abundance*dissimilarity*clonality
wherein, the liquid crystal display device comprises a liquid crystal display device,
abundance=L(IC m )*Freq*tanh(TPM);
dissimilarity=(1–L(IC w /50000))/2;
clonality=NCTL*MC;
L(x)=1/(1+e x ) Tan () is calculated as hyperbolic tangent function, score is the fraction obtained by calculation of neoantigen, abundance is the abundance of mutant polypeptide, TPM is gene expression abundance, dissimilarity is Dissimilarity measure the difference of affinity of mutant peptide fragment and corresponding normal peptide fragment, clonality measure the efficiency NCTL of neoantigen from mutation to short peptide and the proportion MC of neoantigen distributed in all tumor cells, freq is the expression abundance ratio of somatic variant allele, which is the ratio of expression abundance of somatic variant allele to gene expression amount in tumor cells, IC m For affinity prediction of mutant polypeptides, IC w For the binding affinity of the wild-type polypeptide obtained during affinity prediction, NCTL is the presentation efficacy, MC is the cloning ratio of mutant genes
And 6, optimizing the analysis method of the affinity of the potential neoantigen in the step 5 by using a machine learning method based on experimental results.
2. The method for predicting clinical personalized tumor neoantigen according to claim 1, wherein in step 1, the sequencing data of tumor cells of a patient is compared with a preset corresponding reference genome, and a Burrows-Wheeler transformation algorithm is adopted to obtain a DNA comparison result; and (5) obtaining an RNA comparison result by adopting a STAR algorithm.
3. The method of predicting a clinical personalized tumor neoantigen according to claim 1, further comprising, prior to step 1, the steps of:
and step 0, preprocessing the original sequencing data of the tumor cells of the patient to obtain the sequencing data of the tumor cells of the patient.
4. The method of claim 3, wherein in step 0, the pretreatment comprises trimming sequencing adapter sequences, trimming sequencing tag sequences, and removing sequences of poor sequencing quality, using trimmatic software.
5. The method of claim 1, wherein in step 2, the pretreatment comprises deduplication, addition of group information, and alkali matrix recalculation, using GATK software.
6. The method of claim 1, wherein in step 3D, the somatic mutation is: further filtering was performed on the annotated actual somatic mutation sites, leaving only non-synonymous somatic mutation sites in the exon region.
7. The method of claim 1, wherein step 3A uses mutec 2 software for analysis; step 3B adopts the HaplotyperCaller software analysis of GATK; step 3C, adopting FilterMutectCalls software to analyze; step 3D, annotating with ANNOVAR software; step 3E, adopting cnvkit software to calculate; in the step 3F, the calculated tumor purity is calculated by adopting ABSOULTE software, and the cloning structure is calculated by adopting pyClone software; the cloning structure refers to the cloning ratio of mutant genes, i.e., the cloning ratio of each individual cell mutation site.
8. The method of claim 1, wherein in step 3G, the analysis of the abundance of gene expression is performed using featuresources software to obtain counts and converting the counts into the number of sequenced transcripts per million sequenced transcripts.
9. The method of claim 8, wherein the conversion of the count value obtained to the number of transcript sequenced fragments per million sequenced fragments is given by:
Figure FDA0004137799990000031
wherein R is the count number of the genes, L is the length of the genes,
Figure FDA0004137799990000032
R k for the count number of gene k, L k For the length of gene k, n is the total number of genes.
10. The method of predicting a clinical personalized tumor neoantigen according to claim 1, wherein step 4A is analyzed using polsover software; step 4C uses IEDB provided MHC class I epitope predictors software for analysis; and step 4D, analyzing and calculating by adopting netCTLpan software.
11. The method of claim 1, wherein step 4B is specifically: downloading fasta files of human genome protein sequences provided by UCSC websites, extracting sequences of 15 amino acids before and after mutated amino acids according to annotated actual somatic mutation sites, respectively obtaining flanking amino acid sequences of wild type proteins and mutant proteins, and combining all the sequences into fasta files of polypeptide sequences, namely non-synonymous mutant protein data.
CN202010162494.9A 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen Active CN111415707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010162494.9A CN111415707B (en) 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010162494.9A CN111415707B (en) 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen

Publications (2)

Publication Number Publication Date
CN111415707A CN111415707A (en) 2020-07-14
CN111415707B true CN111415707B (en) 2023-04-25

Family

ID=71492918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010162494.9A Active CN111415707B (en) 2020-03-10 2020-03-10 Prediction method of clinical individuation tumor neoantigen

Country Status (1)

Country Link
CN (1) CN111415707B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951887A (en) * 2020-07-27 2020-11-17 深圳市新合生物医疗科技有限公司 Leukocyte antigen and polypeptide binding affinity prediction method based on deep learning
CN112466396A (en) * 2020-12-04 2021-03-09 中山大学附属第一医院 Screening method of tumor high-affinity new antigen and application of tumor high-affinity new antigen in indication of treatment prognosis curative effect of PD-1 of liver cancer patient
CN114446389A (en) * 2022-02-08 2022-05-06 上海科技大学 Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
CN115424740B (en) * 2022-09-30 2023-11-17 四川大学华西医院 Tumor immunotherapy effect prediction system based on NGS and deep learning
CN116825188B (en) * 2023-06-25 2024-04-09 北京泛生子基因科技有限公司 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology
CN116959579B (en) * 2023-09-19 2023-12-22 北京求臻医学检验实验室有限公司 System for reducing errors of second generation sequencing system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105524984A (en) * 2014-09-30 2016-04-27 深圳华大基因科技有限公司 Method and equipment for neoantigen epitope prediction
CN108796055A (en) * 2018-06-12 2018-11-13 深圳裕策生物科技有限公司 Tumor neogenetic antigen detection method, device and storage medium based on the sequencing of two generations
CN109125740A (en) * 2017-06-28 2019-01-04 四川大学 A kind of novel tumor vaccine and application thereof
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2439580T3 (en) * 2003-02-28 2014-01-23 The Johns Hopkins University T cell regulation
US20120277999A1 (en) * 2010-10-29 2012-11-01 Pbd Biodiagnostics, Llc Methods, kits and arrays for screening for, predicting and identifying donors for hematopoietic cell transplantation, and predicting risk of hematopoietic cell transplant (hct) to induce graft vs. host disease (gvhd)
EP3091074B1 (en) * 2013-11-21 2019-08-07 Repertoire Genesis Incorporation T cell receptor and b cell receptor repertoire analysis system, and use of same in treatment and diagnosis
EP3083700B1 (en) * 2013-12-17 2023-10-11 The Brigham and Women's Hospital, Inc. Detection of an antibody against a pathogen
EP3314020A1 (en) * 2015-06-29 2018-05-02 The Broad Institute Inc. Tumor and microenvironment gene expression, compositions of matter and methods of use thereof
AU2017254477A1 (en) * 2016-04-18 2018-11-01 Jennifer G. ABELIN Improved HLA epitope prediction
CN111465989A (en) * 2017-10-10 2020-07-28 磨石肿瘤生物技术公司 Identification of neoantigens using hot spots
CN107704727B (en) * 2017-11-03 2020-01-31 杭州风起智能科技有限公司 Neoantigen activity prediction and sequencing method based on tumor neoantigen characteristic value
US20200368336A1 (en) * 2017-12-01 2020-11-26 Shanghai Jenomed Biotech Co., Ltd. Method for preparing personalized cancer vaccine
CN108491689B (en) * 2018-02-01 2019-07-09 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
CN110600077B (en) * 2019-08-29 2022-07-12 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
CN110706742B (en) * 2019-09-30 2020-06-30 中生康元生物科技(北京)有限公司 Pan-cancer tumor neoantigen high-throughput prediction method and application thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105524984A (en) * 2014-09-30 2016-04-27 深圳华大基因科技有限公司 Method and equipment for neoantigen epitope prediction
CN109125740A (en) * 2017-06-28 2019-01-04 四川大学 A kind of novel tumor vaccine and application thereof
CN108796055A (en) * 2018-06-12 2018-11-13 深圳裕策生物科技有限公司 Tumor neogenetic antigen detection method, device and storage medium based on the sequencing of two generations
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄小兰,董海龙,隋延仿.肝癌MAGE-n抗原HLA-A2限制性细胞毒性T细胞表位的预测.现代肿瘤医学.2003,(第05期),7-9. *

Also Published As

Publication number Publication date
CN111415707A (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN111415707B (en) Prediction method of clinical individuation tumor neoantigen
JP7217711B2 (en) Identification, production and use of neoantigens
CN108796055B (en) Method, device and storage medium for detecting tumor neoantigen based on second-generation sequencing
EP2994159B1 (en) Predicting immunogenicity of t cell epitopes
CN109584960B (en) Method, device and storage medium for predicting tumor neoantigen
CN110600077B (en) Prediction method of tumor neoantigen and application thereof
US11441160B2 (en) Compositions and methods for viral delivery of neoepitopes and uses thereof
CN110277135B (en) Method and system for selecting individualized tumor neoantigen based on expected curative effect
CN110706742B (en) Pan-cancer tumor neoantigen high-throughput prediction method and application thereof
CN107704727A (en) Neoantigen Activity Prediction and sort method based on tumour neoantigen characteristic value
CN110799196A (en) System for ranking immunogenic cancer-specific epitopes
WO2018232580A1 (en) Method and device for haplotype phasing of diploid genome based on third generation capture sequencing
CN115747327A (en) Novel antigen prediction methods involving frameshift mutations
Pagadala et al. Germline modifiers of the tumor immune microenvironment implicate drivers of cancer risk and immunotherapy response
CN114333999A (en) Method and system for detecting and screening tumor neoantigen by combining molecular omics and computing structure
KR20180087248A (en) Viral neo-epitopes and uses thereof
CN114333998A (en) Tumor neoantigen prediction method and system based on deep learning model
AU2019382854B2 (en) Method and system of targeting epitopes for neoantigen-based immunotherapy
WO2020187143A1 (en) Method for identifying neoantigens
WO2022266375A1 (en) Quantification of rna mutation expression
CN113129998B (en) Method for constructing prediction model of clinical individualized tumor neoantigen
CN113160884A (en) Method for predicting novel antigen derived from non-coding region by integrating multiple sets of mathematical data
WO2020101037A1 (en) Order-made medical core system
CN111599410A (en) Method for extracting new antigen for microsatellite unstable immunotherapy by integrating multiple sets of mathematical data and application
Schubert Advanced immunoinformatics approaches for precision medicine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant