CN116825188B - Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology - Google Patents

Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology Download PDF

Info

Publication number
CN116825188B
CN116825188B CN202310750011.0A CN202310750011A CN116825188B CN 116825188 B CN116825188 B CN 116825188B CN 202310750011 A CN202310750011 A CN 202310750011A CN 116825188 B CN116825188 B CN 116825188B
Authority
CN
China
Prior art keywords
mutation
data
mutations
coverage
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310750011.0A
Other languages
Chinese (zh)
Other versions
CN116825188A (en
Inventor
马景娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Panshengzi Medical Laboratory Co ltd
Genetron Health Beijing Co ltd
Original Assignee
Guangzhou Panshengzi Medical Laboratory Co ltd
Genetron Health Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Panshengzi Medical Laboratory Co ltd, Genetron Health Beijing Co ltd filed Critical Guangzhou Panshengzi Medical Laboratory Co ltd
Priority to CN202310750011.0A priority Critical patent/CN116825188B/en
Publication of CN116825188A publication Critical patent/CN116825188A/en
Application granted granted Critical
Publication of CN116825188B publication Critical patent/CN116825188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs

Landscapes

  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method, a device and a computer readable storage medium for identifying tumor neoantigens on a plurality of groups of chemical levels based on a high-throughput sequencing technology in the identification field. The invention aims to solve the technical problem of how to analyze and identify tumor neoantigens on a plurality of groups of chemical layers based on a high-throughput sequencing technology. Firstly, obtaining the whole exon sequencing data of a tumor sample of a tumor patient to be detected, the whole exon sequencing data of a transcriptome sequencing data and the whole exon sequencing data of a blood leukocyte sample; comparing the sequencing data with a reference genome to obtain comparison data, and obtaining mutation data of a tumor sample and HLA I antigen typing of a tumor patient to be detected from the comparison data; and then determining mutant polypeptides based on the mutation data, and identifying whether the mutant polypeptides are new antigens according to the binding force of the mutant polypeptides and HLA class I antigens. The invention can realize the recognition of more molecular events possibly generating tumor neoantigens, and can be applied to the preparation of tumor vaccines and related medicaments.

Description

Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology
Technical Field
The invention relates to a method, a device and a computer readable storage medium for identifying tumor neoantigens on a plurality of groups of chemical levels based on a high-throughput sequencing technology in the identification field.
Background
With the development of cancer treatment technologies, new technologies such as immunotherapy (immunotherapy) are gradually maturing, and are increasingly being used in clinical treatment by virtue of low side effects, low recurrence rate, and the like. Immunotherapy, while in many forms such as monoclonal antibodies, CAR-T, TCR-T and cancer vaccines (cancer vaccinees), is generally directed to specifically recognizing and killing tumor cells by activating or enhancing the immune system of a patient.
In the recognition of tumor cells by the immune system, tumor neoantigens (neoantigens) play an important role. The tumour neoantigen originates from a mutation in the genome or transcriptome of the tumour cell, and the DNA or RNA fragment carrying the mutation is translated into a polypeptide which, due to its immunogenicity, is recognised by human leukocyte antigens (HLA, human Leukocyte Antigen) and presented to immune cells. Thus, the recognition of tumor neoantigens is significant. On the one hand, the tumor neoantigen load can be estimated to predict the curative effect of immunotherapy. On the other hand accurate tumour neoantigen recognition can be used to assist in the preparation of more effective cancer vaccines.
For the recognition of tumor neoantigens, the current scientific community is primarily concerned with genome-level single nucleotide mutations. However, the source of tumour neoantigens is not unique and can be generated either at the genomic level, such as gene fusion (gene fusion), or at the transcriptome level, such as selective cleavage (alternative splicing) and RNA editing (RNA editing). Thus, focusing solely on genomic level single nucleotide mutations may lead to underestimation of tumor immunogenicity. On the other hand, the molecular events described above result in more significant changes in the polypeptide sequence, and thus generally render the polypeptide more immunogenic. Thus, only single nucleotide mutations of interest may miss polypeptides with greater immunogenicity.
Disclosure of Invention
The technical problem to be solved by the invention is how to analyze and identify the tumor neoantigen and/or how to analyze and identify the tumor neoantigen at a multiple-chemistry level based on a high-throughput sequencing technology.
In order to solve the above technical problems, the present invention firstly provides a method for identifying or assisting in identifying tumor neoantigens, which may include the steps of:
a1 Obtaining sequencing data of a tumor patient to be tested, wherein the sequencing data comprises whole exon sequencing data of a tumor sample, transcriptome sequencing data of the tumor sample and whole exon sequencing data of a blood leukocyte sample;
A2 Comparing the sequencing data to a reference genome to obtain comparison data, wherein the comparison data comprises whole exon comparison data of a tumor sample, whole exon comparison data of a tumor sample and whole exon comparison data of a blood leukocyte sample, and obtaining mutation data of the tumor sample and HLA class I antigens of the blood leukocyte sample from the comparison data; the mutation data includes single nucleotide mutations and small fragment insertion or deletion mutations detected in the whole exon alignment data of the tumor sample, and selective shear mutations, RNA editing mutations, and RNA gene fusion mutations detected in the transcriptome alignment data of the tumor sample;
a3 Determining mutant polypeptides corresponding to the mutation data based on the mutation data, and identifying whether the mutant polypeptides are neoantigens or candidate neoantigens according to the binding capacity of the mutant polypeptides and the HLA class I antigens.
In the above method, the neoantigen is a tumor-specific antigen derived from a nonsensical mutation.
The HLA I antigen can be any HLA I antigen subtype of the tumor patient to be tested.
The small fragment nucleotides may range in length from 1 to 49bp.
In the above method, the identification of whether the mutant polypeptide is a neoantigen based on the binding ability of the mutant polypeptide and the HLA class I antigen may be at least one of:
a3-1) the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or a deletion mutation and/or an RNA editing mutation that satisfies condition A is a neoantigen or candidate is a neoantigen, and the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or a deletion mutation and/or an RNA editing mutation that does not satisfy the condition A is a non-neoantigen or candidate is a non-neoantigen; the condition A is that the binding force score of the mutant polypeptide and the HLA I antigen is less than 500, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutant polypeptide is less than 1;
a3-2) the mutant polypeptide generated by the RNA gene fusion mutation meeting the condition B is a new antigen or a candidate is a new antigen, and the mutant polypeptide generated by the RNA gene fusion mutation not meeting the condition B is a non-new antigen or a candidate is a non-new antigen; the condition B is that the binding force fraction of the new antigen of the mutant polypeptide is less than 500;
A3-3) the mutant polypeptide produced by the selective cleavage mutation satisfying condition C is a neoantigen or a candidate is a neoantigen, and the mutant polypeptide produced by the selective cleavage mutation not satisfying the condition C is a non-neoantigen or a candidate is a non-neoantigen; and the condition C is that the binding force score of the mutant polypeptide and the HLA I antigen is smaller than 1000, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutation and the HLA I antigen is smaller than 1.
In the above method, the HLA class I antigen may be a major histocompatibility complex (major histocompatibility complex, MHC) class I antigen of a human. The method further comprises the steps of: the number of neoantigen polypeptides generated by the same mutation event meeting the condition A or B or C is more than 1, and the mutant polypeptide meeting the condition A or B or C with the lowest binding force score for neutralizing the HLA class I antigen is selected as a neoantigen or candidate as a neoantigen.
In the above method, in the A2), obtaining mutation data of the tumor sample may include the steps of:
A2-1) screening for single nucleotide mutations and small fragment insertion and/or deletion mutations based on whether the single nucleotide mutations and small fragment insertion and/or deletion mutations in the mutation data are nonsensical mutations, the coverage of the mutations in the tumor sample alignment data and the blood leukocyte sample alignment data, the mutation frequency of the mutations in the tumor sample and the blood leukocyte sample alignment data, the expression level of the gene at which the mutations are located in the transcriptome alignment data, the gene transcript Transcription Support Level (TSL) value at which the mutations are located, and the coverage detection result of the manual check mutation site sequencing read length of the mutations;
the coverage may be the number of sequencing reads (reads) aligned to the site in the alignment data.
The transcription support level (TSL, transcript Support Level) represents the degree of support of the gene transcript model, and data were evaluated based on the mRNA and EST ratios provided by UCSC and Ensembl, with each gene assigned a number of 1-5 representing the degree of support, with lower numbers representing higher degrees of support.
A2-2) screening for selective splice mutations based on whether the selective splice mutation in the mutation data is a rare mutation site, the coverage of the mutation site in the alignment data corresponding to the mutation, the frequency of the 5'/3' splice site corresponding to the mutation, and the immunogenicity score of the mutant polypeptide corresponding to the mutation;
A2-3) screening RNA editing mutations according to whether the RNA editing mutations in the mutation data are nonsensical mutation sites, whether the mutations are rare mutation sites, the population frequency of the mutations, the coverage and mutation frequency of the mutations in the tumor sample comparison data, whether the mutations are detected as mutations in the tumor sample whole exon comparison data, the expression level of the genes with the mutations in the transcriptome comparison data, the value of the Transcription Support Level (TSL) of the gene transcripts with the mutations and the coverage detection result of the sequencing read length of the mutation manual inspection mutation sites;
a2-4) screening the RNA gene fusion mutation according to the coverage of the corresponding fusion breakpoint of the RNA gene fusion mutation in the mutation data in the comparison data, the mutation frequency of the mutation in the comparison data of the tumor sample, whether the mutation is a gene coding region fusion mutation or not and the coverage condition detection result of the sequencing reading length of the artificial check mutation site of the mutation.
The fusion breakpoint described above may be the point at which two genes that are fused break and fuse together.
In the above method, the screening of A2-1) may specifically be: retaining non-synonymous mutations in the single nucleotide mutations and small fragment insertion and/or deletion mutations, and mutations with a coverage of 10 or more in the tumor sample alignment data (including genomic WES data and transcriptome RNA-seq data) and a coverage of 5 or more in the blood leukocyte sample alignment data, and mutations with a frequency of 0.2 or more in the tumor sample alignment data (including genomic WES data and transcriptome RNA-seq data) and a frequency of 0.02 or less in the blood leukocyte sample alignment data, and corresponding gene(s) with an RNA expression level FPKM (Fragments Per Kilobase of transcript per Million fragments mapped) of 1 or more in the transcriptome alignment data and a gene transcript Transcription Support Level (TSL) of 1 or less and a coverage of a read by manual inspection of mutation site sequencing are single nucleotide and small fragment insertion and/or deletion mutations after screening;
A2-2) the screening can be specifically: preserving rare mutation sites in the selective shearing mutation (rare mutation with a population frequency of less than 1% in a GTEx database), wherein the mutation corresponding to the mutation has a coverage of 10 or more of mutation sites in the comparison data, the mutation corresponding to the mutation has a frequency of 0.1 or more of 5'/3' shearing sites, and the mutation corresponding to the mutation has an immunogenicity score of 0.5 or more of the mutant polypeptide.
A2-3) the screening can be specifically: retaining non-synonymous mutation sites in the RNA editing mutation, wherein the non-synonymous mutation sites are rare mutation sites (sites with records in RADER or DANRED databases and population frequency less than 5% in 1000g or ESP databases), the corresponding coverage of the mutation sites in the transcriptome RNA-seq comparison data is more than or equal to 10, the mutation frequency of the mutation sites in the transcriptome RNA-seq is more than or equal to 0.2, the mutation frequency in the tumor sample WES data is equal to 0 (i.e. the mutation sites are not present in the whole exon comparison data), the Transcription Support Level (TSL) of a gene transcript where the mutation is located is less than or equal to 1, and the mutation with the coverage of the mutation site sequencing read length is the RNA editing mutation after screening;
A2-4) the screening can be specifically: and reserving the mutation of the RNA gene fusion mutation, wherein the corresponding fusion breakpoint position in the comparison data is in a gene coding region, the coverage of the fusion breakpoint in the effective data is more than or equal to 2, the mutation frequency FFPM (Fusion Fragments Per Million total reads) in the comparison data of the tumor sample is more than or equal to 0.1, and the coverage of the sequencing reading length of the artificially checked mutation site is the screened RNA gene fusion mutation.
The gene transcript Transcription Support Level (TSL) values at which the mutations are located can be obtained by detection using the pVAC-Seq software.
The binding score of the polypeptide to the HLA class I antigen can be predicted using pVAC-Seq software.
The immunogenicity score may be calculated by ASNEO software.
The parameters of the software may use default parameters.
In order to solve the technical problem, the invention also provides a device for identifying or assisting in identifying tumor neoantigens, which can comprise the following modules:
b1 Sequencing data acquisition module): the method comprises the steps of obtaining sequencing data of a tumor patient to be tested, wherein the sequencing data comprise whole exon sequencing data of a tumor sample, transcriptome sequencing data of the tumor sample and whole exon sequencing data of a blood leukocyte sample;
B2 Mutation detection module): the sequencing data are used for comparing the sequencing data to a reference genome to obtain comparison data, wherein the comparison data comprise whole exon comparison data of a tumor sample, whole exon comparison data of the tumor sample and whole exon comparison data of a blood leukocyte sample, and mutation data of the tumor sample and HLA class I antigens of the blood leukocyte sample are obtained from the comparison data; the mutation data comprises single nucleotide mutations and small fragment insertion and/or deletion mutations detected in the whole exon alignment data of the tumor sample, and selective cleavage mutations, RNA editing mutations and RNA gene fusion mutations detected in the transcriptome alignment data of the tumor sample;
b3 A neoantigen identification module): and determining mutant polypeptides corresponding to the mutation data based on the mutation data, and identifying whether the mutant polypeptides are neoantigens or candidate neoantigens according to the binding capacity of the mutant polypeptides and the HLA class I antigens.
In the above device, the neoantigen is a tumor-specific antigen derived from a nonsensical mutation.
The HLA I antigen can be any HLA I antigen subtype of the tumor patient to be tested.
The small fragment nucleotides may range in length from 1 to 49bp.
In the above device, the identification of whether the mutant polypeptide is a neoantigen based on the binding ability of the mutant polypeptide and the HLA class I antigen may be at least one of:
b3-1) the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or a deletion mutation and/or an RNA editing mutation satisfying condition A is a neoantigen or candidate is a neoantigen, and the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or a deletion mutation and/or an RNA editing mutation not satisfying the condition A is a non-neoantigen or candidate is a non-neoantigen; the condition A is that the binding force score of the mutant polypeptide and the HLA I antigen is less than 500, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutant polypeptide is less than 1;
b3-2) the mutant polypeptide produced by the RNA gene fusion mutation satisfying the condition B is a neoantigen or a candidate is a neoantigen, and the mutant polypeptide produced by the RNA gene fusion mutation not satisfying the condition B is a non-neoantigen or a candidate is a non-neoantigen; the condition B is that the binding force fraction of the new antigen of the mutant polypeptide is less than 500;
B3-3) the mutant polypeptide produced by the selective cleavage mutation satisfying condition C is a neoantigen or a candidate is a neoantigen, and the mutant polypeptide produced by the selective cleavage mutation not satisfying the condition C is a non-neoantigen or a candidate is a non-neoantigen; and the condition C is that the binding force score of the mutant polypeptide and the HLA I antigen is smaller than 1000, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutation and the HLA I antigen is smaller than 1.
In the above device, the HLA class I antigen may be a Major Histocompatibility Complex (MHC) class I antigen of a human. The method further comprises the steps of: the number of neoantigen polypeptides generated by the same mutation event meeting the condition A or B or C is more than 1, and the mutant polypeptide meeting the condition A or B or C with the lowest binding force score for neutralizing the HLAI antigen is selected as a neoantigen or candidate as a neoantigen.
In the above device, in B2), the mutation data obtained from the tumor sample may be established by a method comprising the steps of:
b2-1) screening for single nucleotide mutations and insertion and/or deletion mutations based on whether the single nucleotide mutations and small fragment insertion and/or deletion mutations in the mutation data are nonsensical mutations, the coverage of the mutations in the tumor sample alignment data and the blood leukocyte sample alignment data, the mutation frequency of the mutations in the tumor sample and the blood leukocyte sample alignment data, the expression level of the gene at which the mutations are located in the transcriptome alignment data, the gene transcript Transcription Support Level (TSL) value at which the mutations are located, and the coverage detection result of the manual check mutation site sequencing read length of the mutations;
The coverage described above may be the number of sequencing reads (reads) aligned to that site in the alignment data.
The transcription support level (TSL, transcript Support Level) represents the degree of support of the gene transcript model, and data were evaluated based on the mRNA and EST ratios provided by UCSC and Ensembl, with each gene assigned a number of 1-5 representing the degree of support, with lower numbers representing higher degrees of support.
B2-2) screening for selective splice mutations based on whether the selective splice mutation in the mutation data is a rare mutation site, the coverage of the mutation site in the alignment data corresponding to the mutation, the frequency of the 5'/3' splice site corresponding to the mutation, and the immunogenicity score of the mutant polypeptide corresponding to the mutation;
b2-3) screening RNA editing mutations based on whether the RNA editing mutation in the mutation data is a nonsensical mutation site, whether the mutation is a rare mutation site, the population frequency of the mutation, the coverage and mutation frequency of the mutation in the tumor sample alignment data, whether the mutation is detected as a mutation in the tumor sample whole exon alignment data, the expression level of the gene where the mutation is in the transcriptome alignment data, the gene transcript Transcription Support Level (TSL) value where the mutation is located, and the coverage detection result of the mutation by manual inspection of the mutation site sequencing read length;
B2-4) screening the RNA gene fusion mutation according to the coverage of the corresponding fusion breakpoint of the RNA gene fusion mutation in the mutation data in the comparison data, the mutation frequency of the mutation in the comparison data of the tumor sample, whether the mutation is a gene coding region fusion mutation or not and the coverage condition detection result of the sequencing reading length of the artificial check mutation site of the mutation.
The fusion breakpoint described above may be the point at which two genes that are fused break and fuse together.
In the above device, the screening of B2-1) may specifically be: retaining non-synonymous mutations in the single nucleotide mutations and small fragment insertion and/or deletion mutations; and a mutation having a coverage of 10 or more in the tumor sample alignment data (including genomic WES data and transcriptome RNA-seq data) and a coverage of 5 or more in the blood leukocyte sample alignment data, and a mutation frequency of 0.2 or more in the sequencing data of the tumor sample (including genomic WES data and transcriptome RNA-seq data) and a mutation frequency of 0.02 or less in the alignment data of the blood leukocyte sample, and a corresponding RNA expression amount FPKM of the gene in the transcriptome alignment data of 1 or more and a Transcription Support Level (TSL) of the gene at which the mutation is located of 1 or less, and a mutation having a coverage of a sequencing read length by manual inspection of a mutation site is a single nucleotide mutation and a small fragment insertion and/or deletion mutation after screening;
The screening of B2-2) can be specifically: preserving rare mutation sites in the selective shearing mutation (rare mutation with a population frequency of less than 1% in a GTEx database), wherein the mutation corresponding to the mutation has a coverage of 10 or more of mutation sites in the comparison data, the mutation corresponding to the mutation has a frequency of 0.1 or more of 5'/3' shearing sites, and the mutation corresponding to the mutation has an immunogenicity score of 0.5 or more of the mutant polypeptide.
B2-3) the screening can be specifically: retaining non-synonymous mutation sites in the RNA editing mutation, wherein the non-synonymous mutation sites are rare mutation sites (sites with records in RADER or DANRED databases and population frequency less than 5% in 1000g or ESP databases), the corresponding coverage of the mutation sites in the transcriptome RNA-seq comparison data is more than or equal to 10, the mutation frequency of the mutation sites in the transcriptome RNA-seq is more than or equal to 0.2, the mutation frequency in the tumor sample WES data is equal to 0 (i.e. the mutation sites are not present in the whole exon comparison data), the Transcription Support Level (TSL) of a gene transcript where the mutation is located is less than or equal to 1, and the mutation with the coverage of the mutation site sequencing read length is the RNA editing mutation after screening;
B2-4) the screening can be specifically: and reserving the mutation of the RNA gene fusion mutation, wherein the corresponding fusion breakpoint position in the comparison data is in a gene coding region, the coverage of the fusion breakpoint in the effective data is more than or equal to 2, the mutation frequency FFPM in the comparison data of the tumor sample is more than or equal to 0.1, and the coverage condition of sequencing and reading the length of the artificially checked mutation site is the screened RNA gene fusion mutation.
In the above device, the gene transcript Transcription Support Level (TSL) value at which the mutation is located can be obtained by detection by the pVAC-Seq software.
The binding score of the polypeptide to the HLA class I antigen can be predicted using pVAC-Seq software.
The immunogenicity score may be calculated by ASNEO software.
The parameters of the software may use default parameters. To solve the above technical problem, the present invention also provides a computer readable storage medium storing a computer program, which runs the steps of the method as described above.
Any of the following applications of the method and/or the apparatus described above are also within the scope of the present invention:
c1 Use in the preparation of a tumor vaccine;
C2 For the development or preparation of medicaments related to tumors.
Any of the following applications of the computer readable storage medium described above are also within the scope of the present invention:
d1 Use in the preparation of a tumor vaccine;
d2 For the development or preparation of medicaments related to tumors.
The invention aims to provide a method for identifying tumor neoantigens by utilizing a high-throughput sequencing technology in multiple-chemistry level analysis. Thereby aiding in assessing tumor immunogenicity and aiding in tumor vaccine design.
In order to achieve the above purpose, the present invention adopts the following technical scheme: the invention constructs a multi-group chemical integration algorithm aiming at high-flux sequencing data. The algorithm takes as input tumor tissue sample Whole Exome (WES) data, blood leukocyte Whole Exome (WES) data and transcriptome (RNA-Seq) data of a tumor patient. After quality control and sequence alignment, the following four mutation events were detected in parallel: (1) Single nucleotide mutations and small fragment insertions and/or deletions; (2) selective shearing; (3) RNA editing; (4) Gene fusion. The present invention sets up a series of conditions to evaluate mutation authenticity (mutation authenticity) and explores optimal screening parameters using supervised learning methods. Next, HLA affinity assessment was performed for polypeptides produced by high quality mutations. Finally, polypeptide sequences with immunogenicity are output, and the immunogenicity degree of each polypeptide is obtained.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. compared with the traditional tumor neoantigen recognition which only focuses on a single mutation form, the integrated algorithm can realize the recognition of more molecular events (single nucleotide mutation, small fragment insertion and/or deletion, selective shearing, RNA editing and gene fusion) possibly generating the tumor neoantigen through multiple groups of chemical analysis. Furthermore, the evaluation of the tumor neoantigen load of the cancer patient can be more real;
2. through multiple sets of chemical analysis, this integrated algorithm can recognize more tumor neoantigens than traditional methods. Thus, the downstream researchers have more room to screen for more effective cancer vaccines for cancer patients. Moreover, polypeptides resulting from the other three molecular events will generally be more immunogenic than single nucleotide mutations, and are more suitable for use in the preparation of cancer vaccines;
the integrated algorithm quantifies the immunogenicity of the tumor neoantigen from the dimensions of HLA affinity, mutation abundance and the like, and can better assist researchers in downstream screening and preparation of cancer vaccines.
Drawings
FIG. 1 is a flow chart for detecting tumor neoantigens by integrating multiple sets of chemical data analysis.
FIG. 2 shows the main filtering conditions for evaluating the authenticity of the novel antigen and the immunogenicity of the antigen.
FIG. 3 shows the result of detecting tumor neoantigens in 36 lung adenocarcinoma samples, the abscissa is sample numbers (1-36), the left ordinate is the mutated genes producing the neoantigens, the right ordinate is the population frequency of each gene in 36 lung adenocarcinoma queues, only the genes with the population frequency of 40 are shown in the figure, the genes with the population frequency of 3% are not all shown, and the genes with the population frequency of 7 are selected alphabetically.
Detailed Description
The following detailed description of the invention is provided in connection with the accompanying drawings that are presented to illustrate the invention and not to limit the scope thereof. The examples provided below are intended as guidelines for further modifications by one of ordinary skill in the art and are not to be construed as limiting the invention in any way.
The experimental methods in the following examples, unless otherwise specified, are conventional methods, and are carried out according to techniques or conditions described in the literature in the field or according to the product specifications. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.
Example 1, multiunit chemical sample neoantigen detection.
This example describes an example of a lung adenocarcinoma sample sequenced through multiple sets of chemical data to detect tumor neoantigens. As shown in fig. 1, the process of detecting a neoantigen includes: performing filtering quality control on raw sequencing data (WES data and RNA-seq data) of a tumor tissue sample of a patient to be tested, and filtering out low-quality data to obtain effective data; comparing the effective data with a reference genome to obtain comparison data, performing quality control on the comparison data, and performing the following four mutation detection on the comparison data after the quality control is qualified: (1) Single nucleotide mutations and small fragment insertions and/or deletions; (2) selective shearing; (3) RNA editing; (4) Gene fusion. And obtaining the real mutation event through a series of screening and filtering. And simultaneously, HLA detection is carried out on a blood leukocyte sample of a patient to be detected, a new polypeptide fragment generated by a mutation event detected in a tumor tissue sample is predicted to serve as the binding force between a new antigen and HLA, and a final candidate new antigen is selected according to a certain screening condition.
1. Raw data is acquired.
1.1 sequencing library construction.
A whole exon sequencing (whole exon sequencing, WES) library was constructed on the DNA of one tumor tissue sample and one blood leukocyte sample, which were clinically confirmed as lung adenocarcinoma patients, by using the Agilent's SureSelect Human All Exon V Kit according to the procedure of the specification, and a tumor tissue sample WES library and a blood leukocyte sample WES library were obtained, respectively.
Meanwhile, using an ABclonal's mRNA-seq Lib Prep Kit for Illumina kit, operating according to the instruction, constructing an RNA sequencing library of RNA extracted from a tumor tissue sample of the patient, and obtaining an RNA library of the tumor tissue sample.
1.2 sequencing data were obtained.
And performing high-throughput double-ended sequencing on the WES library of the tumor tissue sample, the WES library of the blood leukocyte sample and the RNA library of the tumor tissue by using an Illumina NovaSeq6000 platform to respectively obtain WES original data of the tumor tissue sample, WES original data of the blood leukocyte sample and transcriptome sequencing (RNA-seq) original data of the tumor tissue sample, wherein the original sequencing data is FASTQ format data.
2. Raw sequencing data quality control and alignment.
2.1 quality control of raw sequencing data.
WES raw data of tumor tissue samples, transcriptome sequencing raw data, and WES raw data of blood leukocyte samples were filtered using trimomatic (http:// www.usadellab.org/cms/: for a reads (sequencing read length), if the linker sequence is contained, the linker sequence is cut off, the base with a head-to-tail base matrix size below 5 is cut off, and the base average size is calculated with a sliding window of 4bp length, if the base average size is below 15, the window base is cut off, and the remaining reads length is below 36bp defined as low quality data. And filtering to obtain effective data with qualified quality control.
2.2 genome alignment.
2.2.1WES data alignment and quality control.
WES valid data of the tumor tissue sample and the blood leukocyte sample obtained in step 2.1 were aligned to human reference genome (GRCh 37, 2009-2-27) using BWA-MEM (https:// sourceforge. Net/projects/bio-BWA /) software to obtain bam format WES alignment data.
And performing quality control on the obtained WES comparison data. The quality control criteria are as follows: (1) The base quantity Q30 (base correct rate is 99.99%) is more than 90%; (2) comparing greater than 95% to the genomic data of the ginseng; (3) average sequencing depth: tumor samples are larger than 200X, and blood leukocyte samples are larger than 100X; (4) sites greater than 10X depth account for greater than 95%. If the samples meet the above 4 standards at the same time, the quality control is qualified.
2.2.2 transcriptome data alignment and quality control.
Transcriptome sequencing effective data of the tumor tissue samples obtained in step 2.1 were aligned to human reference genomes (GRCh 37, 2009-2-27 and GRCh38, 2013-12-17) using STAR (http:// code. Google. Com/p/ra-STAR /) software to obtain bam format transcriptome alignment data (alignment data obtained for GRCh37 reference genomes were used to detect variable splice mutations, alignment data obtained for GRCh38 reference genomes were used to detect RNA editing and gene fusion).
And performing quality control on the obtained transcriptome comparison data. The quality control criteria are as follows: (1) data volume greater than 10G; (2) a base quality value Q30 of greater than 90%; (3) ribosomal RNA content below 15%; (4) unique comparison to a reference genome reads of greater than 80%; (5) All genes are uniformly covered (gene coverage defines: the number of reads supported per site in the region of each gene from 5 'end to 3' end divided by the total depth in the gene region, with substantially identical results per site). If the samples meet the above 5 standards at the same time, the quality control is qualified.
And (3) performing next analysis on the comparison data (namely effective comparison data) which are qualified in quality control.
3. And detecting mutation of a tumor sample.
Detecting single nucleotide (single nucleotide variants, SNV) mutations on WES alignment data qualified for the tumor-like quality control obtained in the step 2 by using MuTect (https:// software. Broadensite. Org/cancer/cga/mutct) software; small fragment insertion and/or deletion (Indel, 1-49bp in length) mutations were detected using the Strelka (https:// gitsub. Com/Illumina/Strelka) software.
Detecting selective cleavage (alternative splicing) mutations using ASNEO (https:// gitsub.com/bm 2-lab/ASNEO) software on transcriptome alignment data of qualified tumor-like quality control obtained in step 2; RNA editing mutations were detected using GIREM (https:// gitsub. Com/zhqingiit/giregi) software; RNA gene fusions were detected using STAR-Fusion (https:// gitsub. Com/STAR-Fusion) software.
4. Leukocyte Antigen (HLA) detection of blood leukocyte samples.
HLA detection is carried out on the WES ratio data of the quality-controlled qualified blood leukocyte samples obtained in the step 2, and HLAminer (https:// gitsub.com/bcgsc/HLAminer) software is used for detecting the major histocompatibility complex (major histocompatibility complex, MHC) class I antigen subtype.
5. Mutation event screening.
The four types of mutations detected in step 3 (single nucleotide mutations, small fragment insertion and/or deletion mutations, selective cleavage mutations, RNA editing mutations and gene fusion) were screened to obtain the true mutations (fig. 2). The screening criteria were as follows:
(1) Screening for single nucleotide mutations and small fragment insertions and/or deletions, retaining mutation sites that simultaneously meet the following 6 conditions: 1) The type of mutation is a non-synonymous mutation; 2) Tumor samples have a coverage of 10 or more at the mutation site (WES and RNA-seq) and a coverage of 5 or more at the mutation site of blood leukocyte samples (WES); 3) Mutation frequency in tumor sample (WES and RNA-seq) sequencing effective alignment data is greater than or equal to 0.2, and mutation frequency in corresponding blood leukocyte sample (WES) sequencing effective alignment data is less than or equal to 0.02; 4) The expression level FPKM (Fragments Per Kilobase of transcript per Million fragments mapped) of (RNA-seq) RNA of the mutated gene in the tumor sample is more than or equal to 1; 5) The transcriptional support level (TSL, transcript Support Level) of the mutant gene transcript is 1 or less; 6) Manually checking IGV (Integrative Genomics Viewer) pictures of the mutation, and checking that the result is a real mutation;
TSL shows the degree of support of the gene transcript model, and is determined by the pVAC-Seq software, and data are evaluated based on the mRNA and EST ratios provided by UCSC and Ensembl, with each gene assigned a number of 1-5 representing the degree of support, with lower numbers representing higher degrees of support. (2) Screening for selective cleavage mutations, mutations that simultaneously met the following 4 conditions were retained: 1) Selective splice mutations occur at a population frequency of less than 1% in the GTEx database (https:// gtexact/home /); 2) The coverage of the sequencing effective comparison data corresponding to the selective shearing mutation site (the number of sequencing reads compared to the site in the comparison data) is more than or equal to 10; 3) The 5'/3' cleavage site frequency is greater than or equal to 0.1 (the frequency can be calculated by sj2psi (https:// pypi. Org/project/sj2psi /) software); 4) An immunogenicity score (calculated by ASNEO software) of 0.5 or greater;
(3) Screening for RNA editing mutations, mutations that simultaneously met the following 6 conditions were retained: 1) The type of mutation is a non-synonymous mutation; 2) Mutation sites were recorded in RADER (doi: 10.1093/nar/gkt996,996) or DANRED (doi: 10.1093/bioinformation/btq 285) databases (these 2 databases record pathogenic, rare variant sites); 3) Mutation sites have a population frequency of less than 5% in 1000g (https:// www.internationalgenome.org /) or ESP (https:// EVS. Gs. Washington. Edu/EVS /) databases (this part of the sites are pathogenic, rare variant sites); 4) Coverage and mutation frequency: the coverage of a mutation site in the RNA-seq sequencing effective alignment data of the tumor sample is more than or equal to 10, the mutation frequency is more than or equal to 0.2, and the mutation frequency of the site in the DNA sequencing effective alignment data is equal to 0 (namely, the mutation site is not present in the whole exon sequencing data); 5) A Transcription Support Level (TSL) of the mutant gene transcript of 1 or less; 6) Manually checking the mutated IGV picture and checking that the result is a true mutation;
(4) Screening for gene fusions, retaining mutations that simultaneously meet the following 4 conditions: 1) The fusion site of two genes which are fused is in the gene coding region; 2) The coverage of the fusion breakpoint (the point where two fused genes are broken and fused together) in the sequencing effective comparison data is more than or equal to 2; 3) The mutation frequency FFPM of the fusion gene in the tumor sample sequencing effective comparison data is more than or equal to 0.1; 4) The IGV picture of the fusion site was examined manually and the examination result was a true mutation.
6. And predicting a new antigen.
And (3) predicting new antigen binding capacity of the mutation event screened in the step (5) based on the Human Leukocyte Antigen (HLA) detected in the step (4).
For the three mutation events of single nucleotide mutation and small fragment insertion and/or deletion, RNA editing and gene fusion, the binding score of mutant polypeptide and HLA I antigen corresponding to the mutation was predicted by using pVAC-Seq (https:// pVAC-Seq. Readthes. Io/en/v4.0.6/index. Html) software, and the predicted binding score of mutant polypeptide and HLA I antigen was screened to obtain a new antigen for tumor (lung adenocarcinoma) by the following criteria:
for single nucleotide mutation, small fragment insertion and/or deletion mutation and RNA editing mutation, selecting mutant polypeptide binding force score corresponding to mutation to be less than 500, and the ratio of the mutant polypeptide binding force score to HLA class I antigen binding force score to wild polypeptide binding force score corresponding to mutation to HLA class I antigen binding force score to be less than 1;
For gene fusion mutation events, retaining fusion mutation with fusion mutation corresponding to mutant polypeptide binding force score less than 500;
for the selective cleavage mutation event, the antigen binding of the mutant polypeptide corresponding to the mutation was predicted using NetMHCpan (https:// services. Heathtech. Dtu. Dk/services. PhpNetMHCpan-4.0) software and the antigen binding results of the resulting mutant polypeptides were screened against the following criteria: and selecting the selective shearing mutation that the binding force fraction of the mutant polypeptide is smaller than 1000 and the ratio of the binding force fraction of the mutant polypeptide and HLA I antigen to the binding force fraction of the wild polypeptide corresponding to mutation and HLA I antigen is smaller than 1.
For all mutant polypeptides generated by one mutation event, the filtering standard is met, and an optimal result (the binding force score of the mutant polypeptide and HLA is minimum) is selected as a final candidate new antigen.
Example 2 application example of a method for identifying tumor neoantigens at multiple sets of chemistry based on high throughput sequencing technology
This example contains 36 lung adenocarcinoma samples of clinical origin, each patient being tested using the method established in example 1 for recognition of tumor neoantigens at multiple sets of chemistry based on high throughput sequencing technology. The results are shown in FIG. 3, where each column represents a clinical sample, each row represents the gene for a mutation event, and where different shapes represent different types of mutation events: dna_solid (filled circles) represent single nucleotide mutations and small fragment insertion and/or deletion mutations; rna_as (open triangles) represents selective cleavage mutations; rna_coding (filled triangles) represents gene editing mutations; RNA fusion (open circles) represents a gene fusion mutation. The left side of each row in the figure represents the gene or fusion gene that produced the neoantigen, and the data on the right side of each row represents the ratio of mutated samples among 36 samples. The results of detection of neoantigens in 36 samples are shown in Table 1.
TABLE 1 results of neoantigen detection for 36 samples
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
Note that: where mutant type A represents DNA_solid, B represents RNA_AS, C represents RNA_coding, D represents RNA_fusion, which has only a mutant binding score, and thus the wild type binding score cannot be assessed, so the binding score is denoted by NA.
Taking the detection data of sample 33 (see tables 2-5) as an example, in addition to detecting 3 genomic level of neoantigen polypeptides generated by the genomic mutations (KLHDC 1, ZNF214, TP 53) detectable in the traditional way, the integration algorithm of the invention also detects RNA editing mutations at 2 sites generated by the TUBGCP2 gene, and three gene fusion mutations of PCDHA11-PCDHAC1, CYP7B1-CPS1, and PCDHGA3-PCDHGC 3. These multiple detected transcriptome level neoantigen polypeptides can more truly reflect the neoantigen load of the tumor sample on the one hand; on the other hand, the tumor neoantigen polypeptide detected at the transcriptome level is generally higher in immunogenicity, so that downstream researchers can have a larger space for screening and preparing effective cancer vaccines.
TABLE 2 specific information on detection of DNA_solid by samples
Note that: wherein the base mutation columns, > left represents the base of the reference genome, > right represents the base after mutation, the amino acid mutation columns, > left represents the amino acid of the reference genome, > right represents the amino acid after mutation.
TABLE 3 specific information on RNA_AS detected by the samples
TABLE 4 specific information on RNA_edition detected by the samples
Note that: wherein the base mutation columns, > left represents the base of the reference genome, > right represents the base after mutation, the amino acid mutation columns, > left represents the amino acid of the reference genome, > right represents the amino acid after mutation.
TABLE 5 specific information on RNA_fusion detected by samples
Note that: the fusion sites detected are all in the coding region.
The present invention is described in detail above. It will be apparent to those skilled in the art that the present invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with respect to specific embodiments, it will be appreciated that the invention may be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

Claims (5)

1. A method for identifying or aiding in the identification of a tumour neoantigen, characterised in that: the method comprises the following steps:
A1 Obtaining sequencing data of a tumor patient to be tested, wherein the sequencing data comprises whole exon sequencing data of a tumor sample, transcriptome sequencing data of the tumor sample and whole exon sequencing data of a blood leukocyte sample;
a2 Comparing the sequencing data to a reference genome to obtain comparison data, wherein the comparison data comprises whole exon comparison data of a tumor sample, whole exon comparison data of a tumor sample and whole exon comparison data of a blood leukocyte sample, and obtaining mutation data of the tumor sample and HLA class I antigens of the blood leukocyte sample from the comparison data; the mutation data comprises single nucleotide mutations and small fragment insertion and/or deletion mutations detected in the whole exon alignment data of the tumor sample, and selective cleavage mutations, RNA editing mutations and RNA gene fusion mutations detected in the transcriptome alignment data of the tumor sample;
a3 Determining mutant polypeptides corresponding to the mutation data based on the mutation data, and identifying whether the mutant polypeptides are neoantigens or candidate neoantigens according to the binding capacity of the mutant polypeptides and the HLA class I antigens;
In the A2), obtaining mutation data of the tumor sample includes the steps of:
a2-1) screening for single nucleotide mutations and insertion and/or deletion mutations based on whether the single nucleotide mutations and small fragment insertion and/or deletion mutations in the mutation data are nonsensical mutations, the coverage of the mutations in the tumor sample alignment data and the blood leukocyte sample alignment data, the mutation frequency of the mutations in the tumor sample and the blood leukocyte sample alignment data, the expression level of the gene at which the mutations are located in the transcriptome data, the transcription support level value of the gene transcript at which the mutations are located, and the coverage detection result of the manual examination mutation site sequencing read length of the mutations; the screening is as follows: retaining the nonsensical mutation in the mononucleotide mutation and the small fragment insertion and/or deletion mutation, wherein the coverage in the tumor sample is more than or equal to 10, the coverage in the blood leukocyte sample comparison data is more than or equal to 5, the mutation frequency in the comparison data of the tumor sample is more than or equal to 0.2, the mutation frequency in the comparison data of the blood leukocyte sample is less than or equal to 0.02, the RNA expression amount FPKM of the corresponding gene in the transcriptome comparison data is more than or equal to 1, the transcription support level of the gene transcript in which the mutation is positioned is less than or equal to 1, and the mutation of the coverage condition of the sequencing read length of the mutation site through manual examination is the mononucleotide mutation and the small fragment insertion and/or deletion mutation after screening; the alignment data includes genomic WES data and transcriptome RNA-seq data;
A2-2) screening for selective splice mutations based on whether the selective splice mutation in the mutation data is a rare mutation site, the coverage of the mutation site in the sequencing data corresponding to the mutation, the frequency of the 5'/3' splice site corresponding to the mutation, and the immunogenicity score of the mutant polypeptide corresponding to the mutation; the screening is as follows: retaining rare mutation sites in the selective shearing mutation, wherein the coverage of mutation sites in the comparison data corresponding to the mutation is more than or equal to 10, the frequency of 5'/3' shearing sites corresponding to the mutation is more than or equal to 0.1, and the immunogenicity score of mutant polypeptides corresponding to the mutation is more than or equal to 0.5; the rare mutation sites are rare mutation sites with the crowd frequency less than 1% in the GTEx database;
a2-3) screening RNA editing mutation according to whether RNA editing mutation in the mutation data is a nonsensical mutation site, whether the mutation is a rare mutation site, the population frequency of the mutation, the comparison data coverage and mutation frequency of the mutation in the tumor sample, whether the mutation is detected as mutation in the comparison data of the whole exons of the tumor sample, the expression level of the gene where the mutation is in the transcriptome data, the transcription support level value of the gene where the mutation is and the coverage condition detection result of sequencing reading length of the mutation by manual examination mutation site; the screening is as follows: retaining non-synonymous mutation sites in the RNA editing mutation, wherein the non-synonymous mutation sites are rare mutation sites, the corresponding coverage of the mutation sites in the transcriptome RNA-seq comparison data is more than or equal to 10, the mutation frequency of the mutation sites in the transcriptome RNA-seq is more than or equal to 0.2, the mutation frequency in the WES data of the tumor sample is equal to 0, the transcription support level of the gene transcripts where the mutations are located is less than or equal to 1, and the mutation with the coverage of the sequencing length of the mutation sites is manually checked to obtain the screened RNA editing mutation; the rare mutation sites are sites which are recorded in the RADER or DANRED database and have a crowd frequency of less than 5% in 1000g or ESP database;
A2-4) screening RNA gene fusion mutation according to coverage of corresponding fusion break points of the RNA gene fusion mutation in the mutation data in the sequencing data, mutation frequency of the mutation in comparison data of the tumor sample, whether the mutation is gene coding region fusion mutation or not and coverage condition detection results of sequencing reading length of a manual check mutation site of the mutation; the screening is to keep the corresponding fusion breakpoint position in the comparison data in the RNA gene fusion mutation within a gene coding region, the coverage of the fusion breakpoint in the effective data is more than or equal to 2, the mutation frequency FFPM in the comparison data of the tumor sample is more than or equal to 0.1, and the mutation under the coverage condition of sequencing and reading the length of a manual checking mutation site is the screened RNA gene fusion mutation;
identifying whether the mutant polypeptide is a neoantigen based on the binding capacity of the mutant polypeptide and the HLA class I antigen is at least one of:
a3-1) the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or a deletion mutation and/or an RNA editing mutation that satisfies condition A is a neoantigen or candidate is a neoantigen, and the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or a deletion mutation and/or an RNA editing mutation that does not satisfy the condition A is a non-neoantigen or candidate is a non-neoantigen; the condition A is that the binding force score of the mutant polypeptide and the HLA I antigen is less than 500, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutant polypeptide is less than 1;
A3-2) the mutant polypeptide generated by the RNA gene fusion mutation meeting the condition B is a new antigen or a candidate is a new antigen, and the mutant polypeptide generated by the RNA gene fusion mutation not meeting the condition B is a non-new antigen or a candidate is a non-new antigen; the condition B is that the binding force fraction of the new antigen of the mutant polypeptide is less than 500;
a3-3) the mutant polypeptide produced by the selective cleavage mutation satisfying condition C is a neoantigen or a candidate is a neoantigen, and the mutant polypeptide produced by the selective cleavage mutation not satisfying the condition C is a non-neoantigen or a candidate is a non-neoantigen; and the condition C is that the binding force score of the mutant polypeptide and the HLA I antigen is smaller than 1000, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutation and the HLA I antigen is smaller than 1.
2. A device for identifying or aiding in the identification of a tumour neoantigen, characterised in that: the device comprises the following modules:
b1 Sequencing data acquisition module): the method comprises the steps of obtaining sequencing data of a tumor patient to be tested, wherein the sequencing data comprise whole exon sequencing data of a tumor sample, transcriptome sequencing data of the tumor sample and whole exon sequencing data of a blood leukocyte sample;
B2 Mutation detection module): the sequencing data are used for comparing the sequencing data to a reference genome to obtain comparison data, wherein the comparison data comprise whole exon comparison data of a tumor sample, transcriptome sequencing comparison data of the tumor sample and whole exon comparison data of a blood leukocyte sample, and mutation data of the tumor sample and HLA class I antigens of the blood leukocyte sample are obtained from the comparison data; the mutation data includes single nucleotide mutations and small fragment insertion and/or deletion mutations in the whole exon alignment data of the tumor sample, as well as selective splice mutations, RNA editing mutations, and RNA gene fusion mutations in the transcriptome alignment data of the tumor sample;
b3 A neoantigen identification module): the method comprises the steps of determining mutant polypeptides corresponding to mutation data based on the mutation data, and identifying whether the mutant polypeptides are neoantigens or candidate neoantigens according to the binding capacity of the mutant polypeptides and HLA class I antigens;
in B2), the mutation data obtained for the tumor sample is established by a method comprising the steps of:
b2-1) screening for single nucleotide mutations and insertion and/or deletion mutations based on whether the single nucleotide mutations and small fragment insertion and/or deletion mutations in the mutation data are nonsensical mutations, the coverage of the mutations in the tumor sample alignment data and the blood leukocyte sample alignment data, the mutation frequency of the mutations in the tumor sample and the blood leukocyte sample alignment data, the expression level of the gene at which the mutations are located in the transcriptome data, the transcription support level value of the gene transcript at which the mutations are located, and the coverage detection result of the manual check mutation site sequencing read length of the mutations; the screening is as follows: retaining the nonsensical mutation in the mononucleotide mutation and the small fragment insertion and/or deletion mutation, wherein the coverage in the tumor sample is more than or equal to 10, the coverage in the blood leukocyte sample comparison data is more than or equal to 5, the mutation frequency in the comparison data of the tumor sample is more than or equal to 0.2, the mutation frequency in the comparison data of the blood leukocyte sample is less than or equal to 0.02, the RNA expression amount FPKM of the corresponding gene in the transcriptome comparison data is more than or equal to 1, the transcription support level of the gene transcript in which the mutation is positioned is less than or equal to 1, and the mutation of the coverage condition of the sequencing read length of the mutation site through manual examination is the mononucleotide mutation and the small fragment insertion and/or deletion mutation after screening; the alignment data includes genomic WES data and transcriptome RNA-seq data;
B2-2) screening for selective splice mutations based on whether the selective splice mutation in the mutation data is a rare mutation site, the coverage of the mutation site in the sequencing data corresponding to the mutation, the frequency of the 5'/3' splice site corresponding to the mutation, and the immunogenicity score of the mutant polypeptide corresponding to the mutation; the screening is as follows: retaining rare mutation sites in the selective shearing mutation, wherein the coverage of mutation sites in the comparison data corresponding to the mutation is more than or equal to 10, the frequency of 5'/3' shearing sites corresponding to the mutation is more than or equal to 0.1, and the immunogenicity score of mutant polypeptides corresponding to the mutation is more than or equal to 0.5; the rare mutation sites are rare mutation sites with the crowd frequency less than 1% in the GTEx database;
b2-3) screening RNA editing mutations according to whether the RNA editing mutations in the mutation data are nonsensical mutation sites, whether the mutations are rare mutation sites, the population frequency of the mutations, the comparison data coverage and mutation frequency of the mutations in the tumor samples, whether the mutations are detected as mutations in the comparison data of all exons of the tumor samples, the expression level of genes where the mutations are located in the transcriptome data, the transcription support level value of the genes where the mutations are located and the coverage detection result of sequencing read lengths of the mutation sites by manual examination; the screening is as follows: retaining non-synonymous mutation sites in the RNA editing mutation, wherein the non-synonymous mutation sites are rare mutation sites, the corresponding coverage of the mutation sites in the transcriptome RNA-seq comparison data is more than or equal to 10, the mutation frequency of the mutation sites in the transcriptome RNA-seq is more than or equal to 0.2, the mutation frequency in the WES data of the tumor sample is equal to 0, the transcription support level of the gene transcripts where the mutations are located is less than or equal to 1, and the mutation with the coverage of the sequencing length of the mutation sites is manually checked to obtain the screened RNA editing mutation; the rare mutation sites are sites which are recorded in the RADER or DANRED database and have a crowd frequency of less than 5% in 1000g or ESP database;
B2-4) screening RNA gene fusion mutation according to coverage of corresponding fusion break points of the RNA gene fusion mutation in the mutation data in the sequencing data, mutation frequency of the mutation in comparison data of the tumor sample, whether the mutation is gene coding region fusion mutation or not and coverage condition detection results of sequencing reading length of artificial checking mutation sites of the mutation; the screening is to keep the corresponding fusion breakpoint position in the comparison data in the RNA gene fusion mutation within a gene coding region, the coverage of the fusion breakpoint in the effective data is more than or equal to 2, the mutation frequency FFPM in the comparison data of the tumor sample is more than or equal to 0.1, and the mutation under the coverage condition of sequencing and reading the length of a manual checking mutation site is the screened RNA gene fusion mutation;
identifying whether the mutant polypeptide is a neoantigen based on the binding capacity of the mutant polypeptide and the HLA class I antigen is at least one of:
b3-1) the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion and/or deletion mutation and/or an RNA editing mutation that satisfies condition A is a neoantigen or candidate is a neoantigen, and the mutant polypeptide produced by a single nucleotide mutation, a small fragment insertion or deletion mutation and/or an RNA editing mutation that does not satisfy the condition A is a non-neoantigen or candidate is a non-neoantigen; the condition A is that the binding force score of the mutant polypeptide and the HLA I antigen is less than 500, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutant polypeptide is less than 1;
B3-2) the mutant polypeptide produced by the RNA gene fusion mutation satisfying the condition B is a neoantigen or a candidate is a neoantigen, and the mutant polypeptide produced by the RNA gene fusion mutation not satisfying the condition B is a non-neoantigen or a candidate is a non-neoantigen; the condition B is that the binding force fraction of the new antigen of the mutant polypeptide is less than 500;
b3-3) the mutant polypeptide produced by the selective cleavage mutation satisfying condition C is a neoantigen or a candidate is a neoantigen, and the mutant polypeptide produced by the selective cleavage mutation not satisfying the condition C is a non-neoantigen or a candidate is a non-neoantigen; and the condition C is that the binding force score of the mutant polypeptide and the HLA I antigen is smaller than 1000, and the ratio of the binding force score of the mutant polypeptide and the HLA I antigen to the binding force score of the wild polypeptide corresponding to the mutation and the HLA I antigen is smaller than 1.
3. A computer-readable storage medium storing a computer program, characterized by: the computer program causes a computer to perform the steps of the method as claimed in claim 1.
4. Use of the method of claim 1 and/or of any of the following of the device of claim 2:
C1 Use in the preparation of a tumor vaccine;
c2 For the development or preparation of medicaments related to tumors.
5. A computer readable storage medium of claim 3, any of the following applications:
d1 Use in the preparation of a tumor vaccine;
d2 For the development or preparation of medicaments related to tumors.
CN202310750011.0A 2023-06-25 2023-06-25 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology Active CN116825188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310750011.0A CN116825188B (en) 2023-06-25 2023-06-25 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310750011.0A CN116825188B (en) 2023-06-25 2023-06-25 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology

Publications (2)

Publication Number Publication Date
CN116825188A CN116825188A (en) 2023-09-29
CN116825188B true CN116825188B (en) 2024-04-09

Family

ID=88116110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310750011.0A Active CN116825188B (en) 2023-06-25 2023-06-25 Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology

Country Status (1)

Country Link
CN (1) CN116825188B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388773A (en) * 2018-02-01 2018-08-10 杭州纽安津生物科技有限公司 A kind of identification method of tumor neogenetic antigen
CN108491689A (en) * 2018-02-01 2018-09-04 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application
CN110600077A (en) * 2019-08-29 2019-12-20 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
CN111415707A (en) * 2020-03-10 2020-07-14 四川大学 Prediction method of clinical individualized tumor neoantigen
CN112210596A (en) * 2020-09-08 2021-01-12 中生康元生物科技(北京)有限公司 Tumor neoantigen prediction method based on gene fusion event and application thereof
CN113035272A (en) * 2021-03-08 2021-06-25 深圳市新合生物医疗科技有限公司 Method and apparatus for obtaining new antigens for immunotherapy based on endosomal cell variation
CN113160887A (en) * 2021-04-23 2021-07-23 哈尔滨工业大学 Screening method of tumor neoantigen fused with single cell TCR sequencing data
CN113533741A (en) * 2021-06-23 2021-10-22 深圳市新合生物医疗科技有限公司 Method for predicting new antigen based on polypeptide structural index
CN113956342A (en) * 2021-12-22 2022-01-21 北京大学人民医院 Tumor neogenesis antigen polypeptide and application thereof
CN114446389A (en) * 2022-02-08 2022-05-06 上海科技大学 Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
CN115747327A (en) * 2022-04-15 2023-03-07 成都朗谷生物科技股份有限公司 Novel antigen prediction methods involving frameshift mutations
WO2023068931A1 (en) * 2021-10-21 2023-04-27 Curevac Netherlands B.V. Cancer neoantigens

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388773A (en) * 2018-02-01 2018-08-10 杭州纽安津生物科技有限公司 A kind of identification method of tumor neogenetic antigen
CN108491689A (en) * 2018-02-01 2018-09-04 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application
CN110600077A (en) * 2019-08-29 2019-12-20 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
CN111415707A (en) * 2020-03-10 2020-07-14 四川大学 Prediction method of clinical individualized tumor neoantigen
CN112210596A (en) * 2020-09-08 2021-01-12 中生康元生物科技(北京)有限公司 Tumor neoantigen prediction method based on gene fusion event and application thereof
CN113035272A (en) * 2021-03-08 2021-06-25 深圳市新合生物医疗科技有限公司 Method and apparatus for obtaining new antigens for immunotherapy based on endosomal cell variation
CN113160887A (en) * 2021-04-23 2021-07-23 哈尔滨工业大学 Screening method of tumor neoantigen fused with single cell TCR sequencing data
CN113533741A (en) * 2021-06-23 2021-10-22 深圳市新合生物医疗科技有限公司 Method for predicting new antigen based on polypeptide structural index
WO2023068931A1 (en) * 2021-10-21 2023-04-27 Curevac Netherlands B.V. Cancer neoantigens
CN113956342A (en) * 2021-12-22 2022-01-21 北京大学人民医院 Tumor neogenesis antigen polypeptide and application thereof
CN114446389A (en) * 2022-02-08 2022-05-06 上海科技大学 Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
CN115747327A (en) * 2022-04-15 2023-03-07 成都朗谷生物科技股份有限公司 Novel antigen prediction methods involving frameshift mutations

Also Published As

Publication number Publication date
CN116825188A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN105989246B (en) A kind of mutation detection method and device based on genome assembling
EP3837690B1 (en) Systems and methods for using neural networks for germline and somatic variant calling
Mysara et al. From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data
CN109767810B (en) High-throughput sequencing data analysis method and device
US20140323320A1 (en) Method of detecting fused transcripts and system thereof
CN108388773A (en) A kind of identification method of tumor neogenetic antigen
CN107408163B (en) Method and apparatus for analyzing gene
CN111755067A (en) Screening method of tumor neoantigen
CN112289376B (en) Method and device for detecting somatic cell mutation
CN110739027A (en) cancer tissue positioning method and system based on chromatin region coverage depth
CN108595918B (en) Method and device for processing circulating tumor DNA repetitive sequence
CN107944228A (en) A kind of method for visualizing of gene sequencing variant sites
CN113299344A (en) Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment
CN107784201A (en) A kind of real-time sequencing sequence joint filling-up hole method and system of two generation sequences and three generations's unimolecule
CN107849613A (en) Method for lung cancer parting
CN114446389A (en) Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
CN113362893A (en) Construction method and application of tumor screening model
CN116825188B (en) Method, device and computer readable storage medium for identifying tumor neoantigen at multiple groups of chemical layers based on high-throughput sequencing technology
CN110164504B (en) Method and device for processing next-generation sequencing data and electronic equipment
CN115240773B (en) New antigen identification method and device, equipment and medium of tumor specific circular RNA
CN113096737A (en) Method and system for automatically analyzing pathogen types
CN114530200B (en) Mixed sample identification method based on calculation of SNP entropy
KR20170000743A (en) Method and apparatus for analyzing gene
CN111028885B (en) Method and device for detecting yak RNA editing site
CN110438235B (en) Method for deducing crowd source based on hair shaft proteome nsSNP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant