CN109584960A

CN109584960A - Predict the method, apparatus and storage medium of tumor neogenetic antigen

Info

Publication number: CN109584960A
Application number: CN201811531729.6A
Authority: CN
Inventors: 叶浩; 李祥永; 戴珩
Original assignee: Shanghai Whale Boat Gene Technology Co Ltd
Current assignee: Xukang medical technology (Suzhou) Co., Ltd
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-04-05
Anticipated expiration: 2038-12-14
Also published as: CN109584960B

Abstract

The present invention relates to a kind of methods for predicting tumor neogenetic antigen, comprising steps of (1) carries out somatic mutation according to tumour-embryonal system check sample and Gene Fusion detects；(2) fusion mutant peptide and corresponding wildtype peptide are generated for every a pair of of fusion；(3) mutant peptide and corresponding wildtype peptide are generated based on each somatic mutation；(4) the special a human genome of building tumor sample and generation contain the mutant peptide of multiple mutation；(5) judge the mutant peptide true and false of single mutation and multimutation；(6) removal and the completely the same mutant peptide of wild-type protein other positions sequence；(7) detection of HLA molecule parting is carried out, the affinity of nascent polypeptide and HLA molecule is predicted, using the high nascent polypeptide of affinity as candidate tumor neoantigen.The present invention also provides corresponding device and computer storage mediums.It using method, apparatus and storage medium of the invention, can effectively oncotherapy response biomarker assess, provide accurately candidate peptide fragment for tumor vaccine design.

Description

Predict the method, apparatus and storage medium of tumor neogenetic antigen

Technical field

The present invention relates to biological information field more particularly to immunotherapy of tumors biomarker discoveries, in particular to one Kind forms the prediction technique and its application of tumor neogenetic antigen to somatic mutation and Gene Fusion.

Background technique

Tumor neogenetic antigen

Tumor neogenetic antigen refers to that the script for presenting cell recognition by human antigen is not stored in " nonego " new raw egg of human body White polypeptide is somebody's turn to do the nascent polypeptide of " nonego " mainly from the mutain apoptosis that tumor cell mutations are formed.Specifically new For in the biological process that raw antigen is offered, be divided into 5 steps: 1, antigen presenting cell (APC) can be by endocytosis tumour Albumen (including mutain) in tumour cell is cracked into short peptide fragment by cell；2, APC intracellular transhipment egg These peptide fragments are transported in endoplasmic reticulum by white (TAP, endosome)；3, the HLA I class molecule in endoplasmic reticulum in expression ,-II class The anchoring of molecule groove and peptide fragment is combined into stable compound, and (8~11 amino acid length peptide fragments, II class are divided I class molecule in conjunction with Son combines 13~25 amino acid length peptide fragments)；4, the MHC molecule in endoplasmic reticulum and peptide fragment compound are secreted through golgiosome To APC cell surface；5, the HLA molecule-peptide fragment compound on the surface receptor TCR identification surface APC of immune t-cell, after excitation Continuous immune response.Tumor neogenetic antigen is the key factor for exciting body immune system to react tumour cell primary immune.

Tumor neogenetic antigen is applied in tumour immunity

The immune system that immunization therapy is conceived to recovery body dispels the identification killing ability of tumour cell to reach The purpose of tumour.Different to healthy cell and the logical traditional chemicotherapy or tyrosine kinase inhibitor killed of tumour cell Deng the direct killing of closing growth of tumour cell survival signaling access, immunization therapy is a kind of completely new and efficient oncotherapy New model.2013, cancer immunotherapy was chosen as first of " annual ten big sciences are broken through " by the U.S.'s " science " magazine.2018 Also flower falls immunotherapy field to Nobel prize's soul.Although immunization therapy growth momentum is swift and violent, this utilization Immune system is only effective to certain cancers and several patients come the strategy for attacking tumour.Do not doing any biomarker sieve In the case where choosing, the Overall response rate of most solid tumors is lower than 30%.And with high microsatellite instability/mismatch repair defects Overall response rate for the tumour of biomarker screening, PD1 treatment then can be improved to 50% or more.Therefore, suitable biology Marker screens the patient of immunization therapy, is the key point for realizing the accurate medicine of tumour immunity.Tumour in October, 2018 is prominent The non-small cell lung cancer practice guidelines of the comprehensive cancer network of US National are formally written in varying duty, and neoantigen is prominent as tumour Varying duty triggers the final effect factor of immune response, and the biomarker assessment immunization therapy that can become more accurate can Benefit property.Tumor vaccine personalized treatment based on tumor neogenetic antigen is also another important application scene.Tumour epidemic disease Seedling is to feed back the neoantigen detected in patient tumors cell into human body, exciting human immune response, and orientation is dispelled Present the tumour cell of these neoantigens.Currently, neoantigen, with polypeptide, nucleic acid or shapes such as DC cell through inducing in vitro Formula feeds back human body.Ott (PMID:29542692) and Sahin (PMID:28678784), Carreno (PMID:25837513) Et al., the neoantigen come will be predicted, be applied on cutaneum carcinoma small sample in the form of this 3 kinds of tumor vaccines respectively, obtained Good therapeutic effect.To sum up, tumor neogenetic antigen can not only be used for the biomarker that assessment immunization therapy benefits, can also To directly apply in the treatment of tumor vaccine.

The pre- flow gauge of existing neoantigen and method

Full exon hybrid capture sequencing based on two generation sequencing technologies provides high-throughput detection tumour body cell The possibility of mutation.Currently, the common process of neoantigen prediction is: 1, constructing nascent polypeptide library, somatic mutation annotation is arrived On protein level, it is 8~11 amino acid and 13~25 amino acid lengths that traversal, which generates length, around mutain point Mutant peptide and corresponding wild peptide fragment；2, HLA molecule is predicted to the affinity of nascent polypeptide and its corresponding wildtype peptide, is based on The affinity forecasting software of open source is predicted the affinity of nascent polypeptide and wildtype peptide and HLA molecule, was made with empirical value Filter, filters out potential neoantigen.Affinity forecasting software generally directed to I type and II type HLA molecule is that Denmark's industry is big NetMHC, netMHCpan and the netMHCII of exploitation, netMHCIIpan (http://www.cbs.dtu.dk/ services/).There are two big mainstream open source neoantigen forecasting softwares at present, is Xi Naishanyikan medical college OpenVax respectively Project team exploitation Topiary (https://github.com/openvax/topiary) and Mike Tang, University of Washington How Joint Genome Institute's Malachi Griffith development in laboratory pVACtools (https://github.com/ griffithlab/pVACtools).The two open source softwares have been applied to more and have been published in authoritative magazine Cancer In the research paper in relation to TCGA tumour big-sample data such as Cell, Immunity (PMID:29657128, PMID: 29628290).Due in affinity prediction, Topiary and pVACtools are all made of netMHC, the tools such as netMHCpan, It is not described herein.On nascent polypeptide library generates, two kinds of softwares are to generate single mutation peptide as unit of each mutation.This In have an apparent design defect as shown in Figure 1, if within 8~11 or 13~25 amino acid lengths occur two Or more cis- mutation, conventional method these can be contained multiple mutation nascent polypeptide lose.However, these missings The nascent polypeptide that the cis- mutation of multiple spot is formed, it is also possible to the neoantigen as body actual immunity originality.On the other hand, The process of these open sources is only that the mutant polypeptide that mutation is formed is collectively referred to as nascent polypeptide.Actually some mutant form At polypeptide be not real nascent polypeptide, these mutant polypeptides are likely to be present on wild type protein sequence other than catastrophe point Other positions on.Especially when these mutation occur on Sequences of Low Complexity region, (insertion on such as repetitive sequence is lacked Lose), the case where this mutant polypeptide is not real nascent polypeptide, is more common.For example, the 523rd~530, wild type PRX albumen Polypeptide sequence be LKVSEMKL, the 471st~478 polypeptide sequence is PKVSEMKL.It is mutated chr19:40902691A > G meeting PRX the 523rd amino acid L is caused to become P.Although the mutant polypeptide PKVSEMKL generated at this time and 523~530 open countries Raw peptide sequence is different, but completely the same with 471~478 peptide section sequences on wild PRX protein sequence, thus, this Mutant polypeptide is present in wild type PRX albumen, is not real nascent polypeptide for body machine.Conventional method is set Defect is counted, the accuracy of neoantigen prediction is directly affected.

Summary of the invention

The purpose of the present invention is overcoming the above-mentioned prior art, a kind of more effective oncotherapy response is provided Biomarker assessment provides the method for predicting tumor neogenetic antigen of accurately candidate peptide fragment for tumor vaccine design.

To achieve the goals above, one aspect of the present invention provides a kind of method for predicting tumor neogenetic antigen, has It is following to constitute:

The method comprising steps of

(1) somatic mutation is carried out according to tumour-embryonal system check sample and Gene Fusion detects；

(2) fusogenic peptide and corresponding wildtype peptide are generated for every a pair of of fusion；

(3) mutant peptide and corresponding wildtype peptide are generated based on each somatic mutation；

(4) the special a human genome of building tumor sample and generation contain the multimutation peptide of multiple mutation；

(5) by the cis trans relationship between mutation, judge the mutant peptide true and false of single mutation and multimutation, generate true Existing mutant peptide；

(6) removal and the completely the same mutant peptide of wild-type protein other positions sequence, construct complete nascent polypeptide library；

(7) the bam file based on embryonal system check sample carries out the detection of HLA molecule parting, and prediction nascent polypeptide and HLA divide The affinity of son, using the high nascent polypeptide of affinity as candidate tumor neoantigen.

Preferably, it is thin to export body under the default parameters of Mutect2 tool in the somatic mutation of the step (1) Cytoplasmic process becomes after result, carries out further Quality Control filtering, and the Quality Control filtering includes: that the frequency of mutation is greater than 2%；Catastrophe point Sequencing depth be greater than 10；At least 2 reads instructions have mutation and average base quality > 20 of the reads.

Preferably, detected in the step (1) for Gene Fusion, if input be full exon WES or full genome because Group WGS sequencing data, then detect Gene Fusion under default parameters with FACTERA tool；If input is RNAseq data, then Gene Fusion is detected under default parameters with STAR-Fusion tool, then by number >=1 junction reads, do into To reduce false positive, the junction reads refers to the reads of directly covering fusion breakpoint for the Quality Control of one step.

Preferably, fusion breakpoint annotation AGFusion is according to 5 ' ends, 3 ' end breakpoints in gene in the step (2) Coordinate information in group, annotation fusion breakpoint, and synthesize fused protein sequence overall length；Intercepted length is L and containing fusion The fusogenic peptide of breakpoint and corresponding 5 ' end, 3 ' end wildtype peptides.

Preferably, specific interception rule are as follows:

Determine coordinate of 5 ' the end fusion breakpoints on the fusion protein that wild albumen is held in 5 ' that length is p5 and length is g Index: fusion protein and 5 ' the wild protein sequences in end are compared, obtains maximum consistency fragment sequence seq1 and seq1 5 ' hold the coordinated indexing m on wild albumen, the length of coordinated indexing t, the seq1 on fusion protein are s1, then 5 ' end The coordinated indexing of breakpoint is m+s1 on wild-type protein, is t+s1 on fusion protein；

It determines coordinated indexing of 3 ' the end fusion breakpoints on 3 ' the wild albumen in end that length is p3: comparing fusion protein and 3 ' Wild albumen is held, obtains coordinated indexing n of the most homogeneous fragment sequence seq2 and seq2 on 3 ' the wild albumen in end, institute The length of the seq2 stated is s2；

The fusogenic peptide and corresponding 5 ' and 3 ' end wildtype peptides that intercepted length is L:

In the case where 3 ' end fusion breakpoints do not cause frameshit frame to change, each fusogenic peptide has opposite 5 ' end and 3 ' Two wildtype peptides are held to generate, fusion protein indexes t+s1-L from min coordinates and indexes t to maximum coordinates, and intercepted length is melting for L Close peptide；5 ' the wild albumen in end, which are stayed at one's house demanding payment of a debt from min coordinates index m+s1-L to maximum, draws m+s1, and 3 ' the wild albumen in end are from n-L to most Global coordinate indexes n, and intercepted length is two corresponding wildtype peptides of L；

When 3 ' end fusion breakpoints cause frameshit frame to change, each fusogenic peptide only has one 5 ' end wildtype peptide to generate, and melts Hop protein indexes t+s1-L from min coordinates and indexes g-L to maximum coordinates, and intercepted length is the fusogenic peptide of L, 5 ' the wild albumen in end M+s1-L is indexed from min coordinates and indexes p5-L to maximum coordinates, is sequentially generated the wildtype peptide that corresponding length is L.

Preferably, in the step (3),

It is annotated using SnpEff, it will be on the base mutation annotation to Ensembel database in each somatic cell gene group Each transcript and corresponding protein sequence on；

Intercepted length is the mutant peptide and corresponding wildtype peptide of L.

Preferably, interception rule are as follows:

For missense mutation, non-frameshift mutation, centered on being mutated coordinate, L-1 amino acid is taken to 5 ' ends, is taken to 3 ' ends L-1 amino acid generates length and contains mutation between 8~11 amino acid lengths and 13~25 amino acid lengths The mutation section of amino acid and corresponding wildtype peptide；

Typical single point is mutated, saltant type-wild type peptide fragment that 38 pairs of length are 8~11 amino acid lengths is generated, And 247 pairs of length are saltant type-wild type peptide fragment of 13~25 amino acid lengths；

For frameshift mutation, since the preceding L-1 amino acid of catastrophe point, until extending to first terminator appearance, Generate the mutation multistage of 8~11 and 13~25 amino acid lengths and the wild peptide fragment of corresponding coordinate.

Preferably, in the step (4),

Abrupt information in the VCF file of somatic mutation is disposably all imported into the mankind to refer on genome, and The wild-type base on former coordinate is replaced, all a human genomes of the tumor sample are generated；

The transcript containing mutation edited on genome base level is translated into mutation with Biopython tool Protein sequence；

Based on the corresponding mutain coordinated indexing information of each transcript that single mutation in step (3) annotates, by step (3) peptide fragment in intercepts rule, generates the mutant polypeptide of 8~11 and 13~25 amino acid lengths containing mutation；

In the step (5):

The bam file of tumor sample is read by pysam, export compares the reads information for arriving each catastrophe point, calculates mutation It puts between any twoWherein f (i), f (j) respectively indicate the NGS that instruction has mutation i, is mutated j Reads illustrates that none tumour subclone is to possess mutation i and mutation j simultaneously, needs when Jacard coefficient is 0 Remove the i containing mutation and the multimutation peptide for being mutated j while generation in step (4)；When Jaccard coefficient is 1, illustrate all Tumour subclone is gathered around simultaneously with mutation i and mutation j, need to remove generate in step (3) containing only mutation i or be mutated j's Single mutation peptide；Jaccard coefficient between 0 and 1, then retain the i containing mutation is generated in step (3) or be mutated the single mutation peptide of j with And it is generated in step (4) simultaneously containing mutation i and the multimutation peptide for being mutated j.

Preferably, HLA molecule parting tool is sequenced using 5 kinds of two different generations, passes through generation in the step (7) The highest HLA genotyping result of consistency is to reduce false positive；Preferably, using Polysolver, HLA-HD, HLA-PRG-LA, OptiType and Hla-genotyper calculates HLA molecule parting, to each of 8 major class HLA HLA allele, initially It must be divided into 0, every to be arrived by a software detection, then score+1, the HLA allele of highest scoring finally divide as all kinds of HLA Sub- genotyping result；

The affinity of nascent polypeptide and HLA molecule is predicted specifically, working as nascent polypeptide and HLA molecule affinity≤500nM And opposite ranking≤2%, then regarding as the nascent polypeptide is candidate tumor neoantigen.

The present invention also provides the prediction tumor neogenetic antigens described in one kind in the application for preparing anti-tumor drug or vaccine.

Using prediction tumor neogenetic antigen of the invention and its application, by generating the special a people's gene of tumor sample Group compensates for two big defects of current main stream approach: 1, losing for the wrong deconsolidation process of multimutation peptide fragment or directly；2, New life is mistaken as by the mutation peptide fragment (being often found in wild-type protein) for occurring to be formed in low complex degree region mutagenesis Polypeptide.To enable neoantigen prediction technique of the invention that it is really new accurately more fully to react tumor sample Raw antigen status.In the hepatocellular carcinoma data that 13 receive immunization therapy, it is able to confirm that the neoantigen that the present invention calculates is negative Lotus can be effectively applied to the benefit assessment of immunotherapy of tumors.It is accurate comprehensive pre- on neoantigen in view of the present invention It surveys and provides reliable peptide fragment source for tumor vaccine design, the application on tumor vaccine, which is also that the present invention is another, potentially answers Use scene.

Detailed description of the invention

Fig. 1 is the difference of art methods and the present invention on building nascent polypeptide library.

Fig. 2 is that neoantigen provided by the invention predicts flow diagram.

Fig. 3 is method provided by the invention in embodiment 1 compared with open source software Topiary and pVACtools.

Fig. 4 is that the survivorship curve that the present invention is calculated in neoantigen in hepatocellular carcinoma immunization therapy sample is analyzed.

Specific embodiment

In order to more clearly describe technology contents of the invention, further retouch combined with specific embodiments below It states.

A kind of tumor neogenetic antigen prediction method based on building tumour human genome provided by the invention, comprising: packet The fusogenic peptide and corresponding wildtype peptide interception rule of the breakpoint containing fusion；It constructs tumour human genome and generates and contain multimutation Mutant peptide；The heterogeneous feature for fully considering tumour measures mutation using the Jaccard coefficient of NGS reads in catastrophe point Between cis-trans relationship, guarantee the accuracy of mutant peptide generated；It is highest that consistency is generated based on multiple and different tools HLA molecule parting as a result, guarantee the accuracy of HLA to the full extent；Removal and mutant peptide completely the same in wild albumen, Generate nascent polypeptide truly.

In method provided by the invention: providing raw including the mutant peptide from Gene Fusion and somatic mutation At method, the comprehensive of mutant peptide source is ensured；Tumor sample human genome is constructed, to accurately generate containing multiple prominent The mutant peptide of change；The heterogeneity for fully taking into account tumour ensure that mutant peptide in body with the cis-trans relationship between mutation Interior truth；Removal and the completely the same mutant peptide of wild albumen, ensure that the accuracy of nascent polypeptide；Pass through generation Multiple and different highest HLA of HLA molecule parting consistency as a result, improve the accuracy of HLA to the full extent；For time-consuming HLA and nascent polypeptide affinity prediction steps, carried out parallel processing, effectively improved operation efficiency.

Somatic mutation has been detected the present invention is based on full exon or genome sequencing, and building tumor patient is special A human genome examines or check the NGS reads information of the cis- mutation of multiple spot comprehensively, and it is raw to provide a kind of comprehensive and accurate nascent polypeptide At method.On this basis, the affine force prediction method of mainstream is integrated, neoantigen is calculated, so as to do more effective tumour Treatment response biomarker assessment provides accurately candidate peptide fragment for tumor vaccine design.

In conjunction with Fig. 2, illustrate the method for prediction tumor neogenetic antigen provided by the invention, method includes the following steps:

Step1: it is detected for tumour-embryonal system check sample to somatic mutation and Gene Fusion is done.

Open-Source Tools Mutect2, FACTERA are respectively used to somatic mutation and Gene Fusion detection (for input text When part is RNAseq data, Gene Fusion is detected using STAR-Fusion).

Following Quality Control is separately done after the parameter filtering of Mutect2 default for the reliability for guaranteeing somatic mutation result Filtering:

A. the frequency of mutation is greater than 2%；B. the sequencing depth of catastrophe point is greater than 10；C. at least 2 reads instructions have mutation And average base quality > 20 of the reads.

Detection for Gene Fusion breakpoint, in FACTERA tool (if RNAseq data STAR-Fusion) After default parameters exports Gene Fusion result, it need to separately guarantee that (directly breakpoint is merged in covering at least one junction reads reads)。

Step2: fusogenic peptide and corresponding wildtype peptide are generated for every a pair of of fusion.

It is held, each transcript of gene where 3 ' ends, and generated corresponding complete with AGFusion tool tips fusion breakpoint 5 ' Fusion protein sequence, then 8-11,13-25 amino acid length of the interception comprising fusion breakpoint in fusion protein sequence Polypeptide and 5 ' ends, the corresponding wildtype peptide in 3 ' ends.

It specifically includes:

1. merging breakpoint annotation: AGFusion (is specially contaminated according to 5 ' ends, the coordinate information of 3 ' end breakpoints in the genome Colour solid number+coordinate, such as: chr21:42866283), annotation fusion breakpoint, and synthesize fused protein sequence overall length.

2. intercepted length is that the fusogenic peptide of L breakpoint containing fusion and corresponding 5 ' are held, 3 ' end wildtype peptides.

A) seat of 5 ' the end fusion breakpoints on the fusion protein that wild albumen is held in 5 ' that length is p5 and length is g is determined Mark index.Fusion protein and 5 ' the wild protein sequences in end are compared, show that maximum length is the consistency fragment sequence of s1 Seq1 and seq1 the coordinated indexing m on 5 ' the wild albumen in end, the coordinated indexing t on fusion protein then 5 ' hold breakpoint Coordinated indexing on wild-type protein be m+s1, on fusion protein be t+s1.

B) 3 ' end fusion breakpoints coordinated indexing on 3 ' the wild albumen in end that length is p3 is determined.Compare fusion protein and 3 ' Wild albumen is held, show that the most homogeneous fragment sequence seq2 and seq2 that length is s2 holds the seat on wild albumen 3 ' Mark index n.

C) fusogenic peptide and corresponding 5 ' and 3 ' end wildtype peptides that intercepted length is L.

In the case where 3 ' end fusion breakpoints do not cause frameshit frame to change, each fusogenic peptide has corresponding 5 ' end and 3 ' Two wildtype peptides are held to generate.Fusion protein indexes t+s1-L from min coordinates and indexes t to maximum coordinates, and intercepted length is melting for L Close peptide.5 ' the wild albumen in end, which are stayed at one's house demanding payment of a debt from min coordinates index m+s1-L to maximum, draws m+s1.3 ' the wild albumen in end are from n-L to most Global coordinate indexes two corresponding wildtype peptides that n intercepted length is L.

When 3 ' end fusion breakpoints cause frameshit frame to change, each fusogenic peptide only has one 5 ' end wildtype peptide to generate.Melt Hop protein indexes t+s1-L from min coordinates and indexes g-L to maximum coordinates, and intercepted length is the fusogenic peptide of L.5 ' the wild albumen in end M+s1-L is indexed from min coordinates and indexes p5-L to maximum coordinates, starts the wildtype peptide that intercepted length is L.

Step3: single mutation peptide and corresponding wildtype peptide are generated based on each somatic mutation.

With SnpEff annotation by the body cell base mutation annotation on genome to each on Ensembel database On a transcript and corresponding protein sequence.Interception includes that protein mutation site length is 8-11,13-25 amino acid length Mutant peptide and corresponding position on wildtype peptide.

It specifically includes:

A) being annotated with SnpEff will be on the base mutation annotation to Ensembel database in each somatic cell gene group Each transcript and corresponding protein sequence on.

B) intercepted length is the mutant peptide and corresponding wildtype peptide of L.

For missense mutation, for non-frameshift mutation, centered on being mutated coordinate, L-1 amino acid is taken to 5 ' ends, to 3 ' End takes L-1 amino acid (L is the mutant peptide length to be generated).Generating length is 8-11 amino acid length and 13-25 The mutation section containing mutating acid and corresponding wildtype peptide between a amino acid length.

For simple point mutation typical for one, the mutation that 38 pairs of length are 8-11 amino acid length can be generated Type-wild type peptide fragment and 247 pairs of length are saltant type-wild type peptide fragment of 13-25 amino acid length.

If sporting frameshift mutation, then since taking L-1 amino acid before catastrophe point, prolong to reaching first termination Until son occurs, the mutation multistage of 8-11 and 13-25 amino acid length and the wild peptide fragment of corresponding coordinate are generated.

Step4: the special a human genome of building tumor sample simultaneously generates the mutant peptide for containing multiple mutation.

All mutating alkali yls that batch detects Step1 replace the mankind with reference to the base on genome.This advantage exists In the multiple mutation occurred on each gene can be captured simultaneously.

Loss or error note to the mutant peptide comprising multiple mutation are that the one of existing neoantigen forecasting tool is big short Plate.The present invention is using the method for generating the special a human genome of tumor sample, to make up this defect.Specifically, by body Abrupt information in the VCF file of cell mutation disposably all imported into the mankind with reference to genome, and replaces on former coordinate Wild-type base, generate all a human genomes of the tumor sample.

The transcript containing mutation edited on genome base level is translated into mutation with Biopython tool Protein sequence.Based on the corresponding mutain coordinated indexing information of each transcript that 3 single mutation of Step annotates, by Step 3 In peptide fragment intercept rule, generate containing mutation 8-11 and 13-25 amino acid length mutant polypeptide.

It is edited on protein level compared to based on single mutation annotation information, the present invention is raw by the base of editor's genome At tumour human genome, then unify the entire transcript of annotation, after the multiple base mutations of reaction that can be more accurate Mutain situation.Especially when point mutation occurs within the codeword triplet of the same amino acid.Such as: chr11: 56143803A > G, chr11:56143804G > A occur in the same codon, when individually annotation arrives protein level, respectively For ORBU1:p.Gln235Arg and ORBU1:p.Gln235His, the conflict of protein level editor will cause.This step needs It is noted that guaranteeing that mutation is corresponding consistent with the reference genome version that will be imported with reference to genome version.The invention branch at present Major version GRCh37 and GRCh38 that the mankind refer to genome are held, the reference gene of other species can be further expanded to Such as rat, mouse in group.

Step5: judge to generate in the single mutation and Step4 generated in Step3 by the cis trans relationship between mutation The multimutation peptide true and false, generate the mutant peptide of necessary being.

According to the reads information in sequence alignment bam file, determine that the relationship between mutation is cis- or cis relationship, To judge the mutant peptide true and false containing multimutation and single mutation.By taking two mutation as an example, if two sport trans- dash forward Become, i.e., is that then the mutant peptide containing this pair of prominent peptide is removed by this, is only protected simultaneously comprising the two mutation without a reads It stays containing the mutant peptide being individually mutated.If two sport cis- mutation, only retain the mutation simultaneously containing the two mutation Peptide.

Due to Tumor Heterogeneity, the listed mutation in somatic mutation file is not entirely cis- mutation, i.e., these are prominent Change, which is not necessarily, to be appeared in the same tumour subclone, and multiple mutation are dispersed in different subclones in other words, are being surveyed Ordinal number appears on different NGS reads according to multiple mutation are above shown as.Therefore, tumor sample sequence is introduced in this step The bam file of comparison is used to judge the true and false of these multimutation peptide fragments and single mutation peptide fragment.Only indicate these mutation NGS reads overlaps, and just can guarantee the single mutation peptide fragment generated containing multimutation peptide and Step3 that Step4 is generated It is all necessary being.

The bam file of tumor sample is read particular by pysam, export passes through the chromosome coordinate of each catastrophe point Reads information calculates catastrophe point between any twoWherein, f (i), f (j) respectively indicate instruction The NGS reads for having mutation i, being mutated j illustrates that none tumour subclone is to possess simultaneously when Jacard coefficient is 0 It is mutated i and mutation j, need to remove the i containing mutation simultaneously and is mutated the mutant peptide of j；When Jaccard coefficient is 1, illustrate institute Have tumour subclone simultaneously gather around with mutation i and mutation j, need to remove generated in Step3 it is independent containing only mutation i or dash forward Become the mutant peptide of j；Jaccard coefficient between 0 and 1, then retain Step3 generation the caused mutant peptide of single mutation and The multimutation containing mutation i and mutation j generated in Step4.True complete mutant peptide is eventually generated via step Library.

Step6: removal and the completely the same mutant peptide of wild-type protein other positions sequence construct complete nascent polypeptide Library.

Nascent polypeptide, which refers to, is not present in mutant peptide on wild-type protein caused by mutation, so just can be by immunity of organism system System is considered newborn.The mutant peptide formed via Step5 is not fully equivalent to nascent polypeptide.Especially as the repetition of generation The protein mutation in region, it is easy to the sequence completely the same with mutant peptide, this peptide are found in the other positions of wild-type protein Section can not be known as really newborn anti-peptide fragment.This point is also the place that existing Open-Source Tools directly neglect.This step is directed to Every mutant peptide obtains the corresponding wild protein sequence of each transcript by pyensemble, check mutant peptide whether Occur in wild protein sequence.

Step7: the bam file based on embryonal system check sample does the high HLA molecule parting detection of high consistency.

In view of the goldstandard generation sequencing consistency that is detected as HLA molecule parting also only up to 84% (PMID: 27802932), HLA molecule parting tool is sequenced using 5 kinds of two different generations in the present invention, by generating the highest HLA of consistency Genotyping result is to reduce false positive.Specifically, with Polysolver, HLA-HD, HLA-PRG-LA, OptiType and Hla- Genotyper calculates HLA molecule parting.Wherein Polysolver, OptiType only calculate I type HLA detection, and other three kinds same When can also be used for II type HLA detection.To each of 8 major class HLA (A, B, C, DRB, DPA, DPB, DQA, DQB) HLA Allele, initial to be divided into 0, every to be arrived by a software detection, then score+1.The HLA allele of highest scoring is as each The final molecule parting result of class HLA.

Step8: the affinity of prediction nascent polypeptide and HLA molecule, and according to affinity height, the new life for calculating sample is anti- Former load.

The prediction actually analogue antigen of neoantigen passes through the groove and new life in structure in the HLA molecule in delivery cell The anchoring of polypeptide combines.Existing several mainstream tool such as netMHC, netMHCpan, netMHCII, netMHCIIpan, Mhcflurry etc., which is all based on to the affinity of HLA molecule and small peptide in truthful data, trains each HLA molecular specific Neural network model, the affinity for being subsequently used for the HLA molecule and nascent polypeptide are predicted.Current this kind of algorithm is almost owned Reported neoantigen forecasting tool is applied.The present invention also uses the affinity prediction algorithm of these types of mainstream.

Specifically screening conditions are, as nascent polypeptide and HLA molecule affinity≤500nM and relative affinity ranking≤ 2%, then regarding as the nascent polypeptide is neoantigen.Here the affinity exported is indicated with IC50 value, is represented and 50% Nascent polypeptide concentration when HLA molecule combines, unit nM.The numerical value is smaller, indicates coded by peptide fragment and the allele HLA albumen affinity it is higher.The opposite ranking of affinity is indicated with Rank (%).I.e. the IC50 value of the nascent polypeptide with Percentage ranking in the IC50 data set for 400000 peptide fragments that machine generates.Numerical value is smaller, illustrates peptide fragment and the HLA points The affinity of son is in relatively higher position.Reach the sum of all neoantigen-HLA molecular complexes of this threshold value Referred to as tumor neogenetic antigen load.

Embodiment 1

The present embodiment is started with the mutation file of an example non-small cell lung cancer sample, and specific abrupt information is shown in Table 1, and divides Do not compare Topiary, pVACtools and method of the invention predicts the nascent polypeptide of 8-11 amino acid length.

The embodiment can be proof scheme embodiment of the invention, it was demonstrated that two of the present invention compared to current mainstream The advantage of tool.Fig. 3 illustrate the comparison procedures of two Open-Source Tools of neoantigen prediction technique and mainstream of the invention with As a result.Since affinity of three kinds of tools for peptide fragment and HLA molecule uses identical method, three kinds of tools are only focused on here Difference in nascent polypeptide generation.

58 individual cells mutational site information of 1 Patients with Non-small-cell Lung of table

It is true as the nascent polypeptide for judging to generate according to whether being detected in raw albumen out of office in method provided by the invention Pseudo- foundation, all mutant polypeptides for being mutated and being formed can be at two parts: genuine nascent polypeptide, false nascent polypeptide.In this example In, symbiosis of the present invention is at 203 false nascent polypeptides (i.e. mutated polypeptide sequences can be found in wild-type protein), and 1792 The genuine nascent polypeptide of item.Topiary and pVACtools generate 1748,1702 nascent polypeptides respectively.

From the results of three tools relatively in, it can clearly be seen that 3 point discoveries: 1, the present invention and pVACtools method are raw At nascent polypeptide the result of Topiary can be completely covered；2, in the nascent polypeptide sequence of Topiary and pVACtools prediction There are 62 to find in wild-type protein.This general mutant polypeptide for carrying out mutation is referred to as the way of nascent polypeptide It is wanting in consideration；3, pVACtools and Topiary is weak when handling adjacent double alkali yl mutation.It is prominent for double alkali yl replacement Become, pVACtools is directly split as two base mutations by force, will cause amino acid annotation mistake, and Topiary is then directly neglected Omitting whole double alkali yl mutation causes the nascent polypeptide generated to reduce totally (4 in this sample).Peculiar 76 mistakes of pVACseq institute Nascent polypeptide accidentally is mutated from 4: chr15:28947425G > A；chr15:28947426A>G；chr4: 145041707C>A； chr4:145041708T>C.This 4 mutation be by directly by two in somatic mutation file it is adjacent Double alkali yl is mutated chr15:28947425:GA > AG；Chr4:145041707CT > AC is split by force.Carefully analyze this hair Bright specific 190 genuine nascent polypeptides, discovery are concentrated mainly in 9 mutation shown in table 2.Wherein there are 4 double alkali yls Mutation is lost, and the nascent polypeptide that an EGFR hot spot mutation is formed especially is lacked.Turn in addition, being lost part there are also 3 mutation This annotation is recorded, and leads to the loss of corresponding nascent polypeptide.

To sum up illustrate, invention achieves expected design effects, can overcome the disadvantages that existing tool fault, this will be helpful to accurately It calculates tumor neogenetic antigen load and assesses immunotherapeutic effects, and reliable polypeptide information service swelling in the later period is provided Tumor vaccine design.

9 abrupt informations corresponding to 2 distinctive 190 nascent polypeptides of the present invention of table

Embodiment 2

Embodiment 2 is method provided by the invention concrete application scene on immunotherapy of tumors, to illustrate the present invention Application value on immunotherapy of tumors, and the advantage compared to Tumor mutations load granted at present.High tumour is prominent Varying duty explanation has more tumour somatic mutations, it is meant that can generate more tumor neogenetic antigens, such tumour cell A possibility that being identified by immunocyte is also bigger, this is exactly Tumor mutations load as biomarker and assesses immunization therapy Where the biological theory of effect.It is biological as immunization therapy that embodiment 2 verifies the tumor neogenetic antigen load that the present invention calculates The validity of marker.

The overall survival data of 13 hepatocellular carcinoma patients through immunization therapy are shown in table 3, and pass through full exon The Tumor mutations load detected, the sample neoantigen load that the present invention calculates is sequenced.Here, Tumor mutations load (TMB) it is defined as the non-synonymous somatic mutation number detected on Quan Xianzi.Tumor neogenetic antigen load (TNB) refers to institute There are nascent polypeptide-HLA points for meeting threshold value (nascent polypeptide and HLA molecule affinity≤500nM and opposite ranking≤2%) The sum of sub- compound.According to the median of TMB, patient can be divided into two groups: high 7 people of TMB group, low 6 people of TMB group.It is identical , patient can also be divided by high TNB group and low TNB group according to the median of TNB.In Fig. 4, made respectively TNB, The survivorship curve of TMB height grouping.It was found that TNB can significantly distinguish Survival (p value < 0.05, the high TNB of immunization therapy patient The low TNB group OS of group OS vs is 565 days: 185 days).Although it can be seen that high TMB group has extension compared to low TMB group in trend OS, do not have conspicuousness statistically (p value=0.29, the low TMB group OS of high TMB group OS vs are 336 days: 304 days).This knot Fruit shows that neoantigen prediction technique of the invention more can accurately assess immunotherapy of tumors compared to Tumor mutations load Effect.Neoantigen load has good application scenarios as biomarker.

Table 3

To sum up, neoantigen prediction technique of the invention is in nascent polypeptide generation, special by generating tumor sample A human genome compensates for two big defects of current main stream approach: 1, for the wrong deconsolidation process of multimutation peptide fragment or directly It loses；2, it is missed by the mutation peptide fragment (being often found in wild-type protein) for occurring to be formed in low complex degree region mutagenesis Think nascent polypeptide.In addition, the nascent polypeptide that the present invention is also included in somatic mutation simultaneously and Gene Fusion is formed.To, Neoantigen prediction technique of the invention is enabled more fully accurately to react the true neoantigen situation of tumor sample. In the hepatocellular carcinoma data that 13 receive immunization therapy, it is able to confirm that the neoantigen load that the present invention calculates can be effective Benefit applied to immunotherapy of tumors is assessed.Tumour epidemic disease is accurately comprehensively predicted as on neoantigen in view of the present invention Seedling design provides reliable peptide fragment source, and the application on tumor vaccine is also another potential application scenarios of the present invention.

In this description, the present invention is described referring to its specific embodiment.But it is clear that can still make Various modifications and alterations are without departing from the spirit and scope of the invention out.Therefore, the description and the appended drawings should be considered as illustrative And not restrictive.

Claims

1. a kind of method for predicting tumor neogenetic antigen, which is characterized in that the method comprising steps of

(3) single mutation peptide and corresponding wildtype peptide are generated based on each somatic mutation；

(5) by the cis trans relationship between mutation, judge the mutant peptide true and false of single mutation and multimutation, generate necessary being Mutant peptide；

(7) the bam file based on embryonal system check sample carries out the detection of HLA molecule parting, predicts the parent of nascent polypeptide and HLA molecule And power, using the high nascent polypeptide of affinity as candidate tumor neoantigen.

2. the method for prediction tumor neogenetic antigen according to claim 1, which is characterized in that in the step (1):

For somatic mutation, after exporting somatic mutation result under the default parameters of Mutect2 tool, carry out further Quality Control filtering, the Quality Control filtering include: that the frequency of mutation is greater than 2%；The sequencing depth of catastrophe point is greater than 10；At least 2 Reads instruction has mutation and average base quality > 20 of the reads；

Gene Fusion is detected, if input is full exon WES or full-length genome WGS sequencing data, uses FACTERA tool Gene Fusion is detected under default parameters；If input is RNAseq data, examined under default parameters with STAR-Fusion tool Cls gene fusion is done further Quality Control to reduce false positive, is somebody's turn to do then by number >=1 junction reads Junction reads refers to directly to cover the reads of fusion breakpoint.

3. the method for prediction tumor neogenetic antigen according to claim 1, which is characterized in that in the step (2), Breakpoint annotation is merged, using AGFusion tool according to 5 ' ends, the coordinate information of 3 ' end breakpoints in the genome, annotation fusion is disconnected Point, and generate fused protein sequence overall length；Intercepted length is L and the fusogenic peptide and corresponding 5 ' containing fusion breakpoint End, 3 ' end wildtype peptides；

Preferably, specific interception rule are as follows:

Determine coordinate rope of 5 ' the end fusion breakpoints on the fusion protein that wild albumen is held in 5 ' that length is p5 and length is g Draw: comparing fusion protein and 5 ' the wild protein sequences in end, obtains the 5 ' ends of maximum consistency fragment sequence seq1 and seq1 The length of coordinated indexing t, the seq1 on coordinated indexing m, fusion protein on wild albumen are s1, then 5 ' hold breakpoints Coordinated indexing is m+s1 on wild-type protein, is t+s1 on fusion protein；

It determines coordinated indexing of 3 ' the end fusion breakpoints on 3 ' the wild albumen in end that length is p3: comparing fusion protein and 3 ' ends are wild Raw albumen obtains coordinated indexing n of the most homogeneous fragment sequence seq2 and seq2 on 3 ' the wild albumen in end, described The length of seq2 is s2；

In the case where 3 ' end fusion breakpoints do not cause frameshit frame to change, each fusogenic peptide has opposite 5 ' end and 3 ' ends two Wildtype peptide generates, and fusion protein indexes t+s1-L from min coordinates and indexes t to maximum coordinates, and intercepted length is the fusogenic peptide of L； 5 ' the wild albumen in end, which are stayed at one's house demanding payment of a debt from min coordinates index m+s1-L to maximum, draws m, and 3 ' the wild albumen in end are indexed from n-L to maximum coordinates N, intercepted length are two corresponding wildtype peptides of L；

When 3 ' end fusion breakpoints cause frameshit frame to change, each fusogenic peptide only has one 5 ' end wildtype peptide to generate, and merges egg White to index t+s1-L to maximum coordinates index g-L from min coordinates, intercepted length is the fusogenic peptide of L, and 5 ' the wild albumen in end are from most Small coordinated indexing m+s1-L indexes p5-L to maximum coordinates, is sequentially generated the wildtype peptide that corresponding length is L.

4. the method for prediction tumor neogenetic antigen according to claim 1, which is characterized in that in the step (3),

It is annotated using SnpEff, the base mutation in each somatic cell gene group is annotated to every on Ensembel database On one transcript and corresponding protein sequence；

5. the method for prediction tumor neogenetic antigen according to claim 4, which is characterized in that interception rule are as follows:

For missense mutation, non-frameshift mutation, centered on being mutated coordinate, L-1 amino acid is taken to 5 ' ends, takes L-1 to 3 ' ends A amino acid generates length and contains mutation amino between 8~11 amino acid lengths and 13~25 amino acid lengths The mutation section of acid and corresponding wildtype peptide；

For frameshift mutation, since the preceding L-1 amino acid of catastrophe point, until extending to first terminator appearance, 8 are generated The mutation peptide fragment of~11 and 13~25 amino acid lengths and the wild peptide fragment of corresponding coordinate.

6. the method for prediction tumor neogenetic antigen according to claim 1, which is characterized in that in the step (4),

Abrupt information in the VCF file of somatic mutation is disposably all imported into the mankind with reference to genome, and is replaced Wild-type base on former coordinate generates the special a human genome of the tumor sample；

The transcript containing mutation edited on genome base level is translated into mutain with Biopython tool Sequence；

Based on the corresponding mutain coordinated indexing information of each transcript that single mutation in step (3) annotates, by step (3) Peptide fragment intercept rule, generate containing mutation 8~11 and 13~25 amino acid lengths mutation peptide fragment；

In the step (5):

The bam file of tumor sample is read by pysam, export compares the reads information for arriving each catastrophe point, calculates catastrophe point two Between twoWherein f (i), f (j) respectively indicate the NGS reads that instruction has mutation i, is mutated j, When Jacard coefficient is 0, illustrates that none tumour subclone is to possess mutation i and mutation j simultaneously, need to remove step (4) the multimutation peptide of the i containing mutation and mutation j while generation in；When Jaccard coefficient is 1, illustrate all tumour subclones It gathers around simultaneously with i and mutation j is mutated, needs to remove the single mutation peptide containing only mutation i or mutation j generated in step (3)； Jaccard coefficient then retains the single mutation peptide and step (4) that the i containing mutation or mutation j are generated in step (3) between 0 and 1 Middle generation is simultaneously containing mutation i and the multimutation peptide for being mutated j.

7. the method for prediction tumor neogenetic antigen according to claim 1, which is characterized in that in the step (7), adopt HLA molecule parting tool is sequenced with 5 kinds of two different generations, false sun is reduced by generating the highest HLA genotyping result of consistency Property；Preferably, HLA points are calculated using Polysolver, HLA-HD, HLA-PRG-LA, OptiType and hla-genotyper Sub- parting, initial to be divided into 0 to each of 8 major class HLA HLA allele, every to be arrived by a software detection, then score + 1, the HLA allele of highest scoring is as the final molecule parting result of all kinds of HLA；

The affinity of nascent polypeptide and HLA molecule is predicted specifically, working as nascent polypeptide and HLA molecule affinity≤500nM and phase To ranking≤2%, then regarding as the nascent polypeptide is candidate tumor neoantigen.

8. a kind of device for predicting tumor neogenetic antigen, which is characterized in that the device includes the storage for storing program Device and processor for executing the program, to realize prediction tumor neogenetic described in any one of claims 1 to 7 The method of antigen.

9. a kind of computer readable storage medium, which is characterized in that including program, the program can be executed by processor with complete At the method for predicting tumor neogenetic antigen described in any one of claims 1 to 7.