Background
At present, the treatment of malignant tumors still faces a plurality of difficulties, and a new treatment strategy is urgently needed. In recent years, immunotherapy of tumors has been rapidly developed and receives more and more attention. The immunotherapy of tumor refers to the elimination of tumor cells by using the body's own immune system, and includes antibody therapy, cell therapy, tumor vaccine, etc. In the process of generating and developing tumors, fusion gene breakpoints of a plurality of genes are often accompanied, and the neogenetic antigen refers to epitope specific antigen generated by tumor cell mutation, is only expressed on tumor cells and cannot cause immune tolerance of an organism. More studies have shown that immunotherapy targeting neoantigens has achieved good clinical results in some cancer patients. Therefore, the screening and identification of the tumor-specific neoantigens are one of the key technologies for improving the tumor immunotherapy effect and are the basis for realizing the individualized immunotherapy.
Gene fusion refers to a hybrid gene formed by joining together two or more genes, some or all of which are normally independent of each other. Gene fusions are possible due to chromosomal translocations, inversions, insertions, deletions, and the like. The occurrence of gene fusion is closely related to the occurrence and development of diseases such as tumor, and the gene fusion event generally exists in the tumor discovered at present. Abnormal chromosomal structures (philadelphia chromosomes) were found in chronic myeloid leukemia by Nowell and Hungerford in 1960. In 1980, the first study revealed a pathogenic role of the BCR and ABL1 gene fusion event in Burkitt's lymphoma. Gene fusion is an important variant in tumor cells, and can have a great influence on the functional characteristics of the cells.
The tumor immunotherapy based on the tumor neoantigen has been widely noticed by the medical field in recent years, and recent research and clinical results also show that the tumor neoantigen immunotherapy has a wide application prospect. The acquisition of tumor neoantigens needs to be based on new amino acid or protein sequences generated by the genetic variation of tumors. The fusion of genes can generate a large amount of active abnormal protein sequences, thereby causing the generation of tumors, so that the fusion genes are important targets for the discovery of tumor neoantigens. However, the whole-genome high-throughput screening method for the fusion gene tumor neoantigen is still a medical difficulty, and therefore, the high-throughput method for efficiently and accurately screening the fusion gene tumor neoantigen is developed in the application, and the screening efficiency and the accuracy of the fusion gene tumor neoantigen can be remarkably improved.
Disclosure of Invention
In view of the above problems of the prior art, it is an object of the present invention to provide a method for predicting neoantigens of tumors based on gene fusion events, in which score values of neoantigens are creatively calculated using a scoring function based on a characteristic value of a tumor neoantigen of a fusion gene, and the neoantigens are ranked according to the score values, and the ranked neoantigens are highly reliable. The method provided by the invention comprises multi-step quality control and comprehensive analysis, so that the accuracy of the result and the verification rate of the specific antigen are greatly improved, the workload of experimental verification is greatly reduced, and a foundation is laid for the subsequent design of an anti-tumor vaccine, the development of an anti-tumor drug and the evaluation of a tumor treatment response biomarker.
The above object of the present invention is achieved by the following technical solutions:
the first aspect of the invention provides a scoring function for evaluating the credibility of a fusion gene tumor neoantigen, which is characterized in that the scoring function comprises the following characteristic values: number of junctional Reads supporting the fusion gene, number of bridging Fragments supporting the fusion gene, average coverage of single base of the upstream gene of the fusion event, average coverage of single base of the downstream gene of the fusion event.
Further, in a specific embodiment of the present invention, the scoring function is as follows: score ═ lg (IC/500) + [ (JunctionReadCount + spanningfrancount) × 2/(upstreamcov + downstreamcov) ];
wherein IC is mean (IC50[ i: i + n ]), representing the median taken for various software affinity values; junctionReadCount indicates the number of junctionReads supporting the fusion gene; the Spanning fracccount represents the number of Spanning Fragments supporting the fusion gene; upstreamcocov refers to the average single base coverage of the gene upstream of the fusion event; downstreamcov refers to the average single base coverage of the gene downstream of the fusion event.
Further, the plurality of software includes NetMHCpan, NetMHCIIpan, NetMHC, NetMHCII.
The second aspect of the invention provides a prediction method of a fusion gene tumor neoantigen.
Further, the prediction method comprises obtaining the following characteristic values: tumor tissue RNA-bam file, fusion gene in tumor tissue, gene expression level and mutant polypeptide affinity prediction value.
Further, the prediction method comprises the credibility ranking of the fusion gene tumor neoantigens obtained by the scoring function according to the first aspect of the invention.
Further, the prediction method comprises the following steps:
(1) obtaining a tumor tissue sample and RNA-seq sequencing data;
(2) detecting the fusion gene of the tumor tissue;
(3) calculating the expression quantity of the fusion gene;
(4) annotation of fusion genes;
(5) extracting fusion polypeptide;
(6) identifying MHC molecule types;
(7) HLA affinity prediction;
(8) the scoring function of the first aspect of the invention is used to obtain the scoring order of the confidence level of the fusion gene tumor neoantigen.
Further, the prediction method step (1) comprises: tumor tissues of any cancer tumor patient are obtained, and RNA-seq sequencing of the tumor tissues is completed through an illumina high-throughput sequencing platform.
Further, the raw data obtained by the above sequencing method needs to be processed by quality control, reference genome alignment, and bam files.
Wherein, the quality control: performing quality control on the RNA sequencing original fastq data through fastQC software to obtain data AO.clean.fq.gz after quality control;
reference genome alignment: performing reference genome comparison on the RNA after quality control by using hisat2 software to obtain a bam file of tumor RNA data;
and (3) bam file processing: the compared bam file needs further processing, and the RNA data bam file is subjected to sequencing and quality control processing to obtain the processed RNA-bam file.
Preferably, the expression level of the fusion gene is calculated by using RSEM;
preferably, the fusion gene is detected in tumor tissue using star-fusion software;
preferably, the fusion gene obtained by detection is annotated by using AGFusion;
preferably, the polypeptide extraction uses a sliding window mode, specifically, the step-by-step sliding window extraction is carried out on the upstream and downstream positions of the mutation site by using a sliding window with the length of 8-15 amino acids, and the step length of the sliding window is 1;
preferably, identification of MHCI and MHCII molecular types is performed using seq2 HLA;
preferably, the prediction is carried out by using various software such as NetMHCpan, NetMHCIIpan, NetMHC and NetMHCII, and each method obtains a corresponding IC50 value of the affinity prediction result.
Further, the scoring of the credibility in the step (8) of the prediction method is obtained based on the score values of the fusion gene tumor neoantigens obtained by the scoring function according to the first aspect of the present invention, and the scores are ranked from high to low, and a score higher indicates that the fusion gene tumor neoantigens have high credibility.
In a third aspect, the invention provides the use of a scoring function according to the first aspect of the invention for predicting fusion gene tumor neoantigens.
In a fourth aspect, the invention provides a fusion gene tumor neoantigen.
Further, the neoantigen is selected from one or more of the following group: RBP4_ FRA10AC1, AHNAK _ RPS11, SMURF1_ KPNA7, STRN3_ programmed 1.
In a fifth aspect, the present invention provides a method of screening for a neoantigen according to the fourth aspect of the invention.
Further, the screening method comprises the prediction method according to the second aspect of the present invention.
Further, the screening method further comprises verifying the predicted result.
Preferably, the verification of the predicted result comprises fusion gene verification and immunological verification;
preferably, the fusion gene verification refers to performing PCR verification on the fusion gene for predicting the neoantigen;
preferably, the immunological validation refers to ELISPOT validation of the neoantigen corresponding to the fusion event that is positively validated in the fusion gene validation result.
The sixth aspect of the invention provides the use of the neoantigen of the fourth aspect of the invention in the preparation of an anti-tumor drug or vaccine.
The seventh aspect of the present invention provides an apparatus for predicting a tumor neoantigen of a fusion gene.
Further, the apparatus comprises a memory for storing a program and a processor for executing the program to implement the prediction method according to the second aspect of the present invention.
An eighth aspect of the present invention provides a computer-readable storage medium.
Further, the computer readable storage medium includes a program, which is executable by a processor to perform the prediction method according to the second aspect of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, some terms are explained as follows.
The term "tumor neoantigen" as used herein refers to a neoantigen produced by a tumor cell as a result of a genetic change, which antigen is recognized by an immune cell (via a T cell receptor) resulting in activation of the immune cell. During the development process of cancer cells, a lot of gene mutations are generated, and partial gene mutations also generate proteins which are not contained in normal tissues and normal cells, and the proteins possibly also activate the immune system and cause the immune system to attack the cancer cells. These abnormal proteins (abnormal antigens) produced by the genetic mutation of cancer cells, which activate the immune system (recognized by immune cells), are tumor neoantigens. The tumor neoantigen is a key factor for stimulating the immune system of the organism to have initial immune response to tumor cells, and meanwhile, the identification, screening and identification of the tumor neoantigen are key factors for accelerating the development of individualized immunotherapy of tumor patients.
The term "fused gene" as used herein refers to a new gene formed by fusing together partial or complete sequences of two different genes due to some mechanism, such as genomic variation. In general, a fusion gene refers to a gene resulting from a genome-level fusion. However, fusion of the transcriptome levels may also occur, primarily due to the fact that the RNAs produced by the transcription of two different genes somehow fuse together to form a new fused RNA, which may or may not encode a protein. The fusion gene produced at the genome level may or may not be expressed (e.g., disruption of the promoter region or other reasons) depending on the fusion. The fusion gene is mainly produced by the following three mechanisms: (1) chromosomal Translocation (Chromosomal Translocation). E.g., the two segments on the two chromosomes are crossed over each other, resulting in gene-to-gene fusion on the two chromosomes; (2) intermediate deletion (intermediate deletion). Such as deletion (deletion) of genes and segments between genes on a chromosome, which finally leads to the fusion of the two genes; (3) chromosome Inversion (Chromosomal Inversion). For example, gene-to-gene segments on a chromosome are inverted, which eventually results in fusion of genes and genes.
The invention has the advantages and beneficial effects that:
(1) the invention constructs a scientific scoring function for the first time, scientifically distributes the weight of each key factor influencing the fusion gene tumor neoantigen, and improves the reliability of the prediction result.
(2) The invention provides a prediction method for predicting the fusion gene tumor neoantigen only by using the RNA sequencing data of the tumor tissue, and the method does not depend on other data such as DNA and the like, thereby greatly shortening the prediction time.
(3) The invention aims at the tumor fusion gene to realize high-throughput and high-accuracy prediction of the tumor neoantigen of the fusion gene.
(4) The multi-step quality control and comprehensive analysis provided by the invention greatly improve the accuracy of the result, improve the verification rate of the newborn antigen, shorten the application period and ensure the reliability of the result.
(5) The prediction method provided by the invention can be applied to various cancer species, and the prediction of the fusion gene tumor neoantigen can be realized without distinguishing the cancer species.
(6) The invention provides a fusion gene tumor neoantigen predicted and screened by the prediction method, which comprises RBP4_ FRA10AC1, AHNAK _ RPS11, SMURF1_ KPNA7 and STRN3_ HECTD 1.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances.
Example 1 prediction of fusion Gene tumor neoantigen
The flow chart of the method for predicting the tumor neoantigen of the fusion gene is shown in figure 1. The specific process is as follows:
1. material preparation
Tumor tissue of a liver cancer patient is taken, and RNA-seq sequencing of the tumor tissue is completed through a high-throughput sequencing platform such as illumina.
2. Data quality control
And performing quality control on the RNA-seq sequencing original fastq data through FastQC software to obtain data clean.
3. Data comparison
And performing reference genome alignment on the RNA after quality control by using hisat2 software to obtain a bam file of tumor RNA data.
4. Bam file processing
The compared bam file needs further processing, and the RNA data bam file is subjected to sequencing and quality control processing to obtain the processed RNA-bam file.
5. Detection of fusion genes
The fusion gene of tumor tissue was detected using star-fusion software.
6. Quantification of gene expression
Calculation of gene expression level was carried out using RSEM.
7. Fusion gene annotation
The fusion gene obtained by the detection was annotated using AGFusion.
8. Fusion polypeptide extraction
And (3) obtaining genetic mutation and somatic mutation information based on the steps, comprehensively and accurately extracting mutant site polypeptides, and correspondingly extracting polypeptide sequences of normal wild genotypes. The polypeptide extraction uses a sliding window mode, specifically, sliding windows with the length of 8-15 amino acids are respectively used for carrying out gradual sliding window extraction on the upstream and downstream positions of a mutation site to extract a polypeptide sequence containing the mutant amino acid, and the step length of the sliding window is 1.
9. MHC molecule type identification
Based on RNA sequencing data, MHCI and MHCII molecular types were identified using seq2HLA, and tumor patients were typed: HLA-A11: 01, HLA-A26: 01, HLA-B40: 01, HLA-B38: 01, HLA-C07: 02, and HLA-C12: 03.
10. HLA affinity prediction
Based on the polypeptide sequence and HLA type obtained by the steps, comprehensive prediction is carried out by using NetMHCpan, NetMHCIIpan, NetMHC and NetMHCII multi-software, and each method obtains a corresponding IC50 value of an affinity prediction result.
11. Ordering high affinity mutant polypeptides
The following scoring function is utilized:
and (2) calculating the Score value of the predicted fusion gene tumor neoantigen, wherein the Score value is in positive correlation with the reliability of the neoantigen.
Wherein IC is mean (IC50[ i: i + n ]), representing the median taken for various software affinity values; junctionReadCount indicates the number of junctionReads supporting the fusion gene; the Spanning fracccount represents the number of Spanning Fragments supporting the fusion gene; upstreamcocov refers to the average single base coverage of the gene upstream of the fusion event; downstreamcov refers to the average single base coverage of the gene downstream of the fusion event.
And (4) sequencing according to the degree of the score value to obtain the fusion gene tumor neoantigen with high reliability (see table 1).
TABLE 1 fusion Gene tumor neoantigen score ranking
Example 2 validation of candidate fusion Gene tumor neoantigen
1. PCR validation of fusion genes for prediction of neoantigens
To prevent false positives for fusion events, PCR validation was performed on fusion genes predicted to have the corresponding neoantigens.
The experimental method comprises the following steps: designing a cross-fusion point Primer on the upstream and downstream genes of the fusion gene by using Primer Premier 5 software, namely: the upstream primer is at the fusion event upstream gene, and the downstream primer is at the fusion event downstream gene. The primer sequences obtained are shown in Table 2. The experimental procedures were carried out according to the instructions of the kit used.
The experimental results are as follows: the results showed that 5 positive polypeptides were obtained (see fig. 2), which were: RBP4_ FRA10AC1, AHNAK _ RPS11, DNPH1_ CRIP3, SMURF1_ KPNA7, STRN3_ HECTD 1.
2. ELISPOT validation of neoantigens corresponding to positive fusion events
And (3) further performing immunogenicity verification on the new antigen polypeptides corresponding to the 5 fusion genes in the step (5): ELISPOT experiments.
The experimental method comprises the following steps: ELISPOT validation was performed on 5 neoantigens corresponding to positive fusion events.
The experimental results are as follows: the graphs of the experimental results obtained for the positive control and the negative control are shown in FIGS. 3 and 4, respectively. The results showed 4 immunogenic positive reactions, including one weak positive and 1 negative (see fig. 5-9). Finally, the neoantigens corresponding to the 4 positive results are taken as candidate neoantigens, and the candidate neoantigens are respectively as follows: RBP4_ FRA10AC1, AHNAK _ RPS11, SMURF1_ KPNA7, STRN3_ programmed 1.
TABLE 2 primer sequences obtained by designing Trans-fusion Point primers for genes upstream and downstream of the fusion Gene
RBP4_FRA10AC1_F1(5’-3’)
|
GGCACCTTCACAGACACC(SEQ ID NO.1)
|
RBP4_FRA10AC1_R1(5’-3’)
|
AGCTCTATCCTCTAGGAGCTAC(SEQ ID NO.2)
|
RBP4_FRA10AC1_F2(5’-3’)
|
TGGGCACCTTCACAGACAC(SEQ ID NO.3)
|
RBP4_FRA10AC1_R2(5’-3’)
|
TCACATTAAGGAGCGGAGG(SEQ ID NO.4)
|
AHNAK_RPS11_F(5’-3’)
|
GGGGATGATGAGGAGTACC(SEQ ID NO.5)
|
AHNAK_RPS11_R(5’-3’)
|
TGAAGCGCACTGTCTTGCTC(SEQ ID NO.6)
|
DGCR2_GSC2_F(5’-3’)
|
GTTGCAGCCGAGAGTGTG(SEQ ID NO.7)
|
DGCR2_GSC2_R(5’-3’)
|
GTACTCACGTCAGGATACTGG(SEQ ID NO.8)
|
DNPH1_CRIP3_F(5’-3’)
|
GACAGGACGCTGTACGAGC(SEQ ID NO.9)
|
DNPH1_CRIP3_R(5’-3’)
|
GGCCTGAGTTAGGGTGACC(SEQ ID NO.10)
|
PLA2G6_TMEM184B_F(5’-3’)
|
CACTCAGATGGATGTCACCG(SEQ ID NO.11)
|
PLA2G6_TMEM184B_R(5’-3’)
|
GCTGACGGAGATGTTGTAGA(SEQ ID NO.12)
|
SMURF1_KPNA7_F(5’-3’)
|
GACTGGGCTCGGCTGGAAG(SEQ ID NO.13)
|
SMURF1_KPNA7_R(5’-3’)
|
CGCCTGGAGGATGCAAGA(SEQ ID NO.14)
|
STRN3_HECTD1_F(5’-3’)
|
CACTACATCCAGCACGAGTG(SEQ ID NO.15)
|
STRN3_HECTD1_R(5’-3’)
|
CTGGCTGGGTAGTTACAGGA(SEQ ID NO.16) |
The above-described embodiments are only for illustrating the present invention and are not to be construed as limiting the present invention. As will be understood by those of ordinary skill in the art: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Sequence listing
<110> Zhongsheng Kangyuan Biotechnology (Beijing) Co., Ltd
<120> tumor neoantigen prediction method based on gene fusion event and application thereof
<141> 2020-09-08
<160> 16
<170> SIPOSequenceListing 1.0
<210> 1
<211> 18
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 1
ggcaccttca cagacacc 18
<210> 2
<211> 22
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 2
agctctatcc tctaggagct ac 22
<210> 3
<211> 19
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 3
tgggcacctt cacagacac 19
<210> 4
<211> 19
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 4
tcacattaag gagcggagg 19
<210> 5
<211> 19
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 5
ggggatgatg aggagtacc 19
<210> 6
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 6
tgaagcgcac tgtcttgctc 20
<210> 7
<211> 18
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 7
gttgcagccg agagtgtg 18
<210> 8
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 8
gtactcacgt caggatactg g 21
<210> 9
<211> 19
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 9
gacaggacgc tgtacgagc 19
<210> 10
<211> 19
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 10
ggcctgagtt agggtgacc 19
<210> 11
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 11
cactcagatg gatgtcaccg 20
<210> 12
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 12
gctgacggag atgttgtaga 20
<210> 13
<211> 19
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 13
gactgggctc ggctggaag 19
<210> 14
<211> 18
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 14
cgcctggagg atgcaaga 18
<210> 15
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 15
cactacatcc agcacgagtg 20
<210> 16
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 16
cactacatcc agcacgagtg 20