CN111192632B

CN111192632B - Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data

Info

Publication number: CN111192632B
Application number: CN201911293011.2A
Authority: CN
Inventors: 万季; 潘有东; 汪健; 徐韵婉; 宋麒; 刘鹏; 夏迪
Original assignee: Shenzhen Neocura Biotechnology Corp
Current assignee: Shenzhen Neocura Biotechnology Corp
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2023-06-13
Anticipated expiration: 2039-12-16
Also published as: CN111192632A

Abstract

The invention discloses a method and a device for extracting a gene fusion immunotherapeutic new antigen by integrating DNA and RNA deep sequencing data. The method comprises the following steps: s10, obtaining a genome gene fusion sequence of a sample; s20, obtaining a transcriptome gene fusion sequence S30 of a sample, and constructing a gene fusion protein group; s40, obtaining the sample neoantigen. The tumor specific neoantigen discovered by the scheme of the invention is all derived from gene fusion, so that the screening range of the neoantigen is expanded, and the ammunition library of the neoantigen-based immunotherapy method is enriched. By analyzing and integrating the whole exome sequencing data and the transcriptome sequencing data of the tumor sample, the gene fusion event in tumor tissues is comprehensively detected, the false positive rate of the new antigen generated by fusion is reduced, the effectiveness of the new antigen vaccine is further improved, and the method has important significance for improving the clinical immunotherapy effect.

Description

Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data

Technical Field

The invention relates to the field of tumor immunotherapy, in particular to a method and a device for extracting a novel antigen for gene fusion immunotherapy by integrating DNA and RNA deep sequencing data.

Background

The therapeutic concepts and methods for malignant tumors have been developed deeply in the last decades. Traditional tumor treatment methods comprise surgery, radiotherapy and mutation-based targeted therapy, however, the above treatment methods have certain limitations in terms of toxic and side effects, drug resistance and the like. In recent years, the idea of immunotherapy by activating the immune system to inhibit and kill tumor cells has been new breakthroughs. Existing immunotherapeutic approaches can be divided into three categories according to their mechanism of action: an immune checkpoint inhibitor that activates the immune system by inhibiting the inhibitory pathway of the immune system, (2) an adoptive cellular immunotherapy that modifies T lymphocytes to recognize antigens, (3) a new antigen vaccine immunotherapy method by identifying tumor tissue specific antigens and preparing polypeptides and mRNA vaccines based on the predicted antigens for reinfusion in vivo. Compared with other two types of immunotherapy methods, the novel antigen vaccine immunotherapy method has the characteristics of no limitation to specific cancer species and small toxic and side effects. Prediction of neoantigens relies on whole-exome sequencing and transcriptome sequencing of DNA and RNA, respectively, of tissue samples to predict mutant polypeptides. Existing procedures generally consider mutant polypeptides resulting from DNA point mutations and small indels. In addition, gene fusion is also an important source of mutant polypeptides. However, since gene fusion identification based on a single data source (DNA or RNA) generally has high false positives, predicting the neoantigens resulting from fusion requires more abundant data and a stringent screening procedure to ensure the high efficiency of the neoantigen vaccine. Therefore, integrating various data to extract the new antigen generated by gene fusion has important significance for expanding the screening range of the new antigen and improving the clinical application effect.

Disclosure of Invention

Aiming at the problems, the invention comprehensively considers the possibility of producing mutant polypeptide by fusion transcription and translation of tumor specific genes, and develops a bioinformatics method for obtaining tumor specific new antigens.

In a first aspect, the present invention provides a method for extracting a gene fusion immunotherapeutic neoantigen from deep sequencing data of integrated DNA and RNA, comprising the steps of:

s10, obtaining genome gene fusion of a sample;

s20, obtaining a transcriptome gene fusion sequence of a sample;

s30, constructing a gene fusion protein group;

s40, obtaining the sample neoantigen.

In some embodiments of the invention, the genomic gene fusion sequence of the obtained sample is based on whole exome sequencing.

In some embodiments of the invention, the sample-taking transcriptome gene fusion sequences are based on transcriptome sequencing.

In some embodiments of the invention, the genomic gene fusion sequence of the obtained sample comprises the steps of:

s101, detecting genomic structural variation of a tumor sample;

s102, screening a gene fusion event;

s103, obtaining a genome gene fusion sequence.

In some embodiments of the invention, the sample-taking transcriptome gene fusion sequence comprises the steps of:

s201, detecting a transcriptome gene fusion event of a tumor sample;

s202, obtaining a transcriptome gene fusion sequence.

In some embodiments of the invention, the step S30 comprises performing in-frame translation on the obtained genomic gene fusion sequence and the transcriptome gene fusion sequence, respectively, to obtain a gene fusion protein sequence, i.e., a gene fusion proteome;

preferably, when the frame translation is carried out to the breakpoint position where fusion occurs, whether frame shift translation occurs is judged, if frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigenic peptides, and if frame shift translation does not occur, only sequences near the breakpoint can generate neoantigenic peptides.

In some embodiments of the present invention, the step S30 generates a peptide fragment sequence according to a specific length according to the requirement;

preferably, the default peptide stretch is 9 to 12 amino acids in length.

In some embodiments of the invention, the step S40 includes the steps of:

s401, identifying human leukocyte antigen molecule (HLA) typing;

s402, predicting peptide affinity;

s403, screening the sample neoantigens based on the peptide fragment integration information.

In some embodiments of the invention, the sample is tumor tissue, preferably human tumor tissue.

In some embodiments of the invention, in step S102, a gene fusion event is selected in which the breakpoint position is located within the gene, rather than in the intergenic region.

In some embodiments of the present invention, in the step S103, the gene sequences are extracted and spliced according to the breakpoint positions of the upstream and downstream genes involved in the gene fusion, respectively;

preferably, the method comprises the following steps:

s1031, determining breakpoint positions of upstream and downstream genes;

s1032, judging whether the breakpoint occurs in the exon region or the intron region;

s1033, judging which transcripts in the gene are affected by the breakpoint;

s1034, enabling each affected upstream gene transcript to correspond to each downstream gene transcript one by one, and obtaining a complete gene fusion transcript sequence according to a conventional transcription rule.

In a second aspect, the present invention provides an apparatus for extracting a gene fusion immunotherapeutic neoantigen from deep sequencing data integrating DNA and RNA, comprising:

(1) A first unit for obtaining a genomic gene fusion sequence of a sample;

(2) A second unit for obtaining a transcriptome gene fusion sequence of the sample;

(3) A third unit for constructing a specific gene fusion proteome;

(4) And a fourth unit for obtaining a sample neoantigen.

The invention has the beneficial effects that:

compared with the prior art, the scheme of the invention has the following advantages:

1. from the source, the tumor specific new antigens discovered by the scheme of the invention are all derived from gene fusion, and the gene fusion event is widely existed in different types of tumors; the current method is mainly to obtain new antigen by recognizing somatic mutation. Therefore, the invention expands the screening range of the new antigen and fills the 'ammunition library' of the immune treatment method based on the new antigen.

2. According to the invention, through analyzing and integrating the whole exome sequencing data and the transcriptome sequencing data of the tumor sample, the gene fusion event in tumor tissues is comprehensively detected, the false positive rate of the new antigen generated by fusion is reduced, the effectiveness of the new antigen vaccine is further improved, and the method has important significance in improving the clinical immunotherapy effect.

Drawings

FIG. 1 is a flow chart of a method for extracting a gene fusion immunotherapeutic neoantigen from deep sequencing data of integrated DNA and RNA according to one embodiment of the invention;

FIG. 2 is a schematic representation of the gene fusion sequence in a method for extracting a novel antigen for gene fusion immunotherapy by integrating DNA and RNA deep sequencing data according to an embodiment of the invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.

Before the embodiments of the invention are explained in further detail, it is to be understood that the invention is not limited in its scope to the particular embodiments described below; it is also to be understood that the terminology used in the examples of the invention is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the invention.

In order that those skilled in the art will better understand the present invention, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The terms "first," "second," "again," "then," "next," and the like as used in the specific embodiments herein are not intended to be limiting of the order.

As shown in FIG. 1, the left hand side is a computer-implemented flow chart for obtaining genomic fusion sequences of tumor tissue based on whole-exome sequencing in accordance with an embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:

s101, detecting genomic structural variation of a tumor sample.

First, genomic sequencing raw data is aligned to a human reference genome using genomic alignment software bwa; then, the bam file generated in the above step is used as an input, and genomic structural variation is detected by using variation detection software lumpy-sv.

S102, screening a gene fusion event.

Specifically, structural variation is typed by using SVTyper software, gene fusion events in the structural variation are screened, then a program package pygeno is used for annotating fusion genes, and gene fusion events with breakpoint positions in the interior of the genes rather than in intergenic regions are selected.

S103, obtaining a genome gene fusion sequence.

And respectively extracting gene sequences according to breakpoint positions of upstream and downstream genes involved in gene fusion, and splicing. Specifically, after determining the breakpoint positions of the upstream and downstream genes, it is first determined whether the breakpoint occurs in an exon (exon) region or an intron (intron) region, and since many genes contain multiple transcripts, it is also necessary to determine which transcripts in the gene are affected by the breakpoint, and then make each affected upstream gene transcript correspond to the downstream gene transcript one by one, and according to conventional transcription rules, a complete gene fusion transcript sequence is obtained, so that a subsequent reading frame translation process is facilitated. These gene fusion transcript sequences are collectively referred to herein as genomic gene fusion sequences.

The middle part of fig. 1 is a flow chart of a computer-implemented transcriptome sequencing-based acquisition of a transcriptome gene fusion sequence of a tumor tissue according to an embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:

s201, detecting a transcriptome gene fusion event of a tumor sample.

Firstly, transcriptome sequencing raw data is aligned with a human reference genome by sequence alignment software STAR; gene fusion was then detected using ariba software.

S202, obtaining a transcriptome gene fusion sequence.

And respectively extracting gene sequences according to breakpoint positions of upstream and downstream genes involved in gene fusion, and splicing. It should be noted that since the sequenced data in RNAseq is transcript sequence, not genomic sequence, the determined gene fusion breakpoint location is not the location in the genome where gene fusion actually occurs, but rather the boundaries of the mature mRNA sequence after transcription. Therefore, the sequences at the breakpoints of the upstream and downstream genes are spliced together directly when the gene fusion sequences are generated, and transcription rules do not need to be considered. The rest of the processing is similar to the step S103, and transcripts affected by breakpoint positions are determined, so that complete and complete gene fusion transcript sequences are obtained. These gene fusion transcript sequences are collectively referred to herein as transcriptome gene fusion sequences.

The lower left part of FIG. 1 is a computer-implemented flow chart for constructing a tumor-specific gene fusion protein set according to an embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:

s301, constructing a tumor specific gene fusion protein group.

And (3) performing reading frame translation on the genome gene fusion sequences and the transcriptome gene fusion sequences obtained in the steps S103 and S202 respectively to obtain a gene fusion protein sequence, namely a tumor specific gene fusion protein group. In order to obtain peptide fragment sequences which are not present in normal cells of a human body, when the reading frame translation is carried out to a breakpoint position where fusion occurs, whether frame shift translation occurs or not is judged, if frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigen peptides, and if frame shift translation does not occur, only sequences near the breakpoint can generate neoantigen peptides. And finally, generating a peptide fragment sequence according to the specific length according to the requirement. In the present invention, the default peptide fragment length is 9 to 12 amino acids.

The right part of FIG. 1 is a schematic flow chart for obtaining a novel antigen generated based on tumor-specific gene fusion according to another embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:

s401, typing Human Leukocyte Antigen (HLA) molecules.

Human leukocyte antigen molecule typing was calculated using leukocyte antigen molecule typing software HLA-LA.

S402, predicting peptide fragment affinity.

Affinity prediction was performed on tumor specific gene fusion proteomes using the software netMHCpan-4.0 and Human Leukocyte Antigen (HLA) molecular typing results.

S403, integrating and screening the tumor specific gene fusion neoantigen.

Integrating peptide fragment information, specifically, determining the source of each candidate peptide fragment, including determining the upstream and downstream genes involved in fusion and the corresponding transcript numbers, whether the gene fusion event is from DNA sequencing or RNA sequencing, annotating the affinity of the peptide fragment to the HLA molecular type identified in step S401, the expression level of the fusion gene in RNA sequencing, the rule frequency corresponding to the gene fusion event in DNA sequencing, the specific position of the peptide fragment in the fusion protein sequence, and the like. In the screening phase, the candidate peptide fragments are firstly compared with the human normal proteome, and sequences existing in the normal proteome are filtered; and then, sorting and screening the candidate neoantigens by using different indexes according to corresponding weights to obtain the final tumor specific gene fusion neoantigens. Specific indexes include affinity of the peptide fragment to HLA, expression level of fusion genes, the sequence of the fusion genes and physicochemical properties of the peptide fragment.

In some embodiments, specific parameters of the software used in the present invention are as follows:

genomic sequencing data was aligned using bwa, an example command of which is:

bwa mem\

-R‘@RG\tID:sample\tLB:library\tSM:sample’\

-t 20\

-M bwa_index\

sample_1.DNA.fq.gz sample_2.DNA.fq.gz

wherein, -R indicates the comparison result header file, -t indicates the running thread number, -M indicates the index file used, sample_1.DNA. Fq. Gz, sample_2.DNA. Fq. Gz are the sequencing raw data input.

Structural variations are detected using lumpy-sv, which when run, requires some intermediate files to be generated with the commands it provides. Example commands for this are:

first, a file_filter command is used to extract a discordant reads alignment result (sample. Disc. Bam) and a softlip reads alignment result (sample. Splt. Bam) in a bam file.

lumpy_filter sample.bam sample.splt.bam sample.disc.bam

Then, the insert length average and standard deviation of the sequencing data were calculated.

samtools view sample.bam|python paired_distro.py\

-r readlen\

-X 4\

-N 10000\

-o sample.lib1.histo

Wherein samtools view sample. Bam indicates reading of the bam file as a standard input to a modified distro. Py script, modified distro. Py is a script provided by lumpy-sv software, -r indicates sequencing fragment length, -X indicates a threshold for standard deviation, -N indicates the number of lines of program read data from the standard input, -o indicates the output file name.

Finally, structural variations were detected using the lumpy command.

lumpy\

-mw 4\

-tt 0\

-pe\

id:sample,bam_file:sample.disc.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:readlen,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20\

-sr\

id:sample,bam_file:sample.splt.bam,back_distance:10,weight:1,min_mapping_threshold:20\

>sample.vcf

Wherein-mw indicates the minimum weight for each structural variant event, -tt indicates the threshold size, -pe indicates the series of parameters of the process discordant reads bam file, -sr indicates the series of parameters of the process softclip reads bam file, sample. Specifically, id represents a sample name, bam_file represents a corresponding bam file, history_file represents a recorded insert length distribution file, mean represents an insert length average value, stdev represents an insert length standard deviation, read_length represents a sequencing length, discordant_z represents a standard score value, back_distance represents a number of bases extended by a structural mutation site, weight represents a sample weight, and min_mapping_threshold represents a minimum alignment quality value.

Structural variations are typed using svtypher, example commands are:

svtyper\

-i sample.vcf\

-B sample.bam\

-o sample.svtyper.vcf

wherein, -i indicates the structural variation result generated by the previous lumpy-sv, -B indicates bwa comparison file, -o indicates the output result file name.

And writing a script, processing the output of the SVTyper, screening out a gene fusion event, and annotating the gene fusion event by using a program package pygeno to obtain the position of the gene fusion occurring in a genome. Specifically, it is inferred which two genes the breakpoint upstream and downstream of gene fusion is located inside. Fusion genes whose breakpoints are in the intergenic regions are not considered because their specific transcription process cannot be accurately deduced. The functions to be imported in the script are:

from pyGeno.Transcript import Transcript

from pyGeno.Genome import Genome

from pyGeno.Gene import Gene

from pyGeno.Exon import Exon

from pyGeno.Chromosome import Chromosome

it will be readily appreciated that a gene fusion sequence consists of two parts, the former part, or 5 'end sequence, is a partial sequence from one gene (the upstream gene) and the latter part, or 3' end sequence, is a partial sequence from the other gene (the downstream gene). In order to obtain a genome gene fusion sequence, a script is written to extract partial sequences of upstream and downstream genes respectively and splice the partial sequences. Specifically, it is first determined whether the upstream and downstream breakpoints are located in the exon region or the intron region of the gene, and then the treatment is performed according to four cases, namely, exo-exon, intron-intron, exo-intron, intron-exon, respectively, according to the region where the breakpoints are located, as shown in fig. 2. Wherein, the exon-exon refers to that the upstream and downstream breakpoints are located in the respective exon areas, and the gene fusion sequence can be formed by connecting a 5 'terminal exon sequence before the upstream gene breakpoint with a 3' terminal exon sequence after the downstream gene breakpoint; intron-intron refers to that the upstream and downstream breakpoints are located in the respective intronic regions, and the intronic sequence is not present in the mature mRNA sequence, so that the gene fusion sequence is formed by connecting a 5 'terminal exon sequence before the upstream gene breakpoint with a 3' terminal exon sequence after the downstream gene breakpoint, and does not contain the intronic sequence at the breakpoint; both exon-intron and intron-exon are slightly complex, two transcript sequences can be deduced according to transcription rules, and in order to comprehensively select new antigens, the two gene fusion sequences are output together (type 1 and type2 shown in fig. 2), the type1 sequence does not contain the exon remainder sequence with the breakpoint in the exon region, and the type2 sequence contains the intron remainder sequence with the breakpoint in the intron region.

Transcriptome sequencing data was aligned using STAR, example commands are:

STAR\

--runThreadN 20\

--genomeDir star_index\

--readFilesIn sample_1.RNA.fq.gz sample_2.RNA.fq.gz\

--readFilesCommand zcat\

--outSAMtype BAM SortedByCoordinate\

--outSAMunmapped Within\

--outFilterMultimapNmax 1\

--outFilterMismatchNmax 3\

--chimSegmentMin 10\

--chimOutType WithinBAM SoftClip\

--chimJunctionOverhangMin 10\

--chimScoreMin 1\

--chimScoreDropMax 30\

--chimScoreJunctionNonGTAG 0\

--chimScoreSeparation 1\

--alignSJstitchMismatchNmax 5-1 5 5\

--chimSegmentReadGapMax 3

wherein —runthread indicates the number of threads running; -genomeDir indicates index file path; readFilesIn indicates the raw sequencing data read in; -readFilesCommand indicates a read file command; -outSAMtype BAM SortedByCoordinate indicates that the output format is BAM and ordered; -outSAMunmapped Within indicates that unaligned reads are also output to the results file; -outfiltermultimaplnmax indicates the maximum allowed multiple alignment; -outfiltermissmatchnmax indicates the maximum number of mismatches allowed; - -chimSegmentMin indicates the output fusion transcript, 10 represents the shortest base number of the alignment; -chimOutType WithinBAM SoftClip indicates the output format of the mosaic alignment; - -chimJuctionOverhangMin indicates the shortest base number of the alignment; - -chimScorMin indicates the minimum score of the chimeric fragment; - -chimScorDropMax indicates the maximum score difference between all chimeric fragments; -chimScareJuctionNonGTAG indicates a penalty that the base at the chimeric junction is not in the form of "GT/AG"; -chimscore separation indicates the smallest difference between the best and suboptimal chimerism scores; -alignsjstitchmismatching nmax indicates the maximum number of mismatches for a splice point splice; - -chimSegmentREadGapMax indicates the maximum number of bases for breaks between chimeric fragments in reads.

Gene fusions were detected using ariba software. Example commands are:

arriba\

-x Aligned.out.bam-o fusions.tsv\

-a reference.fa-g annotation.gtf\

-b blacklist.tsv

wherein-x indicates the input bam file; -o indicates an output file; -a indicates a reference genomic sequence; -g indicates gtf annotation file; -b indicates a blacklist file for reducing false positives.

And (5) according to the gene fusion detection result, writing a script to extract a transcriptome gene fusion sequence. The process is substantially similar to the previously described acquisition of genomic gene fusion sequences, except that the transcriptome sequences mature mRNA, and the sequence at the breakpoint of the fusion upstream and downstream genes is ligated without the need to infer the transcription process.

Finally, the genome gene fusion sequence and the transcriptome gene fusion sequence reading frame obtained above are translated into fusion protein sequences, and potential new antigen peptide segment sequences are extracted to construct a tumor specific gene fusion protein group.

Human leukocyte antigen molecule typing was calculated using leukocyte antigen molecule typing software HLA-LA, with the following exemplary commands:

HLA-LA.pl--BAM sample.bam\

--graph PRG_MHC_GRCh38_withIMGT\

--sampleID sample--maxThreads threads\

--workingDir out_dir--picard_sam2fastq_bin SamToFastq.jar

wherein-BAM designates an entered BAM file; -the graph indicates a population reference map; -the sampleID indicates a sample unique identifier; -maxThreads indicates the maximum number of threads; -workingDir indicates the output path; -picard_sam2fastq_bin indicates the tool for converting the bam file into a fastq file.

Affinity prediction was performed on tumor specific gene fusion proteomes using the software netMHCpan-4.0 and Human Leukocyte Antigen (HLA) molecular typing results. Example commands are:

netMHCpan-BA-l 9-a HLA_type\

-f filename-inptype 1-xls-xlsfile peptide.xls

wherein, -BA indicates affinity prediction; -l indicates the length of the peptide fragment; -a indicates Human Leukocyte Antigen (HLA) molecule typing; -f indicates the entered file; -an inpTYPE indicates the type of file entered, 0 being fasta file 1 being a peptide fragment sequence; xls designates the output as xls file; xlsfile indicates the file name of the output.

And (3) compiling a script, integrating peptide fragment information, comparing the script with a human normal proteome, filtering out peptide fragments existing in the normal proteome, and sorting and screening candidate neoantigens by using different indexes according to corresponding weights to obtain the final tumor specific gene fusion neoantigen.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and arranged in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive.

While the preferred embodiments and examples of the present invention have been described in detail, the present invention is not limited to the above-described embodiments and examples, and various changes may be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A method for extracting a gene fusion immunotherapeutic neoantigen by integrating deep sequencing data of DNA and RNA, comprising the steps of:

s10, obtaining a genome gene fusion sequence of a sample;

s20, obtaining a transcriptome gene fusion sequence of a sample;

s30, constructing a gene fusion protein group;

s40, obtaining a sample neoantigen;

the genome gene fusion sequence for obtaining the sample comprises the following steps:

s101, detecting genomic structural variation of a tumor sample;

s102, screening a gene fusion event;

s103, obtaining a genome gene fusion sequence;

the transcriptome gene fusion sequence for obtaining the sample comprises the following steps:

s201, detecting a transcriptome gene fusion event of a tumor sample;

s202, obtaining a transcriptome gene fusion sequence;

the S30 comprises the steps of respectively carrying out reading frame translation on the obtained genome gene fusion sequence and the obtained transcriptome gene fusion sequence to obtain a gene fusion protein sequence, namely a gene fusion protein group; judging whether frame shift translation occurs when the frame shift translation reaches a breakpoint position where fusion occurs, if the frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigen peptides, and if the frame shift translation does not occur, only sequences near the breakpoint can generate neoantigen peptides;

the step S40 includes the steps of:

s401, identifying HLA molecular typing of human leukocyte antigens;

s402, predicting peptide affinity;

2. The method of claim 1, wherein the sample-taking genomic gene fusion sequence is based on whole-exome sequencing.

3. The method of claim 1, wherein the sample-taking transcriptome gene fusion sequence is based on transcriptome sequencing.

4. A method according to any one of claims 1 to 3, wherein the peptide sequences are generated in S30 according to a specific length according to the need.

5. The method according to claim 4, wherein in S30, the default peptide fragment is 9 to 12 amino acids in length.

6. A method according to any one of claims 1 to 3, wherein the sample is tumour tissue.

7. The method of claim 6, wherein the sample is human tumor tissue.

8. A method according to any one of claims 1 to 3, wherein in S102 a gene fusion event is selected wherein the breakpoint position is located within the gene and not in the intergenic region.

9. A method according to any one of claims 1 to 3, wherein in S103, the gene sequences are extracted and spliced according to the breakpoint positions of the upstream and downstream genes involved in the gene fusion, respectively.

10. The method according to claim 9, comprising the steps of:

s1031, determining breakpoint positions of upstream and downstream genes;

s1033, judging which transcripts in the gene are affected by the breakpoint;

11. A device for extracting a gene fusion immunotherapeutic neoantigen by integrating deep sequencing data of DNA and RNA, comprising:

(1) A first unit for obtaining a genomic gene fusion sequence of a sample;

(3) A third unit for constructing a specific gene fusion proteome;

(4) A fourth unit for obtaining a sample neoantigen;

s101, detecting genomic structural variation of a tumor sample;

s102, screening a gene fusion event;

s103, obtaining a genome gene fusion sequence;

s201, detecting a transcriptome gene fusion event of a tumor sample;

s202, obtaining a transcriptome gene fusion sequence;

the construction of the specific gene fusion protein group comprises the steps of respectively carrying out reading frame translation on the obtained genome gene fusion sequence and the obtained transcriptome gene fusion sequence to obtain a gene fusion protein sequence, namely a gene fusion protein group; judging whether frame shift translation occurs when the frame shift translation reaches a breakpoint position where fusion occurs, if the frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigen peptides, and if the frame shift translation does not occur, only sequences near the breakpoint can generate neoantigen peptides;

the obtaining of the sample neoantigen comprises the following steps:

s401, identifying HLA molecular typing of human leukocyte antigens;

s402, predicting peptide affinity;