CN111192632B - Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data - Google Patents

Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data Download PDF

Info

Publication number
CN111192632B
CN111192632B CN201911293011.2A CN201911293011A CN111192632B CN 111192632 B CN111192632 B CN 111192632B CN 201911293011 A CN201911293011 A CN 201911293011A CN 111192632 B CN111192632 B CN 111192632B
Authority
CN
China
Prior art keywords
gene fusion
sample
obtaining
sequence
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911293011.2A
Other languages
Chinese (zh)
Other versions
CN111192632A (en
Inventor
万季
潘有东
汪健
徐韵婉
宋麒
刘鹏
夏迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Neocura Biotechnology Corp
Original Assignee
Shenzhen Neocura Biotechnology Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Neocura Biotechnology Corp filed Critical Shenzhen Neocura Biotechnology Corp
Priority to CN201911293011.2A priority Critical patent/CN111192632B/en
Publication of CN111192632A publication Critical patent/CN111192632A/en
Application granted granted Critical
Publication of CN111192632B publication Critical patent/CN111192632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a device for extracting a gene fusion immunotherapeutic new antigen by integrating DNA and RNA deep sequencing data. The method comprises the following steps: s10, obtaining a genome gene fusion sequence of a sample; s20, obtaining a transcriptome gene fusion sequence S30 of a sample, and constructing a gene fusion protein group; s40, obtaining the sample neoantigen. The tumor specific neoantigen discovered by the scheme of the invention is all derived from gene fusion, so that the screening range of the neoantigen is expanded, and the ammunition library of the neoantigen-based immunotherapy method is enriched. By analyzing and integrating the whole exome sequencing data and the transcriptome sequencing data of the tumor sample, the gene fusion event in tumor tissues is comprehensively detected, the false positive rate of the new antigen generated by fusion is reduced, the effectiveness of the new antigen vaccine is further improved, and the method has important significance for improving the clinical immunotherapy effect.

Description

Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data
Technical Field
The invention relates to the field of tumor immunotherapy, in particular to a method and a device for extracting a novel antigen for gene fusion immunotherapy by integrating DNA and RNA deep sequencing data.
Background
The therapeutic concepts and methods for malignant tumors have been developed deeply in the last decades. Traditional tumor treatment methods comprise surgery, radiotherapy and mutation-based targeted therapy, however, the above treatment methods have certain limitations in terms of toxic and side effects, drug resistance and the like. In recent years, the idea of immunotherapy by activating the immune system to inhibit and kill tumor cells has been new breakthroughs. Existing immunotherapeutic approaches can be divided into three categories according to their mechanism of action: an immune checkpoint inhibitor that activates the immune system by inhibiting the inhibitory pathway of the immune system, (2) an adoptive cellular immunotherapy that modifies T lymphocytes to recognize antigens, (3) a new antigen vaccine immunotherapy method by identifying tumor tissue specific antigens and preparing polypeptides and mRNA vaccines based on the predicted antigens for reinfusion in vivo. Compared with other two types of immunotherapy methods, the novel antigen vaccine immunotherapy method has the characteristics of no limitation to specific cancer species and small toxic and side effects. Prediction of neoantigens relies on whole-exome sequencing and transcriptome sequencing of DNA and RNA, respectively, of tissue samples to predict mutant polypeptides. Existing procedures generally consider mutant polypeptides resulting from DNA point mutations and small indels. In addition, gene fusion is also an important source of mutant polypeptides. However, since gene fusion identification based on a single data source (DNA or RNA) generally has high false positives, predicting the neoantigens resulting from fusion requires more abundant data and a stringent screening procedure to ensure the high efficiency of the neoantigen vaccine. Therefore, integrating various data to extract the new antigen generated by gene fusion has important significance for expanding the screening range of the new antigen and improving the clinical application effect.
Disclosure of Invention
Aiming at the problems, the invention comprehensively considers the possibility of producing mutant polypeptide by fusion transcription and translation of tumor specific genes, and develops a bioinformatics method for obtaining tumor specific new antigens.
In a first aspect, the present invention provides a method for extracting a gene fusion immunotherapeutic neoantigen from deep sequencing data of integrated DNA and RNA, comprising the steps of:
s10, obtaining genome gene fusion of a sample;
s20, obtaining a transcriptome gene fusion sequence of a sample;
s30, constructing a gene fusion protein group;
s40, obtaining the sample neoantigen.
In some embodiments of the invention, the genomic gene fusion sequence of the obtained sample is based on whole exome sequencing.
In some embodiments of the invention, the sample-taking transcriptome gene fusion sequences are based on transcriptome sequencing.
In some embodiments of the invention, the genomic gene fusion sequence of the obtained sample comprises the steps of:
s101, detecting genomic structural variation of a tumor sample;
s102, screening a gene fusion event;
s103, obtaining a genome gene fusion sequence.
In some embodiments of the invention, the sample-taking transcriptome gene fusion sequence comprises the steps of:
s201, detecting a transcriptome gene fusion event of a tumor sample;
s202, obtaining a transcriptome gene fusion sequence.
In some embodiments of the invention, the step S30 comprises performing in-frame translation on the obtained genomic gene fusion sequence and the transcriptome gene fusion sequence, respectively, to obtain a gene fusion protein sequence, i.e., a gene fusion proteome;
preferably, when the frame translation is carried out to the breakpoint position where fusion occurs, whether frame shift translation occurs is judged, if frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigenic peptides, and if frame shift translation does not occur, only sequences near the breakpoint can generate neoantigenic peptides.
In some embodiments of the present invention, the step S30 generates a peptide fragment sequence according to a specific length according to the requirement;
preferably, the default peptide stretch is 9 to 12 amino acids in length.
In some embodiments of the invention, the step S40 includes the steps of:
s401, identifying human leukocyte antigen molecule (HLA) typing;
s402, predicting peptide affinity;
s403, screening the sample neoantigens based on the peptide fragment integration information.
In some embodiments of the invention, the sample is tumor tissue, preferably human tumor tissue.
In some embodiments of the invention, in step S102, a gene fusion event is selected in which the breakpoint position is located within the gene, rather than in the intergenic region.
In some embodiments of the present invention, in the step S103, the gene sequences are extracted and spliced according to the breakpoint positions of the upstream and downstream genes involved in the gene fusion, respectively;
preferably, the method comprises the following steps:
s1031, determining breakpoint positions of upstream and downstream genes;
s1032, judging whether the breakpoint occurs in the exon region or the intron region;
s1033, judging which transcripts in the gene are affected by the breakpoint;
s1034, enabling each affected upstream gene transcript to correspond to each downstream gene transcript one by one, and obtaining a complete gene fusion transcript sequence according to a conventional transcription rule.
In a second aspect, the present invention provides an apparatus for extracting a gene fusion immunotherapeutic neoantigen from deep sequencing data integrating DNA and RNA, comprising:
(1) A first unit for obtaining a genomic gene fusion sequence of a sample;
(2) A second unit for obtaining a transcriptome gene fusion sequence of the sample;
(3) A third unit for constructing a specific gene fusion proteome;
(4) And a fourth unit for obtaining a sample neoantigen.
The invention has the beneficial effects that:
compared with the prior art, the scheme of the invention has the following advantages:
1. from the source, the tumor specific new antigens discovered by the scheme of the invention are all derived from gene fusion, and the gene fusion event is widely existed in different types of tumors; the current method is mainly to obtain new antigen by recognizing somatic mutation. Therefore, the invention expands the screening range of the new antigen and fills the 'ammunition library' of the immune treatment method based on the new antigen.
2. According to the invention, through analyzing and integrating the whole exome sequencing data and the transcriptome sequencing data of the tumor sample, the gene fusion event in tumor tissues is comprehensively detected, the false positive rate of the new antigen generated by fusion is reduced, the effectiveness of the new antigen vaccine is further improved, and the method has important significance in improving the clinical immunotherapy effect.
Drawings
FIG. 1 is a flow chart of a method for extracting a gene fusion immunotherapeutic neoantigen from deep sequencing data of integrated DNA and RNA according to one embodiment of the invention;
FIG. 2 is a schematic representation of the gene fusion sequence in a method for extracting a novel antigen for gene fusion immunotherapy by integrating DNA and RNA deep sequencing data according to an embodiment of the invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.
Before the embodiments of the invention are explained in further detail, it is to be understood that the invention is not limited in its scope to the particular embodiments described below; it is also to be understood that the terminology used in the examples of the invention is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the invention.
In order that those skilled in the art will better understand the present invention, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The terms "first," "second," "again," "then," "next," and the like as used in the specific embodiments herein are not intended to be limiting of the order.
As shown in FIG. 1, the left hand side is a computer-implemented flow chart for obtaining genomic fusion sequences of tumor tissue based on whole-exome sequencing in accordance with an embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:
s101, detecting genomic structural variation of a tumor sample.
First, genomic sequencing raw data is aligned to a human reference genome using genomic alignment software bwa; then, the bam file generated in the above step is used as an input, and genomic structural variation is detected by using variation detection software lumpy-sv.
S102, screening a gene fusion event.
Specifically, structural variation is typed by using SVTyper software, gene fusion events in the structural variation are screened, then a program package pygeno is used for annotating fusion genes, and gene fusion events with breakpoint positions in the interior of the genes rather than in intergenic regions are selected.
S103, obtaining a genome gene fusion sequence.
And respectively extracting gene sequences according to breakpoint positions of upstream and downstream genes involved in gene fusion, and splicing. Specifically, after determining the breakpoint positions of the upstream and downstream genes, it is first determined whether the breakpoint occurs in an exon (exon) region or an intron (intron) region, and since many genes contain multiple transcripts, it is also necessary to determine which transcripts in the gene are affected by the breakpoint, and then make each affected upstream gene transcript correspond to the downstream gene transcript one by one, and according to conventional transcription rules, a complete gene fusion transcript sequence is obtained, so that a subsequent reading frame translation process is facilitated. These gene fusion transcript sequences are collectively referred to herein as genomic gene fusion sequences.
The middle part of fig. 1 is a flow chart of a computer-implemented transcriptome sequencing-based acquisition of a transcriptome gene fusion sequence of a tumor tissue according to an embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:
s201, detecting a transcriptome gene fusion event of a tumor sample.
Firstly, transcriptome sequencing raw data is aligned with a human reference genome by sequence alignment software STAR; gene fusion was then detected using ariba software.
S202, obtaining a transcriptome gene fusion sequence.
And respectively extracting gene sequences according to breakpoint positions of upstream and downstream genes involved in gene fusion, and splicing. It should be noted that since the sequenced data in RNAseq is transcript sequence, not genomic sequence, the determined gene fusion breakpoint location is not the location in the genome where gene fusion actually occurs, but rather the boundaries of the mature mRNA sequence after transcription. Therefore, the sequences at the breakpoints of the upstream and downstream genes are spliced together directly when the gene fusion sequences are generated, and transcription rules do not need to be considered. The rest of the processing is similar to the step S103, and transcripts affected by breakpoint positions are determined, so that complete and complete gene fusion transcript sequences are obtained. These gene fusion transcript sequences are collectively referred to herein as transcriptome gene fusion sequences.
The lower left part of FIG. 1 is a computer-implemented flow chart for constructing a tumor-specific gene fusion protein set according to an embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:
s301, constructing a tumor specific gene fusion protein group.
And (3) performing reading frame translation on the genome gene fusion sequences and the transcriptome gene fusion sequences obtained in the steps S103 and S202 respectively to obtain a gene fusion protein sequence, namely a tumor specific gene fusion protein group. In order to obtain peptide fragment sequences which are not present in normal cells of a human body, when the reading frame translation is carried out to a breakpoint position where fusion occurs, whether frame shift translation occurs or not is judged, if frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigen peptides, and if frame shift translation does not occur, only sequences near the breakpoint can generate neoantigen peptides. And finally, generating a peptide fragment sequence according to the specific length according to the requirement. In the present invention, the default peptide fragment length is 9 to 12 amino acids.
The right part of FIG. 1 is a schematic flow chart for obtaining a novel antigen generated based on tumor-specific gene fusion according to another embodiment of the present invention. As shown in fig. 1, the method includes the following steps performed by a processor:
s401, typing Human Leukocyte Antigen (HLA) molecules.
Human leukocyte antigen molecule typing was calculated using leukocyte antigen molecule typing software HLA-LA.
S402, predicting peptide fragment affinity.
Affinity prediction was performed on tumor specific gene fusion proteomes using the software netMHCpan-4.0 and Human Leukocyte Antigen (HLA) molecular typing results.
S403, integrating and screening the tumor specific gene fusion neoantigen.
Integrating peptide fragment information, specifically, determining the source of each candidate peptide fragment, including determining the upstream and downstream genes involved in fusion and the corresponding transcript numbers, whether the gene fusion event is from DNA sequencing or RNA sequencing, annotating the affinity of the peptide fragment to the HLA molecular type identified in step S401, the expression level of the fusion gene in RNA sequencing, the rule frequency corresponding to the gene fusion event in DNA sequencing, the specific position of the peptide fragment in the fusion protein sequence, and the like. In the screening phase, the candidate peptide fragments are firstly compared with the human normal proteome, and sequences existing in the normal proteome are filtered; and then, sorting and screening the candidate neoantigens by using different indexes according to corresponding weights to obtain the final tumor specific gene fusion neoantigens. Specific indexes include affinity of the peptide fragment to HLA, expression level of fusion genes, the sequence of the fusion genes and physicochemical properties of the peptide fragment.
In some embodiments, specific parameters of the software used in the present invention are as follows:
genomic sequencing data was aligned using bwa, an example command of which is:
bwa mem\
-R‘@RG\tID:sample\tLB:library\tSM:sample’\
-t 20\
-M bwa_index\
sample_1.DNA.fq.gz sample_2.DNA.fq.gz
wherein, -R indicates the comparison result header file, -t indicates the running thread number, -M indicates the index file used, sample_1.DNA. Fq. Gz, sample_2.DNA. Fq. Gz are the sequencing raw data input.
Structural variations are detected using lumpy-sv, which when run, requires some intermediate files to be generated with the commands it provides. Example commands for this are:
first, a file_filter command is used to extract a discordant reads alignment result (sample. Disc. Bam) and a softlip reads alignment result (sample. Splt. Bam) in a bam file.
lumpy_filter sample.bam sample.splt.bam sample.disc.bam
Then, the insert length average and standard deviation of the sequencing data were calculated.
samtools view sample.bam|python paired_distro.py\
-r readlen\
-X 4\
-N 10000\
-o sample.lib1.histo
Wherein samtools view sample. Bam indicates reading of the bam file as a standard input to a modified distro. Py script, modified distro. Py is a script provided by lumpy-sv software, -r indicates sequencing fragment length, -X indicates a threshold for standard deviation, -N indicates the number of lines of program read data from the standard input, -o indicates the output file name.
Finally, structural variations were detected using the lumpy command.
lumpy\
-mw 4\
-tt 0\
-pe\
id:sample,bam_file:sample.disc.bam,histo_file:sample.lib1.histo,mean:500,stdev:50,read_length:readlen,discordant_z:5,back_distance:10,weight:1,min_mapping_threshold:20\
-sr\
id:sample,bam_file:sample.splt.bam,back_distance:10,weight:1,min_mapping_threshold:20\
>sample.vcf
Wherein-mw indicates the minimum weight for each structural variant event, -tt indicates the threshold size, -pe indicates the series of parameters of the process discordant reads bam file, -sr indicates the series of parameters of the process softclip reads bam file, sample. Specifically, id represents a sample name, bam_file represents a corresponding bam file, history_file represents a recorded insert length distribution file, mean represents an insert length average value, stdev represents an insert length standard deviation, read_length represents a sequencing length, discordant_z represents a standard score value, back_distance represents a number of bases extended by a structural mutation site, weight represents a sample weight, and min_mapping_threshold represents a minimum alignment quality value.
Structural variations are typed using svtypher, example commands are:
svtyper\
-i sample.vcf\
-B sample.bam\
-o sample.svtyper.vcf
wherein, -i indicates the structural variation result generated by the previous lumpy-sv, -B indicates bwa comparison file, -o indicates the output result file name.
And writing a script, processing the output of the SVTyper, screening out a gene fusion event, and annotating the gene fusion event by using a program package pygeno to obtain the position of the gene fusion occurring in a genome. Specifically, it is inferred which two genes the breakpoint upstream and downstream of gene fusion is located inside. Fusion genes whose breakpoints are in the intergenic regions are not considered because their specific transcription process cannot be accurately deduced. The functions to be imported in the script are:
from pyGeno.Transcript import Transcript
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Exon import Exon
from pyGeno.Chromosome import Chromosome
it will be readily appreciated that a gene fusion sequence consists of two parts, the former part, or 5 'end sequence, is a partial sequence from one gene (the upstream gene) and the latter part, or 3' end sequence, is a partial sequence from the other gene (the downstream gene). In order to obtain a genome gene fusion sequence, a script is written to extract partial sequences of upstream and downstream genes respectively and splice the partial sequences. Specifically, it is first determined whether the upstream and downstream breakpoints are located in the exon region or the intron region of the gene, and then the treatment is performed according to four cases, namely, exo-exon, intron-intron, exo-intron, intron-exon, respectively, according to the region where the breakpoints are located, as shown in fig. 2. Wherein, the exon-exon refers to that the upstream and downstream breakpoints are located in the respective exon areas, and the gene fusion sequence can be formed by connecting a 5 'terminal exon sequence before the upstream gene breakpoint with a 3' terminal exon sequence after the downstream gene breakpoint; intron-intron refers to that the upstream and downstream breakpoints are located in the respective intronic regions, and the intronic sequence is not present in the mature mRNA sequence, so that the gene fusion sequence is formed by connecting a 5 'terminal exon sequence before the upstream gene breakpoint with a 3' terminal exon sequence after the downstream gene breakpoint, and does not contain the intronic sequence at the breakpoint; both exon-intron and intron-exon are slightly complex, two transcript sequences can be deduced according to transcription rules, and in order to comprehensively select new antigens, the two gene fusion sequences are output together (type 1 and type2 shown in fig. 2), the type1 sequence does not contain the exon remainder sequence with the breakpoint in the exon region, and the type2 sequence contains the intron remainder sequence with the breakpoint in the intron region.
Transcriptome sequencing data was aligned using STAR, example commands are:
STAR\
--runThreadN 20\
--genomeDir star_index\
--readFilesIn sample_1.RNA.fq.gz sample_2.RNA.fq.gz\
--readFilesCommand zcat\
--outSAMtype BAM SortedByCoordinate\
--outSAMunmapped Within\
--outFilterMultimapNmax 1\
--outFilterMismatchNmax 3\
--chimSegmentMin 10\
--chimOutType WithinBAM SoftClip\
--chimJunctionOverhangMin 10\
--chimScoreMin 1\
--chimScoreDropMax 30\
--chimScoreJunctionNonGTAG 0\
--chimScoreSeparation 1\
--alignSJstitchMismatchNmax 5-1 5 5\
--chimSegmentReadGapMax 3
wherein —runthread indicates the number of threads running; -genomeDir indicates index file path; readFilesIn indicates the raw sequencing data read in; -readFilesCommand indicates a read file command; -outSAMtype BAM SortedByCoordinate indicates that the output format is BAM and ordered; -outSAMunmapped Within indicates that unaligned reads are also output to the results file; -outfiltermultimaplnmax indicates the maximum allowed multiple alignment; -outfiltermissmatchnmax indicates the maximum number of mismatches allowed; - -chimSegmentMin indicates the output fusion transcript, 10 represents the shortest base number of the alignment; -chimOutType WithinBAM SoftClip indicates the output format of the mosaic alignment; - -chimJuctionOverhangMin indicates the shortest base number of the alignment; - -chimScorMin indicates the minimum score of the chimeric fragment; - -chimScorDropMax indicates the maximum score difference between all chimeric fragments; -chimScareJuctionNonGTAG indicates a penalty that the base at the chimeric junction is not in the form of "GT/AG"; -chimscore separation indicates the smallest difference between the best and suboptimal chimerism scores; -alignsjstitchmismatching nmax indicates the maximum number of mismatches for a splice point splice; - -chimSegmentREadGapMax indicates the maximum number of bases for breaks between chimeric fragments in reads.
Gene fusions were detected using ariba software. Example commands are:
arriba\
-x Aligned.out.bam-o fusions.tsv\
-a reference.fa-g annotation.gtf\
-b blacklist.tsv
wherein-x indicates the input bam file; -o indicates an output file; -a indicates a reference genomic sequence; -g indicates gtf annotation file; -b indicates a blacklist file for reducing false positives.
And (5) according to the gene fusion detection result, writing a script to extract a transcriptome gene fusion sequence. The process is substantially similar to the previously described acquisition of genomic gene fusion sequences, except that the transcriptome sequences mature mRNA, and the sequence at the breakpoint of the fusion upstream and downstream genes is ligated without the need to infer the transcription process.
Finally, the genome gene fusion sequence and the transcriptome gene fusion sequence reading frame obtained above are translated into fusion protein sequences, and potential new antigen peptide segment sequences are extracted to construct a tumor specific gene fusion protein group.
Human leukocyte antigen molecule typing was calculated using leukocyte antigen molecule typing software HLA-LA, with the following exemplary commands:
HLA-LA.pl--BAM sample.bam\
--graph PRG_MHC_GRCh38_withIMGT\
--sampleID sample--maxThreads threads\
--workingDir out_dir--picard_sam2fastq_bin SamToFastq.jar
wherein-BAM designates an entered BAM file; -the graph indicates a population reference map; -the sampleID indicates a sample unique identifier; -maxThreads indicates the maximum number of threads; -workingDir indicates the output path; -picard_sam2fastq_bin indicates the tool for converting the bam file into a fastq file.
Affinity prediction was performed on tumor specific gene fusion proteomes using the software netMHCpan-4.0 and Human Leukocyte Antigen (HLA) molecular typing results. Example commands are:
netMHCpan-BA-l 9-a HLA_type\
-f filename-inptype 1-xls-xlsfile peptide.xls
wherein, -BA indicates affinity prediction; -l indicates the length of the peptide fragment; -a indicates Human Leukocyte Antigen (HLA) molecule typing; -f indicates the entered file; -an inpTYPE indicates the type of file entered, 0 being fasta file 1 being a peptide fragment sequence; xls designates the output as xls file; xlsfile indicates the file name of the output.
And (3) compiling a script, integrating peptide fragment information, comparing the script with a human normal proteome, filtering out peptide fragments existing in the normal proteome, and sorting and screening candidate neoantigens by using different indexes according to corresponding weights to obtain the final tumor specific gene fusion neoantigen.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and arranged in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive.
While the preferred embodiments and examples of the present invention have been described in detail, the present invention is not limited to the above-described embodiments and examples, and various changes may be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims (11)

1. A method for extracting a gene fusion immunotherapeutic neoantigen by integrating deep sequencing data of DNA and RNA, comprising the steps of:
s10, obtaining a genome gene fusion sequence of a sample;
s20, obtaining a transcriptome gene fusion sequence of a sample;
s30, constructing a gene fusion protein group;
s40, obtaining a sample neoantigen;
the genome gene fusion sequence for obtaining the sample comprises the following steps:
s101, detecting genomic structural variation of a tumor sample;
s102, screening a gene fusion event;
s103, obtaining a genome gene fusion sequence;
the transcriptome gene fusion sequence for obtaining the sample comprises the following steps:
s201, detecting a transcriptome gene fusion event of a tumor sample;
s202, obtaining a transcriptome gene fusion sequence;
the S30 comprises the steps of respectively carrying out reading frame translation on the obtained genome gene fusion sequence and the obtained transcriptome gene fusion sequence to obtain a gene fusion protein sequence, namely a gene fusion protein group; judging whether frame shift translation occurs when the frame shift translation reaches a breakpoint position where fusion occurs, if the frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigen peptides, and if the frame shift translation does not occur, only sequences near the breakpoint can generate neoantigen peptides;
the step S40 includes the steps of:
s401, identifying HLA molecular typing of human leukocyte antigens;
s402, predicting peptide affinity;
s403, screening the sample neoantigens based on the peptide fragment integration information.
2. The method of claim 1, wherein the sample-taking genomic gene fusion sequence is based on whole-exome sequencing.
3. The method of claim 1, wherein the sample-taking transcriptome gene fusion sequence is based on transcriptome sequencing.
4. A method according to any one of claims 1 to 3, wherein the peptide sequences are generated in S30 according to a specific length according to the need.
5. The method according to claim 4, wherein in S30, the default peptide fragment is 9 to 12 amino acids in length.
6. A method according to any one of claims 1 to 3, wherein the sample is tumour tissue.
7. The method of claim 6, wherein the sample is human tumor tissue.
8. A method according to any one of claims 1 to 3, wherein in S102 a gene fusion event is selected wherein the breakpoint position is located within the gene and not in the intergenic region.
9. A method according to any one of claims 1 to 3, wherein in S103, the gene sequences are extracted and spliced according to the breakpoint positions of the upstream and downstream genes involved in the gene fusion, respectively.
10. The method according to claim 9, comprising the steps of:
s1031, determining breakpoint positions of upstream and downstream genes;
s1032, judging whether the breakpoint occurs in the exon region or the intron region;
s1033, judging which transcripts in the gene are affected by the breakpoint;
s1034, enabling each affected upstream gene transcript to correspond to each downstream gene transcript one by one, and obtaining a complete gene fusion transcript sequence according to a conventional transcription rule.
11. A device for extracting a gene fusion immunotherapeutic neoantigen by integrating deep sequencing data of DNA and RNA, comprising:
(1) A first unit for obtaining a genomic gene fusion sequence of a sample;
(2) A second unit for obtaining a transcriptome gene fusion sequence of the sample;
(3) A third unit for constructing a specific gene fusion proteome;
(4) A fourth unit for obtaining a sample neoantigen;
the genome gene fusion sequence for obtaining the sample comprises the following steps:
s101, detecting genomic structural variation of a tumor sample;
s102, screening a gene fusion event;
s103, obtaining a genome gene fusion sequence;
the transcriptome gene fusion sequence for obtaining the sample comprises the following steps:
s201, detecting a transcriptome gene fusion event of a tumor sample;
s202, obtaining a transcriptome gene fusion sequence;
the construction of the specific gene fusion protein group comprises the steps of respectively carrying out reading frame translation on the obtained genome gene fusion sequence and the obtained transcriptome gene fusion sequence to obtain a gene fusion protein sequence, namely a gene fusion protein group; judging whether frame shift translation occurs when the frame shift translation reaches a breakpoint position where fusion occurs, if the frame shift translation occurs, all protein sequences behind the breakpoint position are sources of potential neoantigen peptides, and if the frame shift translation does not occur, only sequences near the breakpoint can generate neoantigen peptides;
the obtaining of the sample neoantigen comprises the following steps:
s401, identifying HLA molecular typing of human leukocyte antigens;
s402, predicting peptide affinity;
s403, screening the sample neoantigens based on the peptide fragment integration information.
CN201911293011.2A 2019-12-16 2019-12-16 Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data Active CN111192632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911293011.2A CN111192632B (en) 2019-12-16 2019-12-16 Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911293011.2A CN111192632B (en) 2019-12-16 2019-12-16 Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data

Publications (2)

Publication Number Publication Date
CN111192632A CN111192632A (en) 2020-05-22
CN111192632B true CN111192632B (en) 2023-06-13

Family

ID=70707362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911293011.2A Active CN111192632B (en) 2019-12-16 2019-12-16 Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data

Country Status (1)

Country Link
CN (1) CN111192632B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113035272B (en) * 2021-03-08 2023-09-05 深圳市新合生物医疗科技有限公司 Method and device for obtaining immunotherapeutic new antigen based on intein cell variation
CN115240773B (en) * 2022-09-06 2023-07-28 深圳新合睿恩生物医疗科技有限公司 New antigen identification method and device, equipment and medium of tumor specific circular RNA

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491689A (en) * 2018-02-01 2018-09-04 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
US20180341746A1 (en) * 2017-05-25 2018-11-29 Koninklijke Philips N.V. System and method for detecting gene fusion
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341746A1 (en) * 2017-05-25 2018-11-29 Koninklijke Philips N.V. System and method for detecting gene fusion
CN108491689A (en) * 2018-02-01 2018-09-04 杭州纽安津生物科技有限公司 Tumour neoantigen identification method based on transcript profile
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application

Also Published As

Publication number Publication date
CN111192632A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN108388773B (en) A kind of identification method of tumor neogenetic antigen
CN109033749B (en) Tumor mutation load detection method, device and storage medium
CN108796055B (en) Method, device and storage medium for detecting tumor neoantigen based on second-generation sequencing
CN109801678B (en) Tumor antigen prediction method based on complete transcriptome and application thereof
Lee et al. Transcriptome sequencing in Sezary syndrome identifies Sezary cell and mycosis fungoides-associated lncRNAs and novel transcripts
CN110600077B (en) Prediction method of tumor neoantigen and application thereof
Zhou et al. TSNAD: an integrated software for cancer somatic mutation and tumour-specific neoantigen detection
CN113035272B (en) Method and device for obtaining immunotherapeutic new antigen based on intein cell variation
CN111192632B (en) Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data
CN110621785B (en) Method and device for haplotyping diploid genome based on three-generation capture sequencing
CN111755067A (en) Screening method of tumor neoantigen
CN110739027A (en) cancer tissue positioning method and system based on chromatin region coverage depth
EP4116436A1 (en) Method and system for screening for neoantigens, and uses thereof
Leprieur et al. Sequential ctDNA whole-exome sequencing in advanced lung adenocarcinoma with initial durable tumor response on immune checkpoint inhibitor and late progression
Giacomelli et al. Relationship between human oral lichen planus and oral squamous cell carcinoma at a genomic level: a datamining study
Gupta et al. Personalized cancer immunotherapy using systems medicine approaches
CN111524548B (en) Method, computing device, and computer storage medium for detecting IGH reordering
Oreper et al. The peptide woods are lovely, dark and deep: Hunting for novel cancer antigens
WO2024051097A1 (en) Neoantigen identification method and device for tumor-specific circular rnas, apparatus and medium
Wagner Serology and molecular biology of DEL: a narrative review
CN112210596A (en) Tumor neoantigen prediction method based on gene fusion event and application thereof
CN116779028A (en) Method, device and computer readable storage medium for predicting neoepitope based on structural variation detection
CN112805784A (en) Methods and systems for targeting epitopes for neoantigen-based immunotherapy
CN111696628A (en) Method for identifying neoantigens
CN114464256A (en) Method, computing device and computer storage medium for detecting tumor neoantigen burden

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant