A method of based on transcript profile sequencing data detection fusion gene
Technical field
The present invention relates to transcriptome analysis fields, more specifically it relates to which a kind of be based on transcript profile sequencing data detection fusion
The method of gene.
Background technique
Gene rearrangement is the phenomenon that happening occasionally between inhereditary material in organism, since gene rearrangement frequently results in originally
One or more genes or genetic fragment not under a cistron form fusion, and transcribe as a cistron,
This will lead to the activation of certain genes, inactivation or generates new function.The generation of many diseases is all accompanied by fusion phenomenon, example
Such as, leukaemia is often accompanied by the fusions such as bcr/abl, AML1/ETO, CBF β/MYH11, PML/RAR α, in a variety of solid tumors
It has also been found that fusion, there is EML4-ALK in non-small cell lung cancer, there is SLC45A3-ELK4 in prostate cancer, in rhabdomyosarcoma
There is PAX3-FOXO1 etc..Scientific research discovery, some fusions take part in the pathogenic course of related disease, therefore, these
The detection of fusion can be used as one of diagnostic criteria, or even can be as therapy target.
Two generation transcript profile sequencing datas are analyzed currently, being generally basede on to the detection of fusion to obtain.It is led
It to be carried out by comparing two kinds of reading sequence (reads).One kind is non-uniform pairs of reading sequence (discordant paired-
End reads, that is, pairs of reads is compared respectively to the 5 ' chaperones and 3 ' chaperones for participating in fusion), another kind is
In conjunction with reading sequence (junction reads, that is, the comparison of reads spans position of fusion).Pass through these two types of branch for reading sequence of identification
Situation is held, can detect that the fusion in transcription product.There are many software having been developed currently based on such methods, packet
Include SOAPfusion, Defuse, SOAPfuse, FusionCatcher, FusionMap, Tophat-fusion,
ChimeraScan, Star-fusion etc..But sequence is read due to the complexity of transcript profile and the sequencing of two generations RNA-seq and reads long limit
System, short reading sequence comparison inherently face very big challenge.With reference to the repetitive sequence on genome and refer to genome itself
It is imperfect, it is easy to cause to read sequence compare position and compare uniqueness make false judgment.It is led to control comparison mistake
The false positive of cause needs that stringenter filtration parameter is arranged to be filtered candidate fusion result, but does so past
It is past that many true-positive results is caused also to be filtered.On the other hand, two generations sequencing can be randomly generated one in library construction process
The segment from different genes is joined together to form at random a bit chimeric reading column, using existing fusion detection method,
These chimeric sequences and real fusion gene sequence cannot be distinguished in we, these chimeric readings being randomly generated column can be also taken as
Fusion detected, and cause the false positive of result.It is based purely on the fusion of two generations RNA-seq sequencing due to the above reasons,
Gene tester is difficult to accomplish to balance at two aspects of accuracy and false negative rate.
The sequencing of three generations's transcript profile is also known as the sequencing of overall length transcript profile, each of which length for reading sequence is all far longer than two generation transcript profiles
Sequencing reading length.The long reading sequence of short reading compared to the sequencing of two generations, the length of three generations's sequencing reads long reading sequence can be more effective in comparison process
Ground reduces false positive caused by comparing mistake, but can not also avoid such false positive completely.Meanwhile three generations's transcript profile
The chimeric reading sequence that some segments from different genes connect formation at random can be also randomly generated in sequencing in library construction process,
False positive is caused to merge.
Therefore, it is necessary to a kind of methods of fusion in new detection transcription product.
Summary of the invention
In order to solve the above problem, the present invention provides a kind of method based on transcript profile sequencing data detection fusion gene,
It includes the following steps:
S1:The sequencing of two generation transcript profiles and the sequencing of three generations's transcript profile are carried out to sample, respectively obtain two generation transcript profiles sequencing number
According to three generations's transcript profile sequencing data;
S2:Three generations's transcript profile sequencing data is compared with reference to genome, identification may have occurred gene and melt
The FLNC of conjunction reads sequence and may participate in the gene pairs of fusion, and the FLNC that may have occurred Gene Fusion described in extraction reads the sequence of sequence
Column, and judge to merge position;
S3:The two generations transcript profile sequencing data is compared to possible fusion FLNC obtained in S2 and reads sequence, root
Sequence is read in pairs according to nonuniformity in comparison result and combines the number of reading sequence, the gene pairs that identification is merged really.
Further, S2 includes the following steps:
S2.1:Three generations's transcript profile sequencing data is compared with reference to genome, obtains comparing to reference to gene
The FLNC of upper two different locations of group reads sequence;
S2.2:Judge it is described with reference to two different locations on genome and the FLNC read in sequence with described two differences
Whether the corresponding segment in position meets fusion decision condition, when meeting all fusion decision conditions, then will
The FLNC that the FLNC is judged to may have occurred Gene Fusion reads sequence and obtains to participate in the gene pairs of fusion, described in extraction
The FLNC that may have occurred Gene Fusion reads the sequence of sequence, and judges to merge position.
Further, the fusion decision condition is:
1) two different locations with reference on genome respectively correspond the 5 ' segments and 3 ' segments that the FLNC reads sequence;
2) 5 ' segment and 3 ' segments read the position in sequence and meet to be no more than Maximum overlap length and most in the FLNC
Large-spacing length, and it is not less than minimum overall length;
3) 5 ' segment and 3 ' segments compare on the reference genome meets minimum comparison consistency;
4) two different locations with reference on genome meet one of the following conditions:A, in different chromosomes;
B, on same chromosome but contrary;C, it is on same chromosome and direction is identical, but distance is more than genome
The maximum length of intron of annotation;And
5) there is gene annotation information at two different locations with reference on genome, and according to the gene annotation
Information can determine that the gene annotation structure with reference to two different locations on genome distinguishes the corresponding 5 ' piece
Section is consistent with the gene structure of 3 ' segments.
Further, the Maximum overlap length and largest interval length are 5-20bp, and the minimum overall length is institute
The 10-20% that FLNC reads sequence length is stated, the minimum comparison consistency is 80-95%, and the maximum length of intron is 50kb.
Further, S3 includes the following steps:
S3.1:The two generations transcript profile sequencing data is read sequence with the FLNC that may have occurred Gene Fusion to compare
It is right, sequence identification nonuniformity, which is read, for each FLNC that may have occurred Gene Fusion reads sequence in pairs and combine to read sequence;
S3.2:When the possibility identified in support S2 participates in may have occurred Gene Fusion described in the gene pairs of fusion
The number that FLNC reads the number of sequence and sequence is read in the combination meets judgement minimum number and the nonuniformity reads sequence in pairs
When logarithm meets minimum judgement logarithm, determine that the possible gene pairs for participating in fusion is merged.
Further, it is 1 that the FLNC that may have occurred Gene Fusion, which reads the judgement minimum number of sequence,.
Further, it is 1 that the judgement minimum number of sequence is read in the combination.
Further, it is 1 pair that the nonuniformity, which reads the judgement minimum logarithm of sequence,.
The sequencing of three generations's overall length transcript profile can cover most of transcript sequence, therefore fusion base also can completely be sequenced
The fusion transcript sequence of cause.Depth height is sequenced in two generation sequencing datas, low-abundance fusion can also be provided enough
Reads is supported.In addition, the technological means different as two is sequenced with three generations in the sequencing of two generations, monotechnics can be effectively avoided
False positive caused by systematic error or Problem of False Negative.Such as respectively generating in two kinds of sequencing technologies library construction process
It is chimeric to read sequence, sequence is read since these are chimeric and is randomly generated, we detect fusion in two kinds of libraries by requiring
Supporting evidence can effectively avoid false positive caused by chimeric reading sequence.FLNC itself is used as overall length transcript sequence, also gives for two generations
RNA-seq comparing provides the reference sequences of an accurate candidate fusion transcript, greatly improves two codes or datas
Compare the efficiency and accuracy with detection fusion gene.
The present invention is avoided simple by being sequenced in conjunction with the sequencing of three generations's transcript profile with two generation transcript profiles come detection fusion gene
It is sequenced using two generation transcript profiles because reading that sequence length is short and bring false positive and single sequencing technologies are because chimeric sequence of reading is led
The false positive of cause, while also avoiding fusion transcript when original two codes or data does fusion detection and being difficult to the problem of reconstructing,
So that more reliable in conjunction with the fusion testing result that supporting evidence is sequenced in two generations and three generations.
Specific embodiment
Principles and features of the present invention are described below in conjunction with example, the given examples are served only to explain the present invention, and
It is non-to be used to limit the scope of the invention.
One embodiment of method of the present invention is applied in the project of a soybean transcript profile by we.This
Soybean transcript profile three generations ISO-seq sequencing approach and two generation RNA-seq sequencing approaches is sequenced in mesh respectively.Sample is sequenced
Product are the mixing sample of soybean different tissues and developmental stage.Wherein two libraries, library size difference have been set up in three generations's transcription
For 0.6-2.5kb and>1.5kb, two libraries are respectively sequenced with PacBio RSII microarray dataset and produce 16 cell and 7
cell.A library has been built in the sequencing of two generations, and library size is 200bp, and the RNA-seq data of 6G are obtained.Subsequent Main Analysis
Process is as follows:
1) initial data is sequenced by RS_subreads, RS_ in SMRT analysis software in three generations
Tri- pipeline of ReadsofInsert and IRS_Isoseq carry out data prediction and Quality Control to sequencing data, and obtain complete
Long FLNC reads sequence.Two codes or datas carry out pretreatment and Quality Control by FastQC software, obtain clean RNA-seq and read sequence;
2) FLNC reads sequence and compares software comparison to reference genome by GMAP, obtains comparison result file;
3) comparison result file is screened, and is found segmentation and is compared to two positions on genome and meet the following conditions
FLNC comparison result:(1) two comparison position respectively corresponds overlap length and interval between 5 ' and 3 ' (2) two segments of FLNC
The region that length is both less than 10bp (3) each segment compares consistency and is both greater than 90%, and FLNC reading sequence always compares length and is greater than
90% (4) are equidirectional if it is homologous chromosomes, then two comparison positions are at least at a distance of 50k bp;
4) according to soybean genome comment file, selecting two comparison positions has a gene annotation, and FLNC is compared
Exon/intron structure reads sequence with the identical FLNC of structure for annotating gene as candidate fusion transcript, and records participation and melt
The gene location information and position of fusion information of conjunction;
5) it extracts and the FLNC of candidate fusion is supported to read sequence as overall length fusion transcript sequence, and two generation RNA-
Seq reads sequence and compares to fusion transcript sequence.According to comparison result detect support fusion nonuniformity read in pairs sequence and
In conjunction with reading sequence;
6) a screening at least FLNC reads sequence support, and at least a pair of of nonuniformity reads sequence in pairs and a combination is read
The candidate fusion that sequence is supported is as final fusion testing result.
By above-mentioned analysis, we detect 225 fusions altogether, and wherein the fusion of interchromosomal has 209,
Intrachromosomal fusion 16.We, which have done some fusion results also by JBrowser visual software, is visually turned into
Figure, has further confirmed that the reliability of result.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.