CN106650254B

CN106650254B - A method of based on transcript profile sequencing data detection fusion gene

Info

Publication number: CN106650254B
Application number: CN201611168738.4A
Authority: CN
Inventors: 程艳兵
Original assignee: Wuhan Frasergen Co Ltd
Current assignee: Jiaxing Feisha Gene Information Co., Ltd.
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2018-11-20
Anticipated expiration: 2036-12-16
Also published as: CN106650254A

Abstract

The present invention relates to a kind of methods based on transcript profile sequencing data detection fusion gene comprising following steps：S1：The sequencing of two generation transcript profiles and the sequencing of three generations's transcript profile are carried out to sample, respectively obtain two generation transcript profile sequencing datas and three generations's transcript profile sequencing data；S2：Three generations's transcript profile sequencing data is compared with reference to genome, the FLNC that identification may have occurred Gene Fusion reads sequence and may participate in the gene pairs of fusion, and the FLNC that extraction may have occurred Gene Fusion reads the sequence of sequence, and judges to merge position；S3：The two generations transcript profile sequencing data is compared to possible fusion FLNC obtained in S2 and reads sequence, the logarithm of sequence is read in pairs according to nonuniformity in comparison result and combines the number of reading sequence, and the FLNC that may have occurred Gene Fusion reads the number of sequence, the gene pairs that identification is merged really.The present invention is by being sequenced in conjunction with the sequencing of three generations's transcript profile with two generation transcript profiles come detection fusion gene, so that more reliable in conjunction with the fusion testing result that supporting evidence is sequenced in two generations and three generations.

Description

A method of based on transcript profile sequencing data detection fusion gene

Technical field

The present invention relates to transcriptome analysis fields, more specifically it relates to which a kind of be based on transcript profile sequencing data detection fusion The method of gene.

Background technique

Gene rearrangement is the phenomenon that happening occasionally between inhereditary material in organism, since gene rearrangement frequently results in originally One or more genes or genetic fragment not under a cistron form fusion, and transcribe as a cistron, This will lead to the activation of certain genes, inactivation or generates new function.The generation of many diseases is all accompanied by fusion phenomenon, example Such as, leukaemia is often accompanied by the fusions such as bcr/abl, AML1/ETO, CBF β/MYH11, PML/RAR α, in a variety of solid tumors It has also been found that fusion, there is EML4-ALK in non-small cell lung cancer, there is SLC45A3-ELK4 in prostate cancer, in rhabdomyosarcoma There is PAX3-FOXO1 etc..Scientific research discovery, some fusions take part in the pathogenic course of related disease, therefore, these The detection of fusion can be used as one of diagnostic criteria, or even can be as therapy target.

Two generation transcript profile sequencing datas are analyzed currently, being generally basede on to the detection of fusion to obtain.It is led It to be carried out by comparing two kinds of reading sequence (reads).One kind is non-uniform pairs of reading sequence (discordant paired- End reads, that is, pairs of reads is compared respectively to the 5 ' chaperones and 3 ' chaperones for participating in fusion), another kind is In conjunction with reading sequence (junction reads, that is, the comparison of reads spans position of fusion).Pass through these two types of branch for reading sequence of identification Situation is held, can detect that the fusion in transcription product.There are many software having been developed currently based on such methods, packet Include SOAPfusion, Defuse, SOAPfuse, FusionCatcher, FusionMap, Tophat-fusion, ChimeraScan, Star-fusion etc..But sequence is read due to the complexity of transcript profile and the sequencing of two generations RNA-seq and reads long limit System, short reading sequence comparison inherently face very big challenge.With reference to the repetitive sequence on genome and refer to genome itself It is imperfect, it is easy to cause to read sequence compare position and compare uniqueness make false judgment.It is led to control comparison mistake The false positive of cause needs that stringenter filtration parameter is arranged to be filtered candidate fusion result, but does so past It is past that many true-positive results is caused also to be filtered.On the other hand, two generations sequencing can be randomly generated one in library construction process The segment from different genes is joined together to form at random a bit chimeric reading column, using existing fusion detection method, These chimeric sequences and real fusion gene sequence cannot be distinguished in we, these chimeric readings being randomly generated column can be also taken as Fusion detected, and cause the false positive of result.It is based purely on the fusion of two generations RNA-seq sequencing due to the above reasons, Gene tester is difficult to accomplish to balance at two aspects of accuracy and false negative rate.

The sequencing of three generations's transcript profile is also known as the sequencing of overall length transcript profile, each of which length for reading sequence is all far longer than two generation transcript profiles Sequencing reading length.The long reading sequence of short reading compared to the sequencing of two generations, the length of three generations's sequencing reads long reading sequence can be more effective in comparison process Ground reduces false positive caused by comparing mistake, but can not also avoid such false positive completely.Meanwhile three generations's transcript profile The chimeric reading sequence that some segments from different genes connect formation at random can be also randomly generated in sequencing in library construction process, False positive is caused to merge.

Therefore, it is necessary to a kind of methods of fusion in new detection transcription product.

Summary of the invention

In order to solve the above problem, the present invention provides a kind of method based on transcript profile sequencing data detection fusion gene, It includes the following steps：

S1：The sequencing of two generation transcript profiles and the sequencing of three generations's transcript profile are carried out to sample, respectively obtain two generation transcript profiles sequencing number According to three generations's transcript profile sequencing data；

S2：Three generations's transcript profile sequencing data is compared with reference to genome, identification may have occurred gene and melt The FLNC of conjunction reads sequence and may participate in the gene pairs of fusion, and the FLNC that may have occurred Gene Fusion described in extraction reads the sequence of sequence Column, and judge to merge position；

S3：The two generations transcript profile sequencing data is compared to possible fusion FLNC obtained in S2 and reads sequence, root Sequence is read in pairs according to nonuniformity in comparison result and combines the number of reading sequence, the gene pairs that identification is merged really.

Further, S2 includes the following steps：

S2.1：Three generations's transcript profile sequencing data is compared with reference to genome, obtains comparing to reference to gene The FLNC of upper two different locations of group reads sequence；

S2.2：Judge it is described with reference to two different locations on genome and the FLNC read in sequence with described two differences Whether the corresponding segment in position meets fusion decision condition, when meeting all fusion decision conditions, then will The FLNC that the FLNC is judged to may have occurred Gene Fusion reads sequence and obtains to participate in the gene pairs of fusion, described in extraction The FLNC that may have occurred Gene Fusion reads the sequence of sequence, and judges to merge position.

Further, the fusion decision condition is：

1) two different locations with reference on genome respectively correspond the 5 ' segments and 3 ' segments that the FLNC reads sequence；

2) 5 ' segment and 3 ' segments read the position in sequence and meet to be no more than Maximum overlap length and most in the FLNC Large-spacing length, and it is not less than minimum overall length；

3) 5 ' segment and 3 ' segments compare on the reference genome meets minimum comparison consistency；

4) two different locations with reference on genome meet one of the following conditions：A, in different chromosomes； B, on same chromosome but contrary；C, it is on same chromosome and direction is identical, but distance is more than genome The maximum length of intron of annotation；And

5) there is gene annotation information at two different locations with reference on genome, and according to the gene annotation Information can determine that the gene annotation structure with reference to two different locations on genome distinguishes the corresponding 5 ' piece Section is consistent with the gene structure of 3 ' segments.

Further, the Maximum overlap length and largest interval length are 5-20bp, and the minimum overall length is institute The 10-20% that FLNC reads sequence length is stated, the minimum comparison consistency is 80-95%, and the maximum length of intron is 50kb.

Further, S3 includes the following steps：

S3.1：The two generations transcript profile sequencing data is read sequence with the FLNC that may have occurred Gene Fusion to compare It is right, sequence identification nonuniformity, which is read, for each FLNC that may have occurred Gene Fusion reads sequence in pairs and combine to read sequence；

S3.2：When the possibility identified in support S2 participates in may have occurred Gene Fusion described in the gene pairs of fusion The number that FLNC reads the number of sequence and sequence is read in the combination meets judgement minimum number and the nonuniformity reads sequence in pairs When logarithm meets minimum judgement logarithm, determine that the possible gene pairs for participating in fusion is merged.

Further, it is 1 that the FLNC that may have occurred Gene Fusion, which reads the judgement minimum number of sequence,.

Further, it is 1 that the judgement minimum number of sequence is read in the combination.

Further, it is 1 pair that the nonuniformity, which reads the judgement minimum logarithm of sequence,.

The sequencing of three generations's overall length transcript profile can cover most of transcript sequence, therefore fusion base also can completely be sequenced The fusion transcript sequence of cause.Depth height is sequenced in two generation sequencing datas, low-abundance fusion can also be provided enough Reads is supported.In addition, the technological means different as two is sequenced with three generations in the sequencing of two generations, monotechnics can be effectively avoided False positive caused by systematic error or Problem of False Negative.Such as respectively generating in two kinds of sequencing technologies library construction process It is chimeric to read sequence, sequence is read since these are chimeric and is randomly generated, we detect fusion in two kinds of libraries by requiring Supporting evidence can effectively avoid false positive caused by chimeric reading sequence.FLNC itself is used as overall length transcript sequence, also gives for two generations RNA-seq comparing provides the reference sequences of an accurate candidate fusion transcript, greatly improves two codes or datas Compare the efficiency and accuracy with detection fusion gene.

The present invention is avoided simple by being sequenced in conjunction with the sequencing of three generations's transcript profile with two generation transcript profiles come detection fusion gene It is sequenced using two generation transcript profiles because reading that sequence length is short and bring false positive and single sequencing technologies are because chimeric sequence of reading is led The false positive of cause, while also avoiding fusion transcript when original two codes or data does fusion detection and being difficult to the problem of reconstructing, So that more reliable in conjunction with the fusion testing result that supporting evidence is sequenced in two generations and three generations.

Specific embodiment

Principles and features of the present invention are described below in conjunction with example, the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.

One embodiment of method of the present invention is applied in the project of a soybean transcript profile by we.This Soybean transcript profile three generations ISO-seq sequencing approach and two generation RNA-seq sequencing approaches is sequenced in mesh respectively.Sample is sequenced Product are the mixing sample of soybean different tissues and developmental stage.Wherein two libraries, library size difference have been set up in three generations's transcription For 0.6-2.5kb and>1.5kb, two libraries are respectively sequenced with PacBio RSII microarray dataset and produce 16 cell and 7 cell.A library has been built in the sequencing of two generations, and library size is 200bp, and the RNA-seq data of 6G are obtained.Subsequent Main Analysis Process is as follows：

1) initial data is sequenced by RS_subreads, RS_ in SMRT analysis software in three generations Tri- pipeline of ReadsofInsert and IRS_Isoseq carry out data prediction and Quality Control to sequencing data, and obtain complete Long FLNC reads sequence.Two codes or datas carry out pretreatment and Quality Control by FastQC software, obtain clean RNA-seq and read sequence；

2) FLNC reads sequence and compares software comparison to reference genome by GMAP, obtains comparison result file；

3) comparison result file is screened, and is found segmentation and is compared to two positions on genome and meet the following conditions FLNC comparison result：(1) two comparison position respectively corresponds overlap length and interval between 5 ' and 3 ' (2) two segments of FLNC The region that length is both less than 10bp (3) each segment compares consistency and is both greater than 90%, and FLNC reading sequence always compares length and is greater than 90% (4) are equidirectional if it is homologous chromosomes, then two comparison positions are at least at a distance of 50k bp；

4) according to soybean genome comment file, selecting two comparison positions has a gene annotation, and FLNC is compared Exon/intron structure reads sequence with the identical FLNC of structure for annotating gene as candidate fusion transcript, and records participation and melt The gene location information and position of fusion information of conjunction；

5) it extracts and the FLNC of candidate fusion is supported to read sequence as overall length fusion transcript sequence, and two generation RNA- Seq reads sequence and compares to fusion transcript sequence.According to comparison result detect support fusion nonuniformity read in pairs sequence and In conjunction with reading sequence；

6) a screening at least FLNC reads sequence support, and at least a pair of of nonuniformity reads sequence in pairs and a combination is read The candidate fusion that sequence is supported is as final fusion testing result.

By above-mentioned analysis, we detect 225 fusions altogether, and wherein the fusion of interchromosomal has 209, Intrachromosomal fusion 16.We, which have done some fusion results also by JBrowser visual software, is visually turned into Figure, has further confirmed that the reliability of result.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of method based on transcript profile sequencing data detection fusion gene, which is characterized in that include the following steps：

S1：To sample carry out two generation transcript profiles sequencing and three generations's transcript profile sequencing, respectively obtain two generation transcript profile sequencing datas and Three generations's transcript profile sequencing data；

S2：Three generations's transcript profile sequencing data is compared with reference to genome, identification may have occurred Gene Fusion FLNC reads sequence and may participate in the gene pairs of fusion, and the FLNC that may have occurred Gene Fusion described in extraction reads the sequence of sequence, And judge to merge position；

S3：By the two generations transcript profile sequencing data compare to possible fusion FLNC obtained in S2 read sequence, according to than The logarithm of sequence is in pairs read to nonuniformity in result and combines the number for reading sequence and the Gene Fusion that may have occurred FLNC reads the number of sequence, the gene pairs that identification is merged really.

2. the method according to claim 1, wherein S2 includes the following steps：

S2.1：Three generations's transcript profile sequencing data is compared with reference to genome annotation file, segmentation comparison is obtained and arrives Sequence is read with reference to the FLNC of two different locations on genome；

S2.2：Judge it is described with reference to two different locations on genome and the FLNC read in sequence with described two different locations Whether corresponding segment meets fusion decision condition, then will be described when meeting all fusion decision conditions The FLNC that FLNC is judged to may have occurred Gene Fusion reads sequence and obtains to participate in the gene pairs of fusion, extracts the possibility The FLNC that Gene Fusion has occurred reads the sequence of sequence, and judges to merge position.

3. according to the method described in claim 2, it is characterized in that, the fusion decision condition is：

2) position of the 5 ' segment and 3 ' segments in FLNC reading sequence, which meets, is no more than between Maximum overlap length and maximum Every length, and it is not less than minimum overall length；

4) two different locations with reference on genome meet one of the following conditions：A, in different chromosomes；B, locate In on same chromosome but contrary；C, it is on same chromosome and direction is identical, but distance is more than that genome is infused The maximum introne released is long；And

5) there is gene annotation information at two different locations with reference on genome, and according to the gene annotation information Can determine the gene annotation structure with reference to two different locations on genome distinguish the corresponding 5 ' segment and The gene structure of 3 ' segments is consistent.

4. according to the method described in claim 3, it is characterized in that, the Maximum overlap length and largest interval length are 5- 20bp, the minimum overall length are the 10-20% that the FLNC reads sequence length, and the minimum comparison consistency is 80-95%, institute Stating maximum length of intron is 50kb.

5. method according to any of claims 1-4, which is characterized in that S3 includes the following steps：

S3.1：The two generations transcript profile sequencing data is read sequence with the FLNC that may have occurred Gene Fusion to be compared, Sequence identification nonuniformity is read for each FLNC that may have occurred Gene Fusion to read sequence in pairs and combine to read sequence；

S3.2：When the FLNC for supporting the possibility identified in S2 to participate in may have occurred Gene Fusion described in the gene pairs of fusion is read The number that sequence is read in the number of sequence and the combination, which meets, determines minimum number, and the nonuniformity read in pairs sequence logarithm it is full When foot determines minimum logarithm, determine that the possible gene pairs for participating in fusion is merged.

6. according to the method described in claim 5, it is characterized in that, the FLNC that may have occurred Gene Fusion reads sentencing for sequence Determining minimum number is 1.

7. according to the method described in claim 5, it is characterized in that, the judgement minimum number that sequence is read in the combination is 1.

8. according to the method described in claim 5, it is characterized in that, the judgement minimum logarithm that the nonuniformity reads sequence in pairs is 1 pair.