CN105830078A

CN105830078A - Systems and methods for using paired-end data in directed acyclic structure

Info

Publication number: CN105830078A
Application number: CN201480067827.2A
Authority: CN
Inventors: 丹尼斯·库拉尔; 南森·梅维斯
Original assignee: Seven Bridges Genomics Inc
Current assignee: Seven Bridges Genomics Inc
Priority date: 2013-10-21
Filing date: 2014-10-15
Publication date: 2016-08-03
Anticipated expiration: 2034-10-15
Also published as: SG11201603044SA; US20150112602A1; WO2015061103A1; AU2014340461B2; CA2927723A1; US9092402B2; EP3061022B1; US20150112658A1; JP2016539443A; CA2927723C; KR20160073406A; US20150302145A1; EP3061022A1; US10055539B2; US9063914B2; US20150310167A1; CN105830078B; WO2015061099A1; US10204207B2

Abstract

Methods of analyzing a transcriptome that involves obtaining at least one pair of paired- end reads from a transcriptome from an organism, finding an alignment with an optimal score between a first read of the pair and a node in a directed acyclic data structure (the data structure has nodes representing RNA sequences such as exons or transcripts and edges connecting pairs of nodes), identifying candidate paths that include the node connected to a downstream node by a path having a length substantially similar to an insert length of the pair of paired-end reads, and aligning the paired-end rends to the candidate paths to determine an optimal- scoring alignment.

Description

For using the system and method for dual ended data in oriented acyclic structure

Cross reference to related applications

This application claims the U.S. Patent Application Serial Number 14/157 submitted on January 17th, 2014, the rights and interests of 979 and priority, this U.S. Patent application requires the U.S. Provisional Patent Application Serial No. 61/893 submitted on October 21st, 2013, the rights and interests of 467 and priority, the content of this temporary patent application is incorporated by reference herein.

Sequence table

The application comprises the sequence table submitted to ASCII fromat by network electronic submission system (EFS-Web), and is incorporated by reference herein with entire contents.The sequence table of the ASCII fromat that on October 14th, 2014 creates is named as SBG-009-01WO-2-Sequences_ST25, and size is 1,217 bytes.

Technical field

The present invention relates generally to bioinformatics, and particularly relates to use oriented acyclic data structure to be analyzed.

Background technology

For specific stage of development or physiological status, transcript profile is one group of transcript complete in cell and their quantity.Transcript profile is particularly important to function, growth and the disease of organism.The essence of organism and vigor derive from all of transcript (comprising mRNA, non-coding RNA and tiny RNA), their expression, gene splicing pattern and posttranscriptional modification.It is true that part much bigger compared with expectation in human genome is now known as transcribed.See Bertone et al., 2004, " there is the global recognition (Globalidentificationofhumantranscribedsequenceswithgenom etilingarrays) of the human transcription sequence of genomic tiling arrays ", " science (Science) " 306:2242-2246.

It has been reported that the method for transcriptome analysis based on next generation's order-checking (NGS) technology.See, such as, Wang et al., 2009, " RNA checks order: for the revolutionary instrument (RNA-Seq:arevolutionarytoolfortranscriptomics) of transcription group ", " hereditism comments on (NatRevGenet) naturally " 10 (l): 57-632009.RNA order-checking (being checked order by RNA) method promises to undertake the transcript profile data quickly generating high power capacity.

But, RNA order-checking faces message challenge, such as stores and analyzes mass data, and this must be solved preferably to use reading.Comparison and analysis RNA order-checking reading, not only in the problem that substantially presents of the information related to, also present problem on capacity.The mapping of reading with large data sets is the challenge on calculating, and the analysis method being used for differential expression just starts appearance.See Garber et al., 2011, " using the computational methods (Computationalmethodsfortranscriptomeannotationandquantif icationusingRNA-Seq) annotating for transcript profile and quantifying of RNA order-checking ", " natural method (NatMeth) " 8 (6): 469-477.The calculating demand mapping a large amount of readings from RNA order-checking is doubled significantly by obtainable a large amount of reference datas in the future.

Such as, all gene expression characteristicses identifying in the human genome being in the news are tried hard to by GENCODE association.Harrow et al., 2012, " GENCODE: for the reference human genome annotation (GENCODE:ThereferencehumangenomeannotationforTheENCODEPro ject) of code scheme ", " genome research (GenomeRes) " 22:1760-1774.The amount of the material comprised by GENCODE plan is huge.Not only to continue to add new protein coding gene site, and the quantity of annotated alternative splicing transcript steadily increases.GENCODE7 version comprises more than 20,000 protein-coding and almost 10,000 long non-coding RNA site (IncRNA), and do not represent in other source more than 33,000 encoding transcription this.GENCODE also comprises non-coding RNA (lincRNA) gene, short non-coding RNA and alternative splicing pattern between further feature such as nontranslated region (UTR), long gene.Even if RNA order-checking research starts only have limited amount new data, from source, the capacity of the potential reference data of such as GENCODE is also the biggest so that understand that new data is computationally challenging.

Summary of the invention

Present invention generally provides the system and method for using the reference being organized as directed acyclic graph (DAG) or similar data structure that both-end reading is analyzed.Feature in Can Kao such as exon and intron provide node in DAG, and those features are linked with the gene order of their specification each other by edge.DAG can be drawn to scale for any size, and it is true that can be received in the importing sequence from external annotated reference in the first example.In the case of new data packets is containing both-end reading, each reading of one centering can potentially along the genome of length L and in complicated DAG with the comparison of any position, much more position can be there is (i.e., on average, if DAG crosses any gene loci n weight, then a pair both-end reading of comparison is equivalent to select from (nL) ^2 option).But, intubating length information provide for the means that potential comparison constrains in DAG are not only in path away from distance in, also by get rid of path, be limited to the quantity of option relative to L be linear.Both-end reading may be provided for out of Memory.Such as, both-end reading can provide the basis for modifying DAG, can be used for infer isomer frequency or other.It is true that have the quality of the both-end reading collection carry information of intubating length information, this quality is very suitable for focusing on comparison based on DAG and updating the intrinsic isomer actually occurred that DAG is not found to express possibility or notices.Therefore, both-end reading is particularly suitable for RNA order-checking plan, it is allowed to analyze the biggest RNA sequencing data collection relative to the biggest reference set.The all of isomer using both-end RNA order-checking reading discovery transcript profile is not that calculating is upper the most intractable, and transcription group can move with the paces and the handling capacity that are not previously possible to.One or any amount of transcript profile can be studied to increase the understanding to living body functional, growth and disease from one or any amount of organism.

Certain aspect of the present invention relates to the DAG creating the feature (such as, exon, intron etc.) comprising reference, and the feature of reference is connected by their proximal edge with genome orders.This DAG data structure has the application in transcription group, comprises and such as identifies isomer.This DAG reference is used in the transcription group relating to comprising the RNA sequencing data of pairing reading.The DAG comprising the feature (such as creating) as node from annotated source genome or transcript profile can be used with reference to analyzing RNA sequencing data.The system and method for the present invention provides the comparison of the improvement using both-end reading, uses both-end reading information to build DAG, infer isomer probability or frequency, or a combination thereof.

In certain aspects, the invention provides the method analyzing transcript profile, the method includes always obtaining at least one pair of both-end reading from the transcript profile of organism, find out this to the first reading and oriented acyclic data structure in node between the comparison (this data structure has the node representing RNA sequence such as exon or transcript and the edge connecting multipair node) with optimal scoring, identify the path candidate comprising node, this node is connected to downstream node by having the path of the length substantially similar to the intubating length of both-end reading with this, and by both-end reading and path candidate comparison to determine comparison of most preferably marking.It is not any path of path candidate that method can comprise from including getting rid of this any contrast conting to both-end reading.

The method is suitable for any amount of both-end reading.Node and downstream node can represent a pair exon, and exon is obtained by this by both-end reading from this.The method can comprise based on determined by optimal scoring comparison identify isomer.In certain embodiments, the isomer identified is novel isomer.Oriented acyclic data structure can be updated to represent the isomer (such as, by adding at least one new node or edge) of novelty.

The related fields of the present invention provide the computer system for transcription group.This system has the processor being couple to non-transitory memory, and is used for storing oriented acyclic data structure, and this oriented acyclic data structure has the node representing RNA sequence and the edge connecting multipair node.System is operable to obtain a pair both-end reading from transcript profile, this to the first reading and data structure in node between find out there is the comparison of optimal scoring, identify the path candidate comprising node, this node is connected to downstream node by the path with the length substantially similar to the intubating length of both-end reading with this, and by this both-end reading with this path candidate comparison to determine comparison of most preferably marking.Preferably, system gets rid of all non-candidate paths from comparison subsequently is attempted.In certain embodiments, this system is further operable to obtain multiple both-end reading from cDNA fragment.Preferably, system is operable thinks that multiple each of both-end reading centering determines comparison of most preferably marking.CDNA fragment can represent single transcript profile.System can based on determined by optimal scoring comparison identify isomer, such as novel isomer.System is operable to update oriented acyclic data structure to represent the isomer of novelty.In certain embodiments, oriented acyclic data structure is updated to represent that novel isomer includes adding at least one new node.Node and downstream node can represent a pair exon, both-end reading to from this exon to obtaining.System is operable not being any path of path candidate from including getting rid of in this any contrast conting to both-end reading.

The other side of the present invention provides expression analysis by such as can be used for the system and method for deduction isomer frequency.

In certain aspects, the present invention provides the method for isomer frequency in transcript profile of inferring.The method comprises the transcript profile from organism and obtains multiple both-end reading pair, and is compared with reference data structure by those readings alternatively.Data structure comprises each multiple features being represented as node from genome.Multipair node or feature connect by edge in their proximal end that (that is, internally, member proximal end at them on genome is connected to each other by two by edge；Due to any member be can also is that other to member, so the either end of feature or two ends can be connected to further feature by one or any amount of edge).Determine both-end reading institute's comparison between distance and this comparison between the frequency of distance.Determine one group of isomer path and isomer frequency so that represent with isomer frequency through the isomer path of structure cause multipair feature with institute's comparison to the frequency of distance be comprised in this isomer path.Determined by each path in this group isomer path can represent that be present in organism transcribes isomer.Institute's comparison between distance can be the first comparison member from every couple upstream extremity to this to the path of downstream of the second comparison member.In certain embodiments, determine that this group isomer path comprises and find novel isomer path.Reference data structure can be updated to comprise the isomer path of novelty.

Such as can carry out and determine this group isomer path by resolving one group of linear equation by algebraic method.In certain embodiments, before resolving this group linear equation (such as, it is possible to determine remaining variable in linear equation), isomer frequency is obtained from biological data.10008 additionally or alternatively, before determining one group of isomer path and isomer frequency, can make institute's comparison between the frequency standard of distance.

Can be from existing gene expression characteristics database sharing data structure.Such as, each node can represent to come the feature of data base since then.

In related aspect, the present invention provides and infers the computer system of isomer frequency in transcript profile.System comprises the processor being couple to non-transitory memory, and stores reference data structure, and this reference data structure includes each multiple features being represented as node from genome, being connected the proximal end at them by edge of the most the plurality of feature.System can be used for obtaining or determine from transcript profile multiple both-end readings between distance and this between the frequency of distance.Then may determine that one group of isomer path and isomer frequency so that represent with isomer frequency through the isomer path of structure cause multipair feature with institute's comparison to the frequency of distance be comprised in this isomer path.Feature can comprise exon, and system is further operable to be compared with reference data structure by multiple both-end readings pair.Determined by each path in this group isomer path can represent that be present in organism transcribes isomer.Determine that this group isomer path can comprise and find novel isomer path.System can update reference data structure to comprise the isomer path of novelty.Institute's comparison between distance can include the upstream extremity of the first comparison member from every couple to this to the path of downstream of the second comparison member.In certain embodiments, determine that this group isomer path includes resolving one group of linear equation.In certain embodiments, before determining that this group isomer path is included in this group linear equation of resolving, isomer frequency is obtained from biological data.System is operable with before determining this group isomer path and isomer frequency, make institute's comparison between the frequency standard of distance.Data structure can comprise the node of each annotated feature represented in long range gene property data base.

Accompanying drawing explanation

Fig. 1 illustrates that the figure of the reading compared with DAG represents.

Fig. 2 illustrates the actual matrix corresponding to comparing.

Fig. 3 provides the graphic of the method for the present invention.

Fig. 4 illustrates DAG and the both-end reading pair of hypothesis.

Fig. 5 illustrates the path candidate through the DAG assumed.

Fig. 6 illustrates the path candidate from the DAG assumed.

Fig. 7 presents three the alternative historical records including sequence.

Fig. 8 depicts comparison according to the present invention.

Fig. 9 illustrates isomer frequency distribution.

Figure 10 is exemplified with the system of the method for implementing the present invention.

Detailed description of the invention

Should be appreciated that in general, gene is not intended to that code is single, specified protein.Instead, the exon of the given gene of different subgroup is transcribed in different proportions in the different time.The different protein created by these different transcripts is referred to as isomer, and relatively, different transcript itself is also referred to as isomer.As discussed above, modern biotechnology informatics has been provided for being used potentially as the most annotated transcript profile of reference.Such as, gene is the most not only edited and recorded in Gencode plan, also edits and records the exon (together with other region) in human genome.

Systems and methods described herein is used for effectively utilizing the information in those annotated transcript profile.It practice, the present invention comprises to compare both-end reading with annotated transcript profile is preferably suited for seeing very clearly of analysis method, in this analysis method, sequence data is stored as data structure based on DAG.

Some embodiments of the present invention provide by the method for the oriented acyclic data structure of the data creation in annotated transcript profile.In this data structure, the feature (such as, exon and intron) from annotated transcript profile is represented as node, and this node is connected by edge.Therefore, the respective path through data structure based on DAG the isomer from real world transcript profile is represented.

Certain aspect of the present invention relates to creating the DAG comprising feature such as intron and exon from one or more known reference.DAG is understood to mean the data that can be presented that chart and the chart presenting data in the art.The method that the present invention is provided to be stored as DAG to be read, by computer system, the data processing or be used for being rendered as chart for bioinformatics.DAG can be preserved in any other suitable format, this suitable format comprises the similar variant structural of such as node and the table of the list at edge, matrix or representing matrix, a collection of array or representing matrix, to be built-in with the language preservation of chart grammar, generalized markup language for chart purpose preserves, or other suitable format.

In certain embodiments, DAG is stored as the list at node and edge.The one of which of this mode is to create to comprise all nodes and the text at all edges, and these all nodes have the ID distributing to each node, each node ID with beginning and end node in these all edges.It is therefoie, for example, create DAG if two statements " SeeJanerun " and " Run, Janerun ", then can create the text of case insensitive.Any suitable form can be used.Such as, text can comprise comma separated value.By named for this DAG " Jane " for reference in the future, in that format, DAG " Jane " can pronounce as follows: 1see, 2run, 3jane, 4run, 1-3,2-3,3-4.It will be apparent to one skilled in the art that this structure is applicable in Fig. 7 and Fig. 8 represented intron and exon, and following discussion of enclosing.

In certain embodiments, DAG is stored as the table of representing matrix (or similar variant structural of a collection of array or representing matrix), wherein (i, j) item instruction node i and the node j connected (wherein N is the vector comprising node with genome orders) in N N matrix.In order to DAG is acyclic, it is only necessary to all of nonzero term face on the diagonal (assuming that node represents with genome orders).In the case of binary system, there is not edge from node i to node j in 0 expression, and 1 expression has an edge from i to j.It should be appreciated by those skilled in the art that matrix structure allows the value in addition to 0 to 1 to be associated with edge.Such as, any item may refer to the number of times showing weight or being used, the numerical value of some built-in qualities of the data observed in the reflection world.Matrix can be written to text as table or a series of linear row (such as, be first row 1, be followed by separator etc.), therefore provides simple serialization mechanism.

After definition node, making the serialized a kind of useful mode of matrix D AG will be to use comma separated value for item.Use this form, DAG " Jane ＂ " node definition same as described above will be comprised, and be followed by matrix entries.This form can be pronounced:

1see、2run、3jane、4run

,,1,\,,1,\,,,1\,,,

Eliminate the item of zero (0) the most simply, and "/" is newline.

Embodiments of the invention comprise the language storage DAG being built-in with chart grammar.Such as, have and be referred to as the DOT language of chart visual software bag of chart visualization (Graphviz) and provide can be used for storing and there is auxiliary information and the multiple tool change commercially available from Graphviz website can be used to become the data structure of diagram file form.Graphviz is chart visual software of increasing income.Chart visualization is a kind of graphic mode that structural information is expressed as abstract chart and network.It has the application in networking, bioinformatics, soft project, data base and webpage design, machine learning, and in the application in the visual interface of other technical field.Graphviz allocation plan takes the chart with plain text language as form to describe, and with useful form (such as image and the SVG of webpage；For being included in the PDF in other document or postscript；Or the display in interactive table browser) make graphic.

In relevant embodiment, DAG is stored for generalized markup language or other form of chart purpose.According to linear text file above or the description of CSV matrix, those skilled in the art creates definition node and their header or the label (labelling) of ID, edge, weight etc. it will be recognized that language such as XML can be used for (expanding to).But, DAG is structured and is stored, and embodiments of the invention include using node to represent feature such as exon and intron.This is for analyzing RNA order-checking reading and discovery, identifying and represent that isomer provides useful instrument.

If node represents the fragment of feature or feature, then can be by the path representation isomer through these fragments.Represented that exon is " being skipped " by the edge of the exon after the exon that some are previous is connected to some.Presented herein is the technology for constructing DAG alternative splicing or isomer to represent gene.At Lee and Wang, 2005, " bioinformatic analysis (Bioinformaticsanalysisofalternativesplicing) of alternative splicing ", " simple and clear bioinformatics (BriefBioinf) " 6 (l): 23-33；Heber et al., 2002, " montage chart and EST assembly problem (SplicinggraphsandESTassemblyproblems) ", " bioinformatics (Bioinformatics) " 18Suppl:sl81-188；Leipzig et al., 2004, " alternative splicing tunnel (ASG): the room (Thealternativesplicinggallery (ASG): Bridgingthegapbetweengenomeandtranscriptome) between bridge joint genome and transcript profile ", " nucleic acids research (NuclAcRes) " 23 (13): 3977-2983；And LeGault and Dewey, 2013, " from the deduction (InferenceofalternativesplicingfromRNA-Seqdatawithprobabi listicsplicegraphs) of the alternative splicing of the RNA sequencing data with probabilistic montage chart ", discussing alternative splicing in " bioinformatics (Bioinformatics) " 29 (18): 2300-2310, the content of each document in these documents is incorporated by reference.Can at Florea et al., 2005, " there is the gene of AIR and alternative splicing annotation (GeneandalternativesplicingannotationwithAIR) ", " genome research (GenomeResearch) " 15:54-66；Kim et al., 2005, " ECgene: for EST cluster based on genome and the gene modeling (ECgene:Genome-basedESTclusteringandgenemodelingforaltern ativesplicing) of alternative splicing ", " genome research (GenomeResearch) " 15:566-576；And Xing et al., 2006, " being used for the expectation-maximization algorithm (Anexpectation-maximizationalgorithmforprobabilisticrecon structionsoffull-lengthisoformsfromsplicegraphs) of the probability reconstruct of the total length isomer from montage chart ", " nucleic acids research (NucleicAcidsResearch) ", 34, finding out other discussion in 3150-3160, the content of each document in these documents is incorporated by reference.

1. use both-end reading

The present invention relates to include the transcription group of RNA sequencing data, this RNA sequencing data comprises both-end reading.The system and method for the present invention provides the comparison of the improvement using both-end reading.In certain embodiments, both-end reading is used for retraining comparison, is thus substantially reduced the calculating cost of RNA order-checking plan.

A lot of sequencing datas are dual ended data.Exist in which that the feature of these type of data should embody they self mode in comparison: the distance between the position that both-end spouse is allocated in comparison should correspond roughly to be run given insertion size by order-checking.In general, the information that insertion size is given the information that alignment algorithm attempts to disclose is related to.

If being well known in the art for producing the drying method of paired nucleic acid reading.Generally, the method include optionally by sample of nucleic acid fragmentation thus obtain some forms of sequence of known dimensions distribution.Generally with functional group or the end of the sample being carried out protected fragment by cyclisation certainly, the most remaining sample of nucleic acid material is removed.Then, end is amplified and checks order, and causes the colony being considered to produce each other the nucleic acid reading of some distances.

It is such as to be purchased from NexteraIllumina (San Diego, California (SanDiego, CA)) that spouse is prepared as covering device to sample.In order to be prepared as to nucleic acid on this platform check order, sample of nucleic acid is become length section between 2kb and 5kb by fragmentation.Then, use affinity column separate, fragment end by biotin labeling with promote after recovery.Biotin labeled end is engaged to create cyclic DNA section.Now, digested from the nucleic acid of the non-cyclizing of sample and be subsequently removed.Then, the DNA of remaining cyclisation is suitable for the fragment of cluster and order-checking by fragmentation again with generation.Can remove by biotin labeled original end from the rest segment of the DNA of cyclisation with the coated globule of affinity column or Streptavidin.Then, the end of recovery carries out end reparation, polyadenylic acid tailing, and adds order-checking adapter before amplification and order-checking.

2. comparison

The present invention is provided to method and system sequence reads and the oriented acyclic data structure representing annotated reference compared.Using the alignment algorithm of the present invention, even if there is the big quantity being associated with RNA sequencing result and generally comprehensively reference, reading can also be mapped rapidly.With the most identical, DAG is the ideal goal of comparison.Compared with using annotated transcript profile in the workflow of certain comparison afterwards, it is much effective for carrying out and being compared by the DAG of annotated transcript profile structure.

By using DAG as with reference to obtaining a large amount of benefits.Such as, and compare with a reference, and the result being then attempt to adjust according to annotated transcript profile certain comparison is compared, and compares more accurate with DAG.This is primarily due to former approach and imposes factitious asymmetric between another sequence represented in the sequence and transcript profile of initial comparison.Compared with attempting to compare with the linear order of every kind of physical possibilities (quantity of this type of probability is by exponential increase generally in several abutments), comparing with the target representing all of related physical probability potentially is that calculating is upper the most much effective.

Embodiments of the invention comprise compares one or more readings with reference data structure.This comparison can comprise comparison based on DAG, its edge finding out the data structure optimally representing reading and node.This comparison can also comprise the example of exon comparison by reading with the node from DAG in a pair wise manner.

In contrast the part generally comprised along target is placed a sequence, introduce room according to algorithm, the degree of two sequences match is marked, and preferably along the most right with reference to different positions is repeated this.Optimum scoring coupling is considered as comparison, and represents the deduction of the content represented about sequence data.In certain embodiments, the comparison to a pair nucleotide sequence is marked and is included that the probability for displacement and insertion and deletion arranges value.When than right single base time, coupling or do not mate to replace probability and contribute to alignment score, displacement probability can be to be such as 1 and for not mating for-0.33 for coupling.Insertion and deletion is to deduct gap penalty from alignment score, and gap penalty can be such as-1.Gap penalty and displacement probability can be based on the Heuristics how developed about sequence or a priori assumptions.Their value affects the comparison obtained.Especially, whether relationship affect displacement or insertion and deletion between gap penalty and displacement probability will be favourable in the comparison obtained.

Formally, comparison represents two deduction relations between sequence x and y.Such as, in certain embodiments, x and y is respectively mapped to comprise two other character string x' and the y' in space by the comparison A of sequence x and y so that: (i) lx'l=ly'l；(ii) remove space from x' and y' and should return to x and y respectively；And (III) is for any i, x'[i] and y'[i] cannot be both space.

Room be any one in x' or y' in the maximum substring of continuous gap.Comparison A can comprise following three kinds of regions: (i) mate to (such as, x'[i]=y'[i])；(ii) unmatched right, (such as, x'[i] ≠ y'[i], and both are not space)；Or (III) room (such as, or x'[i..j] or y'[i..j] be room).In certain embodiments, only mate to having high just scoring a.In certain embodiments, unmatched to having negative scoring b generally, and the room of length r also has negative scoring g+rs, wherein g, s < 0.For DNA, a general marking scheme (such as, BLAST being used) makes mark a=l, scoring b=-3, g=-5 and s=-2.The scoring of comparison A be all of coupling to, unmatched to and the summation of scoring in room.The alignment score of x and y can be defined as the maximum scores among all possible comparison of x and y.

In certain embodiments, any to having by the scoring a of 4 × 4 matrix B definition of displacement probability.Such as, (i, i)=1 and 0<B (i, j) i<>j<1 is a possible marking system to B.Such as, conversion compared with transversion be considered as more biologically may in the case of, matrix B can comprise B (C, T)=7 and B (A, T)=3, or other class value any that is desired or that determined by method as known in the art.

According to some embodiments of the present invention, comparison comprises the most right.In general, the possible Local Alignment in contrast reference genome T (target) including sequence Q (inquiry) and n the character with m character being found out and assessing between Q and T.For any l < i < n and l < j < m, calculate the maximum possible alignment score of T [h..i] and Q [k..j] (i.e., any substring of the T terminated at the i of position and the j of position at the optimum alignment score of any substring of the Q of end), wherein h < i and k < j.This can comprise all substrings with cm character of inspection, and wherein c is the constant according to scale model, and by independent for each substring and Q comparison.Each comparison is scored, and the comparison with preferably scoring is accepted as comparison.It will be understood by those skilled in the art that exact algorithm and the approximate data that there is sequence alignment.Exact algorithm will find out the comparison of the highest scoring, but computationally can be expensive.Two kinds of well-known exact algorithms are Maimonides Man-Weng Shi algorithm (Needleman-Wunsch) (" J. Mol. BioL (JMolBiol) ", 48 (3): 443-453,1970) and the graceful algorithm of Smith-water (Smith-Waterman) (" J. Mol. BioL (JMolBiol) ", 147 (1): 195-197,1981；" mathematics progress (Adv.inMath.) " 20 (3), 367-387,1976).Rear rattan (Gotoh) (" J. Mol. BioL (JMolBiol) ", 162 (3), 705-708,1982) the calculating time is reduced to O (mn) from O (m2n) by improving further of algorithm graceful to Smith-water, the sequence size that wherein m and n is compared, this improvement more can be modified to parallel processing.In field of bioinformatics, the modified algorithm of rear rattan is commonly called the graceful algorithm of Smith-water just.The graceful method of Smith-water is for comparing bigger sequence sets with relatively restricted publication of international news and commentary entitled sequence, because can more commonly and be cheaper to acquisition concurrent computation resource.See for example the cloud computing resources of Amazon (Amazon).All journal of writings referred herein are herein incorporated by reference with entire contents.

Gap penalty between sequence by the overlap rewarded between the base in sequence and is carried out comparison linear order by Smith-water graceful (SW) algorithm.The graceful algorithm of Smith-water is also different from Maimonides Man-Weng Shi algorithm, and difference is that SW does not require that shorter sequence crosses over the alphabetic character string describing longer sequence.It is to say, SW does not assumes that sequence is the reading of the full content of another sequence.Additionally, because SW might not find out the comparison across character string total length, so Local Alignment can be starting and terminating in two sequences Anywhere.

In certain embodiments, according to dot matrix method, dynamic programming or word method, it is the most right to carry out.Dynamic programming typically implements graceful (SW) algorithm of Smith-water or Maimonides Man-Weng Shi (NW) algorithm.Generally in accordance with the similar matrix S with linear gap costs d, (a, b) character of comparison is marked by (such as, the most aforementioned matrix B) in comparison according to NW algorithm.(a b) provides displacement probability to matrix S generally.The SW class of algorithms is similar to NW algorithm, but any negative rating matrix unit is arranged to 0.In United States Patent (USP) 5,701,256 and the U.S. announce and in 2009/0119313, describe in further detail SW algorithm and NW algorithm and embodiment thereof, this patent and announcing is incorporated by reference herein with entire contents.

The alignment programs of the version implementing the graceful algorithm of Smith-water is MUMmer, and MUMmer can be commercially available from the SourceForge website safeguarded by Geeknet (Fairfax, Virginia (Fairfax, VA)).MUMmer is system (Kurtz, S et al., " genome biology (GenomeBiology) ", the 5:R12 (2004) for quick comparison Genome Scale sequence；Delcher, A.L. et al., " nucleic acids research (Nucl.AcidsRes.) ", 27:11 (1999)).Such as, MUMmer3.0 can use all 20-base pairs or longer accurate coupling that 78MB memorizer found out in 13.7 seconds between a pair 5-megabasse genome on 2.4GHzLinux desk computer.MUMmer can process 100 from shotgun sequencing plan or 1000 contigs, and 100 or 1000 contigs are compared by NUCmer program use system comprised with another group contig or reference.If species are the most different and can not detect similarity for DNA sequence comparison, then PROmer program six frames translations based on both list entries can generate comparisons.

Other exemplary alignment programs comprises: the most extensive comparison (EfficientLarge-ScaleAlignmentofNucleotideDatabases (ELAND)) of RiboaptDB or same feeling assessment (ELANDv2componentoftheConsensusAssessmentofSequenceandVar iation (CASAVA)) software (Illumina (Illumina of San Diego, California of sequence and variation, SanDiego, CA)) ELANDv2 parts；Enforcement genomics (RTG) investigation machine (San Francisco (SanFrancisco, CA)) of RealTimeGenomics company；The Novoalign (Malaixiyaxuelane State (Selangor, Malaysia)) of Novocraft；Exonerate, European Bioinformatics institute (EuropeanBioinformaticsInstitute) (Britain Xin Kesidun (Hinxton, UK)) (Slater, and Birney G., E., " BMC bioinformatics (BMCBioinformatics) " 6:31 (2005)), the ClustalOmega of Dublin University, (Dublin, Ireland (Dublin, Ireland)) (SieversF et al., " molecular system biology (MolSystBiol) " 7, article539 (2011))；ClustalW or the ClustalX (Dublin, Ireland (Dublin of Dublin University, Ireland)) (LarkinM.A et al., " bioinformatics (Bioinformatics) ", 23,2947-2948 (2007)；And FASTA, European Bioinformatics institute (EuropeanBioinformaticsInstitute) (Britain Xin Kesidun (Hinxton, UK)) (PearsonW.R et al., " institute of NAS periodical (PNAS) " 85 (8): 2444-8 (1988)；Lipman, D.J., " science (Science) " 227 (4693): 1435-41 (1985)).

As discussed above, when the reading that checked order by RNA is compared with oriented acyclic annotated reference genome, the version of the improvement implementing SW alignment algorithm or (being discussed in more detail further below) can be preferred or desired.

According to below equation (1), for representing n × m matrix H of two character strings of length n and m, it is easy to expression SW algorithm:

H_k0=H_ol=0 (for 0≤k≤n and 0≤l≤m) (1)

H_ij=max{H_i-1,j-1+s(a_i,b_j),H_i-1-W_in,H_i,j-1-W_del,0}

(for 1≤i≤n and 1≤j≤m)

In above equation, s (a_i,b_j) represent that coupling prize point (works as a_i=b_jTime) or do not mate point penalty and (work as a_i≠b_jTime), and provide point penalty W respectively to inserting and lacking_inAnd W_del.In most of the cases, gained matrix have be zero many elements.This expression makes it easier to the most from high to low, recalls from right to left, therefore identifies comparison.

After being filled up completely with matrix with scoring, SW algorithm carries out backtracking to determine comparison.Starting from the maximum in matrix, algorithm will be based on three value (H_i-1,j-1、H_i-1,jOr H_i,j-1) the final maximum that is used for calculating each unit recalls.When reaching zero, backtracking stops.Most preferably scoring comparison can comprise the possible quantity being more than the possible quantity of the minimum inserted and lack, and comprises the possible quantity much smaller than the most probable number replaced simultaneously.

After as SW or SW-during rattan application, this technology uses dynamic programming algorithm to carry out two character strings S and the local sequence alignment of A being respectively provided with size m and n.This dynamic programming techniques employing table or matrix preserve coupling scoring and avoid recalculating sequential cells.Can be according to each element of the alphabetic index character string of sequence, say, that if S is character string ATCGAA, then S [l]=A.

Replace optimal deck watch is shown as H_i,j(above), is shown as B [j, k] in below equation (2) by optimal deck watch:

B [j, k]=max (p [j, k], i [j, k], d [j, k], 0) (for 0 < j≤m, 0 < k≤n) (2)

General introduction max function B [j in below equation (3)-(5), k] variable parameter, wherein MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, DELETION_PENALTY and OPENING_PENALTY are constants, and are all negatives in addition to MATCH_BONUS.Be given by below equation (3) and mate variable parameter p [j, k]:

If S [j] ≠ A [k], then p [j, k]=max (p [[j-1, k-1], i [j-1, k-1], d [j-1, k-1])+MISMATCH_PENALTY (3)

If S [j]=A [k], then p [j, k]=max (p [[j-1, k-], i [j-1, k-1], d [j-1, k-1])+MATCH_BONUS,

Be given by below equation (4) and insert variable parameter i [j, k]:

I [j, k]=max (p [j-l, k]+OPENING_PENALTY, i [j-l, k], d [j-l, k]+(4)

OPENING_PENALTY)+INSERTION_PENALTY

And lack variable parameter d [j, k] to be given by below equation (5):

D [j, k]=max (p [j, k-1]+OPENING_PENALTY, i [j, k-1]+(5)

OPENING_PENALTY,d[j,k-1])+DELETION_PENALTY

For all three variable parameter, [0,0] element is set to zero to guarantee that backtracking completes, i.e. p [0,0]=i [0,0]=d [0,0]=0.

Grading parameters is somewhat arbitrary, and may be adjusted to the behavior realizing calculating.Example (the Huang that the grading parameters of DNA is arranged, " the 3rd chapter: biological sequence compares and comparison (Bio-SequenceComparisonandAlignment) ", when perclimax compares molecular biology (CurrTopCompMolBiol.) book series, Massachusetts Cambridge city (Cambridge, Mass.): publishing house of the Massachusetts Institute of Technology (TheMITPress), 2002) will is that

MATCH_BONUS:10

MISMATCH_PENALTY:-20

INSERTION_PENALTY:-40

OPENING_PENALTY:-10

DELETION_PENALTY:-5

Relation between above gap penalty (INSERTION_PENALTY, OPENING_PENALTY) contributes to limiting the number that room is open, i.e. support set of slots to be combined higher than the gap insertion point penalty of the open cost in room by setting.Certainly, there may be alternative relation between MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, OPENING_PENALTY and DELETION_PENALTY.

In certain embodiments, the method and system of the present invention are incorporated to multidimensional alignment algorithm.The multidimensional algorithm of the present invention provides " looking behind " type analysis (as in Smith-water is graceful) of sequence information, and wherein the hyperspace by comprising multiple path and multiple node looks behind.Multidimensional algorithm can be used for checking order RNA reading with DAG type with reference to comparing.This alignment algorithm is C by each sequence recognition maximum scores about the position being included on DAG (such as, reference sequences constructs)_i,jIdentify maximum.It is true that seen by previously position " backward ", it is possible to cross multiple optimal comparison of possible Path Recognition.

At the upper algorithm performing the present invention of reading (also known as " character string ") discussed above and directed acyclic graph table (DAG).For the purpose defining this algorithm, it is assumed that S is intended to the character string of comparison, and vacation lets d be the directed acyclic graph table that S will compare.The element insertion of brackets to character string S is indexed with start from 1.Therefore, if S is character string ATCGAA, then S [1]=A, S [4]=G etc..

In certain embodiments, for DAG, each letter of the sequence of node will be represented as independent element d.The precursor of d is defined as:

If i () d is not the initial of the sequence of its node, then in its node, letter before d is its (uniquely) precursor；

(ii) if d is the initial of the sequence of its node, then be the last letter of the sequence of any node (such as, all exons of genome middle and upper reaches) of the father node of the node of d be the precursor of d.

The group of all precursors is expressed as P [d] then.

In order to find out " optimum " comparison, algorithm seeks the value of M [j, d] (front j the element of S and the scoring of the optimal comparison of a part of the DAG of (and comprising d) before d).This step is similar in above equation 1 find H_i,j.Specifically, determine that M [j, d] includes finding out the maximum of a, i, e and 0, as defined below:

M [j, d]=max{a, i, e, 0} (6)

Wherein

For p*, e=max{M [j, the p*]+DELETE_PENALTY} in P [d]

I=M [j-1, d]+INSERT_PENALTY

For the p* in P [d], if S [j]=d, then a=max{M [j-1, p*]+MATCH_SCORE}；

For the p* in P [d], if S [j] ≠ d, then max{M [j-1, p*]+MISMATCH_PENALTY}

As described above, e is that front j character and the DAG of S is until the comparison of part of (but not comprising) d is plus the peak of other DELETE_PENALTY.Therefore, if dIt is notThe initial of the sequence of node, then only exist a precursor p, and the alignment score of front j the character of S and DAG (until and comprising p) is equivalent to M [j, p]+DELETE_PENALTY.Wherein d be its node sequence initial example in, multiple possible precursor can be there is, and because DELETE_PENALTY is constant, institute is identical with selecting the precursor with the highest alignment score of front j the character comparison with S in the hope of the maximum of [M [j, p*]+DELETE_PENALTY].

In equation (6), i is front j-1 character and the DAG's of character string S until and the comparison of the part that comprises d is plus INSERT_PENALTY, the definition (referring to equation 1) of the insertion variable parameter that it is similar in SW.

Additionally, a be front j character and the DAG's of S until but do not comprise the peak of the part comparison of d, add or MATCH_SCORE (if the jth character of S is identical from character d) or MISMATCH_PENALTY (if the jth character of S and character d are different).As e, it means that if dIt is notThe initial of the sequence of its node, then only exist precursor, i.e. a p.This means that a is front j-1 the character alignment score with DAG (until and comprising p) of S, i.e., M [j-1, p] (adding or MISMATCH_PENALTY or MATCH_SCORE) depends on whether the jth character of d with S mates.Wherein d be its node sequence initial example in, multiple possible precursor can be there is.In this case, seek { M [j, p*] maximum of+MISMATCH_PENALTY or MATCH_SCORE} is identical with following: select there is the highest alignment score of front j-1 the character with S (i.e., the peak of candidate M [j-1, p*] variable parameter) precursor and depend on the jth character of d with S whether to mate and plus or MISMATCH_PENALTY or MATCH_SCORE.

Again, identical with in SW algorithm, point penalty such as DELETE_PENALTY, INSERT_PENALTY, MATCH_SCORE and MISMATCH_PENALTY can be adjusted promoting and the comparison in less room etc..

As described in above equation, this algorithm is by not only calculating the insertion of this element, lacking and mate scoring, and look behind (against the direction of DAG) any front nodal point to DAG to find out maximum scores, find out optimal (that is, maximum) value of each reading.Therefore, this algorithm can study the different paths containing known mutations through DAG in detail.Because chart is oriented, so the preferred isomer towards chart starting point, and the most probable comparison of optimal alignment score identification high certainty are followed in the backtracking moved against figure apparent bearing.

Fig. 1 illustrates that the figure of the reading compared with DAG represents.The top area of Fig. 1 presents sequence reads " ATCGAA " and reference sequences TTGGATATGGG (serial ID number: 1) and known insertion event TTGGATCGAATTATGGG (serial ID number: 2), and wherein inserting is band underscore.The centre of Fig. 1 illustrates reading, serial ID number: l and serial ID number: the relation between 2, keeps each sequences with auxiliary observation relation.The bottom section of Fig. 1 is shown with the comparison of DAG structure.In the DAG described, even if along different paths, but can be by reading both serial ID number 1 and serial ID number 2 from the 3' end of the 5' end of DAG to DAG.As depicted, sequence reads is shown as and upper path comparison.

Fig. 2 illustrates the actual matrix corresponding to comparing.Similar in appearance to the graceful technology of Smith-water, the illustrated the highest scoring of algorithm identification of the present invention, and carry out backtracking to identify the tram of reading.Fig. 1 and Fig. 2 also highlights the present invention and produces the actual match of character string and this structure.In the case of sequence reads comprises the variant being not comprised in DAG wherein, the sequence through comparison will be reported by room, insertion etc..

The DAG of reading and annotated reference of being checked order by RNA compares and allows to find the isomer in transcript profile.It addition, the probability of the alternative splicing of the multi-form having occurred and that (probability such as, being skipped in codon one in relevant exon) can be expected by " freely ".

Exon can automatically be divided into two nodes, and one sub corresponding to first password, and another is corresponding to the remainder of exon.This permission represents two kinds of montage, even if also not observing them in DAG.In view of the value of the difference with comparison of a codon will often be not enough to punish that candidate's comparison is to induce us to find alternative, it was predicted that these are it may is that useful.(this maximum point penalty is inserted being three or the point penalty of three disappearances, and due to probability, so usually point penalty will be less.)

Transcript profile is expressed as DAG and permits naturally being construed to isomer the path through DAG.As usual, when physical item during system and method represents mathematical structure, benefit is substantial amounts of.Such as, the system and method for the present invention allows sequence to visualize with intuitive way.For example, it is possible to by illustrating that comparative sequences is carried out in two paths through DAG.

It addition, the method for the present invention is of value to utilizes chart theorem and algorithm.Analysis through the path of DAG is energetic survey region in modern mathematics and computer science.Therefore, utilizing this research is valuable to improve the ability of transcript profile research.Such as, if the given weight corresponding with the probability at the abutment crossed over by reading in edge, then total weight in path can carry the information about the prior odds realizing different isomer.It is known that path through DAG maximizes (or minimizing) weight, and shortest path first can be used for finding out those shortest paths rapidly.

Fig. 3 provides the graphic of the method for the present invention.As discussed above, once DAG is built-in with the exon being linked in sequence with specification, then obtain both-end reading.Then, both-end reading and DAG comparison, and find out the comparison alignment score of desired indicator (have meet) between the node representing RNA sequence and the DAG at the edge that connects multipair node at reading and including.Desired indicator can be higher assessment proportion by subtraction pair.Comparison can be carried out by the graceful algorithm of Smith-water of above-mentioned improvement.For a pair both-end reading, comparison can comprise based on the path through the oriented acyclic data structure comprising comparison, identifies the transcript in transcript profile.By using insertion size as the constraint of comparison, can effectively carry out comparison.Both-end reading can be used for building DAG (such as, adding edge).

The method is suitable for analyzing the reading obtained by RNA order-checking.In certain embodiments, oriented acyclic data structure substantially represents all known feature of at least one chromosome.Can create and store oriented acyclic data structure before finding out comparison.In certain embodiments, by by each sequence reads and finding out comparison through at least most of being compared in the possible path of oriented acyclic data structure.Method can comprise, based on the comparison found out, multiple sequence reads is assembled into contig.

3. use the constraint inserting size as comparison

Wherein inserting size information can be to use the constraint inserting size as comparison with a kind of mode of DAG combination.By using the method, speed and accuracy are likely to increase.Because the method prevents from abandoning the important information relevant to comparison, so compared with independent comparison both-end spouse, the method can be more accurate.It should be noted that, two both-end spouses of independent comparison are equivalent to select two positions between L possible position (if genome has length L) in genome approx.Therefore, comparison is included between about L^2 kind probability and selects.But, the first comparison is used for some regions of constant size k another comparison being tied to around the first comparison forcefully, then only has about kL kind probability in the magnitude of L rather than L^2.This will be quicker.

Fig. 4-Fig. 6 illustrates use and inserts size information to retrain comparison.

Fig. 4 illustrates the DAG of the hypothesis of arbitrary structures and the both-end reading of a pair hypothesis of definition insertion size k.In DAG as drawn, node is square, and edge is line.In the embodiment depicted, insertion size k is obtained from the approximate normal distribution with average kbp.By this to the first member and DAG comparison (such as, use above-mentioned Smith-water graceful or the Smith-water of improvement is graceful).In general, both-end includes fragment, and the often end of this fragment has adapter, and this fragment has the both-end reading extended the most inwardly from fragment-adapter border.Room can be there is between both-end reading.Intubating length may mean that the length of fragment i.e., from a fragment-adapter border to the distance on another fragment-adapter border.See for example Lindgreen, 2012, " the simple cleaning (AdapterRemoval:easycleaningofnext-generationsequenceread s) of AdapterRemoval: sequence reads of future generation ", " 5:337 is (such as BMC research note " (BMCResNotes) ", Lindgreen, Fig. 1, picture C), its content is incorporated by reference.

In Figure 5, Far Left black bars represents exon, this to the first member comparison at exon.Determining one group of distance, they account for the major part (such as, 99.95%) in desired insertion sizes values.Such as, wherein insertion size is by normal distribution, and this group distance can be those distances in 3 standard deviations of k.Determined by this group distance be used for selecting downstream node, and therefore select path candidate.Selected downstream node represents that the potential of the second member of both-end reading is compared loci by this.In Figure 5, this group downstream node is also filled in black bars.Start from upstream node and to end at the path of in downstream node be path candidate, by this to both-end reading and path candidate comparison.

Fig. 6 illustrates the path candidate of the DAG from Fig. 4.It should be noted that, Fig. 6 also presents the DAG in the DAG being included in Fig. 4.Exemplified with all path candidates and illustrate only path candidate in Fig. 6.Definition can be inserted this DAG presented in both-end reading pair and Fig. 6 of the hypothesis of size k (from Fig. 4) to compare to save amount of calculation.In the case of not recognizing for this constraint of comparison, this to each reading can along the genome of length L and at complicated DAG (such as, DAG in Fig. 4 be approximately cross its length 4 weight) in potentially with the comparison of any position, much more position can be there is (i.e., on average, if DAG crosses any gene loci n weight, then a pair both-end reading of comparison is equivalent to select from (nL) ^2 option).Carrying out according to illustrated method can make the biggest upper cost of plan calculating much smaller with the comparison of DAG, produces the much bigger handling capacity with the high power capacity both-end readings from RNA order-checking plan.

4. the sensitivity of the mutation of pair alternative splicing.

When reading crosses over the abutment of exon, current techniques makes it difficult to or can not make a distinction between the situation of several types.This situation a kind of is situation about having undergone mutation.Even if in the presence of not sudden change, when because having occurred and that the alternative splicing (wherein a part for codon has been comprised or skipped) of a kind of form, so reading can seem when the either side with abutment does not mates, this situation another kind of occurs.Additionally, sequencing error presents this situation another kind of.

Consider that following sequence is as example:

AUGCAUUUCCGUGGAUAGCGAUGCAUACUCACGAUGAUGAAAAUGCAUCAGAAAUAG (serial ID number: 3)

Wherein in plain text region representation exon, the part of the first band underscore represents intron, and the second band underscore partially due to alternative splicing is it is possible that the region that maybe may be not included in Second Exon.Therefore, any one during mature rna can take following two form:

AUGCAUUUCCGUGGAUAGAUGAAAAUGCAUCAGAAAUAG (serial ID number: 4)

AUGCAUUUCCGUGGAUAGAUGCAUCAGAAAUAG (serial ID number: 5)

It is now assumed that we obtain following reading: TGGATAGATGAA (serial ID number: 6).This reading can have at the different some cause effect relation historical records of important aspect.

Fig. 7 presents three the alternative historical records including these sequences.In the 3rd, top the first situation described crossing Fig. 7, take from serial ID number: the reading of 4, and sudden change or sequencing error do not occur.In the second situation, take from serial ID number: the reading of 5, and sequencing error occurs.In the 3rd situation, take the reading from transcription sequence No. ID: 5, and reading accurately reflects sudden change.

If linear reference sequence is only made up of plaintext region (underscore that Fig. 7 is short-and-medium stroke), then very high pass is much mated rising and only suffers an algorithm not mated during point penalty uses that comparison will not be made to be labeled as possible mistake by very possible alignment score.Not check some other file or data bases for montage information, but sudden change or single sequencing error will be supposed simply.Preferable with reference to comparing with DAG.

Fig. 8 depicts serial ID number: 3 with DAG with reference to the comparison of (due to without intron, so now appearing in Fig. 8 as serial ID number: 4).Here, because accurate comparison will have the highest scoring of possible comparison any with this DAG, so accurate comparison is resumed immediately.Due to the essence of DAG, Fig. 8 further depict:

AUGCAUUUCCGUGGAUAGAUGCAUCAGAAAUAG (serial ID number: 5)

In view of the fraction of exon suffers the frequent degree comprising or removing that caused by alternative splicing, this be the most general phenomenon and will it occur frequently that.When these cases occur, the characteristic that generally cannot detect mistake is described by the system including linear reference, and preventing the sole mode of these mistakes is to compare with so many linear reference, so that making process expensive (at time and money two aspect) excessively.

The most correctly use dual ended data: improve DAG.

The present invention provides and additionally utilizes the relation inserted between size and accurate comparison.Insert size as during to the constraint of given comparison or a pair comparison when processing, present case can be inspected.This is the incomplete description to this situation.Using insert size inspect as than to and the constraint of combination of transcript can be more accurate.

The method of the present invention include not only using insert size as a comparison to quality control, also use and insert size as the quality control for transcript.If each in two both-end spouses has higher assessment proportion by subtraction pair in transcribing DAG, if and this corresponds to the most unreasonable insertion size to comparison, then compared with the confidence adjusting us about each correctness in comparison, to be usually more reasonably about transcribing adjustment confidence (such as, by adding edge to chart).

This can by no matter when, two both-end spouses and DAC contrast conting insert size or the size that implied by this comparison is implemented.(because reading may be consistent with some features, it is possible that include more than one size.) if this of minimum inserts size and insert the business of expected value of size more than some constants (may be 2), then check that DAG is to find out the node being not connected with a pair, if this node to being not connected with connects, then by the inevitable existence with isomer, for this isomer, the business of the desired value of the insertion size implied of this isomer and insertion size is between 0.5 and 1.5.If node is existed by this, then connect them with edge.If several this to existence, then select to produce the ratio closest to 1 that to and connect it.

In certain aspects, the present invention provides a kind of method analyzing transcript profile, the method includes obtaining at least one pair of both-end reading from transcript profile, find out this to the first reading and the node in oriented acyclic data structure (this data structure has the node representing RNA sequence such as exon or transcript and the edge connecting every pair of node) between there is the comparison most preferably marked, identify the path candidate comprising node, this node is connected to downstream node by having the path of the length substantially similar to the intubating length of both-end reading with this, with by both-end reading and path candidate comparison to determine comparison of most preferably marking, and improvement DAG is to comprise new edge when necessary or preferably to create the path candidate specified by a pair both-end reading.Those skilled in the art will recognize that the path can with the length substantially similar to the intubating length of both-end reading with this.Such as, the substantially similar nucleotide that can mean equal number, or can preferably mean+10 or+100 nucleotide.In certain embodiments, substantially similar can be defined by the user to mean that in length gap is in 10% each other, or the most each other gap in 25%, or each other gap in 50%, or each other gap in 100%.In a preferred embodiment, length is substantially similar means the length of one twice less than another length.The method is suitable for many both-end readings, such as from those readings (such as, representing the cDNA fragment of single transcript profile) of cDNA fragment.Node and downstream node can represent a pair exon, and exon is obtained by this by both-end reading from this.Method can comprise based on determined by optimal scoring comparison identify isomer.In certain embodiments, the isomer identified is novel isomer.Oriented acyclic data structure can be updated to represent the isomer (such as, by adding new edge) of novelty.It is not any path of path candidate that method can comprise from including getting rid of this any contrast conting to both-end reading.The other side of the present invention provides expression analysis by such as can be used for the system and method for deduction isomer frequency.

6. use dual ended data: infer isomer frequency.

Strong technology is not due to use double-ended structure with correctly comparison, and is due to use comparison with inferring that from comparison the intermediate steps inserting size infers isomer frequency.If multipair exon occurs that frequency in the sample is known (to see for example Harrow, 2012, " genome research (GenomeResearch) " 22:1760), then those frequencies can be used for constitutional isomer appearance frequency in the sample.Additionally, different multipair exons generally will be opened by different separating distances.Therefore, the distance observed between both-end spouse can (the most indirectly) be used for following the tracks of exon frequency.

Exemplary example is as follows.For every couple of both-end spouse, this is to comparing with DAG reference the most alternatively.Note and follow the tracks of each to the distance between the reference point (such as, leftmost position) on member.This can obtain from the comparison with DAG, although should be noted that infer isomer frequency method can need not by this to DAG comparison.

For note and follow the tracks of each be compared to reference point between a suitable method of distance be by using hash table.In distance hash, keyword can be all the distance between 1 and 1000, and the value of each keyword is initialized to 0.It is then possible to complete to be incremented by by the value this value incremental searching the keyword corresponding to distance.It should be noted that and can extremely quickly complete this kind of computing of hash table it practice, in constant time.In addition, it is to be noted that support hash table well by existing bioinformatics software bag such as BioPerl.Then, every pair of exon is associated with corresponding to the summation of the value of the distance of actual distance between exon.(that is: will add up by all " hit-count " that observed close to the distance of the spacing of those exons.In the sample, relative to the exon of all pairs, to this quantity and to the business of total quantity by closely this frequency that exon is occurred).

Isomer frequency can be determined from this data algebraic method.The preferred form of equation can depend on the structure of associated isoforms, but general technology obtains as follows:

Assuming there are three isomers, one of them comprises exon A, B and C, and other in which comprises exon A, B and D, and wherein the 3rd comprise exon B, C and D.The frequency of exon pair wherein [AB] is that the exon between the reading of all of institute comparison obtains (that is, in comparison, when exon i is connected to exon j, corresponding frequency increments 1) to the frequency of AB from the reading of institute's comparison.

Then, if the frequency of three isomers is (in order) [A]=f, [B]=g and [C]=h, then exon-to frequency should be as follows:

[AB]=f+g

[W]=f+h

[X]=g+h

When left side value is from time known to comparison, isomer frequency can be resolved.In general, the method for the present invention comprises writes the equation of exon-p-frequency Yu isomer-frequency dependence, and uses the former to obtain the latter.Those of skill in the art are it will be recognized that be used for resolving the algebraic method of the equation described.In order to provide exemplary example, it should be noted that algebraic operation above produces:

F=[AB]-g,

And:

H=[BD]-g

Therefore:

[BC]=f+ [BD]-g

And:

[BC]=[AB]-g+ [BD]-g

Reset and provide:

[BC]/2 ([AB]+[BD])=g

Owing to [BC], [AB] and [BD] is all from known to comparison, therefore calculate g.Then it is the most simplest for resolving f and h.

In some cases, " degree of freedom " problem can occur, the quantity of variable to be resolved is not less than the quantity of input variable.Can be used for solving any suitable technology of this obvious problem, and specific situation can be depended on.The appropriate technology that can find out comprises makes variable standardization；Resolving more than one group of isomer frequency and selects one afterwards；Obtain at least one input value (such as, obtain from biological specimen) in outside, or input information (such as, providing from document or hypothesis) is provided.

It is important to note that the graphic rendition of committed step: assume that generation has distance in x-axis and has the chart of frequency on the y axis.Then, exon-to the peak value that will appear as on this chart, and often some in those peak values also will easily may be interpreted as isomer-frequency.

Therefore, the present invention provides through from transcript profile obtain multiple both-end reading to and alternatively those readings are compared with reference data structure, infer the method for isomer frequency in transcript profile.Data structure comprises each multiple exons being represented as node from genome.Multipair node or exon are connected by edge in their proximal end.Determine both-end reading institute's comparison between distance and institute's comparison between the frequency of distance.Determine one group of isomer path and isomer frequency so that represent with isomer frequency through the isomer path of structure cause multipair exon with institute's comparison to the frequency of distance be comprised in this isomer path.Determined by each path in this group isomer path can represent that be present in organism transcribes isomer.This comparison between distance can be the first comparison member from every couple upstream extremity to this to the path of downstream of the second comparison member.In certain embodiments, determine that this group isomer path comprises and find novel isomer path.Reference data structure can be updated to comprise the isomer path of novelty.

7. merge priori exon count distribution and Enumeration of Isomers distribution.

In certain embodiments, the present invention utilizes the situation of the priori data wherein having the relative frequency for Enumeration of Isomers.The isomer distribution of transcript can be pushed off relevant to the distribution of protein isomer.Even if by halves, have studied the exon count distribution about given gene and the problem of Enumeration of Isomers distribution the most up hill and dale, and the mode that these distributions are sensitive to other true (such as, it is coded sequence or non-coding sequence) about sequence.

Can from document or by carry out analysis obtain isomer frequency.For example, it is possible to infer isomer frequency from the both-end reading RNA order-checking reading as described in " use dual ended data: infer isomer frequency " in Section 6 above.In certain embodiments, the insertion size from both-end readings is used for inferring isomer frequency, as described in following Section 8 " use is inserted size and more effectively inferred isomer probability ".In certain embodiments, from source, such as data base or document obtain isomer frequency.

Fig. 9 represents the protein isomer distribution obtained by the research of SwissProt aminoacid sequence, and it is from Nakao et al., 2005, " large scale analysis of the alternative protein isomer of the mankind: pattern classification and the dependency (Large-scaleanalysisofhumanalternativeproteinisoforms:pat ternclassificationandcorrelationwithsubcellularlocalizat ionsignals) with subcellular localization signal ", " the reproduction of the 3rd figure of nucleic acids research (NuclAcRes) 33 (8): 2355-2363.It addition, Chang et al. has used the quantitative distribution of the programmed method various alternative splice forms of derivation.Chang et al., 2005, " application (Theapplicationofalternativesplicinggraphsinquantitativea nalysisofalternativesplicingformfromESTdatabase) of alternative splicing chart in the quantitative analysis of the alternative splicing of est database ", " computer technology application International Periodicals (IntJ.Comp.Appl.Tech) " 22 (1): 14.

The information presented by Fig. 9 represents the Enumeration of Isomers distribution that can be as an aid in reading is mapped to DAG.It is likewise possible to obtain exon count distribution information.

Such as, there is the distribution of the display of the quantity of the exon of every transcript.See for example Harrow et al., 2012, " GENCODE: for the reference human genome annotation (GENCODE:ThereferencehumangenomeannotationforTheENCODEPro ject) of code scheme ", " genome research (GenomeRes) " 22:1760-1774.The transcript of protein-encoding gene site is peak value display at 4 exons of every transcript, and IncRNA illustrates different peak values (seeing Harrow et al., Fig. 2 of 2012) at two exons.Can with the y-axis providing relative frequency rather than absolute frequency probability explain that the exon count distribution from Harrow et al. or the isomer from Nakao et al. are distributed this information and can be used for informing about candidate's comparison corresponding to the judgement of the likelihood of actual nucleotide sequence.Prior art depends on the scoring of comparison.The present invention provides the mode of the priori comprising isomer quantity.

Such as, for big reading group, 75% can be unambiguously mapped either onto one group of 8 isomer { region in I1, I2 ..., I8}, and it other 25% can be interpreted or map I5 or be mapped to some other combinations of exon corresponding to another isomer I9.In this case, all to how, the decision-making of last the 25% of comparison reading is relevant for the fact that all kinds.This type of relevant fact comprises (i) and I5 comparison and the scoring with I9 comparison；(ii) any further knowledge about regulatory mechanism that we are likely to be of；And (III) has 8 or 9 isomers the most more likely for priori.

When reading is mapped to DAG reference, the present invention merges isomer distributed intelligence.Isomer frequency distribution information can be used in a number of different ways.Comprise mapping in reading the first illustrative methods comprising priori isomer quantity: (1) is the relevant range structure DAG (edge of this DAG can be weighting or unweighted) of genome and each edge of DAG is associated by (2) with the value of the use indicating edge.If the quantity of the reading of institute's comparison is less than some variablees U in the way of crossing over this edge, then it is set using "abnormal".This variable reflecting edge intuitively the most " is used ".Estimate that a kind of mode of this variable may be by being set to equal to (quantity of reading in sample) * (rationally estimating and the business of average reading length of exon length) * (.005).

Further, the method comprises (3) by the relevant range of genome with corresponding to using the variable of threshold value T to be associated, and may arrange T and add the standard deviation of the quantity for the edge of chart equal to the expected value of the quantity for the edge of chart.These quantity can be calculated by following steps: estimates the quantity at edge in DAG, and this DAG has and the gene by being discussed and desired experience exon frequency, isomer frequency or node that both is corresponding and the quantity of maximum path.Empirical Frequency can be such as from Harrow et al., 2012 (for exon count distribution) or from Nakao et al., and 2005 (being distributed for isomer) obtain.

The method comprises (4) and improves or the marking scheme of other algorithm graceful for Smith-water；If the total N of " use " edge (that is, Boolean type use value is " normally " or " really " edge) is more than T, then punished the traversal at untapped edge by some point penalties.Can make to deserve punishment point increases along with N and increases.Rational formula for point penalty can be max (0, (N-T) * (insertion point penalty) * 2).0 occurs avoiding negative point penalty i.e., to avoid using the bonus point at new edge but this little bonus point are probably suitably in some cases, in this case, it is not necessary to take maximum and the value of the latter's expression of 0.

In order to avoid the most received path dependence of the form of the order sensitive processed to reading in comparison, comparison can be repeated several times and comparative result.This does not affect the progressive operation time of algorithm.

The alternative illustrative methods comprising priori isomer quantity in mapping reading comprises:

The each reading of (i) comparison and DAG；

(ii) the isomer quantity observed is compared with desired isomer quantity (such as, give the background information about gene, obtain from the prior probability distribution of isomer quantity)；And

(iii) if the difference between the quantity observed and desired quantity exceedes a certain threshold value, comparison is assigned to the reading of the rarest candidate's isomer the most again.In comparison again, by the weight of the value according to difference by any comparison point penalty with rare candidate's isomer.Point penalty can be max (0, (N-T) * (insertion point penalty) * 2), and wherein N is the sum of " use " edge (that is, Boolean type use value is " normally " or " really " edge), and T is to use threshold value.In certain embodiments, T is that the expected value of quantity at the edge for chart is plus the standard deviation of quantity for the edge of chart.

The method being used the prior probability distribution of isomer quantity by application, is mapped to DAG to produce the distribution close to prior distribution by promoting reading.

8. more efficiently use insertion size to infer isomer probability.

Assume to occur two exon E1 and E2 in different isomers, and be separated by distance L1 and L2 at those two isomer Exon E1 and E2.To obtain a pair both-end reading once in a while, one spouse is mapped to E1, and another is mapped to E2.Which then, can be reduced to the selection belonging to isomer be the actual distance between reading in view of the skew from exon edge of spouse by this, if infer a problem shifting suitable amount in L1 or L2 or those distances.

Because we are generally of the good background information that in running, the distribution of insertion size is relevant with both-end order-checking, so the problem about insertion size that isomer-assignment problem is converted to is valuable.Reading is gathered from an isomer or another isomer in view of other reason generally will not had to believe more likely, we can carry out simple Bayes and calculate, and distribute to two isomers with the ratio according to the probability inserting size L1 and L2 in distribution by probability for reading.

This is to significantly improve current techniques, and it inspects reading in the case of being somebody's turn to do generally is to carry the equivalent evidence weight for two isomers.Therefore, merging insertion size in the manner described is not ignore the mode of relevant proof.

Therefore, in the case of these are indefinite, we can count isomer, as follows:

(1) probability distribution of insertion size is calculated.(referred to the probability with the reading inserting size s by P (s).) it should be noted that this only needs to calculate once, and this can complete in pre-treatment step.

(2) if each in two both-end spouses is mapped in inside exon:

(2a) list of all isomers comprising those exons is made.

(2b) each isomer is associated with the insertion size that will be implied by the reading derived from this isomer.

(2c) by each calculating P (s) in the insertion size that calculated in (2b).

(2d) each isomer is associated divided by the correlation of the summation of all P (s) values calculated in (2c) with P (s).

It should be noted that techniques described herein will be to improve the reading counting of isomer in the way of not affecting the ability of use " downstream " instrument (it is to say, as input and reading counting is provided the instrument of other useful output).

9. structure " maximum DAG ".

In certain embodiments, structure DAG can be useful, and wherein, in this DAG, every pair of node is with from representing that " relatively early " or " left side " exon edge to the node of " slower " or " right side " exon connects.

Use the annotated transcript profile of Gencode or similar, create node to represent exon and further feature.Gencode provides enough information so that these nodes and this information are associated as the sequence of node, its position occupied in reference sequences (and therefore its length) etc..Therefore, use and include being couple to the computer system of the processor of non-transitory memory, be represented as node from each in multiple features of genome.For every pair in the plurality of feature, create edge, edge by this to the near-end of two members connect this two members.This biological explanation moved is the existence that conjecture represents the isomer of each combination observing those exons that exon occurs in order in genome, or at least promotes to find this isomer.The reason of Irritability existing to the order sensitive occurred in genome in accordance with exon has again biochemical reason.

Such as, report shows that the exon in many exons premessenger RNA keeps the most in order.See for example Black, 2005, " the simple answer (Asimpleanswerforasplicingconundrum) of a montage difficult problem ", " institute of NAS periodical (PNAS) " 102:4927-8.Not by any specific mechanism constraint, can say theoretically, the beginning of montage plant equipment identification intron and end, and branch point adenine or other conserved sequence close to montage abutment.Montage plant equipment or spliceosome move from 5' to 3' along premessenger RNA, remove intron and montage exon.The most apparently existing for being linked in sequence the phenomenology of all exons and document support with 5' to 3', in 5' to 3' order, all of exon seems along genomic DNA chain.Referring further to Shao et al., 2006, " mankind's Exon repeats reception, exon scrambling and the bioinformatic analysis (Bioinformaticanalysisofexonrepetition of trans-splicing, exonscramblingandtrans-splicinginhumans) ", " bioinformatics (Bioinformatics) " 22:692-698.For some it is assumed that the method for the present invention follows mankind's premessenger RNA is become the hypothesis of the linear molecule retaining the exon order defined by genome sequence by montage.

Additionally, the example comprising exon scrambling in being expected to oriented acyclic data structure (see for example Carrington et al., 1985, " during modifying after the translation of concanavalin A, occurring polypeptide to engage (Polypeptideligationoccursduringpost-translationalmodific ationofconcanavalinA) ", " natural (Nature) " 313:64-67) in the case of, can by comprise dual multiple exon generally keep as " fabricating " exon with the hypothesis of their original order montage (or, in alternative scheme, this hypothesis can be abolished).

Therefore, comprise at such as genome and be in following sequence: in the case of the following exon of ABCDEFGHI

Use dash line can be represented as node, this complete group edge as edge and letter:

A-C；A-D；A-E；A-F；A-G；A-H；A-I；B-C；B-D；B-E；B-F；B-G；B-H；B-I；C-D；C-E；C-F；C-G；C-H；C-I；D-E；D-F；D-G；D-H；D-I；E-F；E-G；E-H；E-I；F-G；F-H；F-I；G-H；G-I.

Really this suffices to show that all of isomer of discovery rather than one, and this one isomer provides example: ADEHG

Then, by emulation or imaginary G, it is called G', and this acyclic data structure can be " tooth-like ".It is then possible to represented this isomer by A-D-E-H-G', and structure need not any edge of comprising from H to G.However, it is found that exon scrambling and the genomic exon order kept by all edges comprised can be ignored.

The DAG of the present invention can be first prepared by using the input data from annotated transcript profile.Similarly, starting anew, DAG will comprise all known isomers.And then if making the DAG of maximum, then DAG will comprise all possible edge keeping genomic exon order, not only represent known isomer by therefore comprising but also represents node and the edge of novel isomer.Node and edge are stored as oriented acyclic data structure in memory, and method comprises and compared with the path of the subset including the feature being connected by the subset at edge by both-end sequence reads, thus identify the isomer corresponding to path.

When by new data and DAG (such as, both-end RNA check order reading) comparison, those data may be embodied in exon unrecognized in annotated reference.Because exon is novel in this example, so this exon is isomer.Using DAG as with reference to being preferably suited for this situation, can also guide, because novel isomer being compared with existing DAG, the exon to represent novelty creating new node in DAG.Therefore, in the case of the isomer identified is novelty, comparison transcription sequence can be included in oriented acyclic data structure establishment new node.Can also create new node for splicing variants, such as, one of them exon is different from another exon by single codon.

Preferably, before analyzing both-end reading and independent of analyzing both-end reading, maximum DAG is built the most whole or in part.Therefore, before method may be embodied in comparison transcript, for every pair of establishment edge in multiple features.For with this qualitative correlation of oriented acyclic data structure for reasons of, in data structure represent reference can be arbitrarily large, such as, hundreds of or thousands of exons, intron or both.

10. the system of the present invention

Figure 10 illustrates the computer system 401 for implementing approach described herein.The system of the present invention can comprise any one parts shown in Figure 10 or any amount of parts.In general, system 401 can comprise the computer 433 and server computer 409 that can be communicated with one another by network 415.Furthermore it is possible to obtain data from data base 405 (such as, Local or Remote) alternatively.In certain embodiments, system comprises the instrument 455 for obtaining sequencing data, and instrument 455 can be coupled to the sequenator computer 451 initial treatment for sequence reads.

In certain embodiments, by parallel processing method carried out therewith, and server 409 comprises multiple processors with parallel architecture, i.e., can by multiple sequences (such as, RNA checks order reading) processor that compares with reference sequences structure (such as, DAG) and the distributed network of memorizer.System includes multiple processor, these processors perform simultaneously multiple reading and reference sequences construct between multiple compare.Although other mixed configuration may be had, but share between all treatment elements that the main storage in parallel computer is generally or in individual address space, or in distributed, i.e. each treatment element has the home address space of himself.(distributed memory is to refer to the fact that memorizer is logically distributed, but generally implies that it is distributed the most for physically.) distributed shared memory and storage virtualization combination both approaches, wherein treatment element has the local storage of himself and the access right to the memorizer on non-local processor.Local storage is accessed the access of usual comparison non-local memory faster.

Wherein can be referred to as uniform memory access (UMA) system with the computer architecture of equal time delay and each element of bandwidth access main storage.Generally, UMA system can only be realized by shared accumulator system, and wherein memorizer is distributed the most for physically.The system without this characteristic is referred to as uneven memory access (NUMA) framework.Distributed memory systems has uneven memory access.

Can implement processor-processor and processor-memory communication the most within hardware, these modes comprise ties up grid via shared (or multiport or multiplexing) memorizer, crossbar switch, shared bus or the interference networks (comprising star, annular, tree-like, hypercube, fat hypercube (having the hypercube of more than one processor at a node)) of countless topology or n.

Parallel computers based on interference networks must merge the message transmission routeing to realize between the node of also indirect connection.The media of the communication between processor are likely in large-scale multiprocessor machine layering.This type of resource for specific purpose, or can access these resources via " cloud " of such as Amazon cloud computing the most commercially available.

Computer comprises generally and is couple to memorizer and the processor of input-output (I/O) mechanism by bus, memorizer can comprise RAM or ROM, and preferably comprising at least one tangible non-transitory media of storage instruction, this instruction can perform to cause system to carry out functionality described herein.Those of skill in the art it will be recognized that, as required or be best suited for carry out the present invention method, the system of the present invention comprises one or more processor (such as, CPU (CPU), Graphics Processing Unit (GPU) etc.), computer readable storage devices (such as, main storage, static memory etc.), or a combination thereof, they are communicated with one another by bus.

Processor can be any suitable processor as known in the art, such as by Intel (Santa Clara (SantaClara, CA)) the processor sold with trade mark XEONE7, or the processor sold with trade mark OPTERON6200 by AMD (California Sunnyvale (Sunnyvale, CA)).

Input-output apparatus according to the present invention can comprise video display unit (such as, liquid crystal display (LCD) or cathode ray tube (CRT) monitor), Alphanumeric Entry Device is (such as, keyboard), cursor control device is (such as, mouse or Trackpad), disk drive unit, signal generates equipment (such as, speaker), touch screen, accelerometer, mike, cellular radio frequency antenna, and Network Interface Unit, Network Interface Unit can be such as NIC (NIC), Wi-Fi card or cellular modem.

It is herein incorporated by reference

Run through the disclosure by reference to and refer to other document, such as patent, patent application, patent publication, periodical, books, paper, Web content.These type of documents all are incorporated herein for all purposes in entirety by reference at this.

Equivalent

Except shown herein as and describe those in addition to, those skilled in the art will understand the various improvement of the present invention and many further embodiments of the present invention from the complete content (comprising the reference to the science quoted in this article and patent documentation) of this document.Theme herein contains important information, example and guide, and they may be adapted to present invention practice in various embodiments of the present invention and equivalent thereof.

Claims

1. the method analyzing transcript profile, described method includes:

A pair both-end reading is obtained from transcript profile；

Finding out the comparison with optimal scoring between the node in first reading of described pair and oriented acyclic data structure, described data structure includes the node representing RNA sequence and connects the edge of multipair described node；

Identifying the path candidate comprising described node, described node is connected to downstream node by having the path of the length substantially similar with the intubating length of the pair of both-end reading；And

By described both-end reading with described path candidate comparison to determine comparison of most preferably marking.

Method the most according to claim 1, farther includes to obtain multiple both-end reading from described transcript profile.

Method the most according to claim 2, farther includes to determine, for the plurality of each of both-end reading centering, comparison of most preferably marking.

Method the most according to claim 3, wherein said multiple both-end readings are to obtain from the cDNA fragment from described transcript profile.

Method the most according to claim 1, farther include based on determined by optimal scoring comparison identify isomer.

Method the most according to claim 5, wherein said identified isomer is novel isomer.

Method the most according to claim 6, farther includes to update described oriented acyclic data structure to represent the isomer of described novelty.

Method the most according to claim 7, wherein updates described oriented acyclic data structure and includes adding at least one new node with the isomer representing described novelty.

Method the most according to claim 8, wherein said node and described downstream node represent a pair exon, and the pair of both-end reading obtains from the pair of exon.

Method the most according to claim 1, farther including to get rid of from any contrast conting including the pair of both-end reading is not any path of path candidate.

11. methods according to claim 1, farther include:

Multiple both-end readings pair are compared with described oriented acyclic data structure；

Determine the plurality of both-end reading pair the comparison of described institute between distance and the comparison of described institute between the frequency of described distance；And

Determine one group of isomer path and isomer frequency so that: represent with described isomer frequency through the described isomer path of described structure cause multipair feature with the comparison of described institute to the described frequency of described distance be comprised in described isomer path.

12. methods according to claim 11, wherein said feature comprises exon.

13. methods according to claim 12, wherein determined by each path representation in described group of isomer path be present in organism transcribe isomer.

14. methods according to claim 11, the comparison of wherein said institute between described distance include the path of downstream of the second comparison member of upstream extremity to described pair of the first comparison member from every couple.

15. 1 kinds of systems for transcription group, described system includes:

Including being couple to the computer system with the processor of the non-transitory memory being stored in oriented acyclic data structure therein, described data structure include represent RNA sequence node and connect multipair described node edge, wherein said system operable with:

A pair both-end reading is obtained from cDNA fragment；

Find out the comparison with optimal scoring between the node in first reading of described pair and described data structure；

Identifying the path candidate comprising described node, described node is connected to downstream node by having the path of the length substantially similar with the intubating length of described both-end reading；And

16. systems according to claim 15, further operable to obtain multiple both-end reading from cDNA fragment.

17. systems according to claim 16, further operable to identify the isomer of novelty based on scoring comparison optimal determined by described.

18. systems according to claim 17, further operable to update described oriented acyclic data structure to represent the isomer of described novelty.

19. systems according to claim 18, wherein update described oriented acyclic data structure and include adding at least one new node with the isomer representing described novelty.

20. systems according to claim 15, further operable not being any path of path candidate from including described any contrast conting to both-end reading is got rid of.

Inferring the method for isomer frequency in transcript profile for 21. 1 kinds, described method includes:

Being compared with reference data structure by multiple both-end readings pair from cDNA, described reference data structure includes each multiple features being represented as node from genome, and the most multipair the plurality of feature is connected by edge in their proximal end；

Determine described both-end reading the comparison of described institute between distance and the comparison of described institute between the frequency of described distance；And

22. methods according to claim 21, wherein said cDNA is generally designated by the whole transcript profile of organism, and described feature comprises exon.

23. methods according to claim 22, wherein determined by each path representation in described group of isomer path be present in described organism transcribe isomer.

24. methods according to claim 23, wherein determine that described group of isomer path comprises and find novel isomer path.

25. methods according to claim 24, farther include to update described reference data structure with the isomer path comprising described novelty.

26. methods according to claim 21, the comparison of wherein said institute between described distance include the path of downstream of the second comparison member of upstream extremity to described pair of the first comparison member from every couple.

27. methods according to claim 21, wherein determine that described group of isomer path includes resolving one group of linear equation.

28. methods according to claim 27, wherein determine that described group of isomer path obtains described isomer frequency from biological data before being included in described group of linear equation of resolving.

29. methods according to claim 21, further include at make before determining one group of isomer path and isomer frequency the comparison of described institute between the described frequency standard of described distance.

30. methods according to claim 21, wherein said data structure includes the node representing each annotated feature in gene expression characteristics data base.

Inferring the system of isomer frequency in transcript profile for 31. 1 kinds, described system includes:

Including being couple to the computer system with the processor of the non-transitory memory being stored in reference data structure therein, described reference data structure includes each multiple features being represented as node from genome, the most multipair the plurality of feature is connected by edge in their proximal end, wherein said system operable with:

Determine from cDNA multiple both-end readings between distance and described between the frequency of described distance；And

32. systems according to claim 31, wherein said cDNA is generally designated by the whole transcript profile of organism, described feature comprises exon, and described system is further operable to be compared with described reference data structure by the multiple both-end readings pair from cDNA.

33. systems according to claim 32, wherein determined by each path representation in described group of isomer path be present in described organism transcribe isomer.

34. systems according to claim 33, wherein determine that described group of isomer path comprises and find novel isomer path.

35. systems according to claim 34, further operable to update described reference data structure to comprise the isomer path of described novelty.

36. systems according to claim 31, the comparison of wherein said institute between described distance include the path of downstream of the second comparison member of upstream extremity to described pair of the first comparison member from every couple.

37. systems according to claim 31, wherein determine that described group of isomer path includes resolving one group of linear equation.

38. according to the system described in claim 37, wherein determines that described group of isomer path obtains described isomer frequency from biological data before being included in described group of linear equation of resolving.

39. systems according to claim 31, further operable with made before determining described group of isomer path and isomer frequency the comparison of described institute between the described frequency standard of described distance.

40. systems according to claim 31, wherein said data structure includes the node representing each annotated feature in long range gene property data base.