CN105069325A

CN105069325A - Method for matching nucleic acid sequence information

Info

Publication number: CN105069325A
Application number: CN201510482636.9A
Authority: CN
Inventors: 盛司潼
Original assignee: 盛司潼
Current assignee: Shenzhen Malt Accelerator Technology Co., Ltd.
Priority date: 2012-07-28
Filing date: 2012-07-28
Publication date: 2015-11-18
Anticipated expiration: 2032-07-28
Also published as: CN102841988B; CN102841988A; CN105069325B

Abstract

The present invention relates to the field of information processing, and provides a method for matching nucleic acid sequence information. The method comprises the following steps: A, performing Burrows-Wheeler transform (BWT) on reference sequences in a database to obtain matched reference sequences, and storing the matched reference sequences in the database; B, performing interval marking on the matched reference sequences in the database; and C, performing consistency matching between nucleic acid sequence fragments and the matched reference sequences in the database sequentially and respectively to obtain matched nucleic acid sequences. By adopting the method for matching the nucleic acid sequence information, fast matching between the nucleic acid sequence information and the reference sequences can be realized.

Description

A kind of method that nucleic acid sequence information is mated

This case is application on 07 08th, 2012, and application number is 201210263634.7, and denomination of invention is the divisional application of " a kind of system and method mated nucleic acid sequence information ".

Technical field

The present invention relates to field of information processing, more particularly, relate to a kind of system and method that nucleic acid sequence information is mated.

Background technology

American scientist proposed the Human Genome Project in 1985, through the joint efforts of the U.S., Britain, the Republic of France, the Federal Republic of Germany, Japan and Chinese Scientists, completed human genome " work frame chart " in 2000.And disclose human genome map and Preliminary Analysis Results in calendar year 2001.Its research contents also comprises establishment Computer Analysis management system (also namely processed by the result of Computerized analysis system to order-checking, obtain nucleic acid sequence information), ethics, law and social concern that inspection is relevant.After human genome map announces, in the work that the genome starting both at home and abroad actively to put into each biological race is drawn.Utilize nucleic acid sequence information and existing Genome Atlas (reference sequences) to compare, by the correlation technique such as transcriptomics and proteomics, the matching analysis is carried out to gene expression profile, gene mutation etc., the information with disease related gene can be obtained.Undertaken mating, analyzing by nucleic acid sequence information and Genome Atlas, and expose ill root, become the problem that biochemical medical field is shown great attention to, it is like a raging fire that therefore global gene sequencing technology also develops, but gene information to be obtained quickly and accurately from the sequencing result data of vastness, but become the bottleneck of current gene sequencing technology development.

The system of mating nucleic acid sequence information utilizes the nucleic acid sequence fragments of computing machine to order-checking gained to mate on known reference sequences, and also i.e. comparison one by one, the result according to coupling carries out follow-up analysis.The method of mating nucleic acid sequence information is the process of mating nucleic acid sequence information based on the system of mating nucleic acid sequence information.

In prior art, a kind of method that nucleic acid sequence information is mated, described method comprises step: A, according to allowing the number n of mispairing, every bar nucleic acid sequence fragments being divided at least n+1 bar and participating in the short-movie section of coupling, obtaining the database of short-movie section; B, according to participate in coupling short-movie section length set up and stored reference sequence index, obtain database; C, the short-movie section that every bar nucleic acid sequence fragments segmentation is set up individually to be mated in a database, obtain matching result.Because reference sequences index is isometric, according to the principle of probability, there is identical multiple reference sequences index.In this technical scheme, the short-movie section that every bar participates in coupling is mated with reference sequences index successively, short-movie section needs to carry out respectively mating (short-movie section needs mate with multiple identical reference sequences index respectively) with all reference sequences indexes, and this will reduce the speed of information processing greatly.And reference sequences and nucleotide sequence all need to carry out staging treating, this will further increase the workload of information processing, thus reduce further the speed of information processing.In addition, the reference sequences index that reference sequences is set up and the short-movie section that nucleotide sequence segmentation is set up, will produce a large amount of information, and this will increase the storage space of signal conditioning package.

Therefore need a kind of system and method that nucleic acid sequence information is mated newly, nucleotide sequence and reference sequences Rapid matching can be realized.

Summary of the invention

The object of the present invention is to provide a kind of system and method that nucleic acid sequence information is mated, be intended to solve prior art nucleic acid sequence information when mating with reference sequences, slow-footed problem.

In order to realize goal of the invention, a kind of system of mating nucleic acid sequence information comprises database, reference sequences change unit, indexing unit and matching unit.Described database, for stored reference sequence; Described reference sequences converter unit, for carrying out BWT conversion to the reference sequences in database, obtains coupling reference sequences; Described indexing unit, for carrying out spaced markings to the coupling reference sequences in database; Described matching unit, for nucleic acid sequence fragments is carried out consistance coupling with the reference sequences that mates in database successively, obtains coupling nucleotide sequence.

Consistance coupling comprises the situation allowing mispairing and do not allow mispairing.When allowing N number of mispairing, nucleic acid sequence fragments has N number of base at the most and mates that reference sequences is inconsistent is called that consistance is mated in database; When not allowing mispairing, nucleic acid sequence fragments with mate that reference sequences is completely the same is called that consistance is mated in database.N is positive integer.

Wherein, described reference sequences converter unit comprises reference sequences matrix module and BWT matrix module.Described reference sequences matrix module, for adding identifier to reference sequences end in a database or front end, and by this reference sequences loopy moving, obtains reference sequences matrix; Described BWT matrix module, for sorting according to lexicographic order with reference to sequence matrix, obtains BWT reference sequences matrix.Described reference sequences converter unit also can comprise coupling reference sequences module, and described coupling reference sequences module, for obtaining BWT reference sequences matrix first row with last arranges, obtains coupling reference sequences, and storage in a database.

Wherein, described indexing unit, for carrying out spaced markings to the coupling reference sequences in database according to arithmetic progression.

Further, described indexing unit, also marks the coupling reference sequences in database further for recycling arithmetic progression in each arithmetic progression interval.

In above-mentioned arbitrary technical scheme, described matching unit, for nucleic acid sequence fragments reverse complemental is formed reverse complemental nucleic acid sequence fragments, and reverse complemental nucleic acid sequence fragments is carried out consistance coupling with the reference sequences that mates in database, obtain coupling nucleotide sequence.

Wherein, described matching unit, utilizes on the position of backtracking method successively before the position that reverse complemental nucleic acid sequence fragments can not be mated and carries out base replacement, and continues to mate in a database from replacement position.

In above-mentioned arbitrary technical scheme, described also information receiving unit is comprised to the system that nucleic acid sequence information is mated; Described information receiving unit, obtains nucleic acid sequence fragments and reference sequences for passing through USB interface or disc drives interface or INTERNET.

In order to better realize the present invention, the present invention also comprises a kind of method of mating nucleic acid sequence information.

Described method comprises step: A, carry out BWT conversion to the reference sequences in database, obtains coupling reference sequences, and will mate reference sequences and store in a database; B, carry out spaced markings to by the coupling reference sequences in database; C, nucleic acid sequence fragments is carried out consistance coupling with the reference sequences that mates in database successively respectively, obtain coupling nucleotide sequence.Wherein, store reference sequences in database, steps A and the step B reference sequences respectively in database converts.

Wherein, described steps A comprises: A1, add identifier to the reference sequences end in database or front end, and by this reference sequences through loopy moving, obtains reference sequences matrix; A2, to sort according to lexicographic order with reference to sequence matrix, obtain BWT reference sequences matrix, and store in a database.After steps A 2, also can comprise steps A 3, obtain BWT reference sequences matrix first row and arrange with last, obtain coupling reference sequences, and storage in a database.

Wherein, in described step B, according to arithmetic progression, spaced markings is carried out to the coupling reference sequences in database.

Wherein, in described step B, in each arithmetic progression interval, recycle arithmetic progression the coupling reference sequences in database is marked further.

In above-mentioned arbitrary technical scheme, described step C is, nucleic acid sequence fragments reverse complemental is formed reverse complemental nucleic acid sequence fragments, then reverse complemental nucleic acid sequence fragments is carried out consistance coupling with mating in reference sequences in database, obtain coupling nucleotide sequence.

Wherein, in described step C, when allowing mispairing, utilizing on the position of backtracking method successively before the position that reverse complemental nucleic acid sequence fragments can not be mated and carrying out base replacement, and continue to mate on the database from replacement position.

As from the foregoing, the present invention by nucleic acid sequence fragments without the need to segmentation, directly with mate in a database, simultaneously, nucleic acid sequence fragments is without the need to mating one by one with all identical coupling reference sequences, only once need mate with all identical sequences, thus improve the speed of information processing on the whole; In addition, the reference sequences in database is without the need to setting up reference sequences index, and the coupling reference sequences in database is without the need to marking one by one, thus greatly reduces the requirement of the storage space to system.

Accompanying drawing explanation

Fig. 1 is the structural representation to the system that nucleic acid sequence information is mated in one embodiment of the invention.

Fig. 2 is the structural representation to the system that nucleic acid sequence information is mated in another embodiment of the present invention.

Fig. 3 is the structural representation of reference sequences converter unit in one embodiment of the invention.

Fig. 4 is the structural representation of reference sequences converter unit in another embodiment of the present invention.

Fig. 5 is the method flow diagram that one embodiment of the invention more control sequences fragment carries out mating.

Fig. 6 is the structural representation to the system that nucleic acid sequence information is mated in another embodiment of the present invention.

Fig. 7 is to the method flow diagram that reference sequences converts in one embodiment of the invention.

Fig. 8 is to the method flow diagram that nucleic acid sequence fragments is mated in one embodiment of the invention.

Fig. 9 is the schematic diagram in one embodiment of the invention, forward nucleic acid sequence fragments being carried out to consistance coupling.

Figure 10 is the schematic diagram in one embodiment of the invention, reverse nucleic acid sequence fragment being carried out to consistance coupling.

Figure 11 is to the schematic diagram that nucleic acid sequence fragments is mated in one embodiment of the invention.

Figure 12 is to the schematic diagram that nucleic acid sequence fragments is mated in one embodiment of the invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.

In order to the convenience of technical scheme of the present invention is described, the nucleic acid sequence fragments in following examples and reference sequences all only give shorter base sequence, and it does not represent nucleic acid sequence fragments truly and reference sequences fragment.General kernel acid sequence fragment length is at 20bp or more, and reference sequences length is at 2000bp or more, and this is general situation certainly, and also there is nucleic acid sequence fragments length at below 20bp, reference sequences length is in the situation of below 2000bp.

Nucleic acid sequence fragments of the present invention generally by obtaining the order-checking of certain species, also obtains by Prof. Du Yucang, is also artificial sequence.Described reference sequences is known nucleotide sequence, and it is for the template as coupling, and nucleic acid sequence fragments is mated with reference sequences, can to obtain checking order the whether information such as accurate according to the situation of mating.It should be noted that, the nucleic acid sequence fragments in the present invention, without particular restriction, can comprise by the sequence fragment of the base composition such as A, G, C, T or A, G, C, U, as: ATTACGTTA, UUCCUCAAGGU etc.

The present invention proposes the first embodiment, as shown in Figure 1, comprises database, reference sequences converter unit, indexing unit and matching unit to the system that nucleic acid sequence information is mated.Below will describe in detail.

(1) database 1, for stored reference sequence.

The reference sequences stored in described database can be the reference sequences being stored in internal system or being stored in beyond system.Described reference sequences is base sequence, is also nucleic acid sequence information.This reference sequences and nucleic acid sequence fragments are the nucleic acid sequence information of same species, such as, nucleic acid sequence fragments checks order to the nucleic acid of paramecium to obtain, then corresponding reference sequences is just the nucleic acid sequence information of paramecium, also can be reference sequences and the nucleic acid sequence fragments of artificial sequence gained.Be not particularly limited reference sequences and nucleic acid sequence fragments, wherein, reference sequences is known base sequence.

(2) reference sequences converter unit 2, for carrying out BWT conversion to the reference sequences in database, obtains coupling reference sequences.

Described BWT conversion is the transformation idea that MikeBurrows proposes according to DavidWheeler, and improve and be successfully applied to the transform method that real data compresses, this conversion is the study hotspot in current Lossless Compression field.The reversible data conversion method of BWT to be a kind of with data block be operand.

Described reference sequences change unit, after BWT conversion is carried out to the reference sequences in database, the coupling reference sequences obtained, autostore coupling reference sequences in database.

(3) indexing unit 3, for carrying out spaced markings to the coupling reference sequences in database.

Described the mode that coupling reference sequences in database carries out spaced markings not to be limit, can arithmetic progression be adopted, or other ordered series of numbers carry out regular spaced markings.The data type that this mark adopts can be selected as required, the data types such as such as Int, Byte.

(4) matching unit 4, for nucleic acid sequence fragments is carried out consistance coupling with the reference sequences that mates in database successively, obtains coupling nucleotide sequence.

Described nucleic acid sequence fragments is the nucleic acid sequence fragments being stored in internal system, or is stored on the storer beyond this system.Whole piece nucleic acid sequence fragments is directly carried out consistance coupling with the reference sequences that mates in database or whole piece nucleic acid sequence fragments head and the tail are carried out consistance coupling with the reference sequences that mates in database simultaneously.Described consistance coupling refers to when having allowed N number of mispairing, whole piece nucleic acid sequence fragments has N number of at the most can not matching with mating reference sequences, then think that this whole piece nucleic acid sequence fragments matches, obtain a coupling nucleic acid sequence fragments, otherwise, think that this nucleic acid sequence fragments can not match, give up this nucleic acid sequence fragments.Other all nucleic acid sequence fragments carry out consistance coupling all in this manner in a database, then obtain mating nucleotide sequence.This coupling nucleotide sequence can export with readable form, also can store in systems in which.When mating nucleotide sequence and exporting, its information exported can comprise every bar nucleic acid sequence fragments reference position corresponding with reference sequences and final position, the information such as the position of every bar nucleic acid sequence fragments mispairing and the number of mispairing.

In the present embodiment, the system of mating nucleic acid sequence information described in the present embodiment can comprise the program of mating nucleic acid sequence information on computing machine and computing machine.When mating nucleic acid sequence information, first reference sequences converter unit carries out BWT conversion to the reference sequences in database, then indexing unit carries out spaced markings in database through the reference sequences of BWT conversion, and nucleic acid sequence fragments is carried out consistance coupling by last matching unit successively in a database.In the technical scheme of the present embodiment, directly carry out consistance coupling in a database by overall nucleic acid sequence fragments, and identical coupling reference sequences is only mated once, thus improve the efficiency of coupling.Simultaneously, store reference sequences in a database to set up reference sequences index without the need to carrying out segmentation (hypothetical reference sequence index is long is K, then in adjacent two reference sequences indexes, rear K-1 base of previous reference sequences index is identical with K-1 base before a rear reference sequences index), and carry out spaced markings, relatively and prior art, storage space is greatly reduced.

Based on the first embodiment, the present invention proposes the second embodiment, a kind of system of mating nucleic acid sequence information of the present invention comprises computing machine and the program of mating nucleic acid sequence information on it, described computing machine also can comprise the program controlled sequenator.Below provide specific description, as shown in Figure 2.Computing machine is connected with multiple stage sequenator, and this computing machine receives the sequencing data measured by sequenator, and processes sequencing data, obtains nucleic acid sequence fragments.Wherein, the sequencing data that the arbitrary sequencer that described nucleic acid sequence fragments can market be sold obtains is through processing the nucleic acid sequence fragments obtained.Preferably, described nucleic acid sequence fragments can be by processing to the sequencing data of Pstar series sequenator, the generation of MiSeq series sequenator, GSJunior/Senior sequenator and SOLID sequencer the nucleic acid sequence fragments obtained.Preferred, described nucleic acid sequence fragments can process by the sequencing data produced Pstar series sequencer the nucleic acid sequence fragments obtained.Described computing machine is the signal conditioning package with the information processing function and data storage function that any market is sold.

It should be noted that, nucleic acid sequence fragments in computing machine of the present invention can for receiving the sequencing data of sequenator, then through processing the nucleic acid sequence fragments obtained, also can be directly store in computing machine or nucleic acid sequence fragments that computing machine directly receives from the external world, to the source of nucleic acid sequence fragments without particular restriction.

Be described in further detail to the reference sequences converter unit in above-described embodiment below, as shown in Figure 3, described reference sequences converter unit comprises reference sequences matrix module, BWT matrix module, below will elaborate to each module.

(1) reference sequences matrix module 21, for adding identifier to the reference sequences end in database or front end, and by this reference sequences through loopy moving, obtains reference sequences matrix.

In order to make the principle of work of reference sequences matrix module be easier to understand, below provide an example.Reference sequences is general all longer, and length is generally between several thousand to several hundred million, even longer.The example below provided is only used to help to understand, and is not reference sequences truly.Suppose that this reference sequences is ACCACCTG, first add marker character in the front end of reference sequences or end, the symbol of marker character, without particular restriction, is the two ends in order to distinguish reference sequences, add $ marker character at end in this example, ACCACCTG $ can be obtained; And then move with reference to sequence loops, obtain reference sequences matrix, concrete outcome is as shown in the table.

Table 1

The reference sequences matrix that reference sequences ACCACCTG obtains through the process of reference sequences matrix module is in table 1.Wherein, above-mentioned A, G, C, T are nucleic acid corresponding in biochemistry.

(2) BWT matrix module 22, for sorting according to lexicographic order with reference to sequence matrix, obtains BWT reference sequences matrix.

In order to make explanation be easier to understand, below by for the reference sequences matrix in table 1, supposing that $ marker character is all less than A, G, C, T, sorting with reference to sequence matrix according to lexicographic order, the BWT reference sequences matrix obtained is as shown in table 2.

Table 2

Wherein, BWT reference sequences matrix stores in a database.Described lexicographic order refer to according to Chinese dictionary looked-up sequence A, B, C ..., Z sorts.

In technique scheme, nucleic acid sequence fragments once can be mated in identical reference sequences, reason is that reference sequences is after BWT conversion, adjacent every line order of BWT reference sequences matrix in database is made to show maximum common prefix, when mating, if nucleic acid sequence fragments and r is capable matches, the length of nucleic acid sequence fragments is m, then capable with the r of BWT reference sequences matrix maximum common prefix is at least m's, all can in comparison, only need determine maximum common prefix, without the need to nucleic acid sequence fragments is being carried out consistance coupling, also namely just only coupling is needed once.Such as nucleic acid sequence fragments length is 3, and it is the second row of ACC, BWT reference sequences matrix and the maximum common prefix of the third line is 3, is all ACC, as long as now carry out comparing with common prefix, just achieves nucleic acid sequence fragments coupling in a database.This technical scheme, substantially increases the efficiency of nucleic acid sequence fragments coupling.

As shown in Figure 4, described reference sequences converter unit also can comprise coupling reference sequences module, below will be described in detail to this module.Coupling reference sequences module 23, for obtaining BWT reference sequences matrix first row with last arranges, obtains coupling reference sequences, and storage in a database.

In order to save storage space, adopt companion matrix, below for the matrix in table 2, the coupling reference sequences obtained is as shown in table 3.

Table 3

Database corresponding in table 3 is more succinct, thus greatly reduces the requirement of database to storage space.

In order to search more efficiently, further can adopt companion matrix, below continue for the matrix in table 2, the storage obtained coupling reference sequences information is in a database as shown in table 4.

Table 4

In table, the 3rd is classified as first row position residing in reference sequences, 4th first row being classified as reference sequences matrix, when follow-up nucleic acid sequence fragments is mated, directly can obtain the position of this nucleic acid sequence fragments in reference sequences, make database more easy-to-use, improve the efficiency of follow-up nucleic acid sequence fragments coupling.

Reference sequences converter unit in second embodiment processes database, thus makes database more easy-to-use, and the efficiency of carrying out when nucleic acid sequence fragments is mated is higher, relatively with database of the prior art, has greatly saved storage space.The technical program overcomes in conventional art on the whole to be existed in sequences match speed slowly, the problem that storage space is large simultaneously.

Be described in detail to the indexing unit in the second embodiment below.Described indexing unit 3, for carrying out spaced markings to the coupling reference sequences in database according to arithmetic progression.Based on above-mentioned scheme, reference sequences or coupling reference sequences are marked, thus makes nucleic acid sequence fragments when mating, the position at nucleic acid sequence fragments place can be determined.Specific description is given below by the mark mode of indexing unit.

Table 5

Mark in table 5 marks according to arithmetic progression, and the tolerance of its arithmetic progression is not limit, and the tolerance in the present embodiment has only selected 256.This technical scheme adopts spaced markings, greatly reduces the storage space that database takies.In addition, when reference sequences or coupling reference sequences longer time, (the reference sequences length that can mark is 2 preferably to adopt Int type to carry out marking ³¹), when reference sequences or coupling reference sequences shorter time, preferably adopt Byte type to mark, relatively carry out marking with adopting LongInt type, the technical program further reduces the storage space that database takies.

Below will be described in further detail above-mentioned indexing unit.Indexing unit, also marks the coupling reference sequences in database further for recycling arithmetic progression in each arithmetic progression interval.In order to make explanation more clear understandable, below on the basis of table 4, provide the further function of indexing unit.In table 6.

Table 6

Technical scheme in the present embodiment, can realize nucleic acid sequence fragments reference sequences or coupling reference sequences on coupling after, the particular location of nucleic acid sequence fragments on reference sequences matched can be obtained sooner.Such as: the reference position that nucleic acid sequence fragments matches is 274 of reference sequences, when reference sequences only once marks, need just can know this particular location from 256 to pusher 18, and after further mark has been done to reference sequences or matching sequence, 256+16=272 can be known, only need advance two reference positions that can obtain this matched position backward, thus substantially increase the efficiency of matching unit coupling.Wherein, the data type of this programme mark does not do particular restriction, and be preferably Byte type, relatively other data types, this preferred version greatly reduces the storage space that database takies.

It should be noted that, only carried out two-layer mark to the mark of reference sequences or coupling reference sequences in above-mentioned example, if need carry out multilayer mark, the mode of its multilayer mark can with reference to above-mentioned example, do not repeat them here, mark mode of the present invention is not limited to the above-mentioned example provided.

Be described further to the matching unit 4 in the second embodiment below.Described matching unit 4, for nucleic acid sequence fragments oppositely being formed reverse nucleic acid sequence fragment or nucleic acid sequence fragments reverse complemental being formed reverse complemental nucleic acid sequence fragments, and nucleic acid sequence fragments reverse nucleic acid sequence or reverse complemental nucleic acid sequence fragments are carried out consistance coupling with the reference sequences that mates in database.

Below by respectively to nucleic acid sequence fragments, reverse complemental nucleic acid sequence fragments and reverse complemental nucleic acid sequence fragments and the reference sequences in database or mate the technical scheme that reference sequences carries out consistance coupling and provide corresponding specific embodiments.

Wherein, we will do following hypothesis: nucleic acid sequence fragments is ACC; Reference sequences G $ CCAACTC in database, coupling reference sequences information corresponding in database is in table 7.

Table 7

(1) forward nucleic acid sequence fragments carries out consistance coupling.Its detailed process as shown in Figure 9.

For convenience of description, marked 1. 2. 3. 4. 5. 6. below often arranging respectively.Wherein, the 1. to the 5. row be in database, mate reference sequences information, the 6. row are nucleic acid sequence fragments.The 1. row and the 2. row be respectively BWT reference sequences matrix the 1. row and last arrange, the 3. row marked the position of base in reference sequences 1. arranged, the position (mark mode of reference sequences position can adopt the mark mode of compartment) of the 4. row mark reference sequences, 5. the be classified as reference sequences.In this programme, last position that ACC matches is 3, from the 3. row, find the position at 3 places, from this position, in the 1. row, find the position of base C, then in BWT reference sequences matrix, obtain maximum common prefix, if maximum-prefix is more than or equal to 3, then according to the 3. row can obtain all positions of nucleic acid sequence fragments on reference sequences.This technical scheme, nucleic acid sequence fragments only need be mated once by the reference sequences identical with reference sequences, just can obtain the reference position of all reference sequences corresponding to nucleic acid sequence fragments, thus substantially increases the efficiency that nucleic acid sequence fragments carries out consistance coupling.

(2) reverse nucleic acid sequence fragment carries out consistance coupling.

In order to the technical program is understood in clearer help, as shown in Figure 10, from BWT reference sequences matrix, the technical program is described below.

Below technique scheme is explained in detail, wherein, the first row C, CC, ACC are first of reverse nucleic acid sequence fragment, front two and front three, consistance coupling is carried out successively from first, upper and lower two positions of arrow indication represent reference position and the end position of the position matched respectively, reverse nucleic acid sequence fragment each coupling all according to table in mode mate, give the coupling of front three in table altogether.Can see that the position matched has two from the matching result in upper table.Coupling reference sequences information corresponding from database, we can know that the reference position that matches is first and the 4th of reference sequences.In the technical program, nucleic acid sequence fragments only need be mated once by the reference sequences identical with reference sequences, just can obtain the reference position of all reference sequences corresponding to nucleic acid sequence fragments, thus substantially increase the efficiency that nucleic acid sequence fragments carries out consistance coupling.

(3) reverse complemental nucleic acid sequence fragments carries out consistance coupling.

Below provide an example, as shown in A in Figure 11, by nucleic acid sequence fragments through converting the reverse complemental nucleic acid sequence fragments obtained, continue (2) and carry out step (3), its concrete scheme is, the reverse complemental nucleic acid sequence fragments of ACC is GGT, and its matching process is as shown in B in Fig. 9.In the technical program, reverse complemental nucleic acid sequence fragments from after down mate, the position at corresponding reference sequences place is found in the 3. row, in this specific embodiment, last position that GGT matches is 6, from the 3. row, find the position at 6 places, from this position, the position of the last appearance of the complementary base C of bases G is found in the 1. row, in BWT reference sequences matrix, obtain maximum common prefix again, if maximum-prefix is more than or equal to 3, then according to the 3. row can obtain all positions of nucleic acid sequence fragments on reference sequences.This technical scheme, nucleic acid sequence fragments only need be mated once by the reference sequences identical with reference sequences, just can obtain the reference position of all reference sequences corresponding to nucleic acid sequence fragments, thus substantially increases the efficiency that nucleic acid sequence fragments carries out consistance coupling.

Consistance coupling described in this programme embodiment comprises and to match completely or when allowing each nucleic acid sequence fragments to have N number of mispairing, N number ofly at the most can not match, below by there being the situation of mispairing to provide a concrete scheme, as shown in Figure 5.In figure, when matching a certain bit base C, can not match, then the last bit base " C " in base C be carried out base replacement, and then mate, after base " C " changes base " T, G, A " successively into, still cannot mate, then base replacement is carried out to more last position, after the base " C " on more last position changes base " T " into, after then can matching, proceed other couplings.Give the situation allowing mispairing in the technical program, when a permission mispairing, nucleic acid sequence fragments has a base not match in a database at the most; When having allowed N number of mispairing, nucleic acid sequence fragments more control sequences fragment has had N number of base not match in a database at the most, is also nucleic acid sequence fragments can match after any N number of position amendment in a database.The technical program can realize the detection of gene mutation while meeting Rapid matching, simultaneously, the method utilizing base to replace (also namely correct base identification error) solves the problem that the nucleotide sequence that causes due to base identification error can not mate, thus provides guarantee for the accuracy of nucleic acid sequencing.

For above-mentioned arbitrary technical scheme, the present invention proposes the 3rd embodiment, and described system of mating nucleic acid sequence information comprises information receiving unit, database, reference sequences converter unit, indexing unit and matching unit.As shown in Figure 6.To no longer repeat database, reference sequences converter unit, indexing unit and matching unit in the present embodiment, concrete technical scheme, with reference to above-mentioned arbitrary technical scheme, is only described further instruction to information receiving unit below.Described information receiving unit, for receiving nucleic acid sequence fragments information and reference sequences information.Described system can comprise computing machine, and described computing machine can comprise USB interface or disc drives interface or INTERNET network interface.Preferably, information receiving unit, obtains nucleic acid sequence fragments and reference sequences by USB interface or disc drives interface or INTERNET.Wherein, the information received stores by information receiving unit, wherein nucleic acid sequence fragments and reference sequences store respectively, matching unit can obtain nucleic acid sequence fragments from the database storing nucleic acid sequence fragments, carry out consistance coupling with the reference sequences that mates in the database of stored reference sequence, obtain matching result.Its matching result can with readable novel output, such as comprise: the length of every bar nucleic acid sequence fragments, every bar nucleic acid sequence fragments mate the information such as the position that number, the nucleic acid sequence fragments that can not match match, its output is only form, no longer specifically elaborates at this.

Based on the first embodiment, the present invention proposes the 4th embodiment.Described the method that nucleic acid sequence information is mated to be comprised the following steps.

Step S1, BWT conversion is carried out to the reference sequences in database, obtain coupling reference sequences, and reference sequences will be mated store in a database.

The reference sequences stored in described database is be stored in computer-internal or reference sequences in the storer that is stored in outside this computing machine.Described reference sequences is base sequence, is also nucleic acid sequence information.This reference sequences and nucleic acid sequence fragments are the nucleic acid sequence information of same species, such as, nucleic acid sequence fragments checks order to the nucleic acid of paramecium to obtain, then corresponding reference sequences is just the nucleic acid sequence information of paramecium, also can be reference sequences and the nucleic acid sequence fragments of artificial sequence gained.Be not particularly limited reference sequences and nucleic acid sequence fragments, wherein, reference sequences is known base sequence.

Described BWT conversion is the transformation idea that MikeBurrows proposes according to DavidWheeler, and improve and be successfully applied to the transform method that real data compresses, this conversion is the study hotspot in current Lossless Compression field.The reversible data conversion method of BWT to be a kind of with data block be operand, its core concept is that the character matrix obtained after turning character string wheel sorts and converts.After carrying out BWT conversion to the reference sequences in database, the coupling reference sequences obtained stores in a database.

Step S2, carry out spaced markings to by the coupling reference sequences in database.

Step S3, nucleic acid sequence fragments reverse complemental is formed reverse complemental nucleic acid sequence fragments, then reverse complemental nucleic acid sequence fragments is carried out consistance coupling with mating in reference sequences in database, obtain coupling nucleotide sequence.

Described nucleic acid sequence fragments for being stored in intrasystem nucleic acid sequence fragments, or is stored on the storer beyond this system.Whole piece nucleic acid sequence fragments is directly carried out consistance coupling with the reference sequences that mates in database or whole piece nucleic acid sequence fragments head and the tail are carried out consistance coupling with the reference sequences that mates in database simultaneously.Described consistance coupling refers to when having allowed N number of mispairing, whole piece nucleic acid sequence fragments has has N number of base can not match with mating reference sequences at the most, then think that this whole piece nucleic acid sequence fragments matches, obtain a coupling nucleic acid sequence fragments, otherwise, think that this nucleic acid sequence fragments can not match, give up this nucleic acid sequence fragments.Other all nucleic acid sequence fragments carry out consistance coupling all in this manner in a database, then obtain mating nucleotide sequence.This coupling nucleotide sequence can export with readable form, also can store in a database.When mating nucleotide sequence and exporting, its information exported can comprise every bar nucleic acid sequence fragments reference position corresponding with reference sequences and final position, the information such as the position of every bar nucleic acid sequence fragments mispairing and the number of mispairing.

In the technical scheme of the present embodiment, directly carry out consistance coupling in a database by overall nucleic acid sequence fragments, and identical coupling reference sequences is only mated once, thus improve the efficiency of coupling.Simultaneously, store reference sequences in a database to set up reference sequences index without the need to carrying out segmentation (hypothetical reference sequence index is long is K, then in adjacent two reference sequences indexes, rear K-1 base of previous reference sequences index is identical with K-1 base before a rear reference sequences index), and carry out spaced markings, relatively and prior art, storage space is greatly reduced.

Above-mentioned steps S1 comprises: S11, the reference sequences end in database or front end are added identifier, and by this reference sequences through loopy moving, obtains reference sequences matrix.S12, to sort according to lexicographic order with reference to sequence matrix, obtain BWT reference sequences matrix.

For this technical scheme, provide an example, if the nucleic acid sequence fragments needing coupling is CCACC, BWT reference sequences matrix is matrix as follows.

The first row $ ACCACCTG

Second row ACCACCTG $

The third line ACCTG $ ACC

Fourth line CACCTG $ AC

Fifth line CCACCTG $ A

6th row CCTG $ ACCA

7th row CTG $ ACCAC

8th row G $ ACCACCT

9th row TG $ ACCACC

Then when comparing, first bit base of nucleic acid sequence fragments is C, then only need beginning comparison the four lines from BWT matrix, the second of the second of nucleic acid sequence fragments and the fourth line of BWT matrix is compared, if in second comparison, base on the 3rd of the fourth line of comparison nucleic acid sequence fragments again and BWT matrix, if second does not have in comparison, then to move to fifth line, circulate above-mentioned alignments, until comparison is to the 7th row.It should be noted that if nucleic acid sequence fragments only has M base, then only need the n-th bit comparison of the row being positioned at the comparison place of BWT matrix by n-th of nucleic acid sequence fragments, only compare M position.In the technical program, the BWT reference sequences matrix being undertaken by lexicographic order sorting is coupling reference sequences, nucleic acid sequence fragments is with when mating reference sequences comparison, when first base of nucleic acid sequence fragments is A, the row being classified as the place of A at BWT reference sequences matrix first is only needed to compare, when first base of nucleic acid sequence fragments is G, C, T, the row being classified as the place of G, C, T at BWT reference sequences matrix first is only needed to compare.Thus substantially increase the speed of comparison.

Also can comprise after described step S12: S13, obtain BWT reference sequences matrix first row and arrange with last, obtain coupling reference sequences, and storage in a database.The concrete grammar process flow diagram of step S1 as shown in Figure 7.First, at the identifier that end or the front end of the reference sequences of reference sequences add, this identifier can be any character except A, G, C, T, and the interpolation of this character is the head and the tail in order to distinguish reference sequences; Length is that reference sequences length after interpolation character of X becomes X+1; Then, the reference sequences of this interpolation identifier is carried out loopy moving, the reference sequences matrix of (X+1) * (X+1) can be obtained; Then, lexicographic order sequence is carried out to described reference sequences matrix, BWT reference sequences matrix, the sequence of described lexicographic order is A, B, C, D according to the Chinese phonetic alphabet ... sort successively.Finally, extract first row and arrange with last, storage.Preferably, the identifier added is considered to maximum or minimum.

In technique scheme, nucleic acid sequence fragments once can be mated in identical reference sequences, reason is that reference sequences is after BWT conversion, adjacent every line order of BWT reference sequences matrix in database is made to show maximum common prefix, when mating, if nucleic acid sequence fragments and r is capable matches, the length of nucleic acid sequence fragments is m, then capable with the r of BWT reference sequences matrix maximum common prefix is at least m's, all can in comparison, only need determine maximum common prefix, without the need to nucleic acid sequence fragments is being carried out consistance coupling, also namely just only coupling is needed once.Such as nucleic acid sequence fragments length is 3, and it is the second row of ACC, BWT reference sequences matrix and the maximum common prefix of the third line is 3, is all ACC, as long as now carry out comparing with common prefix, just achieves nucleic acid sequence fragments coupling in a database.This technical scheme, on the conventional art not only solved, the slow-footed problem of nucleotide sequence fragment match, also solves the problem that storage space is large, achieves nucleic acid sequence fragments matching speed fast, and coupling reference sequences takes up room little.

In the present embodiment, in above-mentioned steps S2, spaced markings is carried out to the coupling reference sequences in database, when this technical scheme makes nucleic acid sequence fragments mate, the reference position of coupling can be obtained fast.Preferably, in above-mentioned steps S2, according to arithmetic progression, spaced markings is carried out to the coupling reference sequences in database, in this technical scheme, adopts arithmetic progression to mark, thus greatly reduce the storage space of database.Preferred, in step s 2, in each arithmetic progression interval, recycle arithmetic progression mark further the coupling reference sequences in database, this technical scheme not only can obtain the position of nucleic acid sequence fragments coupling fast, and can reduce the storage area of database further.In the technical program, when marking further, can renumber, mark in relative preferred version, the less data type that takes up room can be adopted to store, adopt Int type to mark in such as preferred version, preferred technical scheme can adopt Byte type further to mark.

In the present embodiment, in above-mentioned steps S3, for nucleic acid sequence fragments reverse complemental is formed reverse complemental nucleic acid sequence fragments, and reverse complemental nucleic acid sequence fragments is carried out consistance coupling with the reference sequences that mates in database, obtain coupling nucleotide sequence.Further, in above-mentioned steps S3, utilize on the position of backtracking method successively before the position that reverse complemental nucleic acid sequence fragments can not be mated and carry out base replacement, and continue to mate on the database from replacement position.

Based on the 4th embodiment, the present invention proposes the 5th embodiment, to the process flow diagram of the method that nucleic acid sequence fragments is mated as shown in Figure 8.First, a nucleic acid sequence fragments is mated with the reference sequences that mates in database, if the match is successful, then terminate this coupling; If mate unsuccessful, then judge whether to allow mispairing; If do not allow mispairing, then terminate this coupling; If permission mispairing, then a beginning before the position that can not match, carries out base replacement, and then mates.

Below provide an example, as shown in figure 12.In this example, after the base in nucleic acid sequence fragments is replaced, continue to carry out consistance coupling with the reference sequences that mates in database, in this example, after the base A of nucleotide sequence the 3rd position replaces to T, match completely with the reference sequences that mates in database, now, complete the coupling of this nucleic acid sequence fragments.

In the present embodiment, to allowing the number of mispairing without particular restriction, the number of mispairing is allowed to determine according to the length of nucleic acid sequence fragments, when nucleic acid sequence fragments is longer, the number of permission mispairing can be many, when nucleic acid sequence fragments is shorter, allow the number of mispairing less, such as: nucleic acid sequence fragments is long is 50bp permission mispairing 4, and the long 30bp of being of nucleic acid sequence fragments allows mispairing 2; Nucleic acid sequence fragments is long is 10bp permission mispairing 0.In the present embodiment, nucleic acid sequence fragments is mated in reference sequences, and when not allowing mispairing, a nucleotide sequence can not match completely, then think that this nucleic acid sequence fragments can not be mated; When having allowed M mispairing, article one, nucleic acid sequence fragments allows the base of M position at the most to carry out base replacement, when carried out M position carried out base replace after, still cannot match, then think that this nucleic acid sequence fragments can not match, otherwise think that this nucleic acid sequence fragments matches.

It should be noted that the present invention typically applies but is not limited to the coupling to nucleic acid sequence fragments in biochemical order-checking field, in the field of information processing that other are similar, also can apply method set forth in the present invention.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. to the method that nucleic acid sequence information is mated, it is characterized in that, said method comprising the steps of:

A, BWT conversion is carried out to the reference sequences in database, obtain coupling reference sequences, and reference sequences will be mated store in a database;

B, spaced markings is carried out to the coupling reference sequences in database;

C, nucleic acid sequence fragments is carried out consistance coupling with the reference sequences that mates in database successively respectively, obtain coupling nucleotide sequence.

2. method of mating nucleic acid sequence information according to claim 1, is characterized in that, described steps A comprises:

A1, the reference sequences end in database or front end are added identifier, and by this reference sequences loopy moving, obtain reference sequences matrix;

A2, to sort according to lexicographic order with reference to sequence matrix, obtain BWT reference sequences matrix and store in a database.

3. method of mating nucleic acid sequence information according to claim 2, also comprises after described steps A 2:

A3, obtain BWT reference sequences matrix first row and arrange with last, and storage in a database.

4. method of mating nucleic acid sequence information according to claim 1, is characterized in that, described step B is, carries out spaced markings to the coupling reference sequences in database according to arithmetic progression.

5. the method that nucleic acid sequence information is mated according to claim 1, it is characterized in that, described step C is, nucleic acid sequence fragments is oppositely formed reverse nucleic acid sequence fragment, or nucleic acid sequence fragments reverse complemental is formed reverse complemental nucleic acid sequence fragments, then reverse nucleic acid sequence fragment or reverse complemental nucleic acid sequence fragments are carried out consistance coupling with the reference sequences that mates in database successively, obtain coupling nucleotide sequence.

6. the method that nucleic acid sequence information is mated according to any one of claim 1 to 5, it is characterized in that, described step C is, nucleic acid sequence fragments reverse complemental is formed reverse complemental nucleic acid sequence fragments, then reverse complemental nucleic acid sequence fragments is carried out consistance coupling with mating in reference sequences in database, obtain coupling nucleotide sequence.

7. the method that nucleic acid sequence information is mated according to claim 6, it is characterized in that, in described step C, when allowing mispairing, utilize on the position of backtracking method successively before the position that reverse complemental nucleic acid sequence fragments can not be mated and carry out base replacement, and continue to carry out consistance coupling on the database from replacement position.