CN106682393B

CN106682393B - Genome sequence comparison method and device

Info

Publication number: CN106682393B
Application number: CN201611074255.8A
Authority: CN
Inventors: 何光铸; 王东辉; 蔡文君; 刘凯
Original assignee: UNITED ELECTRONICS CO Ltd
Current assignee: Ronglian Technology Group Co., Ltd
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2019-05-17
Anticipated expiration: 2036-11-29
Also published as: CN106682393A

Abstract

The invention discloses a kind of genome sequence comparison method and devices, comprising: reads partial genome sequence from genome sequence file to be compared；According to two-way BWT alignment algorithm, single-ended Dynamic Programming alignment algorithm and both-end Dynamic Programming alignment algorithm, the partial genome sequence is compared with reference to genome sequence；According to aforementioned any alignment algorithm compare after, when in the partial genome sequence there is no without comparison on sequence when, read new partial genome sequence from genome sequence file to be compared, continue to compare according to above-mentioned steps；It repeats the above steps, completes the genome sequence file to be compared until all comparing, export comparison result.Genome sequence comparison method and device proposed by the present invention, are able to solve that the taking a long time of Genome Alignment, processing progress be slow, the problem more than consumption resource.

Description

Genome sequence comparison method and device

Technical field

The present invention relates to technical field of data processing, a kind of genome sequence comparison method and device are particularly related to.

Background technique

Genome sequence comparison is the general basic processing steps of genomic data analysis, and the purpose of this process is positioning Position of the sequencing sequence on reference genome.The reference genome sequence length of human genome has about 3GB, and sequencing sequence is long Degree is generally between 100bp to 150bp, and the sequence data total amount of general genome sequencing is about in 100GB or so.Compare this A little sequences, for industry generally using the comparison software of open source, more famous has BWA, Bowtie2 at present, and the general processing time exists 10 hours or more, be the main elapsed time step in genomic data analysis the inside.However, these two common generation gene order-checkings Sequence alignment algorithms are generally existing to be taken a long time, processing progress is slow, the problem more than consumption resource.

Summary of the invention

In view of this, being able to solve base it is an object of the invention to propose a kind of genome sequence comparison method and device Because of the taking a long time of group sequence alignment algorithms, processing progress is slow, the problem more than consumption resource.

Based on above-mentioned purpose genome sequence comparison method provided by the invention, comprising:

Partial genome sequence is read from genome sequence file to be compared；

According to two-way BWT alignment algorithm, the partial genome sequence is compared with reference to genome sequence；

After comparing according to two-way BWT alignment algorithm, at least there is a pair of reads in the partial genome sequence In only in read comparisons when, will be only one in the partial genome sequence according to single-ended Dynamic Programming alignment algorithm Each pair of reads in read comparison, is compared again with reference genome sequence；

After single-ended Dynamic Programming alignment algorithm compares, also at least there is a pair in the partial genome sequence It, will be two in the partial genome sequence according to both-end Dynamic Programming alignment algorithm when two read are not compared in reads Each pair of reads that read is not compared, is compared again with reference genome sequence；

After being compared according to aforementioned any alignment algorithm, when there is no do not compare in the partial genome sequence On sequence when, read new partial genome sequence from genome sequence file to be compared, continue according to above-mentioned steps It is compared；

It repeats the above steps, completes the genome sequence file to be compared until all comparing, export comparison result.

It is specific to wrap according to the method that two-way BWT alignment algorithm compares genome sequence in some optional embodiments It includes:

Reads is segmented using dovecote principle；

It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward；

Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right Both direction searches for its position on reference genome sequence.

In some optional embodiments, according to the method that single-ended Dynamic Programming alignment algorithm compares genome sequence, tool Body includes:

Determine that one in a pair of reads compares to the specific position with reference on genome sequence；

According to predeterminated position range threshold, the particular range around the specific position is chosen；

In the particular range using dynamic programming algorithm to another not be compared in a pair of of reads into Row compares.

In some optional embodiments, according to the method that both-end Dynamic Programming alignment algorithm compares genome sequence, tool Body includes:

Seed is constructed respectively to every in a pair of of reads；

Each seed is compared onto reference genome sequence；

If two of the reads have corresponding seed to compare respectively in a certain region with reference to genome sequence On, then the region is the candidate region of final comparison position；

Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.

Another aspect of the present invention provides a kind of genome sequence comparison device, comprising:

Data acquisition module, for reading partial genome sequence from genome sequence file to be compared；

Comparison module, for by the partial genome sequence and referring to genome sequence according to two-way BWT alignment algorithm It is compared；After comparing according to two-way BWT alignment algorithm, at least there is a pair in the partial genome sequence When only having in a read comparison in reads, according to single-ended Dynamic Programming alignment algorithm, by the partial genome sequence only There is each pair of reads in a read comparison, is compared again with reference genome sequence；It is calculated when single-ended Dynamic Programming compares After method compares, when not compared in the presence of two read in a pair of reads at least also in the partial genome sequence, press According to both-end Dynamic Programming alignment algorithm, each pair of reads that two read in the partial genome sequence are not compared, with It is compared again with reference to genome sequence；And after being compared according to aforementioned any alignment algorithm, when the part base Because in group sequence there is no without comparison on sequence when, read new portion gene from genome sequence file to be compared Group sequence, continues to compare according to above-mentioned steps.

In some optional embodiments, the comparison module is specifically used for:

Reads is segmented using dovecote principle；

In some optional embodiments, the comparison module is specifically used for:

Seed is constructed respectively to every in a pair of of reads；

Each seed is compared onto reference genome sequence；

From above-described embodiment as can be seen that genome sequence comparison method provided by the invention and device, more by being arranged Grade alignment algorithm continue pair using next stage alignment algorithm after the completion of previous stage algorithm compares to the part not compared Than so that the complexity of algorithm be allowed to match the complexity of data, and being optimized to every level-one algorithm, and then reach whole Optimization on algorithm speed.Using genome sequence comparison method provided by the invention and device, in same asset and guarantee ratio Pair accuracy under the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the prior art Comparison time have significant shortening, improve sequencing efficiency.

Detailed description of the invention

Fig. 1 is the flow diagram of one embodiment of genome sequence comparison method provided by the invention；

Fig. 2 is the modular structure schematic diagram of one embodiment of genome sequence comparison device provided by the invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.

It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of a same names or non-equal parameter, it is seen that " first " " second " only for the convenience of statement, does not answer It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.

Based on foregoing purpose, the first aspect of the embodiment of the present invention proposes a kind of genome sequence comparison method, energy Enough solve the problems, such as that the taking a long time of Genome Alignment, processing progress are slow, consumption resource is more.As shown in Figure 1, for this The flow diagram of one embodiment of the genome sequence comparison method provided is provided.

The genome sequence comparison method, comprising the following steps:

Step 101: obtaining and refer to genome sequence and genome sequence file to be compared.Here file acquisition mode Using conventional acquisition modes.Wherein, the format of the genome sequence file to be compared can be FASTQ file.

Sequence alignment is divided into 3 ranks to carry out by the genome sequence comparison method；Every time from the to be compared of input Genome sequence file in read a part of sequence, then successively to 1 grade, 2 grades, 3 grades of alignment algorithms are executed, upper level do not have Sequence in comparison, into continuing to compare in the alignment algorithm of next stage；Specifically include following steps.

Step 102: reading partial genome sequence from genome sequence file to be compared.

Step 103: according to (the 1st grade: two-way BWT alignment algorithm, two-way BWT:Bi- of two-way BWT alignment algorithm Directional Burrows-Wheeler Transform, two-way Barrow this-Wheeler transformation), by the portion gene group sequence Column are compared with reference to genome sequence.Wherein, the two-way BWT alignment algorithm processing at most allows 4 base mistakes Reads is compared.Reads reads length, is the sequencing sequence obtained in high-flux sequence, each read is one section of base sequence.? During analysis of biological information, each read is compared onto reference genome, so that it may obtain sequencing sequence and with reference to gene The difference of group, to find to make a variation.

Optionally, the method for comparing genome sequence according to two-way BWT alignment algorithm, specifically can comprise the following steps that

Reads is segmented using dovecote principle, 0-2 base mistake of every section of permission；

Then it scans for comparing using two-way BWT alignment algorithm, comprising:

Using sweep backward (backward) and sweep forward (forward) respectively to each segment of reads or reads Both direction searches for its position on reference genome sequence from right to left and from left to right.

The two-way BWT is compared when handling multiple base erroneous matchings, and efficiency is relatively low.Allow 4 most In the case where base erroneous matching, reads is segmented according to dovecote principle, each paragraph allows 0-2 base mistake Match, handle the comparison of most 2 base mistakes with two-way BWT in this way, efficiency greatly increases.

The common software BWA that compares after the BWT that establishes reference sequences and corresponding index and SA (suffixarray), It is searched for using backward, i.e., its position in the genome is searched for from right to left to each segment of reads or reads. The two-way BWT that this patent uses also establishes the backward sequence of reference sequences other than establishing traditional BWT index (being denoted as B) One BWT index (being denoted as B ').Utilize B, B ', SA, searched in two directions by backward, forward reads or The position of seeds in the genome, the efficiency of sequence alignment significantly improve.

Step 104: whether at least existing in a pair of reads in the partial genome sequence in an only read comparison (that is, at least a pair of reads is that only a read is compared in the partial genome sequence)；If so, into Step 108；If it is not, entering step 105.

Step 105: according to single-ended Dynamic Programming alignment algorithm (the 2nd grade), will only have one in the partial genome sequence Each pair of reads in read comparison, is compared again with reference genome sequence.Passing through aforementioned 1st grade of two-way BWT Alignment algorithm, in a pair of of reads (A, A '), wherein one (A or A ') compares onto reference genome sequence, another (A ' Or A) but without comparing onto reference genome sequence, it will continue to compare using the 2nd grade of alignment algorithm.

Optionally, the method for comparing genome sequence according to single-ended Dynamic Programming alignment algorithm, specifically may include following step It is rapid:

Determine a read (A or A ') comparison in a pair of reads (A, A ') to the spy with reference on genome sequence (position pos) is set in positioning；The data reads that both-end is sequenced is pairs of, it is assumed that wherein one of a pair of of reads (A, A ') Read (A or A ') is compared to the position pos on reference genome sequence, then (A ' or A) theoretical compare position by another read In certain area, that is, candidate region (candidate region) around the position pos；

Therefore, according to predeterminated position range threshold, the particular range around the specific position (position pos) is chosen；Institute Stating predeterminated position range threshold can be selected according to actual needs, such as reference error tolerance is configured；Specifically Ground, in both-end sequencing, a pair of of reads is compared onto genome, then the distance between two read and two read long The sum of degree is equal to the length of sequencing fragment (fragment), determines the position of candidate region around this principle.For example, sequencing piece Section is 500bp, and each read is 150bp, then after comparing onto genome, the theoretical distance between two read is 200bp.Cause It is differed for sequencing fragment length, so theoretical distance is about in 100bp~200bp；

Using dynamic programming algorithm to another not be compared in a pair of of reads in the particular range (A ' A) is compared.Step 106: it is equal whether at least to there are two read in a pair of reads in the partial genome sequence (that is, every read of at least a pair of reads is not compared in the partial genome sequence) is not compared；If It is to enter step 108；If it is not, entering step 107.

Step 107: according to both-end Dynamic Programming alignment algorithm (3rd level), by two in the partial genome sequence Each pair of reads that read is not compared, is compared again with reference genome sequence.In a pair of of reads (A, A '), warp Aforementioned 1st grade of two-way BWT alignment algorithm and the 2nd grade of single-ended Dynamic Programming alignment algorithm are crossed, certain a pair of reads (A, A ') in A and A ' without referring to genome sequence on comparing, will continue to compare using 3rd level alignment algorithm.

Optionally, the method for comparing genome sequence according to both-end Dynamic Programming alignment algorithm, specifically may include following step It is rapid:

Seed (seeds, substrings of a read) is constructed respectively to every (A and A ') in a pair of of reads；

Specifically, many segments are respectively classified into every read of a pair of of reads (A, A '), building seed (seeds, substrings of a read)；When a pair of of reads is compared onto genome, the distance between two read are in certain model In enclosing, therefore the distance between the seed of two read also should be in a certain range；

Each seed is compared onto reference genome sequence；

Specifically, the region that the seeds of pairs of (i.e. the distance between two seeds meet the requirements) is compared is retrieved, really Determine this candidate comparison area to reads.Then reads is compared to candidate region with dynamic programming algorithm.

If two (A and the A ') of the reads have corresponding kind respectively in a certain region with reference to genome sequence On son compares, then the region is the candidate region of final comparison position；

Two (A and the A ') of the reads are compared respectively using dynamic programming algorithm in the candidate region； 108 are entered step after the completion of comparing；Step 108: whether all comparing and complete the genome sequence file to be compared；If It is no, return step 102；If so, entering step 109.

Step 109: output comparison result.Optionally, BAM file is the output file that genome sequence compares, and BAM is base Because group sequence alignment result saves a format, genome sequence is noted down and has been listed in position and detailed sequence ratio with reference to genome sequence To situation.

From above-described embodiment as can be seen that genome sequence comparison method provided by the invention, compares by the way that setting is multistage Algorithm carries out continuing to compare using next stage alignment algorithm after the completion of previous stage algorithm compares to the part not compared, from And the complexity of algorithm is allowed to match the complexity of data, and optimize to every level-one algorithm, and then reach total algorithm Optimization in speed.Using genome sequence comparison method provided by the invention, in same asset and guarantee the accuracy compared Under the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the comparison time of the prior art There is significant shortening, improves sequencing efficiency.

The second aspect of the embodiment of the present invention proposes a kind of genome sequence comparison device, is able to solve genome The taking a long time of sequence alignment algorithms, processing progress be slow, the problem more than consumption resource.As shown in Fig. 2, being base provided by the invention Because of group modular structure schematic diagram of one embodiment of sequence alignment device.

The genome sequence comparison device, comprising:

Data acquisition module 201, for reading partial genome sequence from genome sequence file to be compared；

Comparison module 202, for by the partial genome sequence and referring to genome according to two-way BWT alignment algorithm Sequence is compared；After comparing according to two-way BWT alignment algorithm, at least there is a pair in the partial genome sequence When only having in a read comparison in reads, according to single-ended Dynamic Programming alignment algorithm, by the partial genome sequence only There is each pair of reads in a read comparison, is compared again with reference genome sequence；It is calculated when single-ended Dynamic Programming compares After method compares, when not compared in the presence of two read in a pair of reads at least also in the partial genome sequence, press According to both-end Dynamic Programming alignment algorithm, each pair of reads that two read in the partial genome sequence are not compared, with It is compared again with reference to genome sequence；And after being compared according to aforementioned any alignment algorithm, when the part base Because in group sequence there is no without comparison on sequence when, read new portion gene from genome sequence file to be compared Group sequence, continues to compare according to above-mentioned steps；

After the completion of comparison module 202 compares partial genome sequence, if genome sequence file to be compared also not by It compares and completes, then next portion gene group is read in the continuation of data acquisition module 201 from genome sequence file to be compared Sequence is simultaneously sent to comparison module 202 and continues to compare, until genome sequence file to be compared is completed by complete compare.

In some optional embodiments, the comparison module 202 is specifically used for:

Reads is segmented using dovecote principle；

Seed is constructed respectively to every in a pair of of reads；

Each seed is compared onto reference genome sequence；

From above-described embodiment as can be seen that genome sequence comparison device provided by the invention, compares by the way that setting is multistage Algorithm carries out continuing to compare using next stage alignment algorithm after the completion of previous stage algorithm compares to the part not compared, from And the complexity of algorithm is allowed to match the complexity of data, and optimize to every level-one algorithm, and then reach total algorithm Optimization in speed.Using genome sequence comparison device provided by the invention, in same asset and guarantee the accuracy compared Under the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the comparison time of the prior art There is significant shortening, improves sequencing efficiency.

It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples；Under thinking of the invention, above embodiments Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.

In addition, to simplify explanation and discussing, and in order not to obscure the invention, it can in provided attached drawing It is connect with showing or can not show with the well known power ground of integrated circuit (IC) chip and other components.Furthermore, it is possible to Device is shown in block diagram form, to avoid obscuring the invention, and this has also contemplated following facts, i.e., about this The details of the embodiment of a little block diagram arrangements be height depend on will implementing platform of the invention (that is, these details should It is completely within the scope of the understanding of those skilled in the art).Elaborating that detail (for example, circuit) is of the invention to describe In the case where exemplary embodiment, it will be apparent to those skilled in the art that can be in these no details In the case where or implement the present invention in the case that these details change.Therefore, these descriptions should be considered as explanation Property rather than it is restrictive.

Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).

The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims, Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made Deng should all be included in the protection scope of the present invention.

Claims

1. a kind of genome sequence comparison method characterized by comprising

Step 101: reading partial genome sequence from genome sequence file to be compared；

Step 102: according to two-way BWT alignment algorithm, the partial genome sequence being compared with reference to genome sequence；

Step 103: after comparing according to two-way BWT alignment algorithm, at least there is a pair in the partial genome sequence When only having in a read comparison in reads, according to single-ended Dynamic Programming alignment algorithm, by the partial genome sequence only There is each pair of reads in a read comparison, is compared again with reference genome sequence；

Step 104: after single-ended Dynamic Programming alignment algorithm compares, also at least there is one in the partial genome sequence It, will be in the partial genome sequence according to both-end Dynamic Programming alignment algorithm to when two read do not compare upper in reads Each pair of reads that two read are not compared, is compared again with reference genome sequence；

Step 105: after being compared according to aforementioned any alignment algorithm, when there is no do not have in the partial genome sequence When sequence in comparison, new partial genome sequence is read from genome sequence file to be compared, according to above-mentioned steps Continue to compare；

Step 102 is repeated to step 105, completes the genome sequence file to be compared until all comparing, output compares As a result；

Wherein, the method for comparing genome sequence according to single-ended Dynamic Programming alignment algorithm, specifically includes: determining a pair of reads In one compare to the specific position with reference on genome sequence；According to predeterminated position range threshold, the spy is chosen The particular range of surrounding is set in positioning；In the particular range using dynamic programming algorithm in a pair of of reads not by than Upper another is compared；

According to the method that both-end Dynamic Programming alignment algorithm compares genome sequence, specifically include: to every in a pair of of reads Seed is constructed respectively；Each seed is compared onto reference genome sequence；If described with reference to a certain of genome sequence Region, two of the reads have corresponding seed to compare respectively, then the region is the candidate region of final comparison position； Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.

2. the method according to claim 1, wherein comparing genome sequence according to two-way BWT alignment algorithm Method specifically includes:

Reads is segmented using dovecote principle；

Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right two Its position on reference genome sequence is searched in direction.

3. a kind of genome sequence comparison device characterized by comprising

Comparison module, for being carried out by the partial genome sequence and with reference to genome sequence according to two-way BWT alignment algorithm It compares；After being compared according to two-way BWT alignment algorithm, at least exist in a pair of reads in the partial genome sequence When in an only read comparison, according to single-ended Dynamic Programming alignment algorithm, will only have one in the partial genome sequence Each pair of reads in read comparison, is compared again with reference genome sequence；When single-ended Dynamic Programming alignment algorithm compares After, when not compared in the presence of two read in a pair of reads at least also in the partial genome sequence, according to both-end Dynamic Programming alignment algorithm, each pair of reads that two read in the partial genome sequence are not compared, with reference base Because group sequence is compared again；And after being compared according to aforementioned any alignment algorithm, when the portion gene group sequence In column there is no without comparison on sequence when, new portion gene group sequence is read from genome sequence file to be compared Column, repeatedly according to two-way BWT alignment algorithm, the step of the partial genome sequence is compared with reference genome sequence To according to both-end Dynamic Programming alignment algorithm, two read in the partial genome sequence are not compared each pair of Reads continues to compare the step of being compared again with reference genome sequence；

The comparison module specifically includes for completing single-ended Dynamic Programming alignment algorithm: determining one in a pair of reads Compare the specific position with reference on genome sequence；According to predeterminated position range threshold, the specific position week is chosen The particular range enclosed；It is another to not being compared in a pair of of reads using dynamic programming algorithm in the particular range One is compared；

The comparison module is also used to complete both-end Dynamic Programming alignment algorithm, specifically includes: to every in a pair of of reads Seed is constructed respectively；Each seed is compared onto reference genome sequence；If described with reference to a certain of genome sequence Region, two of the reads have corresponding seed to compare respectively, then the region is the candidate region of final comparison position； Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.

4. device according to claim 3, which is characterized in that the comparison module is specifically used for:

Reads is segmented using dovecote principle；