WO2013078623A1

WO2013078623A1 - Method and device for gap closure in nucleotide sequence assembly

Info

Publication number: WO2013078623A1
Application number: PCT/CN2011/083178
Authority: WO
Inventors: 刘兵行; 李振宇; 陈燕香; 李英睿; 汪建; 王俊; 杨焕明
Original assignee: 深圳华大基因科技有限公司; 深圳华大基因研究院
Priority date: 2011-11-29
Filing date: 2011-11-29
Publication date: 2013-06-06

Abstract

Disclosed are a method and a device for gap closure in nucleotide sequence assembly, one end of a gap having a first contig, and the other end of the gap having a second contig. The method comprises the following steps of: finding a nucleotide sequence at one end of the first contig close to the gap and taking the nucleotide sequence as a node read; using all reads that overlap the node read as a gap closure read set, and selecting from the set a read as a seed read; splicing the seed read and the first contig, so as to form a new first contig; determining whether one end of the new first contig after the extension process close to the gap overlaps one end of the second contig close to the gap; if the two ends do not overlap, performing gap closure continuously on the basis of the new first contig; and if the two ends overlap, connecting the first contig and the second contig, so as to complete the gap closure.

Description

Filling hole method and device thereof in nucleic acid sequence assembly

【技术领域】[Technical Field]

The invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.

【背景技术】【Background technique】

In the field of gene sequencing, with the popularity of second-generation sequencing technology, the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species. The principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.

In the analysis of the data obtained by sequencing, the genome assembly method is generally adopted. Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.

Prior art, genome assembly based on sanger sequencing technology and based on solexa Such as the genome assembly of second-generation sequencers, there will be a large number of unassembled areas in the initial assembly version, these unassembled areas often repeat with the sequence (repet )closely related. Among them, the sequence repeats related to the holes can be divided into tandem repeats and transposon repeats. The prior art fill-in procedure can handle simple transposon repetitions relatively accurately, but it is difficult to deal with long tandem repeats.

In terms of assembly methods, there are mainly two ways to solve the long series repetition problem in the prior art. The first method is based on overlap (overlap). Partial assembly, the second way is partial assembly based on De Bruijn diagram.

Among them, overlap-based partial assembly is difficult to identify the exact location where the conflict is caused, so this method is easy to cause insertion/deletion ( Indel ).

And De bruijn The partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.

Obviously, both of the above methods are difficult to effectively process long tandem repeats.

In terms of assembly tools, there are mainly two hole-filling programs in the prior art, which are Gapclosers corresponding to overlap-based partial assembly. Program and SOAPdenovo program based on partial assembly of De Bruijn diagrams.

But both programs have the same drawbacks:

First, the hole-filling software Gapcloser is based on the base sequence segment with overlap The method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy. And, Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.

Second, the hole in the SOAPdenovo assembly software is based on the area inside the hole. De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.

In summary, how to effectively treat the complemented nucleic acid sequence holes to fill holes, improve the accuracy and fill hole rate of the fill holes, save time and memory, is one of the research directions in the field of gene sequencing.

【发明内容】[Summary of the Invention]

The technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can effectively treat the holes of the complementary nucleic acid sequence to fill holes, and improve the accuracy and the hole filling rate of the hole filling.

In order to solve the above technical problem, a technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein one end of the hole has a first contig and the other end has a second contig, including the following steps. Determining a node read sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence; selecting a fill hole read set: finding the read order from the read order for filling the hole All the reading sequences with overlapping are used as the complement reading set; the seed reading order is selected: a reading order is selected from the complement reading set as the seed reading; the extension processing: the seed reading and the first overlapping group Splicing to form a new first contig; judging process: judging whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; loop processing: if the new first contig One end near the hole does not overlap with the end of the second contig close to the hole, and then the determination of the node reading order, the selection of the complement reading set, and the selection of the seed reading are performed on the basis of the new first contig. Extension step process; hole filling is completed: if the first end of the new contig hole near the end of the second hole near the contig overlaps, the connection of the first and second overlapping contigs group, complete filling holes.

The step of selecting the complement read sequence set includes: determining whether there is a common fixed length short string between the node read order and the read order for filling the hole, and adopting a pattern between the read order of the public fixed length short string. Identify all sets of readings that determine overlap.

The step of performing pattern matching between the reading sequences having the common fixed length short strings includes: obtaining the reading order by using a window sliding stepwise extending manner between the reading orders having the common fixed length short strings The overlap length between.

The step of determining the node reading sequence further includes: using the sliding window frequency of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further inferring the hole according to the series repetition. Repeat pattern of inner sequences.

The step of selecting a seed reading sequence includes: performing a comparison rate filtering process on the complement hole reading sequence, that is, selecting a complement hole reading sequence having a 100% alignment rate as a seed reading sequence.

The step of selecting a seed reading sequence comprises: performing a short similar repeated processing and recognition on the complement reading sequence, that is, selecting a long overlapping complementary hole reading sequence as a seed reading sequence when identifying that there is a short similar repetition.

The step of selecting the complement reading sequence set includes: performing position filtering on the filling hole reading order, that is, positioning the reading order according to the double end relationship, calculating the position of the filling hole reading sequence in the hole, and filtering the reading order according to the position.

The step of selecting the complement read sequence set includes: length filtering the fill hole read sequence, that is, selecting a short double end read sequence in the inner region of the hole, and selecting a long single end read sequence at both ends of the hole.

Wherein, if one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, before the step of connecting the first contig and the second contig, including: according to the estimated hole length, if If the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, the step of connecting the first contig and the second contig is not performed, and the new first overlap is continued. Performing the steps of determining a node read order, selecting a fill hole read set, and selecting a seed read order, and selecting, in the step of selecting a seed read order, selecting a non-overlapping read order other than the complement read set Seed reading.

The step of completing the hole filling comprises: performing sequence connection, wherein the sequence connection is a direct connection of two overlapping groups, one end extension is connected with another end overlapping group or two end extended sequence connection.

Before the step of performing the sequence connection, the method comprises: determining a credibility of the accuracy of the sequence connection when the sequence is connected, and selecting the first credibility when the first credibility is present, Sequence connection; when there is no first credibility, but there is a second credibility, the second credibility is selected for sequence connection; there is no first credibility and second credibility, but there is a third credibility When the degree is selected, the third credibility is selected to perform sequence connection, wherein the first credibility is that the two sequences of the connection have overlapping, and are not repeated, and the reading order crosses the support; the second credibility is The two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.

In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein the hole has a first contig at one end and a second contig at the other end, including the following Step: determining a node reading sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig near the hole, as a second Node reading order; selecting a complement reading sequence set: finding all reading sequences overlapping with the first node reading order as a first complement hole reading set from the reading order for filling holes; Finding, in the reading sequence, all the readings overlapping with the second node reading sequence as the second supplementary hole reading set; selecting the seed reading order: selecting a reading order from the first complementary hole reading set as the first a seed reading sequence; selecting a reading sequence from the second complementing reading set as a second seed reading sequence; extending processing: splicing the first seed reading sequence with the first overlapping group to form a new first overlapping group And the second The sub-reading sequence is spliced with the second contig to form a new second contig; the judging process is: judging whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extension processing; If the end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, continue to determine the node read order and select the fill hole read order based on the new first and second contigs. a step of selecting, selecting a seed reading sequence, and extending the processing; completing the filling hole: if the end of the new first overlapping group near the hole overlaps with the end of the new second overlapping group near the hole, connecting the first overlapping group and the second Align the group and complete the hole.

In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a hole filling device in nucleic acid sequence assembly, the device comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig, the other end of which has a second contig, a nucleic acid sequence is found at one end of the first contig near the hole as a node read sequence; and a first selection unit is used to select a complement read set. Finding all the reading sequences overlapping the node reading order as a complement hole reading set in the reading order of the filling hole; the second selecting unit is configured to select the seed reading order, and selecting one from the supplementary hole reading order set The reading sequence is used as a seed reading sequence; the extending unit is configured to extend the processing, and the seed reading sequence is spliced with the first contig to form a new first contig; the first determining unit is configured to determine the processing, and determine the extension processing. Whether there is overlap between one end of the new first contig and the end of the second contig close to the hole; a loop unit for loop processing if the new first contig is near one end of the hole and the second contig If there is no overlap at one end of the near hole, the processing of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension processing is continued on the basis of the new first contig; the connection unit is used in the new When one end of the contig is close to the end of the hole and the end of the second contig close to the hole, the first contig and the second contig are connected to complete the hole.

The second judging unit is configured to determine whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and the pattern recognition is determined between the reading order of the public fixed length short string. Overlapping of all read sets.

The second determining unit is further configured to obtain a length of overlap between the reading sequences by using a window sliding stepwise extending manner between reading sequences having the common fixed length short strings.

The method includes an identifying unit, configured to use the sliding window frequency of the filling hole reading in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further infer the repeating pattern of the sequence in the hole according to the series repetition. .

The second selection unit is further configured to perform a comparison rate filtering process on the complement reading sequence, that is, select a complement reading sequence with a 100% alignment rate as a seed reading sequence.

The second selection unit is further configured to perform short similar repetition processing and recognition on the supplementary hole reading sequence, that is, when the short similar repetition is recognized, the longer overlapping supplementary hole short reading order is selected as the seed reading order.

The first selection unit is further configured to perform position filtering on the complement reading sequence, that is, according to the double-end relationship positioning reading order, calculate the position of the filling hole reading sequence in the hole, and filter the reading order according to the position.

The first selection unit is further configured to perform length filtering on the complement reading sequence, that is, select a short double-end read sequence in the inner region of the hole, and select a long single-end read sequence at both ends of the hole.

The third judging unit is configured to: according to the estimated hole length, if the end of the new first contig close to the hole is prematurely overlapped with the end hole of the second contig close to the hole, the first contig is not performed. The connection of the second contig continues to fill the hole based on the new first contig, and when the seed reading is selected, the non-overlapping read order other than the complement reading set is selected as the seed reading.

The connecting unit is specifically configured to overlap one end of the new first contig close to the hole and one end of the second contig close to the hole, and directly connect the contigs at both ends, and extend one end to the other end or Extended sequence connection at both ends;

The device includes a fourth determining unit, configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting unit performs sequence connection; when there is no first credibility, but when there is a second credibility, the second credibility is selected, so that the connecting unit performs sequence connection; there is no first credibility and second Credibility, but when there is a third credibility, the third credibility is selected, so that the connecting unit performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated At the same time, there is a read sequence across the support; the second credibility is that the two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlapping and overlapping There is no evidence to support the area.

In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a hole filling device in nucleic acid sequence assembly, the device comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig having a second contig at the other end, finding a nucleic acid sequence at the end of the first contig near the hole as a first node read sequence; and finding a segment at the end of the second contig near the hole a nucleic acid sequence as a second node read sequence; a first selection unit for selecting a complement read set, and finding all read orders overlapping the first node read order from the read sequence for filling holes a complement reading set; finding all readings overlapping with the second node reading as a second complement reading set from the reading sequence for filling holes; and a second selecting unit for selecting seed reading Sorting, selecting a reading order from the first complement reading set as a first seed reading; selecting a reading from the second supplementary reading set as a second seed reading; extending the unit, using For extension processing, The first seed reading sequence is spliced with the first contig to form a new first contig; and the second seed reading sequence is spliced with the second contig to form a new second contig; the first determining unit is configured to: Determining, determining whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extension processing; a loop unit for loop processing if the new first contig is close to the hole One end does not overlap with the end of the new second contig close to the hole, and then the processing of determining the node read order, selecting the fill hole read set, selecting the seed read order, and extending the process is performed on the basis of the new first and second contigs; The connecting unit is configured to connect the first contig group and the second contig group when the end of the new first contig close to the hole overlaps with the end of the new second contig close to the hole, and complete the hole filling.

The invention has the beneficial effects that the prior art method of supplementing holes has a limited number of holes and the accuracy of filling holes is not high. The present invention determines a nucleic acid sequence as a node by first identifying a nucleic acid sequence at one end of the hole. a sequence, finding a read sequence with a minimum overlap with the node read sequence, reading the read sequence as a seed, and splicing the seed read sequence with the first contig to form a new first contig, ie, the first contig extension Forming a new first contig, and then judging whether the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, if not, looping the extension process until the new first contig is near the end of the hole The second contig has an overlap near one end of the hole, and connects the first contig and the second contig to complete the hole. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.

【附图说明】[Description of the Drawings]

Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention;

2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention;

Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.

The comparison and definition of some Chinese and English names in this article are as follows:

PE read	Double end reading	The distance between the two ends of the longer DNA sequence and the two end sequences is obtained by the double-end library construction method, and the sequences of the two ends obtained by sequencing are obtained.
Read	Reading order	Base sequence generated during sequencing
Block	window	An artificially selected nucleotide sequence of a certain length on a DNA sequence
Contig	Contiguous group	A linear ordered sequence of readings that consist of overlapping relationships
Overlap	overlapping	Refers to the same part of the two sequences during the sequence stitching process.
Kmar	Fixed length string	Is a DNA sequence of length K, K is usually taken 17
Single read	Single-ended read order	Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
Scaffold	Connecting bracket	Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented
Gap	hole	Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
Repeat	Sequence repeat	Repeated nucleotide sequence in the genome sequence
Indel	Insert/miss	Refers to the insertion or deletion of a sequence to alter the structure of the DNA sequence

【具体实施方式】【detailed description】

The invention will now be described in detail in conjunction with the drawings and embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention. In the hole-filling method, one end of the hole has a first contig and the other end has a second contig, and the hole-filling method includes the following steps:

Step 101, determining a node read sequence (read): finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence;

Step 102: Select a complement-reading set: find all the readings that overlap with the node reading order from the reading order for filling the hole as a complement reading set;

Wherein, the step of selecting the complement reading set involves the overlapping calculation of the following readings (see step B2 on the next page);

Step 103: Select a seed reading sequence: select a reading order from the complement reading set as a seed reading sequence;

The reading order with the smallest overlap with the node reading order is selected as the seed reading sequence. Of course, in other embodiments, the least overlapping reading order may not be selected as the seed reading order, for example, selecting a long a little overlapping reading as a seed reading;

Step 104: Extend processing: splicing the seed reading sequence with the first contig to form a new first contig;

Step 105: Determine a process of: determining whether an end of the new first contig close to the hole after the extension process overlaps with an end of the second contig close to the hole;

Step 106: Loop processing: if one end of the new first contig close to the hole does not overlap with one end of the second contig close to the hole, continue to perform determining the node read order and selecting the fill hole read set based on the new first contig , selecting a seed reading sequence and extending the processing steps;

Step 107: Completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.

The above can be understood that, in contrast to the prior art method of supplementing holes, the number of holes is limited, and the accuracy of filling holes is not high. The present invention finds a nucleic acid sequence as a node reading sequence on the first contig of one end of the hole. The reading sequence with the smallest overlap with the node reading sequence, the reading sequence is used as a seed reading sequence, and the seed reading sequence is spliced with the first contig to form a new first contig, that is, the first contig extends to form a new corpus a contig, and then judging whether an end of the new first contig close to the hole overlaps with an end of the second contig close to the hole, and if not, looping the extension process until the end of the new first contig close to the hole overlaps with the second The groups are overlapped at one end of the hole, and the first contig and the second contig are connected to complete the hole. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.

In other embodiments, the step of selecting a complement reading set includes: determining whether there is a public fixed length short string (Kmer) between the node reading order and the reading order for filling the hole, and having a public fixed length short string Pattern recognition is used between the reading sequences to determine all the reading sets that overlap. This method can quickly find all the sets of readings that overlap. Of course, the method of the above public fixed length short string may not be used, and details are not described herein again.

And in the step of determining whether there is a common fixed length short string between the node read sequence and the read sequence for filling the hole, the algorithm may be used to determine the node read order and the hole for filling the hole by using an algorithm such as a hash method. Whether there is a public fixed length short string between readings. The step of performing pattern matching by using pattern recognition may be to obtain a length of overlap between the reading orders by using a window sliding stepwise extending manner between the reading sequences having the common fixed length short strings.

In the embodiment of the present invention, the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding The base sequence segment is filled with holes. The holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole. Of course, the above is only one of the definitions of various holes, and the size of each hole is merely exemplary, and is not limited herein.

In the specific implementation process, in addition to the above-mentioned hole for the middle hole, the first embodiment of the method for filling holes in the nucleic acid sequence assembly of the present invention described above is used:

1. When filling holes in a hole with a small hole level, the following steps are included:

1) searching for a reading sequence falling within the small hole for determining a small hole length, and calculating an actual hole length of the small hole according to the reading order;

2) determining whether the actual hole length is greater than a first threshold preset by the system, and if it is greater than a first threshold preset by the system, using the read base to complete the hole filling process of the small hole.

2. When filling holes in a hole with a large hole level, the following steps are included:

1) acquiring the positions of the readings in the large hole according to the double-end relationship, sorting the reading order according to the position of the reading order, and determining that the continuous reading order is covered by one block according to the position;

2) assembling in each block according to the treatment of the center hole;

3) Connect the assembly results of each block to obtain the sequence within the large hole.

For a more detailed description of the above steps, please refer to the following one by one.

First, a scaffold that forms a gene sequence hole is acquired and analyzed. Among them, the original scaffold is broken to form a contig, and the gap between the two contigs is a hole. In the embodiment of the present invention, by reading the contig for the hole filling, the size of the hole and the contiguous group before and after the hole can be accurately obtained. Moreover, it is also possible to simultaneously acquire the contig length and sequence information, as well as the information of the holes before and after the contig.

In the specific implementation process, the embodiment of the present invention further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.

Then, in the nucleic acid sequence hole, the reading order for filling the hole is read. In the embodiment of the present invention, the reading order for filling the hole mostly belongs to the PE. The reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.

Among them, the PE reading order supports each other, PE The reading sequence is from both ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp. In the embodiment of the present invention, an insert can be passed through high-throughput multiplier sequencing. Multiple PE The overlapping relationship of the reading order is restored. Therefore, for a nucleic acid sequence hole, if there is an overlap between the read sequence and the contig of one end of the hole, and the direction of the read sequence is consistent with the direction of the contig, that is, if the read order is PE In the reading sequence, the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.

For the long reading order, since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.

In the embodiment of the present invention, for each read sequence in the acquired nucleic acid sequence hole, the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the read sequence itself are acquired. Sequence information.

Based on the level of the above-mentioned nucleic acid sequence hole, the hole-filling process specifically includes: A, the hole-filling process for the small hole, B, the hole-filling process for the center hole, and C, and the hole-filling process for the large hole. The hole filling process of each level of hole is described below.

A. For a small hole, first look for the reading order that falls within the small hole. Find all the readings in the small hole and analyze them. Look for the reading order that can overlap with the overlapping groups on both sides of the hole. Use these readings to calculate the actual hole length, because it falls into the hole, and The overlapping groups on both sides of the hole overlap, so if the part of the sequence overlapping the overlapping groups on both sides of the hole is removed, the remaining sequence is the sequence within the hole. Therefore, these readings can be used to calculate the actual hole length of the hole. The specific method is: a hole length can be calculated for each reading sequence across the hole. For all such reading sequences, a frequency table is formed to represent a range of the length of the hole. The frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.

After the actual hole length is obtained, if the actual hole length is greater than the first threshold preset by the system, such as 0, then the base on the sequence in the hole characterizing the hole length may be the true base of the hole, and all the representations may be The read order of the actual hole length is analyzed from base to base to determine the bases at each position; if the determined actual hole length is less than the first threshold preset by the system, such as 0, it is determined that there are overlaps at both ends of the contig. It is further determined whether the overlap is a repetition, and if so, the repeat mode is judged, otherwise the overlap end is intercepted by the end of the contig.

In the specific implementation process, since the number of readings across the small holes is small, the reliability of the above-mentioned bases for determining the length of the small holes will be whether the reading can fill the holes. A constraint. In order to ensure the accuracy of the sequence filled in the hole, the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole. Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.

In the embodiment of the present invention, for a small hole, not every small hole can find a reading order for determining the length of the small hole, and when it is impossible to find a reading order that can be used to determine the length of the small hole, it is necessary to use In the embodiment of the present invention, the hole filling method of the middle hole is processed, please refer to the following.

B. For the treatment of the middle hole, as shown in the first embodiment of the method for filling holes in the assembly of the nucleic acid sequence of the present invention, the specific embodiment is as follows:

B1), read-based repeat feature recognition, need to take out all possible blocks from the inner hole read sequence. In the embodiment of the present invention, the block is set to 6 bp or 12 bp. Where block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.

In a specific implementation process, in order to identify the tandem repetition, the embodiment of the present invention records the block frequency (block_freq) and the distance of the same block (block_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance block_dis value, and the distance block_dis size is equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.

Moreover, the embodiment of the present invention further infers the mode of tandem repetition according to the information obtained in the above process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or If the series is not crossed, it is determined to be a multi-series mode.

In a specific implementation process, in order to identify the series repetition, the embodiment of the present invention records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block in the hole. It is a duplication that the expected depth is multiplied.

B2), calculation of the overlap of the reading order. Among them, the overlap calculation first uses the hash method to quickly determine whether there is a common kmer between each read order, and there may be overlap between the read orders of the common kmer. Kmer is defined as a contiguous sequence of bases of length k. In the genome, the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.

In the specific implementation process, first set the maximum overlap, and divide this area into several blocks, read the block from a read-end front end, and search from another read-order internal to determine whether the block can be found. If it can be found, the details are detailed. Compare to get the overlap length; if not found, continue reading the block. In the embodiment of the present invention, in order to be fault-tolerant (that is, the number of mismatches that can be allowed between the bases of the overlap of two readings is three), the number of blocks can be appropriately raised.

B3), on identifying extension conflicts. In the embodiment of the present invention, a method of extending both ends of a nucleic acid sequence in a hole is adopted.

For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.

Thereafter, the embodiments of the present invention are extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed. Of course, in some cases, the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict. When the extended reading is the previous reading, the extension is terminated. The extension of the back end of the embodiment of the present invention is similar to the extension of the front end, and will not be described in detail herein.

For the patch hole local assembly, there is a high similar repetition problem, so the conflict identification should be as sensitive as possible, and the read sequence data of the fill hole should also have a lower error rate. In the embodiment of the present invention, the read sequence is pre-corrected. In order to improve the quality of the reading order, to ensure the accuracy of both ends of the reading.

B4), on conflict handling. Among them, the inventors of the present invention found in the research process that there are two reasons for the extension conflict: one is a base error in the seed reading order, and the other is a repeated bifurcation, based on the above two cases, In the embodiment of the present invention, when selecting a complement reading set and a seed reading, the following strategies are used to handle conflicts:

A1), comparison rate filtering: must have a 100% alignment rate to extend as a seed reading. The comparison rate filtering uses the following strategies:

Search all the readings on a contig, find the reading that overlaps the contig, and initially select the reading with the smallest overlap with the contig as the seed reading. Thus, other readings that overlap with the contig must necessarily overlap the seed reading. By comparing the seed reading sequence with other reading sequences overlapping with the contig, if the ratio of the overlap between the initially selected seed reading and the other reading order is greater than the second threshold set by the system, the second If the threshold is 3%, then it is determined that the seed reading is unreliable, and then a sub-reading is re-selected, and the length of the overlapping of the seed reading with the contig is greater than the overlapping length of the unreliable seed reading and the contig. At the same time, it is smaller than the overlap length of other read orders and contigs, and so on, until the seed read order is found, or the extension is abandoned because the seed read order is not found. In the above manner, the embodiment of the present invention avoids the problem that the base cannot be extended due to the base error of the seed reading.

In the specific implementation process, the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.

A2), position filtering: According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole. In order to ensure the accuracy of the calculation of the position within the hole, the embodiment of the present invention can set strict filtering conditions.

A3), read sequence length filtering: in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole. In the embodiment of the present invention, a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.

A4), end filtering: according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed. In the embodiment of the invention, the end filtering can only occur once.

A5), short similar repetitive processing and recognition: short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence. When it is recognized that there is a short similar repetition, the embodiment of the present invention preferentially selects a longer overlapping reading sequence as a seed reading sequence, which can effectively avoid the problem of short similar repetition.

B5), about sequence connection. In the process of filling the hole, the embodiment of the invention not only requires accurate assembly, but also requires accurate connection. Accurate assembly on the one hand guarantees a low base error rate and on the other hand ensures an accurate connection. The exact connection directly determines whether an insertion/deletion will eventually occur. Moreover, the connection must be considered when extending the error. The sequence connection relationship of the embodiment of the present invention can be divided into the following three credibility according to the connection quality:

B1), first credibility: the two sequences of the connection have overlapping, and are not repeated, and there are read orders across the support.

B2), second credibility: the two sequences connected have a read sequence across the connection, and the two sequences may not overlap.

B3), third credibility: the two sequences connected have at least 8 bp overlap, and the overlap region is not supported by evidence, and may be repeated.

The above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct. The quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in the embodiment of the present invention is classified and processed according to the actual use situation. The connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend. When the above three types of contigs are connected, it is determined whether there are three credibility exists, that is, the reliability of the sequence connection is judged when the contig sequence is connected, and when there is the first credibility, the second is selected. a credibility, sequence connection; if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.

C. For the treatment of large holes, the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.

Because the size of the PE is limited when filling the hole, the longest insert of the support PE is 800 bp. When the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole. In order to avoid PE In the embodiment of the present invention, the large hole is divided into a plurality of middle holes, and then the middle holes are separately assembled, and finally the assembly results are connected, and the details are as follows:

C1) Calculate the position of the hole in the reading order according to the PE relationship, sort the reading order according to the position of the reading order in the hole, and judge that there is a block in the continuous reading order according to the position.

C2), each block is assembled in blocks by means of a medium hole.

C3), connecting the assembly results of each block to obtain the sequence within the large hole.

In the embodiment of the present invention, the hole can be extended from one end or the hole can be extended at both ends. The technical solution of extending both ends is described below:

Figure 2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention. In the hole filling method, one end of the hole has a first contig and the other end has a second contig, and the specific process of filling the hole includes the following steps:

Step 201, determining a node reading sequence: finding a nucleic acid sequence at a end of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig close to the hole as a second node reading ;

Step 202: Select a complement read sequence set: find all read orders overlapping with the first node read order as the first fill hole read set from the read sequence for filling holes; from the read order for filling holes Finding all the readings that overlap with the second node reading order as the second filling hole reading set;

Step 203: Select a seed reading sequence: select a reading order from the first supplementary hole reading order set as the first seed reading order; and select a reading order from the second supplementary hole reading order set as the second seed reading order;

Wherein, a read sequence having the smallest overlap with the node read order is selected as the first seed read sequence from the first complement hole read set; and a read with the smallest overlap with the node read order is selected from the second fill hole read sequence set. Order as the second seed reading order;

Step 204, extending processing: splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second contig;

Step 205: The judging process is: judging whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extending process;

Step 206: Loop processing: if one end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, then the determining of the node read order and the selection complement is performed on the basis of the new first and second contigs. The steps of reading the sequence, selecting the seed reading, and extending the processing;

Step 207, completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the new second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.

As can be understood from the above, the method of determining the length of the hole in the prior art is limited, and the accuracy of the hole is not high. The present invention determines a nucleic acid sequence as the first node by using a first contig on one end of the hole. Finding a read sequence with a minimum overlap with the first node read sequence, using the read sequence as a first seed read sequence, and splicing the first seed read sequence with the first contig to form a new first contig, ie The first contig extends to form a new first contig, and at the same time, a nucleic acid sequence is determined as a second node read sequence on the second contig at the other end of the hole, and a read order with minimal overlap with the second node read sequence is found And reading the read sequence as a second seed reading sequence, and splicing the second seed reading sequence with the second contig to form a new second contig, that is, the second contig extends to form a new second contig, and then judges the new Whether the end of the first contig close to the hole overlaps with the end of the new second contig close to the hole, and if not, the extension process is repeated until the end of the new first contig close to the hole and the end of the new contig close to the hole Overlapping connecting the first and second contigs contig complete hole filling. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.

Of course, the two ends can be extended at the same time, or can be alternately extended. It can also be extended at one end of one time and extended at the other end of the other time, and details are not described herein again.

Fig. 3 is a view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention. The device includes: a determining unit 31, a first selecting unit 32, a second selecting unit 33, an extending unit 34, a first determining unit 35, a circulating unit 36, a connecting unit 38, an identifying unit 39, a second determining unit 40, and a The third determining unit 41 and the fourth determining unit 37.

The determining unit 31 is configured to determine a node reading sequence, where one end of the hole has a first contig and the other end has a second contig, and a nucleic acid sequence is found at one end of the first contig near the hole as a node reading. The first selecting unit 32 is configured to select a complement reading set, and find all readings overlapping the node reading order as a complement reading set from the reading order for filling the hole; the second selecting unit 33 is for selecting The seed reading sequence selects a reading sequence from the complement reading sequence set as the seed reading sequence; the extending unit 34 is used for the extension processing, and splicing the seed reading sequence with the first overlapping group to form a new first overlapping group; the first determining unit 35 is used for judging processing, determining whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; the loop unit 36 is used for loop processing, if the new first contig is close to the hole If one end does not overlap with one end of the second contig close to the hole, the execution determining unit 31 continues to determine the node reading order, the first selecting unit 32 selects the complement reading set, and the second selection based on the new first contig. The selecting unit 33 selects the seed reading sequence and the processing of the extension processing; the connecting unit 38 is configured to complete the filling hole, and if the end of the new first overlapping group near the hole overlaps with the end of the second overlapping group near the hole, the first overlapping group is connected And the second contig, complete the hole; the connecting unit 38 is used for sequence connection, the sequence connection can be divided into two ends contig direct connection, one end extension and the other end contig connection and two end extension sequence connection; the identification unit 39 is used for Using the frequency of the sliding window of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, further deducing the repetition pattern of the sequence in the hole according to the series repetition; the second determining unit 40 is configured to Determining whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and pattern recognition is used to determine all the reading sets having overlap between the reading orders having the common kmer; the third determining unit 41 is configured to According to the estimated hole length, if the end of the new first contig close to the hole overlaps with the end hole of the second contig close to the hole, the first contig and the second weight are not performed. The connection of the supergroup continues to fill the hole based on the new first contig, and, when selecting the seed reading, the non-overlapping read order other than the complement reading set is selected as the seed reading; fourth judgment The unit 37 is configured to perform credibility judgment on the accuracy of the sequence connection when the connection unit 38 is connected in sequence, and when there is the first credibility, select the first credibility, so that the connection unit 38 performs sequence connection; The first credibility, but when there is the second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and the second credibility, but the third credibility exists. When the degree is selected, the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap, and are not repeated, and the read sequence crosses the support; The reliability is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.

The hole filling device in the assembly of the nucleic acid sequence of the present invention has the following hole filling process: before the hole filling, the identification unit 39 determines whether there is a tandem repeat in the hole, and further infers the repeated pattern in the segment sequence, so as to facilitate the hole filling. get on. In the process of filling the hole, one end of the hole has a first contig and the other end has a second contig. First, the determining unit 31 finds a nucleic acid sequence at the end of the first contig close to the hole, as a node reading, and then a second. Judging unit 40 Determining an overlapping area between the node reading order and the reading order for filling the hole, according to the second determining unit 40 As a result of the judgment, the first selection unit 32 finds all the reading sequences overlapping the node reading order as the complement hole reading set from the reading order for the hole filling, and then the second selecting unit 33 selects from the complement hole reading set. The read sequence having the smallest overlap with the node read sequence is used as the seed read sequence, and the extension unit 34 splices the seed read sequence with the first contig to form a new first contig. Then, the first determining unit 35 determines whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole, if one end of the new first contig close to the hole is close to the second contig If there is no overlap at one end of the hole, the loop unit 36 continues to perform the process of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension process on the basis of the new first contig, and finally the connection unit 38 completes the hole filling. If one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, the first contig is connected to the second contig, and the connecting unit 38 completes the merging. The third judging unit 41 is configured to not connect the first contig and the second if the end of the new first contig close to the hole overlaps with the end hole of the second contig close to the hole according to the predicted hole length. The contig, continuing to perform the process of determining the node read order, selecting the fill hole read set, and selecting the seed read order based on the new first contig, and selecting the fill order in the process of selecting the seed read order Non-overlapping reads outside the collection are used as seed reads. The fourth judging unit 37 performs credibility judgment on the accuracy of the sequence connection when the connecting unit 38 is connected in sequence, and when the first credibility exists, selects the first credibility, and causes the connecting unit 38 to perform sequence connection; If there is no first credibility, but there is a second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and second credibility, but there is a third credibility In the case of reliability, the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated, and the read sequence crosses the support; The two credibility is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlap, and the overlap region has no evidence to support. After the hole is completed, the connecting unit 38 performs sequence connection, and the sequence connection can be divided into two groups of direct overlapping ends, one end extending and the other end overlapping group connection and two end extended sequence connections.

Obviously, in the embodiment of the present invention, the processing method of the middle hole is mainly taken as the core, the small hole can be converted into the middle hole for processing, and the large hole is completely decomposed into the middle hole, and is processed according to the middle hole method. In the embodiment of the present invention, different levels of holes correspond to different processing modes, and the hole repairing process is refined into the hole, and the hole itself is fully utilized to complete the hole filling, and all the holes can be effectively filled. It greatly improves the accuracy of the hole filling, saves the hole filling time and memory space, and is conducive to the development and promotion of gene sequencing technology.

The above is only the embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformation of the present invention and the contents of the drawings may be directly or indirectly applied to other related technologies. The fields are all included in the scope of patent protection of the present invention.

Claims

A method of filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, comprising the steps of:

Determining a node reading sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a node reading sequence;

Selecting a complement reading set: finding all readings that overlap with the node reading order from the reading order for filling the hole as a complement reading set;

Selecting a seed reading sequence: selecting a reading order from the complementing reading set as a seed reading sequence;

Extending processing: splicing the seed reading sequence with the first contig to form a new first contig;

Judgment processing: determining whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing;

Cycling processing: if one end of the new first contig close to the hole does not overlap with one end of the second contig close to the hole, continue to perform determining the node reading order, selecting the complement hole reading set based on the new first contig, Select seed The steps of reading and extending the processing;

Completing the hole: If one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, the first contig and the second contig are connected to complete the hole.
The method of claim 1 wherein:

The step of selecting a complement reading set includes:

It is judged whether there is a common fixed length short string between the node reading order and the reading order for filling holes, and pattern reading is used between the reading orders having the common fixed length short strings to determine all the reading sets having overlap.
The method of claim 2 wherein:

The step of performing pattern matching between the reading sequences having the common fixed length short strings includes: using a window sliding stepwise extending manner between the reading orders having the common fixed length short strings to obtain between the reading orders Overlap length.
The method of claim 1 wherein:

Before the determining the node reading step, the method further comprises: using the sliding window frequency of the supplementary hole reading in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, further inferring according to the series repetition The repeating pattern of the sequence within the hole.
The method of claim 1 wherein:

The step of selecting a seed reading sequence comprises: performing a comparison rate filtering process on the fill hole reading order, that is, selecting a fill hole reading order having a 100% matching rate as a seed reading order.
The method of claim 1 wherein:

The step of selecting a seed reading sequence comprises: performing short similar repetition processing and recognition on the complement reading sequence, that is, selecting a longer overlapping filling hole reading sequence as a seed reading sequence when identifying that there is a short similar repetition.
The method of claim 1 wherein:

The step of selecting the complement reading sequence set includes: performing position filtering on the complement hole reading order, that is, positioning the reading order according to the double end relationship, calculating the position of the filling hole reading sequence in the hole, and filtering the reading order according to the position.
The method of claim 1 wherein:

The step of selecting the complement read sequence set includes: length filtering the fill hole read sequence, that is, selecting a short double end read sequence in the inner region of the hole, and selecting a long single end read sequence at both ends of the hole.
The method of claim 1 wherein:

If the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, before the step of connecting the first contig and the second contig, including: according to the estimated hole length, if New first contig If one end of the hole overlaps with one end of the second contig close to the hole, the step of connecting the first contig and the second contig is not performed, and the determining node reading is performed on the basis of the new first contig. The steps of selecting a complement reading set and selecting a seed reading, and, in the step of selecting a seed reading, selecting a non-overlapping read order other than the complement reading set as a seed reading.
The method of claim 1 wherein:

The step of completing the hole filling comprises: performing sequence connection, wherein the sequence connection is a direct connection of the two ends contig, one end extension is connected with the other end contig group or the two end extension sequence connection.
The method of claim 10 wherein:

Before the step of performing the sequence connection, the method comprises: determining a credibility of the accuracy of the sequence connection when the sequence is connected, and selecting the first credibility when the first credibility is present, and performing the sequence connection Nothing A credibility, but when there is a second credibility, the second credibility is selected for sequence connection; if there is no first credibility and second credibility, but there is a third credibility, then The third credibility is to perform sequence connection, wherein the first credibility is that the two sequences connected have overlap and are not repeated, and the read sequence crosses the support; the second credibility is the two sequences connected. There are read sequences across the connections, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and there is no evidence to support the overlap.
A method of filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, comprising the steps of:

Determining a node reading sequence: finding a nucleic acid sequence at the end of the first contig near the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig near the hole, reading as a second node sequence;

Selecting a complement reading set: finding all readings overlapping with the first node reading in the reading order for filling the hole as the first filling reading set; finding from the reading used for filling the hole All the reading sequences overlapping the second node reading sequence are used as the second supplementary hole reading set;

Selecting a seed reading sequence: selecting a reading order from the first complement hole reading set as a first seed reading order; and selecting a reading order from the second supplementary hole reading order set as a second seed reading order;

An extension process: splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second contig;

Judgment processing: judging whether an end of the new first contig close to the hole after the extension processing overlaps with an end of the new second contig close to the hole;

Cycling processing: if one end of the new first contig close to the hole does not overlap with one end of the new second contig close to the hole, continue to perform determining the node reading and selecting the supplementary hole based on the new first and second contigs The steps of reading the set, selecting the seed reading, and extending the processing;

Completing the hole: If one end of the new first contig close to the hole overlaps with one end of the new second contig close to the hole, the first contig and the second contig are connected to complete the hole.
A hole filling device for assembling a nucleic acid sequence, characterized in that the device comprises:

a determining unit, configured to determine a node reading sequence, the hole has a first contig at one end and a second contig at the other end, and a nucleic acid sequence is found at one end of the first contig near the hole as a node reading ;

a first selection unit, configured to select a complement reading set, and find all readings that overlap with the node reading in the reading order for filling the hole as a complement reading set;

a second selecting unit, configured to select a seed reading sequence, and select a reading order from the patching reading set as a seed reading sequence;

An extension unit, configured to extend processing, splicing the seed reading sequence with the first contig to form a new first contig;

a first determining unit, configured to determine, by the determining process, whether an end of the new first contig close to the hole after the extending process is overlapped with an end of the second contig close to the hole;

a looping unit, configured to perform a loop processing, if the end of the new first contig close to the hole does not overlap with the end of the second contig close to the hole, continue to perform determining the node reading order and selecting the complement based on the new first contig The process of reading the ordered set, selecting the seed reading order, and extending the processing;

The connecting unit is configured to connect the first contig group and the second contig group when the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, and complete the hole filling.
The device of claim 13 wherein said device comprises:

a second determining unit, configured to determine whether there is a common fixed length short string between the node reading order and the reading order for filling holes, and pattern recognition is used between the reading orders having the common fixed length short strings to determine that there is overlap Read the collection.
The device of claim 14 wherein:

The second determining unit is further configured to obtain a length of overlap between the reading orders by using a window sliding stepwise extending manner between reading sequences having the common fixed length short strings.
The device of claim 13 wherein said device comprises:

a recognition unit, configured to use the sliding window frequency of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further infer the repeating pattern of the sequence in the hole according to the series repetition .
The device of claim 13 wherein:

The second selection unit is further configured to perform a comparison rate filtering process on the complement reading sequence, that is, select a complement reading sequence with a 100% alignment rate as the seed reading sequence.
The device of claim 13 wherein:

The second selecting unit is further configured to perform short similar repetitive processing and recognition on the complement reading sequence, that is, when identifying short similar repetitions, select a longer overlapping complementary hole reading sequence as the seed reading sequence.
The device of claim 13 wherein:

The first selection unit is further configured to perform position filtering on the complement reading sequence, that is, according to the double-end relationship positioning reading order, calculate the position of the filling hole reading sequence in the hole, and filter the reading order according to the position.
The device of claim 13 wherein

The first selection unit is further configured to perform length filtering on the complement reading sequence, that is, select a short double-end read sequence in the inner region of the hole, and select a long single-end read sequence at both ends of the hole.
The device of claim 13 wherein said device comprises:

a third judging unit, configured to: according to the predicted hole length, if the end of the new first contig close to the hole is prematurely overlapped with the end hole of the second contig close to the hole, the first contig and the second overlap are not performed Group connection, continue The hole is filled based on the new first contig, and when the seed reading is selected, the non-overlapping read order other than the set of vocabulary readings is selected as the seed reading.
The device of claim 13 wherein:

The connecting unit is specifically configured to overlap one end of the new first contig near the hole and one end of the second contig close to the hole, and directly connect the two ends of the contig, connect one end to the other end, or both ends Extended sequence Connection

The device includes a fourth determining unit, configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting unit performs sequence connection; when there is no first credibility, but when there is a second credibility, the second credibility is selected, so that the connecting unit performs sequence connection; there is no first credibility and second Credibility, but when there is a third credibility, the third credibility is selected, so that the connecting unit performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated At the same time, there is a read sequence across the support; the second credibility is that the two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlapping and overlapping There is no evidence to support the area.
A hole filling device for assembling a nucleic acid sequence, characterized in that the device comprises:

a determining unit, configured to determine a node reading sequence, the hole has a first contig at one end and a second contig at the other end, and a nucleic acid sequence is found at the end of the first contig near the hole as the first node Reading order And finding a nucleic acid sequence at the end of the second contig near the hole as a second node read sequence;

a first selecting unit, configured to select a complement reading set, and find all readings overlapping with the first node reading order as a first complement reading set from the reading order for filling holes; Finding all the reading sequences overlapping with the second node reading order as a second filling hole reading set in the reading sequence of the hole filling;

a second selecting unit, configured to select a seed reading sequence, select a reading order from the first complement hole reading set as a first seed reading order, and select a reading order from the second supplementary hole reading order set as Second seed reading order;

An extension unit, configured to extend processing, splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second Contiguous group

a first determining unit, configured to determine, by the determining process, whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extending process;

a looping unit for loop processing, if the end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, then the determining of the node reading is continued based on the new first and second contigs Sequence, selection of complement reading sequence sets, selection of seed readings, and processing of extension processing;

The connecting unit is configured to connect the first contig group and the second contig group when the end of the new first contig close to the hole overlaps with the end of the new second contig close to the hole, and complete the hole filling.