WO2013078623A1 - Method and device for gap closure in nucleotide sequence assembly - Google Patents

Method and device for gap closure in nucleotide sequence assembly Download PDF

Info

Publication number
WO2013078623A1
WO2013078623A1 PCT/CN2011/083178 CN2011083178W WO2013078623A1 WO 2013078623 A1 WO2013078623 A1 WO 2013078623A1 CN 2011083178 W CN2011083178 W CN 2011083178W WO 2013078623 A1 WO2013078623 A1 WO 2013078623A1
Authority
WO
WIPO (PCT)
Prior art keywords
hole
reading
contig
sequence
new
Prior art date
Application number
PCT/CN2011/083178
Other languages
French (fr)
Chinese (zh)
Inventor
刘兵行
李振宇
陈燕香
李英睿
汪建
王俊
杨焕明
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to PCT/CN2011/083178 priority Critical patent/WO2013078623A1/en
Publication of WO2013078623A1 publication Critical patent/WO2013078623A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.
  • the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species.
  • the principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.
  • Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.
  • the partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.
  • the hole-filling software Gapcloser is based on the base sequence segment with overlap
  • the method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy.
  • Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.
  • the hole in the SOAPdenovo assembly software is based on the area inside the hole.
  • De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.
  • the technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can effectively treat the holes of the complementary nucleic acid sequence to fill holes, and improve the accuracy and the hole filling rate of the hole filling.
  • a technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein one end of the hole has a first contig and the other end has a second contig, including the following steps.
  • Determining a node read sequence finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence; selecting a fill hole read set: finding the read order from the read order for filling the hole All the reading sequences with overlapping are used as the complement reading set; the seed reading order is selected: a reading order is selected from the complement reading set as the seed reading; the extension processing: the seed reading and the first overlapping group Splicing to form a new first contig; judging process: judging whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; loop processing: if the new first contig One end near the hole does not overlap with the end of the second contig close to the hole, and then the determination of the node reading order, the selection of the complement reading set, and the selection of the seed reading are performed on the basis of the new first contig.
  • Extension step process hole filling is completed: if the first end of the new contig hole
  • the step of selecting the complement read sequence set includes: determining whether there is a common fixed length short string between the node read order and the read order for filling the hole, and adopting a pattern between the read order of the public fixed length short string. Identify all sets of readings that determine overlap.
  • the step of performing pattern matching between the reading sequences having the common fixed length short strings includes: obtaining the reading order by using a window sliding stepwise extending manner between the reading orders having the common fixed length short strings The overlap length between.
  • the step of determining the node reading sequence further includes: using the sliding window frequency of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further inferring the hole according to the series repetition. Repeat pattern of inner sequences.
  • the step of selecting a seed reading sequence includes: performing a comparison rate filtering process on the complement hole reading sequence, that is, selecting a complement hole reading sequence having a 100% alignment rate as a seed reading sequence.
  • the step of selecting a seed reading sequence comprises: performing a short similar repeated processing and recognition on the complement reading sequence, that is, selecting a long overlapping complementary hole reading sequence as a seed reading sequence when identifying that there is a short similar repetition.
  • the step of selecting the complement reading sequence set includes: performing position filtering on the filling hole reading order, that is, positioning the reading order according to the double end relationship, calculating the position of the filling hole reading sequence in the hole, and filtering the reading order according to the position.
  • the step of selecting the complement read sequence set includes: length filtering the fill hole read sequence, that is, selecting a short double end read sequence in the inner region of the hole, and selecting a long single end read sequence at both ends of the hole.
  • the step of connecting the first contig and the second contig including: according to the estimated hole length, if If the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, the step of connecting the first contig and the second contig is not performed, and the new first overlap is continued. Performing the steps of determining a node read order, selecting a fill hole read set, and selecting a seed read order, and selecting, in the step of selecting a seed read order, selecting a non-overlapping read order other than the complement read set Seed reading.
  • the step of completing the hole filling comprises: performing sequence connection, wherein the sequence connection is a direct connection of two overlapping groups, one end extension is connected with another end overlapping group or two end extended sequence connection.
  • the method comprises: determining a credibility of the accuracy of the sequence connection when the sequence is connected, and selecting the first credibility when the first credibility is present, Sequence connection; when there is no first credibility, but there is a second credibility, the second credibility is selected for sequence connection; there is no first credibility and second credibility, but there is a third credibility
  • the third credibility is selected to perform sequence connection, wherein the first credibility is that the two sequences of the connection have overlapping, and are not repeated, and the reading order crosses the support; the second credibility is The two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.
  • another technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein the hole has a first contig at one end and a second contig at the other end, including the following Step: determining a node reading sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig near the hole, as a second Node reading order; selecting a complement reading sequence set: finding all reading sequences overlapping with the first node reading order as a first complement hole reading set from the reading order for filling holes; Finding, in the reading sequence, all the readings overlapping with the second node reading sequence as the second supplementary hole reading set; selecting the seed reading order: selecting a reading order from the first complementary hole reading set as the first a seed reading sequence; selecting a reading sequence from the second complementing reading set as a second seed reading sequence; extending processing: splic
  • a step of selecting, selecting a seed reading sequence, and extending the processing completing the filling hole: if the end of the new first overlapping group near the hole overlaps with the end of the new second overlapping group near the hole, connecting the first overlapping group and the second Align the group and complete the hole.
  • a hole filling device in nucleic acid sequence assembly comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig, the other end of which has a second contig, a nucleic acid sequence is found at one end of the first contig near the hole as a node read sequence; and a first selection unit is used to select a complement read set.
  • the second selecting unit is configured to select the seed reading order, and selecting one from the supplementary hole reading order set
  • the reading sequence is used as a seed reading sequence
  • the extending unit is configured to extend the processing, and the seed reading sequence is spliced with the first contig to form a new first contig
  • the first determining unit is configured to determine the processing, and determine the extension processing.
  • a loop unit for loop processing if the new first contig is near one end of the hole and the second contig If there is no overlap at one end of the near hole, the processing of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension processing is continued on the basis of the new first contig; the connection unit is used in the new When one end of the contig is close to the end of the hole and the end of the second contig close to the hole, the first contig and the second contig are connected to complete the hole.
  • the second judging unit is configured to determine whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and the pattern recognition is determined between the reading order of the public fixed length short string. Overlapping of all read sets.
  • the second determining unit is further configured to obtain a length of overlap between the reading sequences by using a window sliding stepwise extending manner between reading sequences having the common fixed length short strings.
  • the method includes an identifying unit, configured to use the sliding window frequency of the filling hole reading in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further infer the repeating pattern of the sequence in the hole according to the series repetition. .
  • the second selection unit is further configured to perform a comparison rate filtering process on the complement reading sequence, that is, select a complement reading sequence with a 100% alignment rate as a seed reading sequence.
  • the second selection unit is further configured to perform short similar repetition processing and recognition on the supplementary hole reading sequence, that is, when the short similar repetition is recognized, the longer overlapping supplementary hole short reading order is selected as the seed reading order.
  • the first selection unit is further configured to perform position filtering on the complement reading sequence, that is, according to the double-end relationship positioning reading order, calculate the position of the filling hole reading sequence in the hole, and filter the reading order according to the position.
  • the first selection unit is further configured to perform length filtering on the complement reading sequence, that is, select a short double-end read sequence in the inner region of the hole, and select a long single-end read sequence at both ends of the hole.
  • the third judging unit is configured to: according to the estimated hole length, if the end of the new first contig close to the hole is prematurely overlapped with the end hole of the second contig close to the hole, the first contig is not performed.
  • the connection of the second contig continues to fill the hole based on the new first contig, and when the seed reading is selected, the non-overlapping read order other than the complement reading set is selected as the seed reading.
  • the connecting unit is specifically configured to overlap one end of the new first contig close to the hole and one end of the second contig close to the hole, and directly connect the contigs at both ends, and extend one end to the other end or Extended sequence connection at both ends;
  • the device includes a fourth determining unit, configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting unit performs sequence connection; when there is no first credibility, but when there is a second credibility, the second credibility is selected, so that the connecting unit performs sequence connection; there is no first credibility and second Credibility, but when there is a third credibility, the third credibility is selected, so that the connecting unit performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated At the same time, there is a read sequence across the support; the second credibility is that the two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlapping and overlapping There is no evidence to support the area.
  • a fourth determining unit configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting
  • a hole filling device in nucleic acid sequence assembly comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig having a second contig at the other end, finding a nucleic acid sequence at the end of the first contig near the hole as a first node read sequence; and finding a segment at the end of the second contig near the hole a nucleic acid sequence as a second node read sequence; a first selection unit for selecting a complement read set, and finding all read orders overlapping the first node read order from the read sequence for filling holes a complement reading set; finding all readings overlapping with the second node reading as a second complement reading set from the reading sequence for filling holes; and a second selecting unit for selecting seed reading Sorting, selecting a reading order from the first complement reading set as a first seed reading; selecting a reading from the second supplementary reading set as a second seed
  • the invention has the beneficial effects that the prior art method of supplementing holes has a limited number of holes and the accuracy of filling holes is not high.
  • the present invention determines a nucleic acid sequence as a node by first identifying a nucleic acid sequence at one end of the hole.
  • Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention
  • FIG. 2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention
  • Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • Kmar Fixed length string Is a DNA sequence of length K K is usually taken 17 Single read Single-ended read order Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
  • Scaffold Connecting bracket Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented Gap hole Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
  • PE read double-end read
  • Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention.
  • the hole-filling method one end of the hole has a first contig and the other end has a second contig, and the hole-filling method includes the following steps:
  • Step 101 determining a node read sequence (read): finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence;
  • Step 102 Select a complement-reading set: find all the readings that overlap with the node reading order from the reading order for filling the hole as a complement reading set;
  • step of selecting the complement reading set involves the overlapping calculation of the following readings (see step B2 on the next page);
  • Step 103 Select a seed reading sequence: select a reading order from the complement reading set as a seed reading sequence;
  • the reading order with the smallest overlap with the node reading order is selected as the seed reading sequence.
  • the least overlapping reading order may not be selected as the seed reading order, for example, selecting a long a little overlapping reading as a seed reading;
  • Step 104 Extend processing: splicing the seed reading sequence with the first contig to form a new first contig;
  • Step 105 Determine a process of: determining whether an end of the new first contig close to the hole after the extension process overlaps with an end of the second contig close to the hole;
  • Step 106 Loop processing: if one end of the new first contig close to the hole does not overlap with one end of the second contig close to the hole, continue to perform determining the node read order and selecting the fill hole read set based on the new first contig , selecting a seed reading sequence and extending the processing steps;
  • Step 107 Completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.
  • the above can be understood that, in contrast to the prior art method of supplementing holes, the number of holes is limited, and the accuracy of filling holes is not high.
  • the present invention finds a nucleic acid sequence as a node reading sequence on the first contig of one end of the hole.
  • the reading sequence with the smallest overlap with the node reading sequence is used as a seed reading sequence, and the seed reading sequence is spliced with the first contig to form a new first contig, that is, the first contig extends to form a new corpus a contig, and then judging whether an end of the new first contig close to the hole overlaps with an end of the second contig close to the hole, and if not, looping the extension process until the end of the new first contig close to the hole overlaps with the second
  • the groups are overlapped at one end of the hole, and the first contig and the second contig are connected to complete the hole. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.
  • the step of selecting a complement reading set includes: determining whether there is a public fixed length short string (Kmer) between the node reading order and the reading order for filling the hole, and having a public fixed length short string Pattern recognition is used between the reading sequences to determine all the reading sets that overlap. This method can quickly find all the sets of readings that overlap.
  • the method of the above public fixed length short string may not be used, and details are not described herein again.
  • the algorithm may be used to determine the node read order and the hole for filling the hole by using an algorithm such as a hash method. Whether there is a public fixed length short string between readings.
  • the step of performing pattern matching by using pattern recognition may be to obtain a length of overlap between the reading orders by using a window sliding stepwise extending manner between the reading sequences having the common fixed length short strings.
  • the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding The base sequence segment is filled with holes.
  • the holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole.
  • the above is only one of the definitions of various holes, and the size of each hole is merely exemplary, and is not limited herein.
  • a scaffold that forms a gene sequence hole is acquired and analyzed.
  • the original scaffold is broken to form a contig, and the gap between the two contigs is a hole.
  • the size of the hole and the contiguous group before and after the hole can be accurately obtained.
  • the embodiment of the present invention further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.
  • the reading order for filling the hole is read.
  • the reading order for filling the hole mostly belongs to the PE.
  • the reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.
  • the PE reading order supports each other, PE
  • the reading sequence is from both ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp.
  • an insert can be passed through high-throughput multiplier sequencing. Multiple PE The overlapping relationship of the reading order is restored.
  • the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.
  • the long reading order since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.
  • the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the read sequence itself are acquired. Sequence information.
  • the hole-filling process specifically includes: A, the hole-filling process for the small hole, B, the hole-filling process for the center hole, and C, and the hole-filling process for the large hole.
  • A the hole-filling process for the small hole
  • B the hole-filling process for the center hole
  • C the hole-filling process for the large hole.
  • the specific method is: a hole length can be calculated for each reading sequence across the hole. For all such reading sequences, a frequency table is formed to represent a range of the length of the hole. The frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.
  • the base on the sequence in the hole characterizing the hole length may be the true base of the hole, and all the representations may be The read order of the actual hole length is analyzed from base to base to determine the bases at each position; if the determined actual hole length is less than the first threshold preset by the system, such as 0, it is determined that there are overlaps at both ends of the contig. It is further determined whether the overlap is a repetition, and if so, the repeat mode is judged, otherwise the overlap end is intercepted by the end of the contig.
  • the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole.
  • Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.
  • the hole filling method of the middle hole is processed, please refer to the following.
  • the block is set to 6 bp or 12 bp.
  • block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.
  • the embodiment of the present invention records the block frequency (block_freq) and the distance of the same block (block_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance block_dis value, and the distance block_dis size is equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.
  • the embodiment of the present invention further infers the mode of tandem repetition according to the information obtained in the above process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or If the series is not crossed, it is determined to be a multi-series mode.
  • the embodiment of the present invention records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block in the hole. It is a duplication that the expected depth is multiplied.
  • the overlap calculation first uses the hash method to quickly determine whether there is a common kmer between each read order, and there may be overlap between the read orders of the common kmer.
  • Kmer is defined as a contiguous sequence of bases of length k.
  • the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.
  • the number of blocks can be appropriately raised.
  • the front-end extension For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.
  • the embodiments of the present invention are extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed.
  • the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict.
  • the extension is terminated.
  • the extension of the back end of the embodiment of the present invention is similar to the extension of the front end, and will not be described in detail herein.
  • the conflict identification should be as sensitive as possible, and the read sequence data of the fill hole should also have a lower error rate.
  • the read sequence is pre-corrected. In order to improve the quality of the reading order, to ensure the accuracy of both ends of the reading.
  • comparison rate filtering must have a 100% alignment rate to extend as a seed reading.
  • the comparison rate filtering uses the following strategies:
  • the embodiment of the present invention avoids the problem that the base cannot be extended due to the base error of the seed reading.
  • the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.
  • position filtering According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole.
  • the embodiment of the present invention can set strict filtering conditions.
  • read sequence length filtering in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole.
  • a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.
  • end filtering according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed.
  • the end filtering can only occur once.
  • short similar repetitive processing and recognition short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence.
  • the embodiment of the present invention preferentially selects a longer overlapping reading sequence as a seed reading sequence, which can effectively avoid the problem of short similar repetition.
  • sequence connection In the process of filling the hole, the embodiment of the invention not only requires accurate assembly, but also requires accurate connection. Accurate assembly on the one hand guarantees a low base error rate and on the other hand ensures an accurate connection. The exact connection directly determines whether an insertion/deletion will eventually occur. Moreover, the connection must be considered when extending the error.
  • the sequence connection relationship of the embodiment of the present invention can be divided into the following three credibility according to the connection quality:
  • the above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct.
  • the quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in the embodiment of the present invention is classified and processed according to the actual use situation.
  • the connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend.
  • a credibility, sequence connection if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.
  • the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.
  • the longest insert of the support PE is 800 bp.
  • the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole.
  • the large hole is divided into a plurality of middle holes, and then the middle holes are separately assembled, and finally the assembly results are connected, and the details are as follows:
  • each block is assembled in blocks by means of a medium hole.
  • the hole can be extended from one end or the hole can be extended at both ends.
  • the technical solution of extending both ends is described below:
  • Figure 2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention.
  • the hole filling method one end of the hole has a first contig and the other end has a second contig, and the specific process of filling the hole includes the following steps:
  • Step 201 determining a node reading sequence: finding a nucleic acid sequence at a end of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig close to the hole as a second node reading ;
  • Step 202 Select a complement read sequence set: find all read orders overlapping with the first node read order as the first fill hole read set from the read sequence for filling holes; from the read order for filling holes Finding all the readings that overlap with the second node reading order as the second filling hole reading set;
  • Step 203 Select a seed reading sequence: select a reading order from the first supplementary hole reading order set as the first seed reading order; and select a reading order from the second supplementary hole reading order set as the second seed reading order;
  • a read sequence having the smallest overlap with the node read order is selected as the first seed read sequence from the first complement hole read set; and a read with the smallest overlap with the node read order is selected from the second fill hole read sequence set. Order as the second seed reading order;
  • Step 204 extending processing: splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second contig;
  • Step 205 The judging process is: judging whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extending process;
  • Step 206 Loop processing: if one end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, then the determining of the node read order and the selection complement is performed on the basis of the new first and second contigs. The steps of reading the sequence, selecting the seed reading, and extending the processing;
  • Step 207 completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the new second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.
  • the present invention determines a nucleic acid sequence as the first node by using a first contig on one end of the hole.
  • the first contig extends to form a new first contig, and at the same time, a nucleic acid sequence is determined as a second node read sequence on the second contig at the other end of the hole, and a read order with minimal overlap with the second node read sequence is found And reading the read sequence as a second seed reading sequence, and splicing the second seed reading sequence with the second contig to form a new second contig, that is, the second contig extends to form a new second contig, and then judges the new Whether the end of the first contig close to the hole overlaps with the end of the new second contig close to the hole, and if not, the extension process is repeated until the end of the new first contig close to the hole and the end of the new contig close to the hole Overlapping connecting the first
  • the two ends can be extended at the same time, or can be alternately extended. It can also be extended at one end of one time and extended at the other end of the other time, and details are not described herein again.
  • Fig. 3 is a view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • the device includes: a determining unit 31, a first selecting unit 32, a second selecting unit 33, an extending unit 34, a first determining unit 35, a circulating unit 36, a connecting unit 38, an identifying unit 39, a second determining unit 40, and a The third determining unit 41 and the fourth determining unit 37.
  • the determining unit 31 is configured to determine a node reading sequence, where one end of the hole has a first contig and the other end has a second contig, and a nucleic acid sequence is found at one end of the first contig near the hole as a node reading.
  • the first selecting unit 32 is configured to select a complement reading set, and find all readings overlapping the node reading order as a complement reading set from the reading order for filling the hole; the second selecting unit 33 is for selecting The seed reading sequence selects a reading sequence from the complement reading sequence set as the seed reading sequence; the extending unit 34 is used for the extension processing, and splicing the seed reading sequence with the first overlapping group to form a new first overlapping group; the first determining unit 35 is used for judging processing, determining whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; the loop unit 36 is used for loop processing, if the new first contig is close to the hole If one end does not overlap with one end of the second contig close to the hole, the execution determining unit 31 continues to determine the node reading order, the first selecting unit 32 selects the complement reading set, and the second selection based on the new first contig.
  • the selecting unit 33 selects the seed reading sequence and the processing of the extension processing;
  • the connecting unit 38 is configured to complete the filling hole, and if the end of the new first overlapping group near the hole overlaps with the end of the second overlapping group near the hole, the first overlapping group is connected And the second contig, complete the hole;
  • the connecting unit 38 is used for sequence connection, the sequence connection can be divided into two ends contig direct connection, one end extension and the other end contig connection and two end extension sequence connection;
  • the identification unit 39 is used for Using the frequency of the sliding window of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, further deducing the repetition pattern of the sequence in the hole according to the series repetition;
  • the second determining unit 40 is configured to Determining whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and pattern recognition is used to determine all the reading sets having overlap between the reading orders having the common kmer;
  • the connection of the supergroup continues to fill the hole based on the new first contig, and, when selecting the seed reading, the non-overlapping read order other than the complement reading set is selected as the seed reading; fourth judgment
  • the unit 37 is configured to perform credibility judgment on the accuracy of the sequence connection when the connection unit 38 is connected in sequence, and when there is the first credibility, select the first credibility, so that the connection unit 38 performs sequence connection; The first credibility, but when there is the second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and the second credibility, but the third credibility exists.
  • the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap, and are not repeated, and the read sequence crosses the support;
  • the reliability is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap;
  • the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.
  • the hole filling device in the assembly of the nucleic acid sequence of the present invention has the following hole filling process: before the hole filling, the identification unit 39 determines whether there is a tandem repeat in the hole, and further infers the repeated pattern in the segment sequence, so as to facilitate the hole filling. get on. In the process of filling the hole, one end of the hole has a first contig and the other end has a second contig. First, the determining unit 31 finds a nucleic acid sequence at the end of the first contig close to the hole, as a node reading, and then a second.
  • Judging unit 40 Determining an overlapping area between the node reading order and the reading order for filling the hole, according to the second determining unit 40 As a result of the judgment, the first selection unit 32 finds all the reading sequences overlapping the node reading order as the complement hole reading set from the reading order for the hole filling, and then the second selecting unit 33 selects from the complement hole reading set. The read sequence having the smallest overlap with the node read sequence is used as the seed read sequence, and the extension unit 34 splices the seed read sequence with the first contig to form a new first contig.
  • the first determining unit 35 determines whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole, if one end of the new first contig close to the hole is close to the second contig If there is no overlap at one end of the hole, the loop unit 36 continues to perform the process of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension process on the basis of the new first contig, and finally the connection unit 38 completes the hole filling. If one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, the first contig is connected to the second contig, and the connecting unit 38 completes the merging.
  • the third judging unit 41 is configured to not connect the first contig and the second if the end of the new first contig close to the hole overlaps with the end hole of the second contig close to the hole according to the predicted hole length.
  • the contig continuing to perform the process of determining the node read order, selecting the fill hole read set, and selecting the seed read order based on the new first contig, and selecting the fill order in the process of selecting the seed read order Non-overlapping reads outside the collection are used as seed reads.
  • the fourth judging unit 37 performs credibility judgment on the accuracy of the sequence connection when the connecting unit 38 is connected in sequence, and when the first credibility exists, selects the first credibility, and causes the connecting unit 38 to perform sequence connection; If there is no first credibility, but there is a second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and second credibility, but there is a third credibility In the case of reliability, the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated, and the read sequence crosses the support; The two credibility is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlap, and the overlap region has no evidence to support.
  • the connecting unit 38 performs sequence connection, and the sequence connection can be divided into two groups of direct overlapping ends, one end extending and the other end overlapping group connection and two end extended sequence connections.
  • the processing method of the middle hole is mainly taken as the core, the small hole can be converted into the middle hole for processing, and the large hole is completely decomposed into the middle hole, and is processed according to the middle hole method.
  • different levels of holes correspond to different processing modes, and the hole repairing process is refined into the hole, and the hole itself is fully utilized to complete the hole filling, and all the holes can be effectively filled. It greatly improves the accuracy of the hole filling, saves the hole filling time and memory space, and is conducive to the development and promotion of gene sequencing technology.

Abstract

Disclosed are a method and a device for gap closure in nucleotide sequence assembly, one end of a gap having a first contig, and the other end of the gap having a second contig. The method comprises the following steps of: finding a nucleotide sequence at one end of the first contig close to the gap and taking the nucleotide sequence as a node read; using all reads that overlap the node read as a gap closure read set, and selecting from the set a read as a seed read; splicing the seed read and the first contig, so as to form a new first contig; determining whether one end of the new first contig after the extension process close to the gap overlaps one end of the second contig close to the gap; if the two ends do not overlap, performing gap closure continuously on the basis of the new first contig; and if the two ends overlap, connecting the first contig and the second contig, so as to complete the gap closure.

Description

核酸序列组装中的补洞方法及其装置  Filling hole method and device thereof in nucleic acid sequence assembly
【技术领域】[Technical Field]
本发明涉及基因工程技术领域,特别是涉及一种核酸序列组装中的补洞方法及其装置。  The invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.
【背景技术】【Background technique】
在基因测序领域,随着第二代测序技术的普及,测序成本越来越低,推动了更多的物种的全基因组测序工作。二代测序技术的原理决定了测序片段的长度偏短。在具体实施过程中,测序片段只有几十到一百个左右的碱基,这无疑增加分析测序所得数据的工作难度。 In the field of gene sequencing, with the popularity of second-generation sequencing technology, the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species. The principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.
在对测序所得数据进行分析时,一般采用基因组组装方式。基因组组装通常首先屏蔽重复区域,然后在双末端读序( pair-end read , PE read )的辅助下,确定非重复区域关系,但是非重复区域之间的未组装区域容易形成 gap ,称之为洞。 In the analysis of the data obtained by sequencing, the genome assembly method is generally adopted. Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.
现有技术中,基于 sanger 测序技术的基因组组装和基于 solexa 等第二代测序仪的基因组组装,初始组装版本中都会存在大量的未组装区域,这些未组装区域往往与序列重复( repeat )密切相关。其中,与洞相关的序列重复可以分为串联重复和转座子重复,现有技术中的补洞程序能够比较准确地处理简单转座子重复,但却难以处理长串联重复。 Prior art, genome assembly based on sanger sequencing technology and based on solexa Such as the genome assembly of second-generation sequencers, there will be a large number of unassembled areas in the initial assembly version, these unassembled areas often repeat with the sequence (repet )closely related. Among them, the sequence repeats related to the holes can be divided into tandem repeats and transposon repeats. The prior art fill-in procedure can handle simple transposon repetitions relatively accurately, but it is difficult to deal with long tandem repeats.
从组装方法来讲,现有技术主要有两种方式来解决长串联重复问题,第一种方式为基于重叠( overlap )的局部组装,第二种方式为基于 De bruijn 图的局部组装。In terms of assembly methods, there are mainly two ways to solve the long series repetition problem in the prior art. The first method is based on overlap (overlap). Partial assembly, the second way is partial assembly based on De Bruijn diagram.
其中,基于 overlap 的局部组装难以识别重复造成冲突的准确位点,因此该方式容易造成插入 / 缺失( indel )。Among them, overlap-based partial assembly is difficult to identify the exact location where the conflict is caused, so this method is easy to cause insertion/deletion ( Indel ).
而 De bruijn 图的局部组装能够识别重复造成的冲突位点,但难以解决冲突,需要断开,从而影响了补洞的数量。And De bruijn The partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.
显然,上述两种方式都难以有效处理长串联重复序列。Obviously, both of the above methods are difficult to effectively process long tandem repeats.
从组装工具来讲,现有技术主要有两种补洞程序,分别为对应基于 overlap 的局部组装的 Gapcloser 程序和基于 De bruijn 图的局部组装的 SOAPdenovo 程序。In terms of assembly tools, there are mainly two hole-filling programs in the prior art, which are Gapclosers corresponding to overlap-based partial assembly. Program and SOAPdenovo program based on partial assembly of De Bruijn diagrams.
但是上述两种程序同样都存在缺点:But both programs have the same drawbacks:
第一、补洞软件 Gapcloser 是基于碱基序列段用 overlap 方法做局部组装,因为没有考虑到洞内情况的复杂性,因此容易导致对复杂洞的处理出现错误,降低整体准确率。而且, Gapcloser 因为其耗用内存大、耗时长而不适合于大基因组初级补洞。First, the hole-filling software Gapcloser is based on the base sequence segment with overlap The method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy. And, Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.
第二、 SOAPdenovo 组装软件的补洞环节都是对洞内区域基于 De bruijn 图做二次组装,虽然能够有效解决长度较小的洞,但是补洞数量有限。Second, the hole in the SOAPdenovo assembly software is based on the area inside the hole. De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.
综上所述,如何有效地对待补核酸序列洞进行补洞处理,提高补洞的准确率和补洞率,节省补洞时间和内存,是基因测序领域研究的方向之一。In summary, how to effectively treat the complemented nucleic acid sequence holes to fill holes, improve the accuracy and fill hole rate of the fill holes, save time and memory, is one of the research directions in the field of gene sequencing.
【发明内容】[Summary of the Invention]
本发明主要解决的技术问题是提供一种核酸序列组装中的补洞方法及其装置,能够有效地对待补核酸序列的洞进行补洞处理,提高补洞的准确率和补洞率。The technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can effectively treat the holes of the complementary nucleic acid sequence to fill holes, and improve the accuracy and the hole filling rate of the hole filling.
为解决上述技术问题,本发明采用的一个技术方案是:提供一种核酸序列组装中的补洞方法,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,包括以下步骤:确定节点读序:在所述第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序;选择补洞读序集合:从用于补洞的读序中找到与所述节点读序有重叠的所有读序作为补洞读序集合;选择种子读序:从所述补洞读序集合中选取一条读序作为种子读序;延伸处理:将所述种子读序与第一重叠群拼接,形成新第一重叠群;判断处理:判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠;循环处理:若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的步骤;完成补洞:若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞。In order to solve the above technical problem, a technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein one end of the hole has a first contig and the other end has a second contig, including the following steps. Determining a node read sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence; selecting a fill hole read set: finding the read order from the read order for filling the hole All the reading sequences with overlapping are used as the complement reading set; the seed reading order is selected: a reading order is selected from the complement reading set as the seed reading; the extension processing: the seed reading and the first overlapping group Splicing to form a new first contig; judging process: judging whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; loop processing: if the new first contig One end near the hole does not overlap with the end of the second contig close to the hole, and then the determination of the node reading order, the selection of the complement reading set, and the selection of the seed reading are performed on the basis of the new first contig. Extension step process; hole filling is completed: if the first end of the new contig hole near the end of the second hole near the contig overlaps, the connection of the first and second overlapping contigs group, complete filling holes.
其中,所述选择补洞读序集合的步骤包括:判断所述节点读序与用于补洞的读序之间是否有公共定长短串,在有公共定长短串的读序之间采用模式识别确定出有重叠的所有读序集合。The step of selecting the complement read sequence set includes: determining whether there is a common fixed length short string between the node read order and the read order for filling the hole, and adopting a pattern between the read order of the public fixed length short string. Identify all sets of readings that determine overlap.
其中,所述在有公共定长短串的读序之间采用模式识别进行比对的步骤包括:在有所述公共定长短串的读序之间采用窗口滑动逐步延伸方式获得所述读序之间的重叠长度。The step of performing pattern matching between the reading sequences having the common fixed length short strings includes: obtaining the reading order by using a window sliding stepwise extending manner between the reading orders having the common fixed length short strings The overlap length between.
其中,所述确定节点读序步骤之前还包括:利用洞内的所述补洞读序的滑动窗口频数以及相同滑动窗口的距离来识别是否存在洞内串联重复,根据所述串联重复进一步推断洞内序列的重复模式。The step of determining the node reading sequence further includes: using the sliding window frequency of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further inferring the hole according to the series repetition. Repeat pattern of inner sequences.
其中,所述选择种子读序的步骤包括:对补洞读序进行比对率过滤处理,即选择有100%比对率的补洞读序作为种子读序。The step of selecting a seed reading sequence includes: performing a comparison rate filtering process on the complement hole reading sequence, that is, selecting a complement hole reading sequence having a 100% alignment rate as a seed reading sequence.
其中,所述选择种子读序的步骤包括:对补洞读序进行短相似重复处理与识别,即在识别出存在短相似重复时,选择较长重叠的补洞读序作为种子读序。The step of selecting a seed reading sequence comprises: performing a short similar repeated processing and recognition on the complement reading sequence, that is, selecting a long overlapping complementary hole reading sequence as a seed reading sequence when identifying that there is a short similar repetition.
其中,所述选择补洞读序集合的步骤包括:对补洞读序进行位置过滤,即根据双末端关系定位读序,计算出补洞读序在洞内位置,根据位置对读序过滤。The step of selecting the complement reading sequence set includes: performing position filtering on the filling hole reading order, that is, positioning the reading order according to the double end relationship, calculating the position of the filling hole reading sequence in the hole, and filtering the reading order according to the position.
其中,所述选择补洞读序集合的步骤包括:对补洞读序进行长度过滤,即在洞内区域选用短的双末端读序,在洞两端选用长的单端读序。The step of selecting the complement read sequence set includes: length filtering the fill hole read sequence, that is, selecting a short double end read sequence in the inner region of the hole, and selecting a long single end read sequence at both ends of the hole.
其中,若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则在连接第一重叠群和第二重叠群的步骤之前,包括:根据预计洞长,如果所述新第一重叠群靠近洞的一端过早与第二重叠群靠近洞的一端洞有重叠,则不进行所述连接第一重叠群和第二重叠群的步骤,继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合以及选择种子读序的步骤,并且,在选择种子读序的步骤中,选用所述补洞读序集合之外的非重叠读序作为种子读序。Wherein, if one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, before the step of connecting the first contig and the second contig, including: according to the estimated hole length, if If the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, the step of connecting the first contig and the second contig is not performed, and the new first overlap is continued. Performing the steps of determining a node read order, selecting a fill hole read set, and selecting a seed read order, and selecting, in the step of selecting a seed read order, selecting a non-overlapping read order other than the complement read set Seed reading.
其中,所述完成补洞步骤包括:进行序列连接,所述序列连接为两端重叠群直接连接、一端延伸与另一端重叠群连接或两端延伸序列连接。The step of completing the hole filling comprises: performing sequence connection, wherein the sequence connection is a direct connection of two overlapping groups, one end extension is connected with another end overlapping group or two end extended sequence connection.
其中,在进行所述序列连接的步骤之前,包括:对所述序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,进行序列连接,其中,第一可信度为连接的两个序列既有重叠,且不是重复,同时有读序跨过支持;第二可信度为连接的两个序列有读序跨过连接,两条序列可能没有重叠;第三可信度为连接的两个序列有重叠,且重叠区域没有证据支持。Before the step of performing the sequence connection, the method comprises: determining a credibility of the accuracy of the sequence connection when the sequence is connected, and selecting the first credibility when the first credibility is present, Sequence connection; when there is no first credibility, but there is a second credibility, the second credibility is selected for sequence connection; there is no first credibility and second credibility, but there is a third credibility When the degree is selected, the third credibility is selected to perform sequence connection, wherein the first credibility is that the two sequences of the connection have overlapping, and are not repeated, and the reading order crosses the support; the second credibility is The two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.
为解决上述技术问题,本发明采用的另一个技术方案是:提供一种核酸序列组装中的补洞方法,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,包括以下步骤:确定节点读序:在所述第一重叠群靠近洞的一端找到一段核酸序列,作为第一节点读序;且在所述第二重叠群靠近洞的一端找到一段核酸序列,作为第二节点读序;选择补洞读序集合:从用于补洞的读序中找到与所述第一节点读序有重叠的所有读序作为第一补洞读序集合;从用于补洞的读序中找到与所述第二节点读序有重叠的所有读序作为第二补洞读序集合;选择种子读序:从所述第一补洞读序集合中选取一条读序作为第一种子读序;从所述第二补洞读序集合中选取一条读序作为第二种子读序;延伸处理:将所述第一种子读序与第一重叠群拼接,形成新第一重叠群;且将所述第二种子读序与第二重叠群拼接,形成新第二重叠群;判断处理:判断延伸处理后的新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端是否有重叠;循环处理:若所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端没有重叠,则继续以新第一和第二重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的步骤;完成补洞:若所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞。In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein the hole has a first contig at one end and a second contig at the other end, including the following Step: determining a node reading sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig near the hole, as a second Node reading order; selecting a complement reading sequence set: finding all reading sequences overlapping with the first node reading order as a first complement hole reading set from the reading order for filling holes; Finding, in the reading sequence, all the readings overlapping with the second node reading sequence as the second supplementary hole reading set; selecting the seed reading order: selecting a reading order from the first complementary hole reading set as the first a seed reading sequence; selecting a reading sequence from the second complementing reading set as a second seed reading sequence; extending processing: splicing the first seed reading sequence with the first overlapping group to form a new first overlapping group And the second The sub-reading sequence is spliced with the second contig to form a new second contig; the judging process is: judging whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extension processing; If the end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, continue to determine the node read order and select the fill hole read order based on the new first and second contigs. a step of selecting, selecting a seed reading sequence, and extending the processing; completing the filling hole: if the end of the new first overlapping group near the hole overlaps with the end of the new second overlapping group near the hole, connecting the first overlapping group and the second Align the group and complete the hole.
为解决上述技术问题,本发明采用的又一个技术方案是:提供一种核酸序列组装中的补洞装置,所述装置包括:确定单元,用于确定节点读序,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,在所述第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序;第一选择单元,用于选择补洞读序集合,从用于补洞的读序中找到与所述节点读序有重叠的所有读序作为补洞读序集合;第二选择单元,用于选择种子读序,从所述补洞读序集合中选取一条读序作为种子读序;延伸单元,用于延伸处理,将所述种子读序与第一重叠群拼接,形成新第一重叠群;第一判断单元,用于判断处理,判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠;循环单元,用于循环处理,若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的处理;连接单元,用于在所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠时,连接第一重叠群和第二重叠群,完成补洞。In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a hole filling device in nucleic acid sequence assembly, the device comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig, the other end of which has a second contig, a nucleic acid sequence is found at one end of the first contig near the hole as a node read sequence; and a first selection unit is used to select a complement read set. Finding all the reading sequences overlapping the node reading order as a complement hole reading set in the reading order of the filling hole; the second selecting unit is configured to select the seed reading order, and selecting one from the supplementary hole reading order set The reading sequence is used as a seed reading sequence; the extending unit is configured to extend the processing, and the seed reading sequence is spliced with the first contig to form a new first contig; the first determining unit is configured to determine the processing, and determine the extension processing. Whether there is overlap between one end of the new first contig and the end of the second contig close to the hole; a loop unit for loop processing if the new first contig is near one end of the hole and the second contig If there is no overlap at one end of the near hole, the processing of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension processing is continued on the basis of the new first contig; the connection unit is used in the new When one end of the contig is close to the end of the hole and the end of the second contig close to the hole, the first contig and the second contig are connected to complete the hole.
其中,包括第二判断单元,用于判断所述节点读序与用于补洞的读序之间是否有公共定长短串,在有公共定长短串的读序之间采用模式识别确定出有重叠的所有读序集合。The second judging unit is configured to determine whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and the pattern recognition is determined between the reading order of the public fixed length short string. Overlapping of all read sets.
其中,所述第二判断单元还用于在有所述公共定长短串的读序之间采用窗口滑动逐步延伸方式获得所述读序之间的重叠长度。The second determining unit is further configured to obtain a length of overlap between the reading sequences by using a window sliding stepwise extending manner between reading sequences having the common fixed length short strings.
其中,包括识别单元,用于利用洞内的所述补洞读序的滑动窗口频数以及相同滑动窗口的距离来识别是否存在洞内串联重复,根据所述串联重复进一步推断洞内序列的重复模式。The method includes an identifying unit, configured to use the sliding window frequency of the filling hole reading in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further infer the repeating pattern of the sequence in the hole according to the series repetition. .
其中,所述第二选择单元还用于对补洞读序进行比对率过滤处理,即选择有100%比对率的补洞读序作为种子读序。The second selection unit is further configured to perform a comparison rate filtering process on the complement reading sequence, that is, select a complement reading sequence with a 100% alignment rate as a seed reading sequence.
其中,所述第二选择单元还用于对补洞读序进行短相似重复处理与识别,即在识别出存在短相似重复时,选择较长重叠的补洞短读序作为种子读序。The second selection unit is further configured to perform short similar repetition processing and recognition on the supplementary hole reading sequence, that is, when the short similar repetition is recognized, the longer overlapping supplementary hole short reading order is selected as the seed reading order.
其中,所述第一选择单元还用于对补洞读序进行位置过滤,即根据双末端关系定位读序,计算出补洞读序在洞内位置,根据位置对读序过滤。The first selection unit is further configured to perform position filtering on the complement reading sequence, that is, according to the double-end relationship positioning reading order, calculate the position of the filling hole reading sequence in the hole, and filter the reading order according to the position.
其中,所述第一选择单元还用于对补洞读序进行长度过滤,即在洞内区域选用短的双末端读序,在洞两端选用长的单端读序。The first selection unit is further configured to perform length filtering on the complement reading sequence, that is, select a short double-end read sequence in the inner region of the hole, and select a long single-end read sequence at both ends of the hole.
其中,包括第三判断单元,用于根据预计洞长,如果所述新第一重叠群靠近洞的一端过早与第二重叠群靠近洞的一端洞有重叠,则不进行第一重叠群和第二重叠群的连接,继续以新第一重叠群为基础进行补洞,并且,在选择种子读序时,选用所述补洞读序集合之外的非重叠读序作为种子读序。The third judging unit is configured to: according to the estimated hole length, if the end of the new first contig close to the hole is prematurely overlapped with the end hole of the second contig close to the hole, the first contig is not performed. The connection of the second contig continues to fill the hole based on the new first contig, and when the seed reading is selected, the non-overlapping read order other than the complement reading set is selected as the seed reading.
其中,所述连接单元具体用于在所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,对两端重叠群直接连接、一端延伸与另一端重叠群连接或两端延伸序列连接;The connecting unit is specifically configured to overlap one end of the new first contig close to the hole and one end of the second contig close to the hole, and directly connect the contigs at both ends, and extend one end to the other end or Extended sequence connection at both ends;
所述装置包括第四判断单元,用于在连接单元对所述序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,使所述连接单元进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,使所述连接单元进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,使所述连接单元进行序列连接,其中,第一可信度为连接的两个序列既有重叠,且不是重复,同时有读序跨过支持;第二可信度为连接的两个序列有读序跨过连接,两条序列可能没有重叠;第三可信度为连接的两个序列有重叠,且重叠区域没有证据支持。The device includes a fourth determining unit, configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting unit performs sequence connection; when there is no first credibility, but when there is a second credibility, the second credibility is selected, so that the connecting unit performs sequence connection; there is no first credibility and second Credibility, but when there is a third credibility, the third credibility is selected, so that the connecting unit performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated At the same time, there is a read sequence across the support; the second credibility is that the two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlapping and overlapping There is no evidence to support the area.
为解决上述技术问题,本发明采用的又一个技术方案是:提供一种核酸序列组装中的补洞装置,所述装置包括:确定单元,用于确定节点读序,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,在所述第一重叠群靠近洞的一端找到一段核酸序列,作为第一节点读序;且在所述第二重叠群靠近洞的一端找到一段核酸序列,作为第二节点读序;第一选择单元,用于选择补洞读序集合,从用于补洞的读序中找到与所述第一节点读序有重叠的所有读序作为第一补洞读序集合;从用于补洞的读序中找到与所述第二节点读序有重叠的所有读序作为第二补洞读序集合;第二选择单元,用于选择种子读序,从所述第一补洞读序集合中选取一条读序作为第一种子读序;从所述第二补洞读序集合中选取一条读序作为第二种子读序;延伸单元,用于延伸处理,将所述第一种子读序与第一重叠群拼接,形成新第一重叠群;且将所述第二种子读序与第二重叠群拼接,形成新第二重叠群;第一判断单元,用于判断处理,判断延伸处理后的新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端是否有重叠;循环单元,用于循环处理,若所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端没有重叠,则继续以新第一和第二重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的处理;连接单元,用于在所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端有重叠时,连接第一重叠群和第二重叠群,完成补洞。In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a hole filling device in nucleic acid sequence assembly, the device comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig having a second contig at the other end, finding a nucleic acid sequence at the end of the first contig near the hole as a first node read sequence; and finding a segment at the end of the second contig near the hole a nucleic acid sequence as a second node read sequence; a first selection unit for selecting a complement read set, and finding all read orders overlapping the first node read order from the read sequence for filling holes a complement reading set; finding all readings overlapping with the second node reading as a second complement reading set from the reading sequence for filling holes; and a second selecting unit for selecting seed reading Sorting, selecting a reading order from the first complement reading set as a first seed reading; selecting a reading from the second supplementary reading set as a second seed reading; extending the unit, using For extension processing, The first seed reading sequence is spliced with the first contig to form a new first contig; and the second seed reading sequence is spliced with the second contig to form a new second contig; the first determining unit is configured to: Determining, determining whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extension processing; a loop unit for loop processing if the new first contig is close to the hole One end does not overlap with the end of the new second contig close to the hole, and then the processing of determining the node read order, selecting the fill hole read set, selecting the seed read order, and extending the process is performed on the basis of the new first and second contigs; The connecting unit is configured to connect the first contig group and the second contig group when the end of the new first contig close to the hole overlaps with the end of the new second contig close to the hole, and complete the hole filling.
本发明的有益效果是:区别于现有技术的补洞方法补洞数量有限、补洞准确率不高的情况,本发明通过在洞的一端的第一重叠群上确定一段核酸序列作为节点读序,找到与该节点读序存在最小重叠的读序,将该读序作为种子读序,并将该种子读序与第一重叠群拼接,形成新第一重叠群,即第一重叠群延伸形成新第一重叠群,然后判断新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠,若没有则循环上述延伸处理,直到新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,连接第一重叠群与第二重叠群完成补洞。这种循环延伸重叠群的方式不仅能大大地提高了补洞数量及效率,而且还提高了补洞的准确率,节省补洞时间。The invention has the beneficial effects that the prior art method of supplementing holes has a limited number of holes and the accuracy of filling holes is not high. The present invention determines a nucleic acid sequence as a node by first identifying a nucleic acid sequence at one end of the hole. a sequence, finding a read sequence with a minimum overlap with the node read sequence, reading the read sequence as a seed, and splicing the seed read sequence with the first contig to form a new first contig, ie, the first contig extension Forming a new first contig, and then judging whether the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, if not, looping the extension process until the new first contig is near the end of the hole The second contig has an overlap near one end of the hole, and connects the first contig and the second contig to complete the hole. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.
【附图说明】[Description of the Drawings]
图1是 本发明核酸序列组装中的补洞方法第一实施例的流程图;Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention;
图2是 本发明核酸序列组装中的补洞方法第二实施例的流程图;2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention;
图3是 本发明核酸序列组装中的补洞装置一实施例的结构示意图。Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
本文中一些名词中英文名称对照及定义如下所示:
PE read 双末端读序 通过双末端建库方法获取到一段较长的 DNA 序列的两个末端及两个末端序列间的距离信息,再通过测序得到的两个末端的序列
read 读序 测序过程中产生的碱基序列
block 窗口 在 DNA 序列上人为选定的一定长度的核苷酸序列
contig 重叠群 一组读序通过重叠关系组成的一条线性有序的序列
overlap 重叠 指在序列拼接过程中,两条序列相同的部分
kmer 定长短串 是一个长度为 K 的 DNA 序列, K 通常取 17
single read 单端读序 主要是基于 sanger 测序方法获取的一种序列信息,就是利用 sanger 测序方法获得较长 DNA 序列的一端序列信息或较短序列的测通信息
scaffold 连接支架 通过质粒、 BACs 、 mRN A 、或其它来源的双末端读序的连接信息将重叠群连接的结果,其中的重叠群之间是有序而且定向的
gap 基因组组装通常首先屏蔽重复区域,然后在双末端读序 (PE read) 的辅助下,确定非重复区域关系,而非重复区域之间的未组装区域形成 gap ,称之为洞区域
repeat 序列重复 基因组序列中重复出现的核苷酸序列
indel 插入 / 缺失 指插入或者缺失一段序列从而改变 DNA 序列结构
The comparison and definition of some Chinese and English names in this article are as follows:
PE read Double end reading The distance between the two ends of the longer DNA sequence and the two end sequences is obtained by the double-end library construction method, and the sequences of the two ends obtained by sequencing are obtained.
Read Reading order Base sequence generated during sequencing
Block window An artificially selected nucleotide sequence of a certain length on a DNA sequence
Contig Contiguous group A linear ordered sequence of readings that consist of overlapping relationships
Overlap overlapping Refers to the same part of the two sequences during the sequence stitching process.
Kmar Fixed length string Is a DNA sequence of length K, K is usually taken 17
Single read Single-ended read order Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
Scaffold Connecting bracket Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented
Gap hole Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
Repeat Sequence repeat Repeated nucleotide sequence in the genome sequence
Indel Insert/miss Refers to the insertion or deletion of a sequence to alter the structure of the DNA sequence
【具体实施方式】【detailed description】
下面结合附图和实施例对本发明进行详细说明。The invention will now be described in detail in conjunction with the drawings and embodiments.
图1示出了本发明核酸序列组装中的补洞方法第一实施例的流程图。在所述补洞方法中,洞的一端具有第一重叠群(Contig),其另一端具有第二重叠群,所述补洞方法包括以下步骤:BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention. In the hole-filling method, one end of the hole has a first contig and the other end has a second contig, and the hole-filling method includes the following steps:
步骤101,确定节点读序(read):在第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序; Step 101, determining a node read sequence (read): finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence;
步骤102,选择补洞读序集合:从用于补洞的读序中找到与节点读序有重叠的所有读序作为补洞读序集合;Step 102: Select a complement-reading set: find all the readings that overlap with the node reading order from the reading order for filling the hole as a complement reading set;
其中,选择补洞读序集合的步骤涉及下述的读序的重叠的计算(参见下页的B2步骤);Wherein, the step of selecting the complement reading set involves the overlapping calculation of the following readings (see step B2 on the next page);
步骤103,选择种子读序:从补洞读序集合中选取一条读序作为种子读序;Step 103: Select a seed reading sequence: select a reading order from the complement reading set as a seed reading sequence;
其中,从补洞读序集合中选取一条与节点读序有最小重叠的读序作为种子读序,当然,在其他实施例中也可以不选取最小重叠的读序作为种子读序,比如选取长一点重叠的读序作为种子读序;The reading order with the smallest overlap with the node reading order is selected as the seed reading sequence. Of course, in other embodiments, the least overlapping reading order may not be selected as the seed reading order, for example, selecting a long a little overlapping reading as a seed reading;
步骤104,延伸处理:将种子读序与第一重叠群拼接,形成新第一重叠群;Step 104: Extend processing: splicing the seed reading sequence with the first contig to form a new first contig;
步骤105,判断处理:判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠;Step 105: Determine a process of: determining whether an end of the new first contig close to the hole after the extension process overlaps with an end of the second contig close to the hole;
步骤106,循环处理:若新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的步骤;Step 106: Loop processing: if one end of the new first contig close to the hole does not overlap with one end of the second contig close to the hole, continue to perform determining the node read order and selecting the fill hole read set based on the new first contig , selecting a seed reading sequence and extending the processing steps;
步骤107,完成补洞:若新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞。Step 107: Completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.
以上可以了解,区别于现有技术的补洞方法补洞数量有限、补洞准确率不高的情况,本发明通过在洞的一端的第一重叠群上确定一段核酸序列作为节点读序,找到与该节点读序存在最小重叠的读序,将该读序作为种子读序,并将该种子读序与第一重叠群拼接,形成新第一重叠群,即第一重叠群延伸形成新第一重叠群,然后判断新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠,若没有则循环上述延伸处理,直到新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,连接第一重叠群与第二重叠群完成补洞。这种循环延伸重叠群的方式不仅能大大地提高了补洞数量及效率,而且还提高了补洞的准确率,节省补洞时间。The above can be understood that, in contrast to the prior art method of supplementing holes, the number of holes is limited, and the accuracy of filling holes is not high. The present invention finds a nucleic acid sequence as a node reading sequence on the first contig of one end of the hole. The reading sequence with the smallest overlap with the node reading sequence, the reading sequence is used as a seed reading sequence, and the seed reading sequence is spliced with the first contig to form a new first contig, that is, the first contig extends to form a new corpus a contig, and then judging whether an end of the new first contig close to the hole overlaps with an end of the second contig close to the hole, and if not, looping the extension process until the end of the new first contig close to the hole overlaps with the second The groups are overlapped at one end of the hole, and the first contig and the second contig are connected to complete the hole. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.
在其他实施例中,所述选择补洞读序集合的步骤包括:判断所述节点读序与用于补洞的读序之间是否有公共定长短串(Kmer),在有公共定长短串的读序之间采用模式识别确定出有重叠的所有读序集合。采用这种方法能快速找出存在重叠的所有读序集合。当然,也可以不采用上述公共定长短串的方法,在此不再赘述。In other embodiments, the step of selecting a complement reading set includes: determining whether there is a public fixed length short string (Kmer) between the node reading order and the reading order for filling the hole, and having a public fixed length short string Pattern recognition is used between the reading sequences to determine all the reading sets that overlap. This method can quickly find all the sets of readings that overlap. Of course, the method of the above public fixed length short string may not be used, and details are not described herein again.
而在所述判断所述节点读序与用于补洞的读序之间是否有公共定长短串的步骤,可以是采用诸如哈希方法等算法判断所述节点读序与用于补洞的读序之间是否有公共定长短串。而采用模式识别进行比对的步骤,可以是在有所述公共定长短串的读序之间采用窗口滑动逐步延伸方式获得所述读序之间的重叠长度。And in the step of determining whether there is a common fixed length short string between the node read sequence and the read sequence for filling the hole, the algorithm may be used to determine the node read order and the hole for filling the hole by using an algorithm such as a hash method. Whether there is a public fixed length short string between readings. The step of performing pattern matching by using pattern recognition may be to obtain a length of overlap between the reading orders by using a window sliding stepwise extending manner between the reading sequences having the common fixed length short strings.
本发明实施例中,根据洞的大小以及系统设置的判断标准获取洞的级别,其中,所述基因序列洞的级别分为小洞、中洞以及大洞,并根据核酸序列洞的级别以及对应的碱基序列段进行补洞。依据如下方式将洞进行分类:洞的长度小于100bp被定义为小洞,洞的长度在100bp~1.5kb之间的被定义为中洞,洞的长度大于1.5kb的被定义为大洞。当然,上述仅仅是对各种洞的定义的其中一种,各个洞大小仅仅是示例性的,本文不作限制。In the embodiment of the present invention, the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding The base sequence segment is filled with holes. The holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole. Of course, the above is only one of the definitions of various holes, and the size of each hole is merely exemplary, and is not limited herein.
在具体实施过程中,除了上述对中洞的补洞采用上述本发明核酸序列组装中的补洞方法第一实施例处理外:In the specific implementation process, in addition to the above-mentioned hole for the middle hole, the first embodiment of the method for filling holes in the nucleic acid sequence assembly of the present invention described above is used:
一、在对级别为小洞的洞进行补洞处理时,包括以下步骤:1. When filling holes in a hole with a small hole level, the following steps are included:
1)查找落在所述小洞内、用于确定小洞洞长的读序,根据所述读序计算所述小洞的实际洞长;1) searching for a reading sequence falling within the small hole for determining a small hole length, and calculating an actual hole length of the small hole according to the reading order;
2)判断所述实际洞长是否大于系统预设的第一阈值,若大于系统预设的第一阈值,则利用所述读序的碱基来完成对所述小洞的补洞处理。2) determining whether the actual hole length is greater than a first threshold preset by the system, and if it is greater than a first threshold preset by the system, using the read base to complete the hole filling process of the small hole.
二、在对级别为大洞的洞进行补洞处理时,包括以下步骤:2. When filling holes in a hole with a large hole level, the following steps are included:
1)根据双末端关系获取所述大洞内各读序的位置,根据读序的位置对所述读序进行排序,根据位置判定有连续读序覆盖的为一个区块;1) acquiring the positions of the readings in the large hole according to the double-end relationship, sorting the reading order according to the position of the reading order, and determining that the continuous reading order is covered by one block according to the position;
2)在每个区块按照对中洞的处理方式进行组装;2) assembling in each block according to the treatment of the center hole;
3)将各个区块组装结果连接,获得大洞洞内序列。3) Connect the assembly results of each block to obtain the sequence within the large hole.
关于上述步骤的更详细的描述,请逐一参阅下文。For a more detailed description of the above steps, please refer to the following one by one.
首先,获取并分析形成基因序列洞的连接支架(scaffold)。其中,原始scaffold被打断后形成重叠群,两个重叠群之间的间隙为洞。本发明实施例通过读取用于补洞的重叠群,可以准确地获取洞的大小、洞前后的重叠群。而且还可以同时获取重叠群长度和序列信息,以及重叠群前后的洞的信息。First, a scaffold that forms a gene sequence hole is acquired and analyzed. Among them, the original scaffold is broken to form a contig, and the gap between the two contigs is a hole. In the embodiment of the present invention, by reading the contig for the hole filling, the size of the hole and the contiguous group before and after the hole can be accurately obtained. Moreover, it is also possible to simultaneously acquire the contig length and sequence information, as well as the information of the holes before and after the contig.
在具体实施过程中,本发明实施例还根据用户的设定,对获取的所有的核酸序列洞和叠连群进行划分,将相互关联的叠连群和读序对应存储至相应的文件夹。譬如,如用户设定4个文件夹,则将获取的所有的核酸序列洞和叠连群分为4份,生成4个文件夹,将相互关联的叠连群和读序一一对应存放至切分好的文件夹中。通过上述切分,各个文件夹都包含有用于补洞的重叠群和读序,在后续进行补洞处理时,可以直接从相应的文件夹获取用于补洞的重叠群和读序。显然,通过上述切分,可将原先需要的内存缩小四分之一,节省空间,而且,在补洞时可以减少搜索时间,从而减少补洞消耗的时间。In the specific implementation process, the embodiment of the present invention further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.
之后,在核酸序列洞内读取用于补洞的读序,本发明实施例中,用于补洞的读序大部分属于PE 读序,来自solexa的测序结果,其余部分为长的单端读序,来自sanger测序结果。Then, in the nucleic acid sequence hole, the reading order for filling the hole is read. In the embodiment of the present invention, the reading order for filling the hole mostly belongs to the PE. The reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.
其中,PE 读序之间相互支持,PE 读序来自某个插入片段的两端,而用于补洞的插入片段一般由180bp、500bp和800bp的组成,本发明实施例通过高通量多乘数的测序,可将某一个插入片段通过多个PE 读序的重叠关系进行还原。因此,对于某个核酸序列洞而言,若存在一条读序与该洞一端的重叠群有重叠关系,且该读序的方向同重叠群的方向一致,即若该读序为PE 读序,则与该读序有PE关系的读序或者落在核酸序列洞内,或者落在核酸序列洞后的重叠群上,即可以对上述核酸序列洞进行补洞处理。Among them, the PE reading order supports each other, PE The reading sequence is from both ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp. In the embodiment of the present invention, an insert can be passed through high-throughput multiplier sequencing. Multiple PE The overlapping relationship of the reading order is restored. Therefore, for a nucleic acid sequence hole, if there is an overlap between the read sequence and the contig of one end of the hole, and the direction of the read sequence is consistent with the direction of the contig, that is, if the read order is PE In the reading sequence, the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.
对于长读序而言,由于长读序本身长度较长,可以跨过洞长较小的核酸序列洞,若长读序的各个碱基都可信,则可以使用该长读序各个位点的碱基来完成洞长较小的核酸序列洞的准确填补。For the long reading order, since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.
本发明实施例中,对于获取的核酸序列洞内的每一条读序,都同时获取了该读序与核酸序列洞的位置关系、该读序所属的重叠群和scaffold,以及该读序自身的序列信息。In the embodiment of the present invention, for each read sequence in the acquired nucleic acid sequence hole, the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the read sequence itself are acquired. Sequence information.
基于上述核酸序列洞的级别,补洞处理具体包括:A、对小洞的补洞处理,B、对中洞的补洞处理,以及C、对大洞的补洞处理。下面分别描述各个级别的洞的补洞过程。Based on the level of the above-mentioned nucleic acid sequence hole, the hole-filling process specifically includes: A, the hole-filling process for the small hole, B, the hole-filling process for the center hole, and C, and the hole-filling process for the large hole. The hole filling process of each level of hole is described below.
A、对于小洞,首先查找落在所述小洞内的读序。查找小洞内的所有读序并进行分析,从洞内读序中寻找能够与洞两边重叠群均有重叠的读序,用这些读序来计算实际洞长,由于落入洞内,且与洞两边重叠群均有重叠,所以,如果除去与洞两边重叠群有重叠的那部分序列,剩下的序列便是洞内序列,因此,可用这些读序来计算洞的实际洞长。具体方法为:跨过该洞的每一条读序都可计算出一个洞长,对于所有这样的读序,便会形成一个频数表,表征洞长的一个范围。频数表的形成是因为连接时可能的误差导致不同的读序跟重叠群连接时显示的洞长各不相同。选择频数表中频率最大的洞长作为实际洞长。A. For a small hole, first look for the reading order that falls within the small hole. Find all the readings in the small hole and analyze them. Look for the reading order that can overlap with the overlapping groups on both sides of the hole. Use these readings to calculate the actual hole length, because it falls into the hole, and The overlapping groups on both sides of the hole overlap, so if the part of the sequence overlapping the overlapping groups on both sides of the hole is removed, the remaining sequence is the sequence within the hole. Therefore, these readings can be used to calculate the actual hole length of the hole. The specific method is: a hole length can be calculated for each reading sequence across the hole. For all such reading sequences, a frequency table is formed to represent a range of the length of the hole. The frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.
获得实际洞长后,如果实际洞长大于系统预设的第一阈值,譬如0,那么表征该洞长的洞内序列上的碱基有可能为该小洞的真实碱基,可以将所有表征该实际洞长的读序逐碱基分析以确定各个位点上的碱基;如果确定出的实际洞长小于系统预设的第一阈值,譬如0,则判断为重叠群两端有重叠,进一步判断该重叠是否为重复,是则判断其重复模式,否则将重叠群末端截取重叠长度。After the actual hole length is obtained, if the actual hole length is greater than the first threshold preset by the system, such as 0, then the base on the sequence in the hole characterizing the hole length may be the true base of the hole, and all the representations may be The read order of the actual hole length is analyzed from base to base to determine the bases at each position; if the determined actual hole length is less than the first threshold preset by the system, such as 0, it is determined that there are overlaps at both ends of the contig. It is further determined whether the overlap is a repetition, and if so, the repeat mode is judged, otherwise the overlap end is intercepted by the end of the contig.
在具体实施过程中,由于跨过小洞的读序的数目很少,因此,上述用于确定小洞洞长的读序上的碱基的可信度将成为该读序是否可以补洞的一个制约。本实施例为了保证填入洞内序列的准确性,查找其它落入该小洞内、但是没有跨过该小洞的读序,同上述用于确定小洞洞长的读序进行比对,如果比对容错性小于3%(通常为3%),则可以确定用于确定小洞洞长的读序其落入洞内的序列每一个碱基都是可信的,可用于补洞;如果比对容错性大于3%(通常为3%),则可以确定用于确定小洞洞长的读序其落入洞内的序列每一个碱基都是不可信的,将不可信的部分剪断。这样确保填入小洞内的读序的准确性。In the specific implementation process, since the number of readings across the small holes is small, the reliability of the above-mentioned bases for determining the length of the small holes will be whether the reading can fill the holes. A constraint. In order to ensure the accuracy of the sequence filled in the hole, the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole. Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.
本发明实施例中,对于小洞而言,并不是每个小洞都可以找到用于确定小洞洞长的读序,在无法找到可以用于确定小洞洞长的读序时,需要使用本发明实施例中对中洞的补洞方式来处理,请参阅下文。In the embodiment of the present invention, for a small hole, not every small hole can find a reading order for determining the length of the small hole, and when it is impossible to find a reading order that can be used to determine the length of the small hole, it is necessary to use In the embodiment of the present invention, the hole filling method of the middle hole is processed, please refer to the following.
B、对于中洞的处理,则如前述的本发明核酸序列组装中的补洞方法第一实施例所示,具体实施方式如下:B. For the treatment of the middle hole, as shown in the first embodiment of the method for filling holes in the assembly of the nucleic acid sequence of the present invention, the specific embodiment is as follows:
B1)、基于读序的重复特征识别,需要从中洞内读序中取出所有可能的窗口(block)。本发明实施例中block设置为6bp或者12bp。其中,block为一个窗口,该窗口中包含一定个数的碱基,在读序上每次滑动一个碱基。具体地说,假定一个窗口中含有X个碱基,首先窗口取第一到第X个碱基,第1次滑动,则窗口取第二到第(X+1)个碱基,以此类推,每滑动一次,则窗口向前挪动一个碱基,当滑动第n次时,窗口内取的是第n+1到第(X+n)个碱基。B1), read-based repeat feature recognition, need to take out all possible blocks from the inner hole read sequence. In the embodiment of the present invention, the block is set to 6 bp or 12 bp. Where block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.
在具体实施过程中,为了识别串联重复,本发明实施例记录block频数(block_freq)以及相同block的距离(block_dis)进行分析。如果在某个距离block_dis值下,频数block_freq具有最大值,同时该距离block_dis大小等于block中碱基的个数,则判定该段序列中存在串联重复。In a specific implementation process, in order to identify the tandem repetition, the embodiment of the present invention records the block frequency (block_freq) and the distance of the same block (block_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance block_dis value, and the distance block_dis size is equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.
而且,本发明实施例还根据在上述判断串联重复的过程中获得的信息进一步推断串联重复的模式:即如果在序列中只有一种串联情况,则判定为单模式串联;如果存在多种交叉或不交叉的串联,则判定为多串联模式。Moreover, the embodiment of the present invention further infers the mode of tandem repetition according to the information obtained in the above process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or If the series is not crossed, it is determined to be a multi-series mode.
在具体实施过程中,为了识别串联重复,本发明实施例记录block频数,通过计算洞内block期望深度和分析洞内block深度分布来判断洞内的重复情况,如果洞内block频数比洞内block期望深度成倍增多即说明有重复。In a specific implementation process, in order to identify the series repetition, the embodiment of the present invention records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block in the hole. It is a duplication that the expected depth is multiplied.
B2)、关于读序的重叠的计算。其中,重叠计算首先采用哈希(Hash)方法快速判断各读序之间是否有公共kmer,有公共kmer的读序之间可能有重叠。Kmer的定义为:长度为k的一段连续的碱基序列,基因组中,kmer的分布和基因组的大小、错误率及杂合率等密切相关。之后,针对可能有重叠的一对读序采用模式识别进行比对。B2), calculation of the overlap of the reading order. Among them, the overlap calculation first uses the hash method to quickly determine whether there is a common kmer between each read order, and there may be overlap between the read orders of the common kmer. Kmer is defined as a contiguous sequence of bases of length k. In the genome, the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.
在具体实施过程中,首先设置最大重叠,并将这个区域分成若干block,分别从一条读序前端读取block,从另一条读序内部查找,判断是否能找到该block,如果能找到,则详细对比以获得重叠长度;如果没有找到,则继续读取block。本发明实施例为了容错(也即两条读序的重叠之间的碱基可以允许的不匹配的个数为3个)考虑,可以适当的上调该block个数。In the specific implementation process, first set the maximum overlap, and divide this area into several blocks, read the block from a read-end front end, and search from another read-order internal to determine whether the block can be found. If it can be found, the details are detailed. Compare to get the overlap length; if not found, continue reading the block. In the embodiment of the present invention, in order to be fault-tolerant (that is, the number of mismatches that can be allowed between the bases of the overlap of two readings is three), the number of blocks can be appropriately raised.
B3)、关于识别延伸冲突。本发明实施例对核酸序列洞内组装采用两端延伸的方法。B3), on identifying extension conflicts. In the embodiment of the present invention, a method of extending both ends of a nucleic acid sequence in a hole is adopted.
对于前端延伸来讲,针对起始的节点读序,找到与该节点读序有重叠的所有读序,选取与该节点读序的重叠最小的读序作为种子读序,则其它读序都应与种子读序有重叠,而且这些重叠的长度必然大于种子读序与节点读序的重叠长度,如果存在读序与种子读序没有重叠,则判断为有冲突发生,通过更换种子读序,即重新寻找新的种子读序来解决该冲突。上述方式保证了寻找到的种子读序的正确性。For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.
之后,本发明实施例进行延伸。将种子读序视为重叠群的一部分,按上述方式继续查找新的种子读序,若能找到,则判断为序列将不断延伸,否则判断为序列延伸结束,需要等待另一端的延伸来确定两端延伸是否有重叠,进而确定该洞是否可以补全。当然,在某些情况下,往前延伸的读序可能寻找到之前作为种子读序的序列作为它的延伸读序,此时将造成该段范围内延伸的无限循环,将其作为冲突进行处理,当延伸读序为之前的读序时,终止延伸。本发明实施例后端的延伸类似前端的延伸,此处不再详述。Thereafter, the embodiments of the present invention are extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed. Of course, in some cases, the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict. When the extended reading is the previous reading, the extension is terminated. The extension of the back end of the embodiment of the present invention is similar to the extension of the front end, and will not be described in detail herein.
对于补洞局部组装来讲,本身有高相似重复问题,因此冲突识别应当尽可能敏感,而补洞的读序数据也应具有更低错误率,本发明实施例对读序预先进行纠错处理,以提高读序的质量,确保读序两端的准确性。For the patch hole local assembly, there is a high similar repetition problem, so the conflict identification should be as sensitive as possible, and the read sequence data of the fill hole should also have a lower error rate. In the embodiment of the present invention, the read sequence is pre-corrected. In order to improve the quality of the reading order, to ensure the accuracy of both ends of the reading.
B4)、关于冲突处理。其中,本发明的发明人在研究过程中发现,造成延伸冲突的原因有两种:一种是种子读序内有碱基错误,另一种是遇到重复分叉,基于上述两种情况,本发明实施例在选择补洞读序集合及种子读序时就采用如下策略处理冲突:B4), on conflict handling. Among them, the inventors of the present invention found in the research process that there are two reasons for the extension conflict: one is a base error in the seed reading order, and the other is a repeated bifurcation, based on the above two cases, In the embodiment of the present invention, when selecting a complement reading set and a seed reading, the following strategies are used to handle conflicts:
a1)、比对率过滤:必须有100%比对率才能作为种子读序去延伸。比对率过滤采用如下策略:A1), comparison rate filtering: must have a 100% alignment rate to extend as a seed reading. The comparison rate filtering uses the following strategies:
搜索出某条重叠群上的所有读序,找到与该重叠群有重叠的读序,并初始选中与该重叠群有最小重叠的读序作为种子读序。这样,其它与该重叠群有重叠的读序必然与种子读序有重叠。通过比对种子读序与其他和重叠群有重叠的读序,若初始选定的种子读序与其它读序之间的重叠的比对容错性大于系统设置的第二阈值,所述第二阈值为3%,那么,判定此种子读序不可靠,则重新选定一种子读序,此种子读序与该重叠群的重叠的长度大于不可靠的种子读序与该重叠群的重叠长度、同时又小于其他读序与重叠群的重叠的长度,如此循环,直到找到种子读序,或者因为没有找到种子读序而放弃延伸为止。本发明实施例通过上述方式,避免了由于种子读序的碱基错误造成无法延伸的问题。Search all the readings on a contig, find the reading that overlaps the contig, and initially select the reading with the smallest overlap with the contig as the seed reading. Thus, other readings that overlap with the contig must necessarily overlap the seed reading. By comparing the seed reading sequence with other reading sequences overlapping with the contig, if the ratio of the overlap between the initially selected seed reading and the other reading order is greater than the second threshold set by the system, the second If the threshold is 3%, then it is determined that the seed reading is unreliable, and then a sub-reading is re-selected, and the length of the overlapping of the seed reading with the contig is greater than the overlapping length of the unreliable seed reading and the contig. At the same time, it is smaller than the overlap length of other read orders and contigs, and so on, until the seed read order is found, or the extension is abandoned because the seed read order is not found. In the above manner, the embodiment of the present invention avoids the problem that the base cannot be extended due to the base error of the seed reading.
在具体实施过程中,读序之间重叠的比对采用block的逐步延伸方式,即从种子读序上选取一个block,设定一个目标读序,比对block内碱基是否能在所述目标读序中找到,若可以,则将种子读序内block向前移动一个单位,再与目标读序比对,如此重复直至无法匹配为止。此时可以得到种子读序和目标读序之间重叠的长度,对于该长度,需要有一个第三阈值,所述第三阈值为1 kmer,以表征两条读序之间重叠出于非偶然状况,是确实可信的。如果前面的种子读序自身有测序错误,则可能导致大量读序被过滤掉,此时设置一循环设置,用于将前面种子读序替换掉。In the specific implementation process, the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.
a2)、位置过滤:根据双末端关系定位读序,计算出读序在洞内位置,根据位置对读序过滤,从而减少由于洞内长片段重复造成的冲突。为了保证洞内位置计算的准确性,本发明实施例可设置严格的过滤条件。A2), position filtering: According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole. In order to ensure the accuracy of the calculation of the position within the hole, the embodiment of the present invention can set strict filtering conditions.
a3)、读序长度过滤:在读序获取过程中,PE 读序长度短,而单端读序(single read)通常较长。长度较长的单端读序都与洞一端有重叠。本发明实施例在洞内区域优先选用短的双末端读序进行延伸,在洞两端优先选用长的单端读序进行延伸。A3), read sequence length filtering: in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole. In the embodiment of the present invention, a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.
a4)、末端过滤:根据预计洞长,如果延伸读序过早与另一端有重叠,则选用非重叠的读序,即选用读序位置刚好位于延伸读序后面,与延伸读序无重叠,且放上去跟预计洞长不冲突的读序。从而确保越过一次repeat区域。本发明实施例中,末端过滤只能出现一次。A4), end filtering: according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed. In the embodiment of the invention, the end filtering can only occur once.
a5)、短相似重复处理与识别:短相似重复通常小于50bp,且位置较近,最终会造成核酸序列洞内序列有碱基缺失发生。在识别出存在短相似重复时,本发明实施例优先选择较长重叠的读序作为种子读序进行延伸,能有效避免短相似重复的问题。A5), short similar repetitive processing and recognition: short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence. When it is recognized that there is a short similar repetition, the embodiment of the present invention preferentially selects a longer overlapping reading sequence as a seed reading sequence, which can effectively avoid the problem of short similar repetition.
B5)、关于序列连接。本发明实施例在补洞过程中,不仅仅要求有准确的组装,更要求有准确的连接。准确的组装一方面能够保证降低碱基错误率,另一方面能够保证能够连接准确。而准确的连接则直接决定最终是否会产生插入/缺失。而且,连接的时候必须要考虑延伸错误的情况。本发明实施例的序列连接关系按照连接质量可以分为以下三个可信度:B5), about sequence connection. In the process of filling the hole, the embodiment of the invention not only requires accurate assembly, but also requires accurate connection. Accurate assembly on the one hand guarantees a low base error rate and on the other hand ensures an accurate connection. The exact connection directly determines whether an insertion/deletion will eventually occur. Moreover, the connection must be considered when extending the error. The sequence connection relationship of the embodiment of the present invention can be divided into the following three credibility according to the connection quality:
b1)、第一可信度:连接的两个序列既有重叠,且不是重复,同时有读序跨过支持。B1), first credibility: the two sequences of the connection have overlapping, and are not repeated, and there are read orders across the support.
b2)、第二可信度:连接的两个序列有读序跨过连接,两条序列可能没有重叠。B2), second credibility: the two sequences connected have a read sequence across the connection, and the two sequences may not overlap.
b3)、第三可信度:连接的两个序列有至少8bp的重叠,且重叠区域没有证据支持,可能是重复。B3), third credibility: the two sequences connected have at least 8 bp overlap, and the overlap region is not supported by evidence, and may be repeated.
上述三个可信度均可能存在,而且第一可信度的质量更高,但并不意味着一定正确,第二可信度的质量次高,同样,并不意味着一定正确。因此,本发明实施例在洞内的连接情况根据实际使用情况分类细致处理。洞内的连接分为三类:两端重叠群直接连接,一端延伸与另一端重叠群连接或两端延伸序列连接。上述三类重叠群连接时均会去判定是否有三个可信度存在,即在重叠群序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,进行序列连接。 The above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct. The quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in the embodiment of the present invention is classified and processed according to the actual use situation. The connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend. When the above three types of contigs are connected, it is determined whether there are three credibility exists, that is, the reliability of the sequence connection is judged when the contig sequence is connected, and when there is the first credibility, the second is selected. a credibility, sequence connection; if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.
C、对于大洞的处理,主要是将大洞划分为多个中洞,按照对中洞的处理过程进行处理。C. For the treatment of large holes, the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.
因为补洞时对PE大小有限制,支持PE的插入片段最长的为800bp,当洞长超过1.5kb时,减去和两端重叠群之间的重叠的长度,两条800bp的插入片段不可能存在有重叠关系,即不可能找到完整路径能够将大洞完全进行填补。为了规避PE 读序可能产生的空白区域,本发明实施例将大洞分成若干中洞,然后对中洞分别进行组装,最后将组装结果连接,具体描述如下:Because the size of the PE is limited when filling the hole, the longest insert of the support PE is 800 bp. When the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole. In order to avoid PE In the embodiment of the present invention, the large hole is divided into a plurality of middle holes, and then the middle holes are separately assembled, and finally the assembly results are connected, and the details are as follows:
c1)、按照PE关系计算读序的洞内位置,按照读序在洞内的位置将读序排序,根据位置判断有连续读序覆盖的为一个区块。C1) Calculate the position of the hole in the reading order according to the PE relationship, sort the reading order according to the position of the reading order in the hole, and judge that there is a block in the continuous reading order according to the position.
c2)、将每个区块用中洞的方式进行分块组装。C2), each block is assembled in blocks by means of a medium hole.
c3)、将各个区块组装结果连接,获得大洞洞内序列。C3), connecting the assembly results of each block to obtain the sequence within the large hole.
本发明实施例既可以从一端开始延伸进行补洞,也可以两端延伸进行补洞,下面描述两端延伸的技术方案:In the embodiment of the present invention, the hole can be extended from one end or the hole can be extended at both ends. The technical solution of extending both ends is described below:
图2示出了本发明核酸序列组装中的补洞方法第二实施例的流程图。在补洞方法中,洞的一端具有第一重叠群,其另一端具有第二重叠群,补洞的具体流程包括以下步骤:Figure 2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention. In the hole filling method, one end of the hole has a first contig and the other end has a second contig, and the specific process of filling the hole includes the following steps:
步骤201,确定节点读序:在第一重叠群靠近洞的一端找到一段核酸序列,作为第一节点读序;且在第二重叠群靠近洞的一端找到一段核酸序列,作为第二节点读序; Step 201, determining a node reading sequence: finding a nucleic acid sequence at a end of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig close to the hole as a second node reading ;
步骤202,选择补洞读序集合:从用于补洞的读序中找到与第一节点读序有重叠的所有读序作为第一补洞读序集合;从用于补洞的读序中找到与第二节点读序有重叠的所有读序作为第二补洞读序集合;Step 202: Select a complement read sequence set: find all read orders overlapping with the first node read order as the first fill hole read set from the read sequence for filling holes; from the read order for filling holes Finding all the readings that overlap with the second node reading order as the second filling hole reading set;
步骤203,选择种子读序:从第一补洞读序集合中选取一条读序作为第一种子读序;从第二补洞读序集合中选取一条读序作为第二种子读序;Step 203: Select a seed reading sequence: select a reading order from the first supplementary hole reading order set as the first seed reading order; and select a reading order from the second supplementary hole reading order set as the second seed reading order;
其中,从第一补洞读序集合中选取一条与节点读序有最小重叠的读序作为第一种子读序;从第二补洞读序集合中选取一条与节点读序有最小重叠的读序作为第二种子读序;Wherein, a read sequence having the smallest overlap with the node read order is selected as the first seed read sequence from the first complement hole read set; and a read with the smallest overlap with the node read order is selected from the second fill hole read sequence set. Order as the second seed reading order;
步骤204,延伸处理:将第一种子读序与第一重叠群拼接,形成新第一重叠群;且将第二种子读序与第二重叠群拼接,形成新第二重叠群; Step 204, extending processing: splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second contig;
步骤205,判断处理:判断延伸处理后的新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端是否有重叠;Step 205: The judging process is: judging whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extending process;
步骤206,循环处理:若新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端没有重叠,则继续以新第一和第二重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的步骤;Step 206: Loop processing: if one end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, then the determining of the node read order and the selection complement is performed on the basis of the new first and second contigs. The steps of reading the sequence, selecting the seed reading, and extending the processing;
步骤207,完成补洞:若新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞。 Step 207, completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the new second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.
以上可以了解,区别于现有技术的补洞方法补洞数量有限、补洞准确率不高的情况,本发明通过在洞的一端的第一重叠群上确定一段核酸序列作为第一节点读序,找到与该第一节点读序存在最小重叠的读序,将该读序作为第一种子读序,并将该第一种子读序与第一重叠群拼接,形成新第一重叠群,即第一重叠群延伸形成新第一重叠群,同时,在洞的另一端的第二重叠群上确定一段核酸序列作为第二节点读序,找到与该第二节点读序存在最小重叠的读序,将该读序作为第二种子读序,并将该第二种子读序与第二重叠群拼接,形成新第二重叠群,即第二重叠群延伸形成新第二重叠群,然后判断新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端是否有重叠,若没有则循环上述延伸处理,直到新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端有重叠,连接第一重叠群与第二重叠群完成补洞。这种循环延伸重叠群的方式不仅能大大地提高了补洞数量及效率,而且还提高了补洞的准确率,节省补洞时间。As can be understood from the above, the method of determining the length of the hole in the prior art is limited, and the accuracy of the hole is not high. The present invention determines a nucleic acid sequence as the first node by using a first contig on one end of the hole. Finding a read sequence with a minimum overlap with the first node read sequence, using the read sequence as a first seed read sequence, and splicing the first seed read sequence with the first contig to form a new first contig, ie The first contig extends to form a new first contig, and at the same time, a nucleic acid sequence is determined as a second node read sequence on the second contig at the other end of the hole, and a read order with minimal overlap with the second node read sequence is found And reading the read sequence as a second seed reading sequence, and splicing the second seed reading sequence with the second contig to form a new second contig, that is, the second contig extends to form a new second contig, and then judges the new Whether the end of the first contig close to the hole overlaps with the end of the new second contig close to the hole, and if not, the extension process is repeated until the end of the new first contig close to the hole and the end of the new contig close to the hole Overlapping connecting the first and second contigs contig complete hole filling. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.
当然,两端可以同时延伸,也可以交替延伸,还可以是某个时间一端延伸而另外时间另一端延伸,在此不再赘述。Of course, the two ends can be extended at the same time, or can be alternately extended. It can also be extended at one end of one time and extended at the other end of the other time, and details are not described herein again.
图3示出了本发明核酸序列组装中的补洞装置一实施例的结构示意图。所述装置包括:确定单元31、第一选择单元32、第二选择单元33、延伸单元34、第一判断单元35、循环单元36、连接单元38、识别单元39、第二判断单元40、第三判断单元41以及第四判断单元37。Fig. 3 is a view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention. The device includes: a determining unit 31, a first selecting unit 32, a second selecting unit 33, an extending unit 34, a first determining unit 35, a circulating unit 36, a connecting unit 38, an identifying unit 39, a second determining unit 40, and a The third determining unit 41 and the fourth determining unit 37.
其中,确定单元31用于确定节点读序,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,在第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序;第一选择单元32用于选择补洞读序集合,从用于补洞的读序中找到与节点读序有重叠的所有读序作为补洞读序集合;第二选择单元33用于选择种子读序,从补洞读序集合中选取一条读序作为种子读序;延伸单元34用于延伸处理,将种子读序与第一重叠群拼接,形成新第一重叠群;第一判断单元35用于判断处理,判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠;循环单元36用于循环处理,若新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则继续以新第一重叠群为基础执行控制确定单元31确定节点读序、第一选择单元32选择补洞读序集合、第二选择单元33选择种子读序以及延伸处理的处理;连接单元38用于完成补洞,若新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞;连接单元38用于序列连接,序列连接可分为两端重叠群直接连接、一端延伸与另一端重叠群连接和两端延伸序列连接;识别单元39用于利用洞内的所述补洞读序的滑动窗口频数以及相同滑动窗口的距离来识别是否存在洞内串联重复,根据所述串联重复进一步推断洞内序列的重复模式;第二判断单元40用于判断节点读序与用于补洞的读序之间是否有公共定长短串,在有公共kmer的读序之间采用模式识别确定出有重叠的所有读序集合;第三判断单元41用于根据预计洞长,如果所述新第一重叠群靠近洞的一端过早与第二重叠群靠近洞的一端洞有重叠,则不进行第一重叠群和第二重叠群的连接,继续以新第一重叠群为基础进行补洞,并且,在选择种子读序时,选用所述补洞读序集合之外的非重叠读序作为种子读序;第四判断单元37用于在连接单元38在序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,使连接单元38进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,使连接单元38进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,使连接单元38进行序列连接,其中,第一可信度为连接的两个序列既有重叠,且不是重复,同时有读序跨过支持;第二可信度为连接的两个序列有读序跨过连接,两条序列可能没有重叠;第三可信度为连接的两个序列有重叠,且重叠区域没有证据支持。The determining unit 31 is configured to determine a node reading sequence, where one end of the hole has a first contig and the other end has a second contig, and a nucleic acid sequence is found at one end of the first contig near the hole as a node reading. The first selecting unit 32 is configured to select a complement reading set, and find all readings overlapping the node reading order as a complement reading set from the reading order for filling the hole; the second selecting unit 33 is for selecting The seed reading sequence selects a reading sequence from the complement reading sequence set as the seed reading sequence; the extending unit 34 is used for the extension processing, and splicing the seed reading sequence with the first overlapping group to form a new first overlapping group; the first determining unit 35 is used for judging processing, determining whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; the loop unit 36 is used for loop processing, if the new first contig is close to the hole If one end does not overlap with one end of the second contig close to the hole, the execution determining unit 31 continues to determine the node reading order, the first selecting unit 32 selects the complement reading set, and the second selection based on the new first contig. The selecting unit 33 selects the seed reading sequence and the processing of the extension processing; the connecting unit 38 is configured to complete the filling hole, and if the end of the new first overlapping group near the hole overlaps with the end of the second overlapping group near the hole, the first overlapping group is connected And the second contig, complete the hole; the connecting unit 38 is used for sequence connection, the sequence connection can be divided into two ends contig direct connection, one end extension and the other end contig connection and two end extension sequence connection; the identification unit 39 is used for Using the frequency of the sliding window of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, further deducing the repetition pattern of the sequence in the hole according to the series repetition; the second determining unit 40 is configured to Determining whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and pattern recognition is used to determine all the reading sets having overlap between the reading orders having the common kmer; the third determining unit 41 is configured to According to the estimated hole length, if the end of the new first contig close to the hole overlaps with the end hole of the second contig close to the hole, the first contig and the second weight are not performed. The connection of the supergroup continues to fill the hole based on the new first contig, and, when selecting the seed reading, the non-overlapping read order other than the complement reading set is selected as the seed reading; fourth judgment The unit 37 is configured to perform credibility judgment on the accuracy of the sequence connection when the connection unit 38 is connected in sequence, and when there is the first credibility, select the first credibility, so that the connection unit 38 performs sequence connection; The first credibility, but when there is the second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and the second credibility, but the third credibility exists. When the degree is selected, the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap, and are not repeated, and the read sequence crosses the support; The reliability is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.
本发明核酸序列组装中的补洞装置,其补洞过程如下:在补洞之前通过识别单元39判断洞内是否存在串联重复,有则进一步推断该段序列中重复的模式,以便于补洞的进行。在补洞过程中,洞的一端具有第一重叠群,其另一端具有第二重叠群,首先确定单元31在第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序,然后第二判断单元40 判断节点读序与用于补洞的读序之间的重叠区域,根据第二判断单元40 的判断结果,第一选择单元32从用于补洞的读序中找到与节点读序有重叠的所有读序作为补洞读序集合,而后第二选择单元33从补洞读序集合中选取与节点读序有最小重叠的读序作为种子读序,延伸单元34将种子读序与第一重叠群拼接,形成新第一重叠群。再由第一判断单元35判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠,若新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则循环单元36继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的处理,最后连接单元38完成补洞,若新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则第一重叠群与第二重叠群连接,连接单元38完成补洞。第三判断单元41用于根据预计洞长,如果所述新第一重叠群靠近洞的一端过早与第二重叠群靠近洞的一端洞有重叠,则不进行连接第一重叠群和第二重叠群,继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合以及选择种子读序的处理,并且,在选择种子读序的处理中,选用所述补洞读序集合之外的非重叠读序作为种子读序。第四判断单元37在连接单元38在序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,使连接单元38进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,使连接单元38进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,使连接单元38进行序列连接,其中,第一可信度为连接的两个序列既有重叠,且不是重复,同时有读序跨过支持;第二可信度为连接的两个序列有读序跨过连接,两条序列可能没有重叠;第三可信度为连接的两个序列有重叠,且重叠区域没有证据支持。在完成补洞后,连接单元38进行序列连接,序列连接可分为两端重叠群直接连接、一端延伸与另一端重叠群连接和两端延伸序列连接。The hole filling device in the assembly of the nucleic acid sequence of the present invention has the following hole filling process: before the hole filling, the identification unit 39 determines whether there is a tandem repeat in the hole, and further infers the repeated pattern in the segment sequence, so as to facilitate the hole filling. get on. In the process of filling the hole, one end of the hole has a first contig and the other end has a second contig. First, the determining unit 31 finds a nucleic acid sequence at the end of the first contig close to the hole, as a node reading, and then a second. Judging unit 40 Determining an overlapping area between the node reading order and the reading order for filling the hole, according to the second determining unit 40 As a result of the judgment, the first selection unit 32 finds all the reading sequences overlapping the node reading order as the complement hole reading set from the reading order for the hole filling, and then the second selecting unit 33 selects from the complement hole reading set. The read sequence having the smallest overlap with the node read sequence is used as the seed read sequence, and the extension unit 34 splices the seed read sequence with the first contig to form a new first contig. Then, the first determining unit 35 determines whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole, if one end of the new first contig close to the hole is close to the second contig If there is no overlap at one end of the hole, the loop unit 36 continues to perform the process of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension process on the basis of the new first contig, and finally the connection unit 38 completes the hole filling. If one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, the first contig is connected to the second contig, and the connecting unit 38 completes the merging. The third judging unit 41 is configured to not connect the first contig and the second if the end of the new first contig close to the hole overlaps with the end hole of the second contig close to the hole according to the predicted hole length. The contig, continuing to perform the process of determining the node read order, selecting the fill hole read set, and selecting the seed read order based on the new first contig, and selecting the fill order in the process of selecting the seed read order Non-overlapping reads outside the collection are used as seed reads. The fourth judging unit 37 performs credibility judgment on the accuracy of the sequence connection when the connecting unit 38 is connected in sequence, and when the first credibility exists, selects the first credibility, and causes the connecting unit 38 to perform sequence connection; If there is no first credibility, but there is a second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and second credibility, but there is a third credibility In the case of reliability, the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated, and the read sequence crosses the support; The two credibility is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlap, and the overlap region has no evidence to support. After the hole is completed, the connecting unit 38 performs sequence connection, and the sequence connection can be divided into two groups of direct overlapping ends, one end extending and the other end overlapping group connection and two end extended sequence connections.
显然,在本发明实施例主要以中洞的处理方式为核心,小洞可以转换为中洞进行处理,而大洞则完全分解为中洞,并按照中洞方式来处理。本发明实施例中,不同级别的洞都对应有不同的处理方式,将补洞处理细化到洞中,充分利用洞自身的性质来完成补洞,可以有效的对所有的洞进行补洞处理,极大的提高了补洞的准确率,节省了补洞时间及内存空间,利于基因测序技术的发展推广。Obviously, in the embodiment of the present invention, the processing method of the middle hole is mainly taken as the core, the small hole can be converted into the middle hole for processing, and the large hole is completely decomposed into the middle hole, and is processed according to the middle hole method. In the embodiment of the present invention, different levels of holes correspond to different processing modes, and the hole repairing process is refined into the hole, and the hole itself is fully utilized to complete the hole filling, and all the holes can be effectively filled. It greatly improves the accuracy of the hole filling, saves the hole filling time and memory space, and is conducive to the development and promotion of gene sequencing technology.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。 The above is only the embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformation of the present invention and the contents of the drawings may be directly or indirectly applied to other related technologies. The fields are all included in the scope of patent protection of the present invention.

Claims (23)

  1. 一种核酸序列组装中的补洞方法,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,其特征在于,包括以下步骤: A method of filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, comprising the steps of:
    确定节点读序:在所述第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序; Determining a node reading sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a node reading sequence;
    选择补洞读序集合:从用于补洞的读序中找到与所述节点读序有重叠的所有读序作为补洞读序集合; Selecting a complement reading set: finding all readings that overlap with the node reading order from the reading order for filling the hole as a complement reading set;
    选择种子读序:从所述补洞读序集合中选取一条读序作为种子读序; Selecting a seed reading sequence: selecting a reading order from the complementing reading set as a seed reading sequence;
    延伸处理:将所述种子读序与第一重叠群拼接,形成新第一重叠群; Extending processing: splicing the seed reading sequence with the first contig to form a new first contig;
    判断处理:判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠; Judgment processing: determining whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing;
    循环处理:若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子 读序以及延伸处理的步骤;Cycling processing: if one end of the new first contig close to the hole does not overlap with one end of the second contig close to the hole, continue to perform determining the node reading order, selecting the complement hole reading set based on the new first contig, Select seed The steps of reading and extending the processing;
    完成补洞:若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞。Completing the hole: If one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, the first contig and the second contig are connected to complete the hole.
  2. 根据权利要求1所述的方法,其特征在于: The method of claim 1 wherein:
    所述选择补洞读序集合的步骤包括: The step of selecting a complement reading set includes:
    判断所述节点读序与用于补洞的读序之间是否有公共定长短串,在有公共定长短串的读序之间采用模式识别确定出有重叠的所有读序集合。 It is judged whether there is a common fixed length short string between the node reading order and the reading order for filling holes, and pattern reading is used between the reading orders having the common fixed length short strings to determine all the reading sets having overlap.
  3. 根据权利要求2所述的方法,其特征在于: The method of claim 2 wherein:
    所述在有公共定长短串的读序之间采用模式识别进行比对的步骤包括:在有所述公共定长短串的读序之间采用窗口滑动逐步延伸方式获得所述读序之间的重叠长度。 The step of performing pattern matching between the reading sequences having the common fixed length short strings includes: using a window sliding stepwise extending manner between the reading orders having the common fixed length short strings to obtain between the reading orders Overlap length.
  4. 根据权利要求1所述的方法,其特征在于: The method of claim 1 wherein:
    所述确定节点读序步骤之前还包括:利用洞内的所述补洞读序的滑动窗口频数以及相同滑动窗口的距离来识别是否存在洞内串联重复,根据所述串联重复进一步推断 洞内序列的重复模式。Before the determining the node reading step, the method further comprises: using the sliding window frequency of the supplementary hole reading in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, further inferring according to the series repetition The repeating pattern of the sequence within the hole.
  5. 根据权利要求1所述的方法,其特征在于: The method of claim 1 wherein:
    所述选择种子读序的步骤包括:对补洞读序进行比对率过滤处理,即选择有100%比对率的补洞读序作为种子读序。 The step of selecting a seed reading sequence comprises: performing a comparison rate filtering process on the fill hole reading order, that is, selecting a fill hole reading order having a 100% matching rate as a seed reading order.
  6. 根据权利要求1所述的方法,其特征在于:The method of claim 1 wherein:
    所述选择种子读序的步骤包括:对补洞读序进行短相似重复处理与识别,即在识别出存在短相似重复时,选择较长重叠的补洞读序作为种子读序。 The step of selecting a seed reading sequence comprises: performing short similar repetition processing and recognition on the complement reading sequence, that is, selecting a longer overlapping filling hole reading sequence as a seed reading sequence when identifying that there is a short similar repetition.
  7. 根据权利要求1所述的方法,其特征在于:The method of claim 1 wherein:
    所述选择补洞读序集合的步骤包括:对补洞读序进行位置过滤,即根据双末端关系定位读序,计算出补洞读序在洞内位置,根据位置对读序过滤。 The step of selecting the complement reading sequence set includes: performing position filtering on the complement hole reading order, that is, positioning the reading order according to the double end relationship, calculating the position of the filling hole reading sequence in the hole, and filtering the reading order according to the position.
  8. 根据权利要求1所述的方法,其特征在于:The method of claim 1 wherein:
    所述选择补洞读序集合的步骤包括:对补洞读序进行长度过滤,即在洞内区域选用短的双末端读序,在洞两端选用长的单端读序。 The step of selecting the complement read sequence set includes: length filtering the fill hole read sequence, that is, selecting a short double end read sequence in the inner region of the hole, and selecting a long single end read sequence at both ends of the hole.
  9. 根据权利要求1所述的方法,其特征在于: The method of claim 1 wherein:
    若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,则在连接第一重叠群和第二重叠群的步骤之前,包括:根据预计洞长,如果所述新第一重叠群 靠近洞的一端过早与第二重叠群靠近洞的一端洞有重叠,则不进行所述连接第一重叠群和第二重叠群的步骤,继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合以及选择种子读序的步骤,并且,在选择种子读序的步骤中,选用所述补洞读序集合之外的非重叠读序作为种子读序。If the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, before the step of connecting the first contig and the second contig, including: according to the estimated hole length, if New first contig If one end of the hole overlaps with one end of the second contig close to the hole, the step of connecting the first contig and the second contig is not performed, and the determining node reading is performed on the basis of the new first contig. The steps of selecting a complement reading set and selecting a seed reading, and, in the step of selecting a seed reading, selecting a non-overlapping read order other than the complement reading set as a seed reading.
  10. 根据权利要求1所述的方法,其特征在于:The method of claim 1 wherein:
    所述完成补洞步骤包括:进行序列连接,所述序列连接为两端重叠群直接连接、一端延伸与另一端重叠群连接或两端延伸序列连接。The step of completing the hole filling comprises: performing sequence connection, wherein the sequence connection is a direct connection of the two ends contig, one end extension is connected with the other end contig group or the two end extension sequence connection.
  11. 根据权利要求10所述的方法,其特征在于: The method of claim 10 wherein:
    在进行所述序列连接的步骤之前,包括:对所述序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,进行序列连接;没有第 一可信度、但存在第二可信度时,则选择第二可信度,进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,进行序列连接,其中,第一可信度为连接的两个序列既有重叠,且不是重复,同时有读序跨过支持;第二可信度为连接的两个序列有读序跨过连接,两条序列可能没有重叠;第三可信度为连接的两个序列有重叠,且重叠区域没有证据支持。Before the step of performing the sequence connection, the method comprises: determining a credibility of the accuracy of the sequence connection when the sequence is connected, and selecting the first credibility when the first credibility is present, and performing the sequence connection Nothing A credibility, but when there is a second credibility, the second credibility is selected for sequence connection; if there is no first credibility and second credibility, but there is a third credibility, then The third credibility is to perform sequence connection, wherein the first credibility is that the two sequences connected have overlap and are not repeated, and the read sequence crosses the support; the second credibility is the two sequences connected. There are read sequences across the connections, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and there is no evidence to support the overlap.
  12. 一种核酸序列组装中的补洞方法,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,其特征在于,包括以下步骤:A method of filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, comprising the steps of:
    确定节点读序:在所述第一重叠群靠近洞的一端找到一段核酸序列,作为第一节点读序;且在所述第二重叠群靠近洞的一端找到一段核酸序列,作为第二节点读序;Determining a node reading sequence: finding a nucleic acid sequence at the end of the first contig near the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig near the hole, reading as a second node sequence;
    选择补洞读序集合:从用于补洞的读序中找到与所述第一节点读序有重叠的所有读序作为第一补洞读序集合;从用于补洞的读序中找到与所述第二节点读序有重叠的所有读序作为第二补洞读序集合;Selecting a complement reading set: finding all readings overlapping with the first node reading in the reading order for filling the hole as the first filling reading set; finding from the reading used for filling the hole All the reading sequences overlapping the second node reading sequence are used as the second supplementary hole reading set;
    选择种子读序:从所述第一补洞读序集合中选取一条读序作为第一种子读序;从所述第二补洞读序集合中选取一条读序作为第二种子读序;Selecting a seed reading sequence: selecting a reading order from the first complement hole reading set as a first seed reading order; and selecting a reading order from the second supplementary hole reading order set as a second seed reading order;
    延伸处理:将所述第一种子读序与第一重叠群拼接,形成新第一重叠群;且将所述第二种子读序与第二重叠群拼接,形成新第二重叠群;An extension process: splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second contig;
    判断处理:判断延伸处理后的新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端是否有重叠;Judgment processing: judging whether an end of the new first contig close to the hole after the extension processing overlaps with an end of the new second contig close to the hole;
    循环处理:若所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端没有重叠,则继续以新第一和第二重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的步骤;Cycling processing: if one end of the new first contig close to the hole does not overlap with one end of the new second contig close to the hole, continue to perform determining the node reading and selecting the supplementary hole based on the new first and second contigs The steps of reading the set, selecting the seed reading, and extending the processing;
    完成补洞:若所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端有重叠,则连接第一重叠群和第二重叠群,完成补洞。 Completing the hole: If one end of the new first contig close to the hole overlaps with one end of the new second contig close to the hole, the first contig and the second contig are connected to complete the hole.
  13. 一种核酸序列组装中的补洞装置,其特征在于,所述装置包括:A hole filling device for assembling a nucleic acid sequence, characterized in that the device comprises:
    确定单元,用于确定节点读序,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,在所述第一重叠群靠近洞的一端找到一段核酸序列,作为节点读序;a determining unit, configured to determine a node reading sequence, the hole has a first contig at one end and a second contig at the other end, and a nucleic acid sequence is found at one end of the first contig near the hole as a node reading ;
    第一选择单元,用于选择补洞读序集合,从用于补洞的读序中找到与所述节点读序有重叠的所有读序作为补洞读序集合;a first selection unit, configured to select a complement reading set, and find all readings that overlap with the node reading in the reading order for filling the hole as a complement reading set;
    第二选择单元,用于选择种子读序,从所述补洞读序集合中选取一条读序作为种子读序;a second selecting unit, configured to select a seed reading sequence, and select a reading order from the patching reading set as a seed reading sequence;
    延伸单元,用于延伸处理,将所述种子读序与第一重叠群拼接,形成新第一重叠群;An extension unit, configured to extend processing, splicing the seed reading sequence with the first contig to form a new first contig;
    第一判断单元,用于判断处理,判断延伸处理后的新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端是否有重叠;a first determining unit, configured to determine, by the determining process, whether an end of the new first contig close to the hole after the extending process is overlapped with an end of the second contig close to the hole;
    循环单元,用于循环处理,若所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端没有重叠,则继续以新第一重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的处理; a looping unit, configured to perform a loop processing, if the end of the new first contig close to the hole does not overlap with the end of the second contig close to the hole, continue to perform determining the node reading order and selecting the complement based on the new first contig The process of reading the ordered set, selecting the seed reading order, and extending the processing;
    连接单元,用于在所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠时,连接第一重叠群和第二重叠群,完成补洞。The connecting unit is configured to connect the first contig group and the second contig group when the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, and complete the hole filling.
  14. 根据权利要求13所述的装置,其特征在于,所述装置包括: The device of claim 13 wherein said device comprises:
    第二判断单元,用于判断所述节点读序与用于补洞的读序之间是否有公共定长短串,在有公共定长短串的读序之间采用模式识别确定出有重叠的所有读序集合。 a second determining unit, configured to determine whether there is a common fixed length short string between the node reading order and the reading order for filling holes, and pattern recognition is used between the reading orders having the common fixed length short strings to determine that there is overlap Read the collection.
  15. 根据权利要求14所述的装置,其特征在于: The device of claim 14 wherein:
    所述第二判断单元还用于在有所述公共定长短串的读序之间采用窗口滑动逐步延伸方式获得所述读序之间的重叠长度。 The second determining unit is further configured to obtain a length of overlap between the reading orders by using a window sliding stepwise extending manner between reading sequences having the common fixed length short strings.
  16. 根据权利要求13所述的装置,其特征在于,所述装置包括: The device of claim 13 wherein said device comprises:
    识别单元,用于利用洞内的所述补洞读序的滑动窗口频数以及相同滑动窗口的距离来识别是否存在洞内串联重复,根据所述串联重复进一步推断洞内序列的重复模式 。a recognition unit, configured to use the sliding window frequency of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further infer the repeating pattern of the sequence in the hole according to the series repetition .
  17. 根据权利要求13所述的装置,其特征在于: The device of claim 13 wherein:
    所述第二选择单元还用于对补洞读序进行比对率过滤处理,即选择有100%比对率的补洞读序作为种子读序。 The second selection unit is further configured to perform a comparison rate filtering process on the complement reading sequence, that is, select a complement reading sequence with a 100% alignment rate as the seed reading sequence.
  18. 根据权利要求13所述的装置,其特征在于: The device of claim 13 wherein:
    所述第二选择单元还用于对补洞读序进行短相似重复处理与识别,即在识别出存在短相似重复时,选择较长重叠的补洞读序作为种子读序。 The second selecting unit is further configured to perform short similar repetitive processing and recognition on the complement reading sequence, that is, when identifying short similar repetitions, select a longer overlapping complementary hole reading sequence as the seed reading sequence.
  19. 根据权利要求13所述的装置,其特征在于: The device of claim 13 wherein:
    所述第一选择单元还用于对补洞读序进行位置过滤,即根据双末端关系定位读序,计算出补洞读序在洞内位置,根据位置对读序过滤。 The first selection unit is further configured to perform position filtering on the complement reading sequence, that is, according to the double-end relationship positioning reading order, calculate the position of the filling hole reading sequence in the hole, and filter the reading order according to the position.
  20. 根据权利要求13所述的装置,其特征在于 The device of claim 13 wherein
    所述第一选择单元还用于对补洞读序进行长度过滤,即在洞内区域选用短的双末端读序,在洞两端选用长的单端读序。 The first selection unit is further configured to perform length filtering on the complement reading sequence, that is, select a short double-end read sequence in the inner region of the hole, and select a long single-end read sequence at both ends of the hole.
  21. 根据权利要求13所述的装置,其特征在于,所述装置包括: The device of claim 13 wherein said device comprises:
    第三判断单元,用于根据预计洞长,如果所述新第一重叠群靠近洞的一端过早与第二重叠群靠近洞的一端洞有重叠,则不进行第一重叠群和第二重叠群的连接,继续 以新第一重叠群为基础进行补洞,并且,在选择种子读序时,选用所述补洞读序集合之外的非重叠读序作为种子读序。a third judging unit, configured to: according to the predicted hole length, if the end of the new first contig close to the hole is prematurely overlapped with the end hole of the second contig close to the hole, the first contig and the second overlap are not performed Group connection, continue The hole is filled based on the new first contig, and when the seed reading is selected, the non-overlapping read order other than the set of vocabulary readings is selected as the seed reading.
  22. 根据权利要求13所述的装置,其特征在于: The device of claim 13 wherein:
    所述连接单元具体用于在所述新第一重叠群靠近洞的一端与第二重叠群靠近洞的一端有重叠,对两端重叠群直接连接、一端延伸与另一端重叠群连接或两端延伸序列 连接;The connecting unit is specifically configured to overlap one end of the new first contig near the hole and one end of the second contig close to the hole, and directly connect the two ends of the contig, connect one end to the other end, or both ends Extended sequence Connection
    所述装置包括第四判断单元,用于在连接单元对所述序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,使所述连接单元进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,使所述连接单元进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,使所述连接单元进行序列连接,其中,第一可信度为连接的两个序列既有重叠,且不是重复,同时有读序跨过支持;第二可信度为连接的两个序列有读序跨过连接,两条序列可能没有重叠;第三可信度为连接的两个序列有重叠,且重叠区域没有证据支持。The device includes a fourth determining unit, configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting unit performs sequence connection; when there is no first credibility, but when there is a second credibility, the second credibility is selected, so that the connecting unit performs sequence connection; there is no first credibility and second Credibility, but when there is a third credibility, the third credibility is selected, so that the connecting unit performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated At the same time, there is a read sequence across the support; the second credibility is that the two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlapping and overlapping There is no evidence to support the area.
  23. 一种核酸序列组装中的补洞装置,其特征在于,所述装置包括: A hole filling device for assembling a nucleic acid sequence, characterized in that the device comprises:
    确定单元,用于确定节点读序,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,在所述第一重叠群靠近洞的一端找到一段核酸序列,作为第一节点读序 ;且在所述第二重叠群靠近洞的一端找到一段核酸序列,作为第二节点读序;a determining unit, configured to determine a node reading sequence, the hole has a first contig at one end and a second contig at the other end, and a nucleic acid sequence is found at the end of the first contig near the hole as the first node Reading order And finding a nucleic acid sequence at the end of the second contig near the hole as a second node read sequence;
    第一选择单元,用于选择补洞读序集合,从用于补洞的读序中找到与所述第一节点读序有重叠的所有读序作为第一补洞读序集合;从用于补洞的读序中找到与所述第二节点读序有重叠的所有读序作为第二补洞读序集合;a first selecting unit, configured to select a complement reading set, and find all readings overlapping with the first node reading order as a first complement reading set from the reading order for filling holes; Finding all the reading sequences overlapping with the second node reading order as a second filling hole reading set in the reading sequence of the hole filling;
    第二选择单元,用于选择种子读序,从所述第一补洞读序集合中选取一条读序作为第一种子读序;从所述第二补洞读序集合中选取一条读序作为第二种子读序;a second selecting unit, configured to select a seed reading sequence, select a reading order from the first complement hole reading set as a first seed reading order, and select a reading order from the second supplementary hole reading order set as Second seed reading order;
    延伸单元,用于延伸处理,将所述第一种子读序与第一重叠群拼接,形成新第一重叠群;且将所述第二种子读序与第二重叠群拼接,形成新第二重叠群;An extension unit, configured to extend processing, splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second Contiguous group
    第一判断单元,用于判断处理,判断延伸处理后的新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端是否有重叠;a first determining unit, configured to determine, by the determining process, whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extending process;
    循环单元,用于循环处理,若所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端没有重叠,则继续以新第一和第二重叠群为基础执行确定节点读序、选择补洞读序集合、选择种子读序以及延伸处理的处理;a looping unit for loop processing, if the end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, then the determining of the node reading is continued based on the new first and second contigs Sequence, selection of complement reading sequence sets, selection of seed readings, and processing of extension processing;
    连接单元,用于在所述新第一重叠群靠近洞的一端与新第二重叠群靠近洞的一端有重叠时,连接第一重叠群和第二重叠群,完成补洞。The connecting unit is configured to connect the first contig group and the second contig group when the end of the new first contig close to the hole overlaps with the end of the new second contig close to the hole, and complete the hole filling.
PCT/CN2011/083178 2011-11-29 2011-11-29 Method and device for gap closure in nucleotide sequence assembly WO2013078623A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083178 WO2013078623A1 (en) 2011-11-29 2011-11-29 Method and device for gap closure in nucleotide sequence assembly

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083178 WO2013078623A1 (en) 2011-11-29 2011-11-29 Method and device for gap closure in nucleotide sequence assembly

Publications (1)

Publication Number Publication Date
WO2013078623A1 true WO2013078623A1 (en) 2013-06-06

Family

ID=48534608

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083178 WO2013078623A1 (en) 2011-11-29 2011-11-29 Method and device for gap closure in nucleotide sequence assembly

Country Status (1)

Country Link
WO (1) WO2013078623A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998026096A1 (en) * 1996-12-12 1998-06-18 Smithkline Beecham Corporation Method for rapid gap closure

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998026096A1 (en) * 1996-12-12 1998-06-18 Smithkline Beecham Corporation Method for rapid gap closure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QU, XUEPING ET AL.: "Progress in Contig", PROGRESS IN BIOTECHNOLOGY, vol. 19, no. 3, 1999, pages 2 - 6 *
ZHAO, DONGSHENG ET AL.: "Computing Resource of AMMS Biomedicine Super-computing Center and Its Applications.", BULL. ACAD. MIL. MED. SCI., vol. 29, no. 4, August 2005 (2005-08-01), pages 363 - 367 *

Similar Documents

Publication Publication Date Title
WO2012092821A1 (en) Data compression system for dna sequence
WO2014183270A1 (en) Method for detecting chromosomal structural abnormalities and device therefor
WO2018014580A1 (en) Data interface test method and apparatus, and server and storage medium
JPH08129568A (en) Statistical method
WO2021107676A1 (en) Artificial intelligence-based chromosomal abnormality detection method
US20160196133A1 (en) Approximate Functional Matching in Electronic Systems
JP2002543470A (en) Optical process correction of rational IC mask layout by reuse of correction
US20080052662A1 (en) Software For Filtering The Results Of A Software Source Code Comparison
WO2014069767A1 (en) Base sequence alignment system and method
WO2023033329A1 (en) Device and method for generating risk gene mutation information for each disease through disease-related gene mutation analysis
WO2013078623A1 (en) Method and device for gap closure in nucleotide sequence assembly
WO2013078619A1 (en) Method and device for identifying extension conflict and determining confidence level of seed read in nucleotide sequence assembly
WO2013078625A1 (en) Gap closure method and device in nucleotide sequence assembly
WO2022097844A1 (en) Method for predicting survival prognosis of pancreatic cancer patients by using gene copy number variation information
WO2013071480A1 (en) Circuit optimization method and device for analog circuit transplantation
WO2010095807A2 (en) Document ranking system and method based on contribution scoring
Täubig et al. PAST: fast structure-based searching in the PDB
WO2022164236A1 (en) Method and system for searching target node related to queried entity in network
WO2023163405A1 (en) Method and apparatus for updating or replacing credit evaluation model
WO2021172780A1 (en) Method and device for selecting gene
Fasulo et al. Efficiently detecting polymorphisms during the fragment assembly process
WO2017067288A1 (en) Fingerprint recognition method and apparatus, and mobile terminal
WO2015009046A1 (en) Molecular orbital library having exclusive molecular orbital distribution, molecular orbital distribution region evaluation method using same, and system using same
WO2016080695A1 (en) Method for recognizing multiple user actions on basis of sound information
WO2015096541A1 (en) Method and device for indexing external sd card

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11876834

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/11/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11876834

Country of ref document: EP

Kind code of ref document: A1