WO2013078623A1 - Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique - Google Patents

Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique Download PDF

Info

Publication number
WO2013078623A1
WO2013078623A1 PCT/CN2011/083178 CN2011083178W WO2013078623A1 WO 2013078623 A1 WO2013078623 A1 WO 2013078623A1 CN 2011083178 W CN2011083178 W CN 2011083178W WO 2013078623 A1 WO2013078623 A1 WO 2013078623A1
Authority
WO
WIPO (PCT)
Prior art keywords
hole
reading
contig
sequence
new
Prior art date
Application number
PCT/CN2011/083178
Other languages
English (en)
Chinese (zh)
Inventor
刘兵行
李振宇
陈燕香
李英睿
汪建
王俊
杨焕明
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to PCT/CN2011/083178 priority Critical patent/WO2013078623A1/fr
Publication of WO2013078623A1 publication Critical patent/WO2013078623A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.
  • the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species.
  • the principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.
  • Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.
  • the partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.
  • the hole-filling software Gapcloser is based on the base sequence segment with overlap
  • the method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy.
  • Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.
  • the hole in the SOAPdenovo assembly software is based on the area inside the hole.
  • De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.
  • the technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can effectively treat the holes of the complementary nucleic acid sequence to fill holes, and improve the accuracy and the hole filling rate of the hole filling.
  • a technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein one end of the hole has a first contig and the other end has a second contig, including the following steps.
  • Determining a node read sequence finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence; selecting a fill hole read set: finding the read order from the read order for filling the hole All the reading sequences with overlapping are used as the complement reading set; the seed reading order is selected: a reading order is selected from the complement reading set as the seed reading; the extension processing: the seed reading and the first overlapping group Splicing to form a new first contig; judging process: judging whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; loop processing: if the new first contig One end near the hole does not overlap with the end of the second contig close to the hole, and then the determination of the node reading order, the selection of the complement reading set, and the selection of the seed reading are performed on the basis of the new first contig.
  • Extension step process hole filling is completed: if the first end of the new contig hole
  • the step of selecting the complement read sequence set includes: determining whether there is a common fixed length short string between the node read order and the read order for filling the hole, and adopting a pattern between the read order of the public fixed length short string. Identify all sets of readings that determine overlap.
  • the step of performing pattern matching between the reading sequences having the common fixed length short strings includes: obtaining the reading order by using a window sliding stepwise extending manner between the reading orders having the common fixed length short strings The overlap length between.
  • the step of determining the node reading sequence further includes: using the sliding window frequency of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further inferring the hole according to the series repetition. Repeat pattern of inner sequences.
  • the step of selecting a seed reading sequence includes: performing a comparison rate filtering process on the complement hole reading sequence, that is, selecting a complement hole reading sequence having a 100% alignment rate as a seed reading sequence.
  • the step of selecting a seed reading sequence comprises: performing a short similar repeated processing and recognition on the complement reading sequence, that is, selecting a long overlapping complementary hole reading sequence as a seed reading sequence when identifying that there is a short similar repetition.
  • the step of selecting the complement reading sequence set includes: performing position filtering on the filling hole reading order, that is, positioning the reading order according to the double end relationship, calculating the position of the filling hole reading sequence in the hole, and filtering the reading order according to the position.
  • the step of selecting the complement read sequence set includes: length filtering the fill hole read sequence, that is, selecting a short double end read sequence in the inner region of the hole, and selecting a long single end read sequence at both ends of the hole.
  • the step of connecting the first contig and the second contig including: according to the estimated hole length, if If the end of the new first contig close to the hole overlaps with the end of the second contig close to the hole, the step of connecting the first contig and the second contig is not performed, and the new first overlap is continued. Performing the steps of determining a node read order, selecting a fill hole read set, and selecting a seed read order, and selecting, in the step of selecting a seed read order, selecting a non-overlapping read order other than the complement read set Seed reading.
  • the step of completing the hole filling comprises: performing sequence connection, wherein the sequence connection is a direct connection of two overlapping groups, one end extension is connected with another end overlapping group or two end extended sequence connection.
  • the method comprises: determining a credibility of the accuracy of the sequence connection when the sequence is connected, and selecting the first credibility when the first credibility is present, Sequence connection; when there is no first credibility, but there is a second credibility, the second credibility is selected for sequence connection; there is no first credibility and second credibility, but there is a third credibility
  • the third credibility is selected to perform sequence connection, wherein the first credibility is that the two sequences of the connection have overlapping, and are not repeated, and the reading order crosses the support; the second credibility is The two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.
  • another technical solution adopted by the present invention is to provide a method for filling holes in nucleic acid sequence assembly, wherein the hole has a first contig at one end and a second contig at the other end, including the following Step: determining a node reading sequence: finding a nucleic acid sequence at a side of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig near the hole, as a second Node reading order; selecting a complement reading sequence set: finding all reading sequences overlapping with the first node reading order as a first complement hole reading set from the reading order for filling holes; Finding, in the reading sequence, all the readings overlapping with the second node reading sequence as the second supplementary hole reading set; selecting the seed reading order: selecting a reading order from the first complementary hole reading set as the first a seed reading sequence; selecting a reading sequence from the second complementing reading set as a second seed reading sequence; extending processing: splic
  • a step of selecting, selecting a seed reading sequence, and extending the processing completing the filling hole: if the end of the new first overlapping group near the hole overlaps with the end of the new second overlapping group near the hole, connecting the first overlapping group and the second Align the group and complete the hole.
  • a hole filling device in nucleic acid sequence assembly comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig, the other end of which has a second contig, a nucleic acid sequence is found at one end of the first contig near the hole as a node read sequence; and a first selection unit is used to select a complement read set.
  • the second selecting unit is configured to select the seed reading order, and selecting one from the supplementary hole reading order set
  • the reading sequence is used as a seed reading sequence
  • the extending unit is configured to extend the processing, and the seed reading sequence is spliced with the first contig to form a new first contig
  • the first determining unit is configured to determine the processing, and determine the extension processing.
  • a loop unit for loop processing if the new first contig is near one end of the hole and the second contig If there is no overlap at one end of the near hole, the processing of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension processing is continued on the basis of the new first contig; the connection unit is used in the new When one end of the contig is close to the end of the hole and the end of the second contig close to the hole, the first contig and the second contig are connected to complete the hole.
  • the second judging unit is configured to determine whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and the pattern recognition is determined between the reading order of the public fixed length short string. Overlapping of all read sets.
  • the second determining unit is further configured to obtain a length of overlap between the reading sequences by using a window sliding stepwise extending manner between reading sequences having the common fixed length short strings.
  • the method includes an identifying unit, configured to use the sliding window frequency of the filling hole reading in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, and further infer the repeating pattern of the sequence in the hole according to the series repetition. .
  • the second selection unit is further configured to perform a comparison rate filtering process on the complement reading sequence, that is, select a complement reading sequence with a 100% alignment rate as a seed reading sequence.
  • the second selection unit is further configured to perform short similar repetition processing and recognition on the supplementary hole reading sequence, that is, when the short similar repetition is recognized, the longer overlapping supplementary hole short reading order is selected as the seed reading order.
  • the first selection unit is further configured to perform position filtering on the complement reading sequence, that is, according to the double-end relationship positioning reading order, calculate the position of the filling hole reading sequence in the hole, and filter the reading order according to the position.
  • the first selection unit is further configured to perform length filtering on the complement reading sequence, that is, select a short double-end read sequence in the inner region of the hole, and select a long single-end read sequence at both ends of the hole.
  • the third judging unit is configured to: according to the estimated hole length, if the end of the new first contig close to the hole is prematurely overlapped with the end hole of the second contig close to the hole, the first contig is not performed.
  • the connection of the second contig continues to fill the hole based on the new first contig, and when the seed reading is selected, the non-overlapping read order other than the complement reading set is selected as the seed reading.
  • the connecting unit is specifically configured to overlap one end of the new first contig close to the hole and one end of the second contig close to the hole, and directly connect the contigs at both ends, and extend one end to the other end or Extended sequence connection at both ends;
  • the device includes a fourth determining unit, configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting unit performs sequence connection; when there is no first credibility, but when there is a second credibility, the second credibility is selected, so that the connecting unit performs sequence connection; there is no first credibility and second Credibility, but when there is a third credibility, the third credibility is selected, so that the connecting unit performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated At the same time, there is a read sequence across the support; the second credibility is that the two sequences connected have a read sequence across the join, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlapping and overlapping There is no evidence to support the area.
  • a fourth determining unit configured to perform credibility judgment on the accuracy of the sequence connection when the connecting unit connects to the sequence, and when the first credibility exists, select the first credibility, so that The connecting
  • a hole filling device in nucleic acid sequence assembly comprising: a determining unit, configured to determine a node reading sequence, wherein one end of the hole has a a contig having a second contig at the other end, finding a nucleic acid sequence at the end of the first contig near the hole as a first node read sequence; and finding a segment at the end of the second contig near the hole a nucleic acid sequence as a second node read sequence; a first selection unit for selecting a complement read set, and finding all read orders overlapping the first node read order from the read sequence for filling holes a complement reading set; finding all readings overlapping with the second node reading as a second complement reading set from the reading sequence for filling holes; and a second selecting unit for selecting seed reading Sorting, selecting a reading order from the first complement reading set as a first seed reading; selecting a reading from the second supplementary reading set as a second seed
  • the invention has the beneficial effects that the prior art method of supplementing holes has a limited number of holes and the accuracy of filling holes is not high.
  • the present invention determines a nucleic acid sequence as a node by first identifying a nucleic acid sequence at one end of the hole.
  • Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention
  • FIG. 2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention
  • Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • Kmar Fixed length string Is a DNA sequence of length K K is usually taken 17 Single read Single-ended read order Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
  • Scaffold Connecting bracket Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented Gap hole Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
  • PE read double-end read
  • Figure 1 is a flow chart showing a first embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention.
  • the hole-filling method one end of the hole has a first contig and the other end has a second contig, and the hole-filling method includes the following steps:
  • Step 101 determining a node read sequence (read): finding a nucleic acid sequence at a side of the first contig close to the hole as a node read sequence;
  • Step 102 Select a complement-reading set: find all the readings that overlap with the node reading order from the reading order for filling the hole as a complement reading set;
  • step of selecting the complement reading set involves the overlapping calculation of the following readings (see step B2 on the next page);
  • Step 103 Select a seed reading sequence: select a reading order from the complement reading set as a seed reading sequence;
  • the reading order with the smallest overlap with the node reading order is selected as the seed reading sequence.
  • the least overlapping reading order may not be selected as the seed reading order, for example, selecting a long a little overlapping reading as a seed reading;
  • Step 104 Extend processing: splicing the seed reading sequence with the first contig to form a new first contig;
  • Step 105 Determine a process of: determining whether an end of the new first contig close to the hole after the extension process overlaps with an end of the second contig close to the hole;
  • Step 106 Loop processing: if one end of the new first contig close to the hole does not overlap with one end of the second contig close to the hole, continue to perform determining the node read order and selecting the fill hole read set based on the new first contig , selecting a seed reading sequence and extending the processing steps;
  • Step 107 Completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.
  • the above can be understood that, in contrast to the prior art method of supplementing holes, the number of holes is limited, and the accuracy of filling holes is not high.
  • the present invention finds a nucleic acid sequence as a node reading sequence on the first contig of one end of the hole.
  • the reading sequence with the smallest overlap with the node reading sequence is used as a seed reading sequence, and the seed reading sequence is spliced with the first contig to form a new first contig, that is, the first contig extends to form a new corpus a contig, and then judging whether an end of the new first contig close to the hole overlaps with an end of the second contig close to the hole, and if not, looping the extension process until the end of the new first contig close to the hole overlaps with the second
  • the groups are overlapped at one end of the hole, and the first contig and the second contig are connected to complete the hole. This way of extending the contig is not only greatly improved the number and efficiency of the hole, but also improves the accuracy of the hole and saves the hole.
  • the step of selecting a complement reading set includes: determining whether there is a public fixed length short string (Kmer) between the node reading order and the reading order for filling the hole, and having a public fixed length short string Pattern recognition is used between the reading sequences to determine all the reading sets that overlap. This method can quickly find all the sets of readings that overlap.
  • the method of the above public fixed length short string may not be used, and details are not described herein again.
  • the algorithm may be used to determine the node read order and the hole for filling the hole by using an algorithm such as a hash method. Whether there is a public fixed length short string between readings.
  • the step of performing pattern matching by using pattern recognition may be to obtain a length of overlap between the reading orders by using a window sliding stepwise extending manner between the reading sequences having the common fixed length short strings.
  • the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding The base sequence segment is filled with holes.
  • the holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole.
  • the above is only one of the definitions of various holes, and the size of each hole is merely exemplary, and is not limited herein.
  • a scaffold that forms a gene sequence hole is acquired and analyzed.
  • the original scaffold is broken to form a contig, and the gap between the two contigs is a hole.
  • the size of the hole and the contiguous group before and after the hole can be accurately obtained.
  • the embodiment of the present invention further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.
  • the reading order for filling the hole is read.
  • the reading order for filling the hole mostly belongs to the PE.
  • the reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.
  • the PE reading order supports each other, PE
  • the reading sequence is from both ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp.
  • an insert can be passed through high-throughput multiplier sequencing. Multiple PE The overlapping relationship of the reading order is restored.
  • the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.
  • the long reading order since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.
  • the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the read sequence itself are acquired. Sequence information.
  • the hole-filling process specifically includes: A, the hole-filling process for the small hole, B, the hole-filling process for the center hole, and C, and the hole-filling process for the large hole.
  • A the hole-filling process for the small hole
  • B the hole-filling process for the center hole
  • C the hole-filling process for the large hole.
  • the specific method is: a hole length can be calculated for each reading sequence across the hole. For all such reading sequences, a frequency table is formed to represent a range of the length of the hole. The frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.
  • the base on the sequence in the hole characterizing the hole length may be the true base of the hole, and all the representations may be The read order of the actual hole length is analyzed from base to base to determine the bases at each position; if the determined actual hole length is less than the first threshold preset by the system, such as 0, it is determined that there are overlaps at both ends of the contig. It is further determined whether the overlap is a repetition, and if so, the repeat mode is judged, otherwise the overlap end is intercepted by the end of the contig.
  • the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole.
  • Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.
  • the hole filling method of the middle hole is processed, please refer to the following.
  • the block is set to 6 bp or 12 bp.
  • block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.
  • the embodiment of the present invention records the block frequency (block_freq) and the distance of the same block (block_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance block_dis value, and the distance block_dis size is equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.
  • the embodiment of the present invention further infers the mode of tandem repetition according to the information obtained in the above process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or If the series is not crossed, it is determined to be a multi-series mode.
  • the embodiment of the present invention records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block in the hole. It is a duplication that the expected depth is multiplied.
  • the overlap calculation first uses the hash method to quickly determine whether there is a common kmer between each read order, and there may be overlap between the read orders of the common kmer.
  • Kmer is defined as a contiguous sequence of bases of length k.
  • the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.
  • the number of blocks can be appropriately raised.
  • the front-end extension For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.
  • the embodiments of the present invention are extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed.
  • the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict.
  • the extension is terminated.
  • the extension of the back end of the embodiment of the present invention is similar to the extension of the front end, and will not be described in detail herein.
  • the conflict identification should be as sensitive as possible, and the read sequence data of the fill hole should also have a lower error rate.
  • the read sequence is pre-corrected. In order to improve the quality of the reading order, to ensure the accuracy of both ends of the reading.
  • comparison rate filtering must have a 100% alignment rate to extend as a seed reading.
  • the comparison rate filtering uses the following strategies:
  • the embodiment of the present invention avoids the problem that the base cannot be extended due to the base error of the seed reading.
  • the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.
  • position filtering According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole.
  • the embodiment of the present invention can set strict filtering conditions.
  • read sequence length filtering in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole.
  • a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.
  • end filtering according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed.
  • the end filtering can only occur once.
  • short similar repetitive processing and recognition short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence.
  • the embodiment of the present invention preferentially selects a longer overlapping reading sequence as a seed reading sequence, which can effectively avoid the problem of short similar repetition.
  • sequence connection In the process of filling the hole, the embodiment of the invention not only requires accurate assembly, but also requires accurate connection. Accurate assembly on the one hand guarantees a low base error rate and on the other hand ensures an accurate connection. The exact connection directly determines whether an insertion/deletion will eventually occur. Moreover, the connection must be considered when extending the error.
  • the sequence connection relationship of the embodiment of the present invention can be divided into the following three credibility according to the connection quality:
  • the above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct.
  • the quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in the embodiment of the present invention is classified and processed according to the actual use situation.
  • the connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend.
  • a credibility, sequence connection if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.
  • the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.
  • the longest insert of the support PE is 800 bp.
  • the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole.
  • the large hole is divided into a plurality of middle holes, and then the middle holes are separately assembled, and finally the assembly results are connected, and the details are as follows:
  • each block is assembled in blocks by means of a medium hole.
  • the hole can be extended from one end or the hole can be extended at both ends.
  • the technical solution of extending both ends is described below:
  • Figure 2 is a flow chart showing a second embodiment of a method of filling a hole in the assembly of a nucleic acid sequence of the present invention.
  • the hole filling method one end of the hole has a first contig and the other end has a second contig, and the specific process of filling the hole includes the following steps:
  • Step 201 determining a node reading sequence: finding a nucleic acid sequence at a end of the first contig close to the hole as a first node reading sequence; and finding a nucleic acid sequence at the end of the second contig close to the hole as a second node reading ;
  • Step 202 Select a complement read sequence set: find all read orders overlapping with the first node read order as the first fill hole read set from the read sequence for filling holes; from the read order for filling holes Finding all the readings that overlap with the second node reading order as the second filling hole reading set;
  • Step 203 Select a seed reading sequence: select a reading order from the first supplementary hole reading order set as the first seed reading order; and select a reading order from the second supplementary hole reading order set as the second seed reading order;
  • a read sequence having the smallest overlap with the node read order is selected as the first seed read sequence from the first complement hole read set; and a read with the smallest overlap with the node read order is selected from the second fill hole read sequence set. Order as the second seed reading order;
  • Step 204 extending processing: splicing the first seed reading sequence with the first contig to form a new first contig; and splicing the second seed reading with the second contig to form a new second contig;
  • Step 205 The judging process is: judging whether there is overlap between one end of the new first contig close to the hole and the end of the new second contig close to the hole after the extending process;
  • Step 206 Loop processing: if one end of the new first contig close to the hole does not overlap with the end of the new second contig close to the hole, then the determining of the node read order and the selection complement is performed on the basis of the new first and second contigs. The steps of reading the sequence, selecting the seed reading, and extending the processing;
  • Step 207 completing the filling hole: if one end of the new first overlapping group near the hole overlaps with one end of the new second overlapping group near the hole, the first overlapping group and the second overlapping group are connected to complete the filling hole.
  • the present invention determines a nucleic acid sequence as the first node by using a first contig on one end of the hole.
  • the first contig extends to form a new first contig, and at the same time, a nucleic acid sequence is determined as a second node read sequence on the second contig at the other end of the hole, and a read order with minimal overlap with the second node read sequence is found And reading the read sequence as a second seed reading sequence, and splicing the second seed reading sequence with the second contig to form a new second contig, that is, the second contig extends to form a new second contig, and then judges the new Whether the end of the first contig close to the hole overlaps with the end of the new second contig close to the hole, and if not, the extension process is repeated until the end of the new first contig close to the hole and the end of the new contig close to the hole Overlapping connecting the first
  • the two ends can be extended at the same time, or can be alternately extended. It can also be extended at one end of one time and extended at the other end of the other time, and details are not described herein again.
  • Fig. 3 is a view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • the device includes: a determining unit 31, a first selecting unit 32, a second selecting unit 33, an extending unit 34, a first determining unit 35, a circulating unit 36, a connecting unit 38, an identifying unit 39, a second determining unit 40, and a The third determining unit 41 and the fourth determining unit 37.
  • the determining unit 31 is configured to determine a node reading sequence, where one end of the hole has a first contig and the other end has a second contig, and a nucleic acid sequence is found at one end of the first contig near the hole as a node reading.
  • the first selecting unit 32 is configured to select a complement reading set, and find all readings overlapping the node reading order as a complement reading set from the reading order for filling the hole; the second selecting unit 33 is for selecting The seed reading sequence selects a reading sequence from the complement reading sequence set as the seed reading sequence; the extending unit 34 is used for the extension processing, and splicing the seed reading sequence with the first overlapping group to form a new first overlapping group; the first determining unit 35 is used for judging processing, determining whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole after the extension processing; the loop unit 36 is used for loop processing, if the new first contig is close to the hole If one end does not overlap with one end of the second contig close to the hole, the execution determining unit 31 continues to determine the node reading order, the first selecting unit 32 selects the complement reading set, and the second selection based on the new first contig.
  • the selecting unit 33 selects the seed reading sequence and the processing of the extension processing;
  • the connecting unit 38 is configured to complete the filling hole, and if the end of the new first overlapping group near the hole overlaps with the end of the second overlapping group near the hole, the first overlapping group is connected And the second contig, complete the hole;
  • the connecting unit 38 is used for sequence connection, the sequence connection can be divided into two ends contig direct connection, one end extension and the other end contig connection and two end extension sequence connection;
  • the identification unit 39 is used for Using the frequency of the sliding window of the hole reading sequence in the hole and the distance of the same sliding window to identify whether there is a series repeat in the hole, further deducing the repetition pattern of the sequence in the hole according to the series repetition;
  • the second determining unit 40 is configured to Determining whether there is a common fixed length short string between the node reading order and the reading order for filling the hole, and pattern recognition is used to determine all the reading sets having overlap between the reading orders having the common kmer;
  • the connection of the supergroup continues to fill the hole based on the new first contig, and, when selecting the seed reading, the non-overlapping read order other than the complement reading set is selected as the seed reading; fourth judgment
  • the unit 37 is configured to perform credibility judgment on the accuracy of the sequence connection when the connection unit 38 is connected in sequence, and when there is the first credibility, select the first credibility, so that the connection unit 38 performs sequence connection; The first credibility, but when there is the second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and the second credibility, but the third credibility exists.
  • the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap, and are not repeated, and the read sequence crosses the support;
  • the reliability is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap;
  • the third confidence is that the two sequences of the join overlap, and the overlap region has no evidence to support.
  • the hole filling device in the assembly of the nucleic acid sequence of the present invention has the following hole filling process: before the hole filling, the identification unit 39 determines whether there is a tandem repeat in the hole, and further infers the repeated pattern in the segment sequence, so as to facilitate the hole filling. get on. In the process of filling the hole, one end of the hole has a first contig and the other end has a second contig. First, the determining unit 31 finds a nucleic acid sequence at the end of the first contig close to the hole, as a node reading, and then a second.
  • Judging unit 40 Determining an overlapping area between the node reading order and the reading order for filling the hole, according to the second determining unit 40 As a result of the judgment, the first selection unit 32 finds all the reading sequences overlapping the node reading order as the complement hole reading set from the reading order for the hole filling, and then the second selecting unit 33 selects from the complement hole reading set. The read sequence having the smallest overlap with the node read sequence is used as the seed read sequence, and the extension unit 34 splices the seed read sequence with the first contig to form a new first contig.
  • the first determining unit 35 determines whether there is overlap between one end of the new first contig close to the hole and the end of the second contig close to the hole, if one end of the new first contig close to the hole is close to the second contig If there is no overlap at one end of the hole, the loop unit 36 continues to perform the process of determining the node read order, selecting the fill hole read set, selecting the seed read sequence, and the extension process on the basis of the new first contig, and finally the connection unit 38 completes the hole filling. If one end of the new first contig close to the hole overlaps with one end of the second contig close to the hole, the first contig is connected to the second contig, and the connecting unit 38 completes the merging.
  • the third judging unit 41 is configured to not connect the first contig and the second if the end of the new first contig close to the hole overlaps with the end hole of the second contig close to the hole according to the predicted hole length.
  • the contig continuing to perform the process of determining the node read order, selecting the fill hole read set, and selecting the seed read order based on the new first contig, and selecting the fill order in the process of selecting the seed read order Non-overlapping reads outside the collection are used as seed reads.
  • the fourth judging unit 37 performs credibility judgment on the accuracy of the sequence connection when the connecting unit 38 is connected in sequence, and when the first credibility exists, selects the first credibility, and causes the connecting unit 38 to perform sequence connection; If there is no first credibility, but there is a second credibility, the second credibility is selected, so that the connecting unit 38 performs sequence connection; there is no first credibility and second credibility, but there is a third credibility In the case of reliability, the third credibility is selected, so that the connection unit 38 performs sequence connection, wherein the first credibility is that the two sequences of the connection overlap and are not repeated, and the read sequence crosses the support; The two credibility is that the two sequences connected have a read sequence across the connection, and the two sequences may not overlap; the third credibility is that the two sequences connected have overlap, and the overlap region has no evidence to support.
  • the connecting unit 38 performs sequence connection, and the sequence connection can be divided into two groups of direct overlapping ends, one end extending and the other end overlapping group connection and two end extended sequence connections.
  • the processing method of the middle hole is mainly taken as the core, the small hole can be converted into the middle hole for processing, and the large hole is completely decomposed into the middle hole, and is processed according to the middle hole method.
  • different levels of holes correspond to different processing modes, and the hole repairing process is refined into the hole, and the hole itself is fully utilized to complete the hole filling, and all the holes can be effectively filled. It greatly improves the accuracy of the hole filling, saves the hole filling time and memory space, and is conducive to the development and promotion of gene sequencing technology.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

L'invention concerne un procédé et un dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique, une extrémité d'un espace présentant un premier contig et l'autre extrémité de l'espace présentant un second contig. Le procédé comprend les étapes consistant à : sélectionner une séquence nucléotidique à une extrémité du premier contig proche de l'espace et considérer la séquence nucléotidique comme une lecture nodale ; utiliser toutes les lectures qui chevauchent la lecture nodale comme ensemble de lectures pour la fermeture d'un espace, et sélectionner dans l'ensemble une lecture en tant que lecture germe ; épisser la lecture germe et le premier contig, afin de former un nouveau premier contig ; déterminer si une extrémité du nouveau premier contig après le processus d'extension proche de l'espace chevauche une extrémité du second contig proche de l'espace ; si les deux extrémités ne se chevauchent pas, réaliser une fermeture de l'espace en continu à partir du nouveau premier contig ; et si les deux extrémités se chevauchent, relier le premier contig et le second contig, afin de réaliser la fermeture de l'espace.
PCT/CN2011/083178 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique WO2013078623A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083178 WO2013078623A1 (fr) 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083178 WO2013078623A1 (fr) 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique

Publications (1)

Publication Number Publication Date
WO2013078623A1 true WO2013078623A1 (fr) 2013-06-06

Family

ID=48534608

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083178 WO2013078623A1 (fr) 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique

Country Status (1)

Country Link
WO (1) WO2013078623A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998026096A1 (fr) * 1996-12-12 1998-06-18 Smithkline Beecham Corporation Procede de comblement rapide d'espace

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998026096A1 (fr) * 1996-12-12 1998-06-18 Smithkline Beecham Corporation Procede de comblement rapide d'espace

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QU, XUEPING ET AL.: "Progress in Contig", PROGRESS IN BIOTECHNOLOGY, vol. 19, no. 3, 1999, pages 2 - 6 *
ZHAO, DONGSHENG ET AL.: "Computing Resource of AMMS Biomedicine Super-computing Center and Its Applications.", BULL. ACAD. MIL. MED. SCI., vol. 29, no. 4, August 2005 (2005-08-01), pages 363 - 367 *

Similar Documents

Publication Publication Date Title
WO2014069764A1 (fr) Système et procédé d'alignement de séquences de base
WO2021107676A1 (fr) Méthode de détection d'anomalies chromosomiques faisant appel à l'intelligence artificielle
JPH08129568A (ja) 統計法
US20160196133A1 (en) Approximate Functional Matching in Electronic Systems
JP2002543470A (ja) 補正の再使用による合理的なicマスク・レイアウトの光学的プロセス補正
WO2017086675A1 (fr) Appareil pour diagnostiquer des anomalies métaboliques et procédé associé
US20080052662A1 (en) Software For Filtering The Results Of A Software Source Code Comparison
WO2014069767A1 (fr) Système et procédé d'alignement de séquences de bases
WO2013078623A1 (fr) Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique
WO2018043843A1 (fr) Système de stockage de base de données relationnelle et procédé de prise en charge d'un traitement de requête rapide avec une faible redondance de données, et procédé de traitement d'interrogation sur la base d'un procédé de stockage de base de données relationnelle
WO2013078619A1 (fr) Procédé et dispositif pour identifier un conflit d'extension et déterminer un niveau de confiance d'une lecture germe dans un ensemble séquence nucléotidique
WO2013078625A1 (fr) Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique
WO2022097844A1 (fr) Procédé pour prédire le pronostic de survie de patients atteints de cancer pancréatique en utilisant les informations sur la variation du nombre de copies de gènes
Täubig et al. PAST: fast structure-based searching in the PDB
WO2013071480A1 (fr) Procédé et dispositif d'optimisation de circuit permettant une transplantation de circuit analogique
WO2010095807A2 (fr) Système et procédé de classement de document fondés sur une notation de contribution
WO2022164236A1 (fr) Procédé et système de recherche de nœud cible associé à une entité interrogée dans un réseau
WO2023080586A1 (fr) Méthode de diagnostic du cancer à l'aide d'une fréquence et d'une taille de séquence à chaque position d'un fragment d'acide nucléique acellulaire
WO2023163405A1 (fr) Procédé et appareil de mise à jour ou de remplacement de modèle d'évaluation de crédit
WO2016080695A1 (fr) Procédé pour reconnaître de multiples actions d'un utilisateur à partir d'informations sonores
WO2021172780A1 (fr) Procédé et dispositif de sélection de gène
Fasulo et al. Efficiently detecting polymorphisms during the fragment assembly process
WO2015009046A1 (fr) Bibliothèque d'orbites moléculaires possédant une distribution d'orbites moléculaires exclusive, procédé et système d'évaluation de région de distribution d'orbites moléculaires l'utilisant
WO2022203437A1 (fr) Procédé basé sur l'intelligence artificielle pour détecter une mutation dérivée d'une tumeur d'adn acellulaire, et procédé de diagnostic précoce du cancer utilisant celui-ci
WO2018021636A1 (fr) Système et procédé d'haplotypage humain

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11876834

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/11/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11876834

Country of ref document: EP

Kind code of ref document: A1