WO2013078625A1 - Gap closure method and device in nucleotide sequence assembly - Google Patents

Gap closure method and device in nucleotide sequence assembly Download PDF

Info

Publication number
WO2013078625A1
WO2013078625A1 PCT/CN2011/083186 CN2011083186W WO2013078625A1 WO 2013078625 A1 WO2013078625 A1 WO 2013078625A1 CN 2011083186 W CN2011083186 W CN 2011083186W WO 2013078625 A1 WO2013078625 A1 WO 2013078625A1
Authority
WO
WIPO (PCT)
Prior art keywords
hole
sequence
contig
length
actual
Prior art date
Application number
PCT/CN2011/083186
Other languages
French (fr)
Chinese (zh)
Inventor
刘兵行
李振宇
陈燕香
李英睿
汪建
王俊
杨焕明
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to PCT/CN2011/083186 priority Critical patent/WO2013078625A1/en
Publication of WO2013078625A1 publication Critical patent/WO2013078625A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.
  • the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species.
  • the principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.
  • Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.
  • the partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.
  • the prior art mainly has two hole filling programs, which are corresponding to overlap-based partial assembly.
  • the Gapcloser program and the SOAPdenovo program based on the partial assembly of the De bruijn diagram.
  • the hole-filling software Gapcloser is based on the base sequence segment with overlap
  • the method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy.
  • Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.
  • the hole in the SOAPdenovo assembly software is based on the area inside the hole.
  • De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.
  • the technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can accurately obtain the intra-hole sequence for characterizing the length of the hole and improve the accuracy of the hole filling.
  • the present invention adopts a technical solution to provide a method for filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, the method comprising the following Step: obtaining a reading sequence overlapping the first contig group and the second contig group from the reading sequence for filling the hole; calculating the actual hole length of the hole according to the overlapping reading order, wherein the actual hole length is represented
  • the sequence is an intra-hole sequence in which the overlapping readings fall into the hole; determining whether the reading order of the actual hole length is overlapped with the overlapping region of the first contig and the second contig, respectively, and the length of the hole is calculated. Whether there is a tandem repetition of the sequence in which the reading sequence falls into the hole; if the result of the above judgments is all no, the sequence in the hole is obtained, and the hole is completed.
  • the step of performing each judgment includes: if any one or more of the above determinations is yes, no hole is filled.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: determining whether there is a tandem repetition of the overlapping area of the first contig and the second contig; if the determination is no, cutting the first contig Or the overlapping area of the second contig is completed, and if the YES is determined, the hole is not filled.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: calculating the length of the hole according to the reading order of each overlapping, forming a length meter of the hole; selecting the length of the hole with the largest frequency in the long frequency table of the hole For the actual hole length.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: determining whether the calculated actual hole length satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is the first frequency If the above judgment is all YES, then it is judged whether the reading order of the actual hole length and the overlapping area of the first contig group and the second contig are respectively overlapped in series, and the reading order of the hole length is calculated. Whether there is a step of tandem repeating in the sequence in the hole, and a step of judging whether there is a series repeat in the overlapping area of the first contig and the second contig; if any one or more of the above is negative, the hole is stopped.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: if the actual hole length is 0, connecting the first contig group and the second contig group to complete the hole filling.
  • the step of obtaining the intra-hole sequence and completing the hole-filling process comprises: comparing the acquired intra-hole sequence with all reading sequences for filling the hole to determine the accuracy of the sequence base in the hole; If you are accurate, you can get the sequence inside the hole and complete the hole.
  • a hole filling device in nucleic acid sequence assembly, the device comprising: an obtaining module, configured to acquire a first contig from a reading sequence of a hole filling hole And the second contig has an overlapping reading order; the calculating module is configured to calculate the actual hole length of the hole according to the overlapping reading order, wherein the sequence characterization of the actual hole length is an overlapping reading sequence falling into the hole The intra-hole sequence; the first judging module is configured to judge whether the reading order of the actual hole length is overlapped with the overlapping area of the first contig group and the second contig, respectively, and the reading order of the hole length is calculated to fall into the hole Whether there is a series repeat in the sequence inside; the first fill hole module is configured to acquire the sequence in the hole and complete the fill hole when all the judgment results of the first judgment module are all negative.
  • the device includes: a canceling module, configured to: when any one or more of the determination results of the first determining module are YES, no hole filling is performed.
  • the device includes: a continuation determining module, configured to determine whether the overlapping area of the first contig and the second contig has a series repeat after the calculating module calculates the actual hole length; and the second hole filling module is configured to continue determining When the module judges to be no, the overlapping area of the first contig or the second contig is cut off, and the hole is completed; wherein the canceling module is used to not fill the hole when the continuation determining module determines YES.
  • a continuation determining module configured to determine whether the overlapping area of the first contig and the second contig has a series repeat after the calculating module calculates the actual hole length
  • the second hole filling module is configured to continue determining When the module judges to be no, the overlapping area of the first contig or the second contig is cut off, and the hole is completed; wherein the canceling module is used to not fill the hole when the continuation determining module determines YES.
  • the calculation module includes: a calculation unit, configured to calculate a length of the hole according to each of the overlapped reading sequences to form a hole length frequency table; and a selection unit for selecting a hole having the largest frequency in the long frequency table as the actual hole long.
  • the device includes: a second determining module, configured to determine whether the actual hole length calculated by the computing module satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double of the first frequency;
  • the third judging module is configured to: when the judgment result of the second judging module is all YES, the first judging module and the continuation judging module are operated; wherein the canceling module is used in the judging result of the second judging module, If any one or more of the judgments is no, no holes will be made.
  • the device includes: a connection module, configured to calculate the actual hole length, and when the actual hole length is 0, connect the first contig and the second contig to complete the hole.
  • the first filling hole module comprises: a comparing unit, configured to compare the acquired intra-hole sequence with all reading sequences used for filling the hole to determine the accuracy of the sequence base in the hole; and the hole-filling unit is used for If the sequence in the hole is accurate, the sequence in the hole is obtained and the hole is filled.
  • the present invention acquires an order of reading that overlaps both the first contig and the second contig from the reading for the hole filling. Calculate the length of the hole according to the reading order, obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed.
  • FIG. 1 is a schematic flow chart of an embodiment of a method for filling a hole in the assembly of a nucleic acid sequence of the present invention
  • FIG. 2 is a schematic view showing the connection of a hole in the assembly process of the nucleic acid sequence of the present invention
  • Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • Kmar Fixed length string Is a DNA sequence of length K K is usually taken 17 Single read Single-ended read order Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
  • Scaffold Connecting bracket Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented Gap hole Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
  • PE read double-end read
  • FIG. 1 is a schematic flow chart of an embodiment of a method for filling a hole in the assembly of a nucleic acid sequence of the present invention
  • one end of the hole has a first contig and the other end has a second contig
  • the hole filling method includes the following steps:
  • Step 101 Obtain a reading sequence that overlaps the first contig group and the second contig group from the reading sequence used for filling the hole;
  • a small hole for a hole having a length of less than 100 bp, we define a small hole, and the hole is filled by the method of the present invention. First, find all the readings in the small hole and analyze them, and select the readings that fall into the hole and overlap with the overlapping groups on both sides of the hole.
  • Step 102 Calculate the actual hole length of the hole according to the overlapping reading order, wherein the sequence representing the actual hole length is an intra-hole sequence in which the overlapping readings fall into the hole;
  • the actual hole length of the hole is calculated based on the remaining sequence completely falling within the hole.
  • a frequency table is formed to characterize a range of hole lengths.
  • the frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.
  • the first judgment condition is to judge whether the calculated actual hole length satisfies the predetermined hole length range.
  • the focus of the present invention is to assemble a hole having a hole length of less than 100 bp, so it is necessary to first judge whether the calculated actual hole length satisfies a predetermined hole length range of less than 100 bp.
  • the second judgment condition is whether the frequency of the actual hole length in the hole length frequency table is more than double of the first frequency, and the first frequency referred to herein refers to the frequency which is only smaller than the maximum frequency in the hole length frequency table.
  • the length of the hole corresponding to the first frequency is the length of the first hole. That is to say, in order to make the actual hole length credible, it is necessary to judge whether the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length. If the actual hole length calculated is less than 100 bp, the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length, then the calculated actual hole length is closer to the true hole length.
  • Step 103 judging whether the read sequence of the actual hole length is overlapped with the overlap region of the first contig group and the second contig, respectively, and whether the sequence of the hole length reading sequence falls into the hole has a series repeat;
  • the inspection of the actual hole length includes: a repeat check of the overlap of the read sequence and the contig, a repeat check of the area falling within the hole, and a repeat check of the overlap area of the first contig and the second contig.
  • the repeat check of the overlap area and the inner area of the hole is to check the tandem repeat, that is, repeat feature recognition.
  • the general tandem repeat check is implemented by a block check method, which is to determine whether the type of tandem repeat hole belongs to multi-mode concatenation, to single-mode concatenation, and to the distance between tandem repeat blocks.
  • the actual hole length inspection provides support for subsequent hole assembly.
  • Step 104 If all the results of the above determinations are no, the sequence in the hole is obtained, and the hole is completed.
  • the method of the present invention does not assemble the duplicated holes, if the judgment result of the step 103 is all negative, the sequence in the hole is acquired, and the hole is completed; if any of the judgment results of the step 103 is one or more, Then, the method of the present invention is not used to fill holes.
  • the obtained intra-hole sequence is compared with all the reading sequences used for filling the hole to determine the accuracy of the sequence base in the hole, and if the sequence within the hole is accurate, the hole is utilized. Inside the sequence, complete the hole.
  • the base on the sequence in the hole characterizing the hole length may be the true base of the hole. All reads that characterize the actual hole length can be analyzed from base to base to determine the base at each position.
  • the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole.
  • Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.
  • FIG. 2 shows a schematic diagram of the connection of the hole in the assembly process of the nucleic acid sequence of the present invention.
  • the two ends of the hole respectively have the first a contig x and a second contig y
  • A is an order of overlap with the first contig and the second contig during the hole filling process
  • a and b are the lengths of overlap between the read sequence and the contig, respectively
  • c is the length of the sequence in which the read sequence A falls into the hole, wherein the sizes of a and b are not limited, and the lengths of x and y are greater than the length of A.
  • a reading sequence A overlapping the first contig group and the second contig group is selected from the reading order for filling the hole, and the reading sequence overlaps with the first contig group x and overlaps with the second contig group y.
  • b then calculating the length of the hole between the first contig x and the second contig y, as shown, the sequence length c of the reading sequence A falling into the hole is the hole length of the hole, during the specific operation, It is necessary to calculate a hole length by reading a plurality of readings overlapping the first contig and the second contig, and a hole length can be calculated for each reading across the hole.
  • a frequency table is formed, which characterizes a range of hole lengths, and selects the frequency of the largest frequency in the frequency table as the actual hole length.
  • the actual hole length calculated must also satisfy the following two conditions.
  • the actual hole length calculated is less than 100 bp, and the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length.
  • the first frequency refers to a frequency that is only smaller than the maximum frequency in the hole length frequency table, and the hole length corresponding to the first frequency is the first hole length.
  • the hole filling method of the present invention does not assemble the repeated holes, it is necessary to judge the reading order A of the actual hole length and the first contig and the second, respectively. Whether the overlapping regions a and c of the contig have a tandem repeat, a sequence c falling into the hole, whether there is a tandem repeat, if there is no tandem repeat, the sequence c in the hole is obtained, and the hole is completed.
  • the number of readings required to calculate the length of the hole and overlap with the first contig and the second contig is not limited.
  • the present invention obtains an order of reading that overlaps both the first contig and the second contig from the reading sequence for the hole filling.
  • Calculate the length of the hole according to the reading order obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed.
  • the hole-filling method in nucleic acid sequence assembly comprises: obtaining a reading sequence overlapping the first contig group and the second contig group from the reading sequence for filling the hole, and calculating the hole according to the reading order Length, if the determined actual hole length is less than the first threshold preset by the system, for example, 0, it is determined that there is overlap at both ends of the contig, further determining whether the overlap is a repetition, and then determining the repeat mode, otherwise the contig will be The end intercepts the overlap length.
  • the details are as follows:
  • the assembly software cannot connect the two contigs into a contig due to the overlap, so when outputting, A contig and a second contig are separated, and the length of the hole between the first contig and the second contig becomes a negative number, forming a negative hole.
  • the result of the repeat check of the overlapping area of the first contig and the second contig is that there is no tandem repetition, the overlapping area of the first contig or the second contig is cut off, and the filling hole can be completed. If there is a tandem repeat, it is necessary to determine a negative value of the negative hole length, and then fill the hole according to the user's needs.
  • the method of filling a hole in the assembly of the nucleic acid sequence comprises: obtaining a reading sequence overlapping the first contig and the second contig from the reading sequence for filling the hole, and calculating the hole according to the reading order The length, if the determined actual hole length is equal to the first threshold preset by the system, such as 0, connects the first contig and the second contig to complete the hole.
  • FIG. 3 is a schematic structural view of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • the device includes: an obtaining module 31, a calculating module 32, a first determining module 33, and a first filling hole module 34.
  • the cancel module 35, the second fill hole module 36, the second judgment module 37, the third judgment module 38, the connection module 39, and the continuation determination module 40 are eliminated.
  • the calculation module 32 includes a calculation unit 321 and a selection unit 322.
  • the first fill hole module 34 includes a comparison unit 341 and a fill hole unit 342.
  • the obtaining module 31 is configured to obtain an reading sequence that overlaps the first contig group and the second contig group from the reading order of the hole filling; the calculating module 32 is configured to calculate the actual hole length of the hole according to the overlapping reading order.
  • the sequence characterization of the actual hole length is an intra-hole sequence in which the overlapping readings fall within the hole; the calculating unit 321 is configured to calculate the length of the hole according to each of the overlapping reading orders to form a hole length frequency table;
  • the selecting unit 322 is configured to select the length of the hole whose frequency is the largest in the hole length frequency table as the actual hole length; the first determining module 33 is configured to determine that the reading order of the actual hole length overlaps with the first contig and the second contig respectively Whether there is a series repeat in the region, and whether the sequence of the hole length reading sequence falls into the hole has a series repeat; the first hole repairing module 34 is configured to acquire the hole when all the judgment results of the first determining module 33 are all negative.
  • the inner sequence completes the hole filling; the comparing unit 341 is configured to compare the acquired intra-hole sequence with all reading sequences for the hole filling to determine the accuracy of the sequence base in the hole; the hole-filling unit 342 is used for the hole If the base sequence is accurate, then the hole is obtained.
  • the inner sequence completes the fill hole; the cancel module 35 is used to fill the hole when the judgment result does not satisfy the condition in the hole filling process.
  • the continuation determining module 40 is configured to determine whether the overlapping area of the first contig group and the second contig has a series repeat after the calculating module 32 calculates the actual hole length; and the second filling hole module 36 is configured to determine, in the continuation determining module 40, When the overlapping area of the first contig or the second contig is cut off, the hole is completed.
  • the second determining module 37 is configured to determine whether the actual hole length calculated by the calculating module 32 satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double the first frequency; the third determining module 38 When the determination result of the second judging module 37 is all YES, the first judging module 33 and the continuation judging module 40 are caused to operate.
  • the connection module 39 is configured to calculate the actual hole length, and when the actual hole length is 0, connect the first contig and the second contig to complete the hole.
  • each module is as follows: First, the obtaining module 31 obtains an reading sequence overlapping the first contig group and the second contig group from the reading order of the hole filling, and then the calculating module 32 reads the reading according to the overlapping Calculating the actual hole length of the hole, wherein the sequence characterizing the actual hole length is an intra-hole sequence in which the overlapping readings fall into the hole, and the process of calculating the hole length is as follows: the calculating unit 321 has an overlapping reading according to each piece. The length of the hole is calculated in order to form a hole length table, and then the selecting unit 322 selects the hole having the largest frequency in the hole length table as the actual hole length.
  • the second judging module 37 judges whether the calculated actual hole length satisfies the predetermined hole length range, and the hole length frequency. Whether the frequency of the actual hole length in the table is more than double the first frequency, the two judgment conditions are explained as follows: Since the present invention is to assemble a hole having a hole length of less than 100 bp, it is necessary to first judge the actual calculated. Whether the hole length satisfies the predetermined hole length range of less than 100 bp.
  • the first frequency referred to herein refers to a frequency that is only smaller than the maximum frequency in the hole length frequency table, and the hole length corresponding to the first frequency is the first hole length.
  • the third determination module 38 causes the first determination module 33 and the continuation determination module 40 to operate.
  • the cancel module 35 does not perform the hole filling.
  • the first judging module 33 judges whether the reading sequence of the actual hole length is overlapped with the overlapping area of the first contig group and the second contig, respectively, and whether the sequence of the hole length reading sequence falls into the hole has a tandem repeat. When all the determination results of the first judging module 33 are all negative, the first filling hole module 34 acquires the intra-hole sequence and completes the filling hole. In order to improve the accuracy of the hole filling, in the process of acquiring the intra-hole sequence by the first hole-filling module 34 and completing the hole-filling process, the comparing unit 341 compares the acquired intra-hole sequence with all the reading sequences for filling the hole, so as to compare Determine the accuracy of the sequence bases in the hole.
  • the hole-filling unit 342 acquires the sequence within the hole and completes the hole filling.
  • the canceling module 35 does not perform the hole filling.
  • the continuation determining module 40 first determines whether the overlapping area of the first contig and the second contig is There is a series repeat, and if the determination is no, the second fill hole module 36 cuts off the overlap region of the first contig or the second contig to complete the fill hole; if the determination is YES, the cancel module 35 does not fill the hole.
  • connection module 39 connects the first contig and the second contig to complete the hole.
  • the present invention obtains an order of reading that overlaps both the first contig and the second contig from the reading sequence for the hole filling.
  • Calculate the length of the hole according to the reading order obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed.
  • the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole, and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding alkali
  • the base sequence segment is filled with holes.
  • the holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole.
  • a hole having a length of less than 100 bp is defined as a small hole
  • a length of the hole between 100 bp and 1.5 kb is defined as a middle hole
  • a length of the hole greater than 1.5 kb is defined as a large hole.
  • the above is only one of the definitions of various holes, and the size of each hole is
  • a scaffold that forms a gene sequence hole is acquired and analyzed.
  • the original scaffold is broken to form a contig, and the gap between the two contigs is a hole.
  • the size of the hole and the contiguous group before and after the hole can be accurately obtained.
  • the embodiment further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.
  • the reading order for filling the hole is read in the nucleic acid sequence hole.
  • the reading order for filling the hole mostly belongs to PE.
  • the reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.
  • the PE reading order supports each other, PE
  • the reading sequence is from the two ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp.
  • the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp.
  • a high-throughput multiplier by inserting a high-throughput multiplier, one insert can be passed through PE The overlapping relationship of the reading order is restored.
  • the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.
  • the long reading order since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.
  • the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the sequence of the read sequence itself are acquired. information.
  • the hole filling process specifically includes: A, the hole filling treatment of the small hole; B, the hole filling the hole in the middle hole Processing and C, the hole filling of the big hole.
  • A the hole filling treatment of the small hole
  • B the hole filling the hole in the middle hole Processing
  • C the hole filling of the big hole.
  • the block is set to 6 bp or 12 bp.
  • block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.
  • the present embodiment records the block frequency (block_freq) and the distance of the same block (b_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance b_dis value, and the distance b_dis is equal to or equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.
  • the present embodiment further infers the mode of the tandem repetition based on the information obtained in the above-described process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or no The series connection of the intersection is determined to be a multi-series mode.
  • the present embodiment records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block requirement in the hole. If the depth is multiplied, there is a repetition.
  • the overlap calculation first uses the hash method to quickly determine whether there is a common fixed-length short string (kmer) between the read orders, and there may be overlap between the read orders of the common kmer.
  • Kmer is defined as a contiguous sequence of bases of length k.
  • the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.
  • the number of blocks can be appropriately raised.
  • the front-end extension For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.
  • this embodiment is extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed.
  • the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict.
  • the extension is terminated.
  • the extension of the back end of this embodiment is similar to the extension of the front end, and will not be described in detail herein.
  • the conflict identification should be as sensitive as possible, and the read data of the hole should also have a lower error rate.
  • the read sequence is pre-corrected. In order to improve the quality of the reading order, ensure the accuracy of both ends of the reading.
  • comparison rate filtering must have a 100% alignment rate to extend as a seed reading.
  • the comparison rate filtering uses the following strategies:
  • the present embodiment avoids the problem that the base cannot be extended due to the base error of the seed reading.
  • the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.
  • position filtering According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole. In order to ensure the accuracy of the calculation of the position within the hole, the embodiment can set strict filtering conditions.
  • read sequence length filtering in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole. In this embodiment, a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.
  • end filtering according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed. In this embodiment, the end filtration can only occur once.
  • short similar repetitive processing and recognition short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence.
  • the present embodiment preferentially selects a longer overlapping read sequence as a seed reading to extend, which can effectively avoid the problem of short similar repetition.
  • the above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct.
  • the quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in this embodiment is classified and processed according to the actual use situation.
  • the connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend.
  • a credibility, sequence connection if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.
  • the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.
  • the longest insert of the support PE is 800 bp.
  • the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole.
  • the large hole is divided into a plurality of middle holes, and then the middle holes are respectively assembled, and finally the assembly results are connected, and the details are as follows:
  • each block is assembled in blocks by means of a medium hole.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed is a gap closure method in nucleotide sequence assembly. One end of a gap has a first contig, and the other end has a second contig. The method comprises: acquiring, from reads for gap closure, a read overlapping the first contig and the second contig (101); calculating an actual gap length of the gap according to the read overlapping the first contig and the second contig, a sequence characterizing the actual gap length being an in-gap sequence, which is a part of the read overlapping the first contig and the second contig falls within the gap (102); judging whether a tandem repeat exists in an overlap region of the read with the actual gap length calculated and the first contig and in an overlap region of the read with the actual gap length calculated and the second contig, and whether a tandem repeat exists in the sequence of the read with the actual gap length calculated falling within the gap (103); if the foregoing judgment results are no, acquiring the in-gap sequence, so as to complete the gap closure (104). Also disclosed is a gap closure device in nucleotide sequence assembly.

Description

核酸序列组装中的补洞方法及其装置  Filling hole method and device thereof in nucleic acid sequence assembly
【技术领域】[Technical Field]
本发明涉及基因工程技术领域,特别是涉及一种核酸序列组装中的补洞方法及其装置。  The invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.
【背景技术】【Background technique】
在基因测序领域,随着第二代测序技术的普及,测序成本越来越低,推动了更多的物种的全基因组测序工作。二代测序技术的原理决定了测序片段的长度偏短。在具体实施过程中,测序片段只有几十到一百个左右的碱基,这无疑增加分析测序所得数据的工作难度。 In the field of gene sequencing, with the popularity of second-generation sequencing technology, the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species. The principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.
在对测序所得数据进行分析时,一般采用基因组组装方式。基因组组装通常首先屏蔽重复区域,然后在双末端读序( pair-end read , PE read )的辅助下,确定非重复区域关系,但是非重复区域之间的未组装区域容易形成 gap ,称之为洞。 In the analysis of the data obtained by sequencing, the genome assembly method is generally adopted. Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.
现有技术中,基于 sanger 测序技术的基因组组装和基于 solexa 等第二代测序仪的基因组组装,初始组装版本中都会存在大量的未组装区域,这些未组装区域往往与序列重复( repeat )密切相关。其中,与洞相关的序列重复可以分为串联重复和转座子重复,现有技术中的补洞程序能够比较准确地处理简单转座子重复,但却难以处理长串联重复。 Prior art, genome assembly based on sanger sequencing technology and based on solexa Such as the genome assembly of second-generation sequencers, there will be a large number of unassembled areas in the initial assembly version, these unassembled areas often repeat with the sequence (repet )closely related. Among them, the sequence repeats related to the holes can be divided into tandem repeats and transposon repeats. The prior art fill-in procedure can handle simple transposon repetitions relatively accurately, but it is difficult to deal with long tandem repeats.
从组装方法来讲,现有技术主要有两种方式来解决长串联重复问题,第一种方式为基于重叠( overlap )的局部组装,第二种方式为基于 De bruijn 图的局部组装。 In terms of assembly methods, there are mainly two ways to solve the long series repetition problem in the prior art. The first method is based on overlap (overlap). Partial assembly, the second way is partial assembly based on De Bruijn diagram.
其中,基于 overlap 的局部组装难以识别重复造成冲突的准确位点,因此该方式容易造成插入 / 缺失( indel )。 Among them, overlap-based partial assembly is difficult to identify the exact location where the conflict is caused, so this method is easy to cause insertion/deletion ( Indel ).
而 De bruijn 图的局部组装能够识别重复造成的冲突位点,但难以解决冲突,需要断开,从而影响了补洞的数量。 And De bruijn The partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.
显然,上述两种方式都难以有效处理长串联重复序列。 Obviously, both of the above methods are difficult to effectively process long tandem repeats.
从组装工具来讲,现有技术主要有两种补洞程序,分别为对应基于 overlap 的局部组装的 Gapcloser 程序和基于 De bruijn 图的局部组装的 SOAPdenovo 程序。 In terms of assembly tools, the prior art mainly has two hole filling programs, which are corresponding to overlap-based partial assembly. The Gapcloser program and the SOAPdenovo program based on the partial assembly of the De bruijn diagram.
但是上述两种程序同样都存在缺点: But both programs have the same drawbacks:
第一、补洞软件 Gapcloser 是基于碱基序列段用 overlap 方法做局部组装,因为没有考虑到洞内情况的复杂性,因此容易导致对复杂洞的处理出现错误,降低整体准确率。而且, Gapcloser 因为其耗用内存大、耗时长而不适合于大基因组初级补洞。 First, the hole-filling software Gapcloser is based on the base sequence segment with overlap The method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy. And, Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.
第二、 SOAPdenovo组装软件的补洞环节都是对洞内区域基于 De bruijn 图做二次组装,虽然能够有效解决长度较小的洞,但是补洞数量有限。 Second, the hole in the SOAPdenovo assembly software is based on the area inside the hole. De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.
【发明内容】[Summary of the Invention]
本发明主要解决的技术问题是提供一种核酸序列组装中的补洞方法及其装置,能够准确地获得表征洞长的洞内序列,提高补洞的准确性。The technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can accurately obtain the intra-hole sequence for characterizing the length of the hole and improve the accuracy of the hole filling.
为解决上述技术问题,本发明采用的一个技术方案是:提供一种核酸序列组装中的补洞方法,该洞的一端具有第一重叠群,其另一端具有第二重叠群,该方法包括以下步骤:从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序;根据均有重叠的读序计算洞的实际洞长,其中,表征实际洞长的序列为均有重叠的读序落入洞内的洞内序列;判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、计算出洞长的读序落入洞内的序列是否存在串联重复;若上述各项判断的结果全部为否,则获取洞内序列,完成补洞。In order to solve the above technical problem, the present invention adopts a technical solution to provide a method for filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, the method comprising the following Step: obtaining a reading sequence overlapping the first contig group and the second contig group from the reading sequence for filling the hole; calculating the actual hole length of the hole according to the overlapping reading order, wherein the actual hole length is represented The sequence is an intra-hole sequence in which the overlapping readings fall into the hole; determining whether the reading order of the actual hole length is overlapped with the overlapping region of the first contig and the second contig, respectively, and the length of the hole is calculated. Whether there is a tandem repetition of the sequence in which the reading sequence falls into the hole; if the result of the above judgments is all no, the sequence in the hole is obtained, and the hole is completed.
其中,进行各项判断的步骤之后包括:若上述判断任何一项或以上为是,则不进行补洞。Wherein, the step of performing each judgment includes: if any one or more of the above determinations is yes, no hole is filled.
其中,根据均有重叠的读序计算洞的实际洞长的步骤之后包括:判断第一重叠群和第二重叠群的重叠区域是否存在串联重复;若判断为否,则截掉第一重叠群或第二重叠群的重叠区域,完成补洞;若判断为是,则不进行补洞。The step of calculating the actual hole length of the hole according to the overlapping reading order includes: determining whether there is a tandem repetition of the overlapping area of the first contig and the second contig; if the determination is no, cutting the first contig Or the overlapping area of the second contig is completed, and if the YES is determined, the hole is not filled.
其中,根据均有重叠的读序计算洞的实际洞长的步骤包括:根据每条均有重叠的读序计算洞的长度,形成洞长频数表;选择洞长频数表中频率最大的洞长为实际洞长。The step of calculating the actual hole length of the hole according to the overlapping reading order includes: calculating the length of the hole according to the reading order of each overlapping, forming a length meter of the hole; selecting the length of the hole with the largest frequency in the long frequency table of the hole For the actual hole length.
其中,根据均有重叠的读序计算洞的实际洞长的步骤之后包括:判断计算得到的实际洞长是否满足预定洞长范围,洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上;若上述判断全部为是,则执行判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、计算出洞长的读序落入洞内的序列是否存在串联重复的步骤,和判断第一重叠群和第二重叠群的重叠区域是否存在串联重复的步骤;若上述任何一项或以上判断为否,则停止补洞。The step of calculating the actual hole length of the hole according to the overlapping reading order includes: determining whether the calculated actual hole length satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is the first frequency If the above judgment is all YES, then it is judged whether the reading order of the actual hole length and the overlapping area of the first contig group and the second contig are respectively overlapped in series, and the reading order of the hole length is calculated. Whether there is a step of tandem repeating in the sequence in the hole, and a step of judging whether there is a series repeat in the overlapping area of the first contig and the second contig; if any one or more of the above is negative, the hole is stopped.
其中,根据均有重叠的读序计算洞的实际洞长的步骤之后包括:若实际洞长为0,则连接第一重叠群和第二重叠群,完成补洞。The step of calculating the actual hole length of the hole according to the overlapping reading order includes: if the actual hole length is 0, connecting the first contig group and the second contig group to complete the hole filling.
其中,获取洞内序列,完成补洞的步骤包括:将获取的洞内序列与用于补洞的所有读序进行比对,以确定洞内序列碱基的准确性;若洞内序列碱基准确,则获取洞内序列,完成补洞。The step of obtaining the intra-hole sequence and completing the hole-filling process comprises: comparing the acquired intra-hole sequence with all reading sequences for filling the hole to determine the accuracy of the sequence base in the hole; If you are accurate, you can get the sequence inside the hole and complete the hole.
为解决上述技术问题,本发明采用的另一个技术方案是:提供一种核酸序列组装中的补洞装置,该装置包括:获取模块,用于从补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序;计算模块,用于根据均有重叠的读序计算洞的实际洞长,其中,表征实际洞长的序列为均有重叠的读序落入洞内的洞内序列;第一判断模块,用于判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、计算出洞长的读序落入洞内的序列是否存在串联重复;第一补洞模块,用于在第一判断模块的各项判断结果全部为否时,获取洞内序列,完成补洞。In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a hole filling device in nucleic acid sequence assembly, the device comprising: an obtaining module, configured to acquire a first contig from a reading sequence of a hole filling hole And the second contig has an overlapping reading order; the calculating module is configured to calculate the actual hole length of the hole according to the overlapping reading order, wherein the sequence characterization of the actual hole length is an overlapping reading sequence falling into the hole The intra-hole sequence; the first judging module is configured to judge whether the reading order of the actual hole length is overlapped with the overlapping area of the first contig group and the second contig, respectively, and the reading order of the hole length is calculated to fall into the hole Whether there is a series repeat in the sequence inside; the first fill hole module is configured to acquire the sequence in the hole and complete the fill hole when all the judgment results of the first judgment module are all negative.
其中,该装置包括:取消模块,用于在第一判断模块的判断结果中,有任何一项或以上为是时,不进行补洞。The device includes: a canceling module, configured to: when any one or more of the determination results of the first determining module are YES, no hole filling is performed.
其中,该装置包括:继续判断模块,用于计算模块计算出实际洞长后,判断第一重叠群和第二重叠群的重叠区域是否存在串联重复;第二补洞模块,用于在继续判断模块判断为否时,截掉第一重叠群或第二重叠群的重叠区域,完成补洞;其中,取消模块,用于在继续判断模块判断为是时,不进行补洞。The device includes: a continuation determining module, configured to determine whether the overlapping area of the first contig and the second contig has a series repeat after the calculating module calculates the actual hole length; and the second hole filling module is configured to continue determining When the module judges to be no, the overlapping area of the first contig or the second contig is cut off, and the hole is completed; wherein the canceling module is used to not fill the hole when the continuation determining module determines YES.
其中,计算模块包括:计算单元,用于根据每条均有重叠的读序计算洞的长度,形成洞长频数表;选择单元,用于选择洞长频数表中频率最大的洞长为实际洞长。The calculation module includes: a calculation unit, configured to calculate a length of the hole according to each of the overlapped reading sequences to form a hole length frequency table; and a selection unit for selecting a hole having the largest frequency in the long frequency table as the actual hole long.
其中,该装置包括:第二判断模块,用于判断计算模块计算得到的实际洞长是否满足预定洞长范围,洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上;第三判断模块,用于在第二判断模块的判断结果全部为是时,使第一判断模块和继续判断模块进行工作;其中,取消模块,用于在第二判断模块的判断结果中,有任何一项或以上判断为否时,不进行补洞。The device includes: a second determining module, configured to determine whether the actual hole length calculated by the computing module satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double of the first frequency; The third judging module is configured to: when the judgment result of the second judging module is all YES, the first judging module and the continuation judging module are operated; wherein the canceling module is used in the judging result of the second judging module, If any one or more of the judgments is no, no holes will be made.
其中,该装置包括:连接模块,用于计算模块计算出实际洞长后,当实际洞长为0时,连接第一重叠群和第二重叠群,完成补洞。The device includes: a connection module, configured to calculate the actual hole length, and when the actual hole length is 0, connect the first contig and the second contig to complete the hole.
其中,第一补洞模块包括:对比单元,用于将获取的洞内序列与用于补洞的所有读序进行比对,以确定洞内序列碱基的准确性;补洞单元,用于若洞内序列碱基准确,则获取洞内序列,完成补洞。The first filling hole module comprises: a comparing unit, configured to compare the acquired intra-hole sequence with all reading sequences used for filling the hole to determine the accuracy of the sequence base in the hole; and the hole-filling unit is used for If the sequence in the hole is accurate, the sequence in the hole is obtained and the hole is filled.
本发明的有益效果是:区别于现有技术的情况,本发明从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序。根据该读序计算洞的长度,得到该洞的实际长度,并对洞的长度进行检查,如果计算出洞长的读序与第一重叠群和第二重叠群的重叠区域以及计算出洞长的读序落入洞内的序列都没有串联重复,则获取表征洞长的洞内序列,完成补洞。通过上述方式,在补洞过程中,能够准确地获得表征洞长的洞内序列,提高补洞的准确性。The beneficial effects of the present invention are: Different from the prior art, the present invention acquires an order of reading that overlaps both the first contig and the second contig from the reading for the hole filling. Calculate the length of the hole according to the reading order, obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed. Through the above method, in the hole filling process, the intra-hole sequence characterizing the hole length can be accurately obtained, and the accuracy of the hole filling can be improved.
【附图说明】[Description of the Drawings]
图1是 本发明核酸序列组装中的补洞方法一实施例的流程示意图;1 is a schematic flow chart of an embodiment of a method for filling a hole in the assembly of a nucleic acid sequence of the present invention;
图2是 本发明核酸序列组装过程中补洞的连接示意图;2 is a schematic view showing the connection of a hole in the assembly process of the nucleic acid sequence of the present invention;
图3是 本发明核酸序列组装中的补洞装置一实施例的结构示意图。Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
本文中一些名词中英文名称对照及定义如下所示:
PE read 双末端读序 通过双末端建库方法获取到一段较长的 DNA 序列的两个末端及两个末端序列间的距离信息,再通过测序得到的两个末端的序列
read 读序 测序过程中产生的碱基序列
block 窗口 在 DNA 序列上人为选定的一定长度的核苷酸序列
contig 重叠群 一组读序通过重叠关系组成的一条线性有序的序列
overlap 重叠 指在序列拼接过程中,两条序列相同的部分
kmer 定长短串 是一个长度为 K 的 DNA 序列, K 通常取 17
single read 单端读序 主要是基于 sanger 测序方法获取的一种序列信息,就是利用 sanger 测序方法获得较长 DNA 序列的一端序列信息或较短序列的测通信息
scaffold 连接支架 通过质粒、 BACs 、 mRN A 、或其它来源的双末端读序的连接信息将重叠群连接的结果,其中的重叠群之间是有序而且定向的
gap 基因组组装通常首先屏蔽重复区域,然后在双末端读序 (PE read) 的辅助下,确定非重复区域关系,而非重复区域之间的未组装区域形成 gap ,称之为洞区域
repeat 序列重复 基因组序列中重复出现的核苷酸序列
indel 插入 / 缺失 指插入或者缺失一段序列从而改变 DNA 序列结构
The comparison and definition of some Chinese and English names in this article are as follows:
PE read Double end reading The distance between the two ends of the longer DNA sequence and the two end sequences is obtained by the double-end library construction method, and the sequences of the two ends obtained by sequencing are obtained.
Read Reading order Base sequence generated during sequencing
Block window An artificially selected nucleotide sequence of a certain length on a DNA sequence
Contig Contiguous group A linear ordered sequence of readings that consist of overlapping relationships
Overlap overlapping Refers to the same part of the two sequences during the sequence stitching process.
Kmar Fixed length string Is a DNA sequence of length K, K is usually taken 17
Single read Single-ended read order Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
Scaffold Connecting bracket Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented
Gap hole Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
Repeat Sequence repeat Repeated nucleotide sequence in the genome sequence
Indel Insert/miss Refers to the insertion or deletion of a sequence to alter the structure of the DNA sequence
【具体实施方式】【detailed description】
下面结合附图和实施例对本发明进行详细说明。The invention will now be described in detail in conjunction with the drawings and embodiments.
图1是本发明核酸序列组装中的补洞方法一实施例的流程示意图,1 is a schematic flow chart of an embodiment of a method for filling a hole in the assembly of a nucleic acid sequence of the present invention,
如图1所示,在所述补洞方法中,洞的一端具有第一重叠群(Contig),其另一端具有第二重叠群,所述补洞方法包括以下步骤:As shown in FIG. 1, in the hole filling method, one end of the hole has a first contig and the other end has a second contig, and the hole filling method includes the following steps:
步骤101,从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序;Step 101: Obtain a reading sequence that overlaps the first contig group and the second contig group from the reading sequence used for filling the hole;
在本实施例中,对于长度小于100bp的洞,我们定义为小洞,采用本发明方法进行补洞。首先查找小洞内的所有读序(read)并进行分析,选择落入洞内,且与洞两边重叠群均有重叠的读序。In the present embodiment, for a hole having a length of less than 100 bp, we define a small hole, and the hole is filled by the method of the present invention. First, find all the readings in the small hole and analyze them, and select the readings that fall into the hole and overlap with the overlapping groups on both sides of the hole.
步骤102,根据均有重叠的读序计算洞的实际洞长,其中,表征实际洞长的序列为均有重叠的读序落入洞内的洞内序列;Step 102: Calculate the actual hole length of the hole according to the overlapping reading order, wherein the sequence representing the actual hole length is an intra-hole sequence in which the overlapping readings fall into the hole;
由于落入洞内,且与洞两边重叠群均有重叠的读序跨过了该洞,如果除去与洞两边重叠群有重叠的那部分序列,剩下的序列便是洞内序列,因此,可用这些读序来计算洞的实际洞长,本实施例便根据完全落在洞内的剩余的序列计算生成小洞的实际洞长。Since the sequence that falls within the hole and overlaps with the overlapping groups on both sides of the hole crosses the hole, if the part of the sequence overlapping with the overlapping group on both sides of the hole is removed, the remaining sequence is the sequence within the hole, therefore, These readings can be used to calculate the actual hole length of the hole. In this embodiment, the actual hole length of the small hole is calculated based on the remaining sequence completely falling within the hole.
由于跨过该洞的每一条读序都可计算出一个洞长,对于所有这样的读序,便会形成一个频数表,表征洞长的一个范围。频数表的形成是因为连接时可能的误差导致不同的读序跟重叠群连接时显示的洞长各不相同。选择频数表中频率最大的洞长作为实际洞长。Since each hole reading across the hole can calculate a hole length, for all such readings, a frequency table is formed to characterize a range of hole lengths. The frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.
获得实际洞长后,需要判断实际洞长的可信度,确保补洞的准确性,具体判断标准如下:第一个判断条件为,判断计算得到的实际洞长是否满足预定洞长范围,由于本发明的重点是对于洞长小于100bp的洞进行组装,所以就需要首先判断计算得到的实际洞长是否满足小于100bp这一预定洞长范围。第二个判断条件为,洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上,这里所说的第一频率是指在洞长频数表中仅小于最大频率的频率,第一频率所对应的洞长为第一洞长。也就是说,为了使得到的实际洞长可信,需要判断支持实际洞长的读序数量是否是支持第一洞长的读序数量的一倍以上。如果计算得到的实际洞长小于100bp,支持实际洞长的读序数量是支持第一洞长的读序数量的一倍以上,那么计算得到的实际洞长就越接近真实洞长。After obtaining the actual hole length, it is necessary to judge the credibility of the actual hole length and ensure the accuracy of the hole filling. The specific judgment criteria are as follows: The first judgment condition is to judge whether the calculated actual hole length satisfies the predetermined hole length range, The focus of the present invention is to assemble a hole having a hole length of less than 100 bp, so it is necessary to first judge whether the calculated actual hole length satisfies a predetermined hole length range of less than 100 bp. The second judgment condition is whether the frequency of the actual hole length in the hole length frequency table is more than double of the first frequency, and the first frequency referred to herein refers to the frequency which is only smaller than the maximum frequency in the hole length frequency table. The length of the hole corresponding to the first frequency is the length of the first hole. That is to say, in order to make the actual hole length credible, it is necessary to judge whether the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length. If the actual hole length calculated is less than 100 bp, the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length, then the calculated actual hole length is closer to the true hole length.
步骤103,判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、计算出洞长的读序落入洞内的序列是否存在串联重复; Step 103, judging whether the read sequence of the actual hole length is overlapped with the overlap region of the first contig group and the second contig, respectively, and whether the sequence of the hole length reading sequence falls into the hole has a series repeat;
实际洞长的检查包括:读序与重叠群重叠区域的重复(repeat)检查、读序落入洞内区域的repeat检查、以及第一重叠群与第二重叠群重叠区域的repeat检查。重叠区域和洞内区域的repeat检查是为了检查串联重复,即重复特征识别。一般串联重复的检查是通过窗口(block)的方法来实现,repeat检查是为了确定串联重复洞的类型是属于多模式串联,还是属于单模式串联,以及串联重复块之间的距离。实际洞长的检查为后续补洞组装提供支持。The inspection of the actual hole length includes: a repeat check of the overlap of the read sequence and the contig, a repeat check of the area falling within the hole, and a repeat check of the overlap area of the first contig and the second contig. The repeat check of the overlap area and the inner area of the hole is to check the tandem repeat, that is, repeat feature recognition. The general tandem repeat check is implemented by a block check method, which is to determine whether the type of tandem repeat hole belongs to multi-mode concatenation, to single-mode concatenation, and to the distance between tandem repeat blocks. The actual hole length inspection provides support for subsequent hole assembly.
步骤104,若上述各项判断的结果全部为否,则获取洞内序列,完成补洞。Step 104: If all the results of the above determinations are no, the sequence in the hole is obtained, and the hole is completed.
由于本发明方法不对有重复的洞进行组装,因此,若步骤103的判断结果全部为否,则获取洞内序列,完成补洞;若103步骤的判断结果中,有任何一项或以上为是,则不使用本发明的方法进行补洞。Since the method of the present invention does not assemble the duplicated holes, if the judgment result of the step 103 is all negative, the sequence in the hole is acquired, and the hole is completed; if any of the judgment results of the step 103 is one or more, Then, the method of the present invention is not used to fill holes.
获取洞内序列后,将获取的洞内序列与用于补洞的所有读序进行比对,以确定所述洞内序列碱基的准确性,若洞内序列碱基准确,则利用该洞内序列,完成补洞。After obtaining the sequence in the hole, the obtained intra-hole sequence is compared with all the reading sequences used for filling the hole to determine the accuracy of the sequence base in the hole, and if the sequence within the hole is accurate, the hole is utilized. Inside the sequence, complete the hole.
具体过程阐述如下:The specific process is as follows:
获得表征实际洞长的洞内序列后,如果实际洞长大于系统预设的第一阈值,譬如0,那么表征该洞长的洞内序列上的碱基有可能为该小洞的真实碱基,可以将所有表征该实际洞长的读序逐碱基分析以确定各个位点上的碱基。After obtaining the intra-hole sequence characterizing the actual hole length, if the actual hole length is greater than the first threshold set by the system, such as 0, then the base on the sequence in the hole characterizing the hole length may be the true base of the hole. All reads that characterize the actual hole length can be analyzed from base to base to determine the base at each position.
在具体实施过程中,由于跨过小洞的读序的数目很少,因此,上述用于确定小洞洞长的读序上的碱基的可信度将成为该读序是否可以补洞的一个制约。本实施例为了保证填入洞内序列的准确性,查找其它落入该小洞内、但是没有跨过该小洞的读序,同上述用于确定小洞洞长的读序进行比对,如果比对容错性小于3%(通常为3%),则可以确定用于确定小洞洞长的读序其落入洞内的序列每一个碱基都是可信的,可用于补洞;如果比对容错性大于3%(通常为3%),则可以确定用于确定小洞洞长的读序其落入洞内的序列每一个碱基都是不可信的,将不可信的部分剪断。这样确保填入小洞内的读序的准确性。In the specific implementation process, since the number of readings across the small holes is small, the reliability of the above-mentioned bases for determining the length of the small holes will be whether the reading can fill the holes. A constraint. In order to ensure the accuracy of the sequence filled in the hole, the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole. Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.
为了清楚的解释序列组装中补洞是如何进行的,根据上述方法,图2示出了本发明核酸序列组装过程中补洞的连接示意图,如图所示,所述洞的两端分别具有第一重叠群x和第二重叠群y,A为补洞过程中与第一重叠群和第二重叠群均有重叠的读序, a和b分别为读序与重叠群之间的重叠长度,c为读序A落入洞内的序列长度,其中,a和b的大小不作限定,x和y的长度大于A的长度。 In order to clearly explain how the hole filling in the sequence assembly is performed, according to the above method, FIG. 2 shows a schematic diagram of the connection of the hole in the assembly process of the nucleic acid sequence of the present invention. As shown in the figure, the two ends of the hole respectively have the first a contig x and a second contig y, A is an order of overlap with the first contig and the second contig during the hole filling process, a and b are the lengths of overlap between the read sequence and the contig, respectively, and c is the length of the sequence in which the read sequence A falls into the hole, wherein the sizes of a and b are not limited, and the lengths of x and y are greater than the length of A.
首先从用于补洞的读序中选择与第一重叠群和第二重叠群均有重叠的读序A,该读序与第一重叠群x有重叠a,与第二重叠群y有重叠b,然后计算第一重叠群x和第二重叠群y之间洞的长度,如图所示,读序A落入洞内的序列长度c为该洞的洞长,在具体操作过程中,需要多条与第一重叠群和第二重叠群均有重叠的读序来计算洞长,由于跨过该洞的每一条读序都可计算出一个洞长,对于所有这样的读序,便会形成一个频数表,表征洞长的一个范围,选择频数表中频率最大的洞长作为实际洞长。且计算出的实际洞长还必须满足以下两个条件,计算得到的实际洞长小于100bp,且支持实际洞长的读序数量是支持第一洞长的读序数量的一倍以上,这里所说的第一频率是指在洞长频数表中仅小于最大频率的频率,第一频率所对应的洞长为第一洞长。First, a reading sequence A overlapping the first contig group and the second contig group is selected from the reading order for filling the hole, and the reading sequence overlaps with the first contig group x and overlaps with the second contig group y. b, then calculating the length of the hole between the first contig x and the second contig y, as shown, the sequence length c of the reading sequence A falling into the hole is the hole length of the hole, during the specific operation, It is necessary to calculate a hole length by reading a plurality of readings overlapping the first contig and the second contig, and a hole length can be calculated for each reading across the hole. For all such readings, A frequency table is formed, which characterizes a range of hole lengths, and selects the frequency of the largest frequency in the frequency table as the actual hole length. And the actual hole length calculated must also satisfy the following two conditions. The actual hole length calculated is less than 100 bp, and the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length. The first frequency refers to a frequency that is only smaller than the maximum frequency in the hole length frequency table, and the hole length corresponding to the first frequency is the first hole length.
计算出最接近实际洞长的洞长后,由于本发明的补洞方法对有重复的洞不进行组装,因此,需要判断计算出实际洞长的读序A分别与第一重叠群和第二重叠群的重叠区域a和c是否存在串联重复、落入洞内的序列c是否存在串联重复,如果均没有串联重复,获取洞内序列c,完成补洞。在补洞过程中,计算洞长需要的与第一重叠群和第二重叠群均有重叠的读序数量不限。After calculating the length of the hole closest to the actual hole length, since the hole filling method of the present invention does not assemble the repeated holes, it is necessary to judge the reading order A of the actual hole length and the first contig and the second, respectively. Whether the overlapping regions a and c of the contig have a tandem repeat, a sequence c falling into the hole, whether there is a tandem repeat, if there is no tandem repeat, the sequence c in the hole is obtained, and the hole is completed. In the process of filling the hole, the number of readings required to calculate the length of the hole and overlap with the first contig and the second contig is not limited.
以上可以了解,区别于现有技术,本发明从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序。根据该读序计算洞的长度,得到该洞的实际长度,并对洞的长度进行检查,如果计算出洞长的读序与第一重叠群和第二重叠群的重叠区域以及计算出洞长的读序落入洞内的序列都没有串联重复,则获取表征洞长的洞内序列,完成补洞。通过上述方式,在补洞过程中,能够准确地获得表征洞长的洞内序列,提高补洞的准确性。As can be understood from the above, the present invention obtains an order of reading that overlaps both the first contig and the second contig from the reading sequence for the hole filling. Calculate the length of the hole according to the reading order, obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed. Through the above method, in the hole filling process, the intra-hole sequence characterizing the hole length can be accurately obtained, and the accuracy of the hole filling can be improved.
在其它实施例中,核酸序列组装中的补洞方法包括:从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序,根据该读序计算洞的长度,如果确定出的实际洞长小于系统预设的第一阈值,譬如0,则判断为重叠群两端有重叠,进一步判断该重叠是否为重复,是则判断其重复模式,否则将重叠群末端截取重叠长度。具体阐述如下:In other embodiments, the hole-filling method in nucleic acid sequence assembly comprises: obtaining a reading sequence overlapping the first contig group and the second contig group from the reading sequence for filling the hole, and calculating the hole according to the reading order Length, if the determined actual hole length is less than the first threshold preset by the system, for example, 0, it is determined that there is overlap at both ends of the contig, further determining whether the overlap is a repetition, and then determining the repeat mode, otherwise the contig will be The end intercepts the overlap length. The details are as follows:
若第一重叠群靠近洞的一端与第二重叠群靠近洞的一端存在重叠关系,那么在组装阶段,组装软件由于该重叠无法将这两条重叠群连接成为一条重叠群,因此输出时,第一重叠群和第二重叠群是分开的,所谓第一重叠群和第二重叠群之间的洞长就变成负数,形成负洞。对于负洞的检查,如果第一重叠群与第二重叠群重叠区域的repeat检查结果为,不存在串联重复,则截掉第一重叠群或第二重叠群的重叠区域,完成补洞即可,若存在串联重复,需要确定出负洞长度一个可变的值,然后根据用户的需求进行补洞。If the end of the first contig close to the hole overlaps with the end of the second contig close to the hole, in the assembly phase, the assembly software cannot connect the two contigs into a contig due to the overlap, so when outputting, A contig and a second contig are separated, and the length of the hole between the first contig and the second contig becomes a negative number, forming a negative hole. For the inspection of the negative hole, if the result of the repeat check of the overlapping area of the first contig and the second contig is that there is no tandem repetition, the overlapping area of the first contig or the second contig is cut off, and the filling hole can be completed. If there is a tandem repeat, it is necessary to determine a negative value of the negative hole length, and then fill the hole according to the user's needs.
在另一实施例中,核酸序列组装中的补洞方法包括:从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序,根据该读序计算洞的长度,如果确定出的实际洞长等于系统预设的第一阈值,譬如0,则连接第一重叠群和第二重叠群,完成补洞。In another embodiment, the method of filling a hole in the assembly of the nucleic acid sequence comprises: obtaining a reading sequence overlapping the first contig and the second contig from the reading sequence for filling the hole, and calculating the hole according to the reading order The length, if the determined actual hole length is equal to the first threshold preset by the system, such as 0, connects the first contig and the second contig to complete the hole.
图3是本发明核酸序列组装中的补洞装置一实施例的结构示意图,如图所示,该装置包括:获取模块31、计算模块32、第一判断模块33、第一补洞模块34、取消模块35、第二补洞模块36、第二判断模块37、第三判断模块38、连接模块39以及继续判断模块40。其中,计算模块32包括计算单元321和选择单元322。第一补洞模块34包括对比单元341和补洞单元342。3 is a schematic structural view of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention. As shown in the figure, the device includes: an obtaining module 31, a calculating module 32, a first determining module 33, and a first filling hole module 34. The cancel module 35, the second fill hole module 36, the second judgment module 37, the third judgment module 38, the connection module 39, and the continuation determination module 40 are eliminated. The calculation module 32 includes a calculation unit 321 and a selection unit 322. The first fill hole module 34 includes a comparison unit 341 and a fill hole unit 342.
其中,获取模块31用于从补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序;计算模块32用于根据均有重叠的读序计算洞的实际洞长,其中,表征实际洞长的序列为均有重叠的读序落入洞内的洞内序列;计算单元321用于根据每条均有重叠的读序计算洞的长度,形成洞长频数表;选择单元322用于选择洞长频数表中频率最大的洞长为实际洞长;第一判断模块33用于判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、计算出洞长的读序落入洞内的序列是否存在串联重复;第一补洞模块34用于在第一判断模块33的各项判断结果全部为否时,获取洞内序列,完成补洞;对比单元341用于将获取的洞内序列与用于补洞的所有读序进行比对,以确定洞内序列碱基的准确性;补洞单元342用于若洞内序列碱基准确,则获取洞内序列,完成补洞;取消模块35用于在补洞过程中,当某个判断结果不满足条件时,不进行补洞。The obtaining module 31 is configured to obtain an reading sequence that overlaps the first contig group and the second contig group from the reading order of the hole filling; the calculating module 32 is configured to calculate the actual hole length of the hole according to the overlapping reading order. The sequence characterization of the actual hole length is an intra-hole sequence in which the overlapping readings fall within the hole; the calculating unit 321 is configured to calculate the length of the hole according to each of the overlapping reading orders to form a hole length frequency table; The selecting unit 322 is configured to select the length of the hole whose frequency is the largest in the hole length frequency table as the actual hole length; the first determining module 33 is configured to determine that the reading order of the actual hole length overlaps with the first contig and the second contig respectively Whether there is a series repeat in the region, and whether the sequence of the hole length reading sequence falls into the hole has a series repeat; the first hole repairing module 34 is configured to acquire the hole when all the judgment results of the first determining module 33 are all negative. The inner sequence completes the hole filling; the comparing unit 341 is configured to compare the acquired intra-hole sequence with all reading sequences for the hole filling to determine the accuracy of the sequence base in the hole; the hole-filling unit 342 is used for the hole If the base sequence is accurate, then the hole is obtained. The inner sequence completes the fill hole; the cancel module 35 is used to fill the hole when the judgment result does not satisfy the condition in the hole filling process.
继续判断模块40用于计算模块32计算出实际洞长后,判断第一重叠群和第二重叠群的重叠区域是否存在串联重复;第二补洞模块36用于在继续判断模块40判断为否时,截掉第一重叠群或第二重叠群的重叠区域,完成补洞。The continuation determining module 40 is configured to determine whether the overlapping area of the first contig group and the second contig has a series repeat after the calculating module 32 calculates the actual hole length; and the second filling hole module 36 is configured to determine, in the continuation determining module 40, When the overlapping area of the first contig or the second contig is cut off, the hole is completed.
第二判断模块37用于判断计算模块32计算得到的实际洞长是否满足预定洞长范围,洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上;第三判断模块38用于在第二判断模块37的判断结果全部为是时,使第一判断模块33和继续判断模块40进行工作。The second determining module 37 is configured to determine whether the actual hole length calculated by the calculating module 32 satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double the first frequency; the third determining module 38 When the determination result of the second judging module 37 is all YES, the first judging module 33 and the continuation judging module 40 are caused to operate.
连接模块39用于计算模块计算出实际洞长后,当实际洞长为0时,连接第一重叠群和第二重叠群,完成补洞。The connection module 39 is configured to calculate the actual hole length, and when the actual hole length is 0, connect the first contig and the second contig to complete the hole.
各模块的具体实施过程如下:首先,获取模块31从补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序,然后,计算模块32根据均有重叠的读序计算洞的实际洞长,其中,表征实际洞长的序列为均有重叠的读序落入洞内的洞内序列,具体计算洞长的过程如下:计算单元321根据每条均有重叠的读序计算洞的长度,形成洞长频数表,然后,选择单元322选择洞长频数表中频率最大的洞长为实际洞长。为了使计算得到的洞长最大限度的接近真实洞长,还必须对计算得到的洞长进行判断,如:第二判断模块37判断计算得到的实际洞长是否满足预定洞长范围,洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上,这两个判断条件的解释如下:由于本发明是对于洞长小于100bp的洞进行组装,所以就需要首先判断计算得到的实际洞长是否满足小于100bp这一预定洞长范围。这里所说的第一频率是指在洞长频数表中仅小于最大频率的频率,第一频率所对应的洞长为第一洞长。也就是说,为了使得到的实际洞长可信,需要判断支持实际洞长的读序数量是否是支持第一洞长的读序数量的一倍以上。如果计算得到的实际洞长小于100bp,支持实际洞长的读序数量是支持第一洞长的读序数量的一倍以上,那么计算得到的实际洞长就越接近真实洞长。第三判断模块38在第二判断模块37的判断结果全部为是时,使第一判断模块33和继续判断模块40进行工作。当在第二判断模块37的判断结果中,有任何一项或以上判断为否时,取消模块35不进行补洞。 The specific implementation process of each module is as follows: First, the obtaining module 31 obtains an reading sequence overlapping the first contig group and the second contig group from the reading order of the hole filling, and then the calculating module 32 reads the reading according to the overlapping Calculating the actual hole length of the hole, wherein the sequence characterizing the actual hole length is an intra-hole sequence in which the overlapping readings fall into the hole, and the process of calculating the hole length is as follows: the calculating unit 321 has an overlapping reading according to each piece. The length of the hole is calculated in order to form a hole length table, and then the selecting unit 322 selects the hole having the largest frequency in the hole length table as the actual hole length. In order to maximize the calculated hole length to the true hole length, the calculated hole length must also be judged. For example, the second judging module 37 judges whether the calculated actual hole length satisfies the predetermined hole length range, and the hole length frequency. Whether the frequency of the actual hole length in the table is more than double the first frequency, the two judgment conditions are explained as follows: Since the present invention is to assemble a hole having a hole length of less than 100 bp, it is necessary to first judge the actual calculated. Whether the hole length satisfies the predetermined hole length range of less than 100 bp. The first frequency referred to herein refers to a frequency that is only smaller than the maximum frequency in the hole length frequency table, and the hole length corresponding to the first frequency is the first hole length. That is to say, in order to make the actual hole length credible, it is necessary to judge whether the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length. If the actual hole length calculated is less than 100 bp, the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length, then the calculated actual hole length is closer to the true hole length. When the determination result of the second determination module 37 is all YES, the third determination module 38 causes the first determination module 33 and the continuation determination module 40 to operate. When any one or more of the determination results of the second determination module 37 is negative, the cancel module 35 does not perform the hole filling.
第一判断模块33判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、计算出洞长的读序落入洞内的序列是否存在串联重复,当第一判断模块33的各项判断结果全部为否时,第一补洞模块34获取洞内序列,完成补洞。为了提高补洞的准确性,在第一补洞模块34获取洞内序列,完成补洞的过程中,对比单元341将获取的洞内序列与用于补洞的所有读序进行比对,以确定洞内序列碱基的准确性,若洞内序列碱基准确,补洞单元342则获取洞内序列,完成补洞。当第一判断模块33的判断结果中,有任何一项或以上为是时,取消模块35不进行补洞。The first judging module 33 judges whether the reading sequence of the actual hole length is overlapped with the overlapping area of the first contig group and the second contig, respectively, and whether the sequence of the hole length reading sequence falls into the hole has a tandem repeat. When all the determination results of the first judging module 33 are all negative, the first filling hole module 34 acquires the intra-hole sequence and completes the filling hole. In order to improve the accuracy of the hole filling, in the process of acquiring the intra-hole sequence by the first hole-filling module 34 and completing the hole-filling process, the comparing unit 341 compares the acquired intra-hole sequence with all the reading sequences for filling the hole, so as to compare Determine the accuracy of the sequence bases in the hole. If the sequence bases in the holes are accurate, the hole-filling unit 342 acquires the sequence within the hole and completes the hole filling. When any one or more of the determination results of the first judging module 33 is YES, the canceling module 35 does not perform the hole filling.
当计算出洞的长度为负数时,即由于第一重叠群和第二重叠群存在重叠关系而导致产生的负洞,继续判断模块40首先判断第一重叠群和第二重叠群的重叠区域是否存在串联重复,若判断为否,则第二补洞模块36截掉第一重叠群或第二重叠群的重叠区域,完成补洞;若判断为是时,则取消模块35不进行补洞。When it is calculated that the length of the hole is a negative number, that is, a negative hole is generated due to the overlapping relationship between the first contig and the second contig, the continuation determining module 40 first determines whether the overlapping area of the first contig and the second contig is There is a series repeat, and if the determination is no, the second fill hole module 36 cuts off the overlap region of the first contig or the second contig to complete the fill hole; if the determination is YES, the cancel module 35 does not fill the hole.
当计算出洞的长度为0时,即第一重叠群和第二重叠群之间不存在洞,连接模块39连接第一重叠群和第二重叠群,完成补洞。When it is calculated that the length of the hole is 0, that is, there is no hole between the first contig and the second contig, the connection module 39 connects the first contig and the second contig to complete the hole.
以上可以了解,区别于现有技术,本发明从用于补洞的读序中获取与第一重叠群和第二重叠群均有重叠的读序。根据该读序计算洞的长度,得到该洞的实际长度,并对洞的长度进行检查,如果计算出洞长的读序与第一重叠群和第二重叠群的重叠区域以及计算出洞长的读序落入洞内的序列都没有串联重复,则获取表征洞长的洞内序列,完成补洞。通过上述方式,在补洞过程中,能够准确地获得表征洞长的洞内序列,提高补洞的准确性。 As can be understood from the above, the present invention obtains an order of reading that overlaps both the first contig and the second contig from the reading sequence for the hole filling. Calculate the length of the hole according to the reading order, obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed. Through the above method, in the hole filling process, the intra-hole sequence characterizing the hole length can be accurately obtained, and the accuracy of the hole filling can be improved.
需要指出的是,本发明实施例中,对于小洞而言,并不是每个小洞都可以找到用于确定小洞洞长的读序,在无法找到可以用于确定小洞洞长的读序时,或计算得到的洞长不准确时,或洞内序列有重复时,需要使用其它的补洞方式来处理,请参阅下文,通过一具体实施例来详细阐述对于洞的组装。It should be noted that, in the embodiment of the present invention, for a small hole, not every small hole can find a reading order for determining the length of the small hole, and can not find a reading that can be used to determine the length of the small hole. In the case of timing, or when the calculated hole length is inaccurate, or when the sequence within the hole is repeated, other hole filling methods are needed for processing. Please refer to the following to explain the assembly of the hole in detail by a specific embodiment.
本实施例中,根据洞的大小以及系统设置的判断标准获取洞的级别,其中,所述基因序列洞的级别分为小洞、中洞以及大洞,并根据核酸序列洞的级别以及对应的碱基序列段进行补洞。依据如下方式将洞进行分类:洞的长度小于100bp被定义为小洞,洞的长度在100bp~1.5kb之间的被定义为中洞,洞的长度大于1.5kb的被定义为大洞。当然,上述仅仅是对各种洞的定义的其中一种,各个洞大小仅仅是示例性的,本文不作限制。In this embodiment, the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole, and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding alkali The base sequence segment is filled with holes. The holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole. Of course, the above is only one of the definitions of various holes, and the size of each hole is merely exemplary, and is not limited herein.
关于补洞的描述,请逐一参阅下文。For a description of the hole, please refer to the following one by one.
首先,获取并分析形成基因序列洞的连接支架(scaffold)。其中,原始scaffold被打断后形成重叠群,两个重叠群之间的间隙为洞。本实施例通过读取用于补洞的重叠群,可以准确地获取洞的大小、洞前后的重叠群。而且还可以同时获取重叠群长度和序列信息,以及重叠群前后的洞的信息。First, a scaffold that forms a gene sequence hole is acquired and analyzed. Among them, the original scaffold is broken to form a contig, and the gap between the two contigs is a hole. In the present embodiment, by reading the contig for the hole filling, the size of the hole and the contiguous group before and after the hole can be accurately obtained. Moreover, it is also possible to simultaneously acquire the contig length and sequence information, as well as the information of the holes before and after the contig.
在具体实施过程中,本实施例还根据用户的设定,对获取的所有的核酸序列洞和叠连群进行划分,将相互关联的叠连群和读序对应存储至相应的文件夹。譬如,如用户设定4个文件夹,则将获取的所有的核酸序列洞和叠连群分为4份,生成4个文件夹,将相互关联的叠连群和读序一一对应存放至切分好的文件夹中。通过上述切分,各个文件夹都包含有用于补洞的重叠群和读序,在后续进行补洞处理时,可以直接从相应的文件夹获取用于补洞的重叠群和读序。显然,通过上述切分,可将原先需要的内存缩小四分之一,节省空间,而且,在补洞时可以减少搜索时间,从而减少补洞消耗的时间。In the specific implementation process, the embodiment further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.
之后,在核酸序列洞内读取用于补洞的读序,本实施例中,用于补洞的读序大部分属于PE 读序,来自solexa的测序结果,其余部分为长的单端读序,来自sanger测序结果。After that, the reading order for filling the hole is read in the nucleic acid sequence hole. In this embodiment, the reading order for filling the hole mostly belongs to PE. The reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.
其中,PE 读序之间相互支持,PE 读序来自某个插入片段的两端,而用于补洞的插入片段一般由180bp、500bp和800bp的组成,本实施例通过高通量多乘数的测序,可将某一个插入片段通过多个PE 读序的重叠关系进行还原。因此,对于某个核酸序列洞而言,若存在一条读序与该洞一端的重叠群有重叠关系,且该读序的方向同重叠群的方向一致,即若该读序为PE 读序,则与该读序有PE关系的读序或者落在核酸序列洞内,或者落在核酸序列洞后的重叠群上,即可以对上述核酸序列洞进行补洞处理。Among them, the PE reading order supports each other, PE The reading sequence is from the two ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp. In this embodiment, by inserting a high-throughput multiplier, one insert can be passed through PE The overlapping relationship of the reading order is restored. Therefore, for a nucleic acid sequence hole, if there is an overlap between the read sequence and the contig of one end of the hole, and the direction of the read sequence is consistent with the direction of the contig, that is, if the read order is PE In the reading sequence, the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.
对于长读序而言,由于长读序本身长度较长,可以跨过洞长较小的核酸序列洞,若长读序的各个碱基都可信,则可以使用该长读序各个位点的碱基来完成洞长较小的核酸序列洞的准确填补。For the long reading order, since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.
本实施例中,对于获取的核酸序列洞内的每一条读序,都同时获取了该读序与核酸序列洞的位置关系、该读序所属的重叠群和scaffold,以及该读序自身的序列信息。In this embodiment, for each read sequence in the acquired nucleic acid sequence hole, the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the sequence of the read sequence itself are acquired. information.
为了保证补洞的准确性和补洞率,在本实施例中,基于上述核酸序列洞的级别,补洞处理具体包括:A、对小洞的补洞处理;B、对中洞的补洞处理以及C、对大洞的补洞处理。下面分别描述各个级别的洞的补洞过程。In order to ensure the accuracy of the hole filling and the hole filling rate, in the embodiment, based on the level of the nucleic acid sequence hole, the hole filling process specifically includes: A, the hole filling treatment of the small hole; B, the hole filling the hole in the middle hole Processing and C, the hole filling of the big hole. The hole filling process of each level of hole is described below.
A、对于小洞的补洞过程,请参阅图1和图3所描述的具体实施例。 A. For the hole filling process of the small hole, please refer to the specific embodiments described in FIG. 1 and FIG.
B、对于中洞的处理,具体实施方式如下:B. For the treatment of the middle hole, the specific implementation is as follows:
B1)、基于读序的重复特征识别,需要从中洞内读序中取出所有可能的窗口。本实施例中block设置为6bp或者12bp。其中,block为一个窗口,该窗口中包含一定个数的碱基,在读序上每次滑动一个碱基。具体地说,假定一个窗口中含有X个碱基,首先窗口取第一到第X个碱基,第1次滑动,则窗口取第二到第(X+1)个碱基,以此类推,每滑动一次,则窗口向前挪动一个碱基,当滑动第n次时,窗口内取的是第n+1到第(X+n)个碱基。B1), read-based repeat feature recognition, need to take out all possible windows from the inner hole read sequence. In this embodiment, the block is set to 6 bp or 12 bp. Where block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.
在具体实施过程中,为了识别串联重复,本实施例记录block频数(block_freq)以及相同block的距离(b_dis)进行分析。如果在某个距离b_dis值下,频数block_freq具有最大值,同时该距离b_dis大小或者等于block中碱基的个数,则判定该段序列中存在串联重复。In the specific implementation process, in order to identify the tandem repetition, the present embodiment records the block frequency (block_freq) and the distance of the same block (b_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance b_dis value, and the distance b_dis is equal to or equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.
而且,本实施例还根据在上述判断串联重复的过程中获得的信息进一步推断串联重复的模式:即如果在序列中只有一种串联情况,则判定为单模式串联;如果存在多种交叉或不交叉的串联,则判定为多串联模式。Moreover, the present embodiment further infers the mode of the tandem repetition based on the information obtained in the above-described process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or no The series connection of the intersection is determined to be a multi-series mode.
在具体实施过程中,为了识别串联重复,本实施例记录block频数,通过计算洞内block期望深度和分析洞内block深度分布来判断洞内的重复情况,如果洞内block频数比洞内block期望深度成倍增多即说明有重复。In the specific implementation process, in order to identify the series repetition, the present embodiment records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block requirement in the hole. If the depth is multiplied, there is a repetition.
B2)、关于读序的重叠的计算。其中,重叠计算首先采用哈希(Hash)方法快速判断各读序之间是否有公共定长短串(kmer),有公共kmer的读序之间可能有重叠。Kmer的定义为:长度为k的一段连续的碱基序列,基因组中,kmer的分布和基因组的大小、错误率及杂合率等密切相关。之后,针对可能有重叠的一对读序采用模式识别进行比对。B2), calculation of the overlap of the reading order. Among them, the overlap calculation first uses the hash method to quickly determine whether there is a common fixed-length short string (kmer) between the read orders, and there may be overlap between the read orders of the common kmer. Kmer is defined as a contiguous sequence of bases of length k. In the genome, the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.
在具体实施过程中,首先设置最大重叠,并将这个区域分成若干block,分别从一条读序前端读取block,从另一条读序内部查找,判断是否能找到该block,如果能找到,则详细对比以获得重叠长度;如果没有找到,则继续读取block。本实施例为了容错(也即两条读序的重叠之间的碱基可以允许的不匹配的个数为3个)考虑,可以适当的上调该block个数。In the specific implementation process, first set the maximum overlap, and divide this area into several blocks, read the block from a read-end front end, and search from another read-order internal to determine whether the block can be found. If it can be found, the details are detailed. Compare to get the overlap length; if not found, continue reading the block. In the present embodiment, in consideration of fault tolerance (that is, the number of mismatches that can be allowed for bases between the overlaps of two read orders is three), the number of blocks can be appropriately raised.
B3)、关于识别延伸冲突。B3), on identifying extension conflicts.
对于前端延伸来讲,针对起始的节点读序,找到与该节点读序有重叠的所有读序,选取与该节点读序的重叠最小的读序作为种子读序,则其它读序都应与种子读序有重叠,而且这些重叠的长度必然大于种子读序与节点读序的重叠长度,如果存在读序与种子读序没有重叠,则判断为有冲突发生,通过更换种子读序,即重新寻找新的种子读序来解决该冲突。上述方式保证了寻找到的种子读序的正确性。For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.
之后,本实施例进行延伸。将种子读序视为重叠群的一部分,按上述方式继续查找新的种子读序,若能找到,则判断为序列将不断延伸,否则判断为序列延伸结束,需要等待另一端的延伸来确定两端延伸是否有重叠,进而确定该洞是否可以补全。当然,在某些情况下,往前延伸的读序可能寻找到之前作为种子读序的序列作为它的延伸读序,此时将造成该段范围内延伸的无限循环,将其作为冲突进行处理,当延伸读序为之前的读序时,终止延伸。本实施例后端的延伸类似前端的延伸,此处不再详述。Thereafter, this embodiment is extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed. Of course, in some cases, the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict. When the extended reading is the previous reading, the extension is terminated. The extension of the back end of this embodiment is similar to the extension of the front end, and will not be described in detail herein.
对于补洞局部组装来讲,本身有高相似重复问题,因此冲突识别应当尽可能敏感,而补洞的读序数据也应具有更低错误率,本实施例对读序预先进行纠错处理,以提高读序的质量,确保读序两端的准确性。For the local assembly of the cavity, there is a high similarity repeat problem, so the conflict identification should be as sensitive as possible, and the read data of the hole should also have a lower error rate. In this embodiment, the read sequence is pre-corrected. In order to improve the quality of the reading order, ensure the accuracy of both ends of the reading.
B4)、关于冲突处理。其中,造成延伸冲突的原因有两种:一种是种子读序内有碱基错误,另一种是遇到重复分叉,基于上述两种情况,本实施例在选择补洞读序集合及种子读序时就采用如下策略处理冲突:B4), on conflict handling. Among them, there are two reasons for the extension conflict: one is the base error in the seed reading sequence, and the other is the repeated bifurcation. Based on the above two cases, the present embodiment selects the complement reading set and When the seed is read, the following strategies are used to handle the conflict:
a1)、比对率过滤:必须有100%比对率才能作为种子读序去延伸。比对率过滤采用如下策略:A1), comparison rate filtering: must have a 100% alignment rate to extend as a seed reading. The comparison rate filtering uses the following strategies:
搜索出某条重叠群上的所有读序,找到与该重叠群有重叠的读序,并初始选中与该重叠群有最小重叠的读序作为种子读序。这样,其它与该重叠群有重叠的读序必然与种子读序有重叠。通过比对种子读序与其他和重叠群有重叠的读序,若初始选定的种子读序与其它读序之间的重叠的比对容错性大于系统设置的第二阈值,所述第二阈值为3%,那么,判定此种子读序不可靠,则重新选定一种子读序,此种子读序与该重叠群的重叠的长度大于不可靠的种子读序与该重叠群的重叠长度、同时又小于其他读序与重叠群的重叠的长度,如此循环,直到找到种子读序,或者因为没有找到种子读序而放弃延伸为止。本实施例通过上述方式,避免了由于种子读序的碱基错误造成无法延伸的问题。Search all the readings on a contig, find the reading that overlaps the contig, and initially select the reading with the smallest overlap with the contig as the seed reading. Thus, other readings that overlap with the contig must necessarily overlap the seed reading. By comparing the seed reading sequence with other reading sequences overlapping with the contig, if the ratio of the overlap between the initially selected seed reading and the other reading order is greater than the second threshold set by the system, the second If the threshold is 3%, then it is determined that the seed reading is unreliable, and then a sub-reading is re-selected, and the length of the overlapping of the seed reading with the contig is greater than the overlapping length of the unreliable seed reading and the contig. At the same time, it is smaller than the overlap length of other read orders and contigs, and so on, until the seed read order is found, or the extension is abandoned because the seed read order is not found. In the above manner, the present embodiment avoids the problem that the base cannot be extended due to the base error of the seed reading.
在具体实施过程中,读序之间重叠的比对采用block的逐步延伸方式,即从种子读序上选取一个block,设定一个目标读序,比对block内碱基是否能在所述目标读序中找到,若可以,则将种子读序内block向前移动一个单位,再与目标读序比对,如此重复直至无法匹配为止。此时可以得到种子读序和目标读序之间重叠的长度,对于该长度,需要有一个第三阈值,所述第三阈值为1 kmer,以表征两条读序之间重叠出于非偶然状况,是确实可信的。如果前面的种子读序自身有测序错误,则可能导致大量读序被过滤掉,此时设置一循环设置,用于将前面种子读序替换掉。In the specific implementation process, the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.
a2)、位置过滤:根据双末端关系定位读序,计算出读序在洞内位置,根据位置对读序过滤,从而减少由于洞内长片段重复造成的冲突。为了保证洞内位置计算的准确性,本实施例可设置严格的过滤条件。A2), position filtering: According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole. In order to ensure the accuracy of the calculation of the position within the hole, the embodiment can set strict filtering conditions.
a3)、读序长度过滤:在读序获取过程中,PE 读序长度短,而单端读序(single read)通常较长。长度较长的单端读序都与洞一端有重叠。本实施例在洞内区域优先选用短的双末端读序进行延伸,在洞两端优先选用长的单端读序进行延伸。A3), read sequence length filtering: in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole. In this embodiment, a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.
a4)、末端过滤:根据预计洞长,如果延伸读序过早与另一端有重叠,则选用非重叠的读序,即选用读序位置刚好位于延伸读序后面,与延伸读序无重叠,且放上去跟预计洞长不冲突的读序。从而确保越过一次repeat区域。本实施例中,末端过滤只能出现一次。A4), end filtering: according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed. In this embodiment, the end filtration can only occur once.
a5)、短相似重复处理与识别:短相似重复通常小于50bp,且位置较近,最终会造成核酸序列洞内序列有碱基缺失发生。在识别出存在短相似重复时,本实施例优先选择较长重叠的读序作为种子读序进行延伸,能有效避免短相似重复的问题。A5), short similar repetitive processing and recognition: short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence. When it is recognized that there is a short similar repetition, the present embodiment preferentially selects a longer overlapping read sequence as a seed reading to extend, which can effectively avoid the problem of short similar repetition.
B5)、关于序列连接。本实施例在补洞过程中,不仅仅要求有准确的组装,更要求有准确的连接。准确的组装一方面能够保证降低碱基错误率,另一方面能够保证能够连接准确。而准确的连接则直接决定最终是否会产生插入/缺失。而且,连接的时候必须要考虑延伸错误的情况。本实施例的序列连接关系按照连接质量可以分为以下三个可信度:B5), about sequence connection. In the embodiment of the hole filling process, not only accurate assembly but also accurate connection is required. Accurate assembly on the one hand guarantees a low base error rate and on the other hand ensures an accurate connection. The exact connection directly determines whether an insertion/deletion will eventually occur. Moreover, the connection must be considered when extending the error. The sequence connection relationship of this embodiment can be divided into the following three credibility according to the connection quality:
b1)、第一可信度:连接的两个序列既有重叠,且不是重复,同时有读序跨过支持。B1), first credibility: the two sequences of the connection have overlapping, and are not repeated, and there are read orders across the support.
b2)、第二可信度:连接的两个序列有读序跨过连接,两条序列可能没有重叠。B2), second credibility: the two sequences connected have a read sequence across the connection, and the two sequences may not overlap.
b3)、第三可信度:连接的两个序列有至少8bp的重叠,且重叠区域没有证据支持,可能是重复。B3), third credibility: the two sequences connected have at least 8 bp overlap, and the overlap region is not supported by evidence, and may be repeated.
上述三个可信度均可能存在,而且第一种可信度的质量更高,但并不意味着一定正确,第二种可信度的质量次高,同样,并不意味着一定正确。因此,本实施例在洞内的连接情况根据实际使用情况分类细致处理。洞内的连接分为三类:两端重叠群直接连接,一端延伸与另一端重叠群连接或两端延伸序列连接。上述三类重叠群连接时均会去判定是否有三个可信度存在,即在重叠群序列连接时对序列连接的准确性进行可信度判断,当存在第一可信度时,则选择第一可信度,进行序列连接;没有第一可信度、但存在第二可信度时,则选择第二可信度,进行序列连接;没有第一可信度及第二可信度、但存在第三可信度时,则选择第三可信度,进行序列连接。 The above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct. The quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in this embodiment is classified and processed according to the actual use situation. The connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend. When the above three types of contigs are connected, it is determined whether there are three credibility exists, that is, the reliability of the sequence connection is judged when the contig sequence is connected, and when there is the first credibility, the second is selected. a credibility, sequence connection; if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.
C、对于大洞的处理,主要是将大洞划分为多个中洞,按照对中洞的处理过程进行处理。C. For the treatment of large holes, the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.
因为补洞时对PE大小有限制,支持PE的插入片段最长的为800bp,当洞长超过1.5kb时,减去和两端重叠群之间的重叠的长度,两条800bp的插入片段不可能存在有重叠关系,即不可能找到完整路径能够将大洞完全进行填补。为了规避PE 读序可能产生的空白区域,本实施例将大洞分成若干中洞,然后对中洞分别进行组装,最后将组装结果连接,具体描述如下:Because the size of the PE is limited when filling the hole, the longest insert of the support PE is 800 bp. When the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole. In order to avoid PE In the embodiment, the large hole is divided into a plurality of middle holes, and then the middle holes are respectively assembled, and finally the assembly results are connected, and the details are as follows:
c1)、按照PE关系计算读序的洞内位置,按照读序在洞内的位置将读序排序,根据位置判断有连续读序覆盖的为一个区块。C1) Calculate the position of the hole in the reading order according to the PE relationship, sort the reading order according to the position of the reading order in the hole, and judge that there is a block in the continuous reading order according to the position.
c2)、将每个区块用中洞的方式进行分块组装。C2), each block is assembled in blocks by means of a medium hole.
c3)、将各个区块组装结果连接,获得大洞洞内序列。C3), connecting the assembly results of each block to obtain the sequence within the large hole.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。 The above is only the embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformation of the present invention and the contents of the drawings may be directly or indirectly applied to other related technologies. The fields are all included in the scope of patent protection of the present invention.

Claims (14)

  1. 一种核酸序列组装中的补洞方法,所述洞的一端具有第一重叠群,其另一端具有第二重叠群,其特征在于,包括以下步骤: A method of filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, comprising the steps of:
    从用于补洞的读序中获取与所述第一重叠群和第二重叠群均有重叠的读序; Obtaining a reading sequence overlapping the first contig group and the second contig group from a reading sequence for filling holes;
    根据所述均有重叠的读序计算洞的实际洞长,其中,表征所述实际洞长的序列为所述均有重叠的读序落入洞内的洞内序列; Calculating an actual hole length of the hole according to the overlapped reading order, wherein the sequence characterizing the actual hole length is an intra-hole sequence in which the overlapping readings fall within the hole;
    判断所述计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、所述计算出洞长的读序落入洞内的序列是否存在串联重复; Determining whether the read sequence of the actual hole length is respectively overlapped with the overlap region of the first contig and the second contig, and whether the sequence in which the read length of the hole falls within the hole has a tandem repeat;
    若上述各项判断的结果全部为否,则获取所述洞内序列,完成补洞。 If the results of the above various determinations are all negative, the sequence in the hole is acquired, and the hole is completed.
  2. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述进行各项判断的步骤之后包括:The step of performing each judgment includes:
    若上述判断任何一项或以上为是,则不进行补洞。 If any of the above judgments is YES, no hole is filled.
  3. 根据权利要求1所述的方法,其特征在于, The method of claim 1 wherein
    所述根据均有重叠的读序计算洞的实际洞长的步骤之后包括: The step of calculating the actual hole length of the hole according to the overlapping reading order includes:
    判断所述第一重叠群和第二重叠群的重叠区域是否存在串联重复; Determining whether there is a tandem repetition of the overlapping area of the first contig and the second contig;
    若判断为否,则截掉所述第一重叠群或第二重叠群的重叠区域,完成补洞; If the determination is no, the overlapping area of the first contig or the second contig is cut off to complete the filling hole;
    若判断为是,则不进行补洞。 If the judgment is yes, the hole is not filled.
  4. 根据权利要求1所述的方法,其特征在于, The method of claim 1 wherein
    根据所述均有重叠的读序计算洞的实际洞长的步骤包括: The steps of calculating the actual hole length of the hole according to the overlapped reading order include:
    根据每条均有重叠的读序计算所述洞的长度,形成洞长频数表; Calculating the length of the hole according to each of the overlapping reading sequences to form a hole length frequency table;
    选择所述洞长频数表中频率最大的洞长为实际洞长。 The length of the hole with the largest frequency in the long-frequency table of the hole is selected as the actual hole length.
  5. 根据权利要求4所述的方法,其特征在于, The method of claim 4 wherein:
    所述根据均有重叠的读序计算洞的实际洞长的步骤之后包括: The step of calculating the actual hole length of the hole according to the overlapping reading order includes:
    判断计算得到的实际洞长是否满足预定洞长范围,所述洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上; Determining whether the calculated actual hole length satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double the first frequency;
    若上述判断全部为是,则执行所述判断计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、所述计算出洞长的读序落入洞内的序列 是否存在串联重复的步骤,和所述判断第一重叠群和第二重叠群的重叠区域是否存在串联重复的步骤;If the above determination is all yes, the determination is performed to calculate whether the read order of the actual hole length and the overlap region of the first contig and the second contig are respectively in series, and the read order of the calculated hole length falls. Sequence within the hole Whether there is a step of tandem repeating, and a step of determining whether there is a tandem repeat of the overlapping area of the first contig and the second contig;
    若上述任何一项或以上判断为否,则停止补洞。If any of the above or above is judged to be no, the hole is stopped.
  6. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述根据均有重叠的读序计算洞的实际洞长的步骤之后包括:The step of calculating the actual hole length of the hole according to the overlapping reading order includes:
    若所述实际洞长为0,则连接所述第一重叠群和第二重叠群,完成补洞。 If the actual hole length is 0, the first contig and the second contig are connected to complete the hole filling.
  7. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述获取洞内序列,完成补洞的步骤包括:The step of acquiring the sequence within the hole and completing the hole filling includes:
    将获取的洞内序列与用于补洞的所有读序进行比对,以确定所述洞内序列碱基的准确性;Comparing the acquired intra-hole sequence with all reading sequences for filling the hole to determine the accuracy of the sequence bases within the hole;
    若所述洞内序列碱基准确,则获取所述洞内序列,完成补洞。 If the sequence of the sequence in the hole is accurate, the sequence in the hole is acquired, and the hole is completed.
  8. 一种核酸序列组装中的补洞装置,其特征在于,所述装置包括:A hole filling device for assembling a nucleic acid sequence, characterized in that the device comprises:
    获取模块,用于从补洞的读序中获取与所述第一重叠群和第二重叠群均有重叠的读序;An obtaining module, configured to obtain, from the reading order of the hole filling, an reading sequence overlapping the first contig group and the second contig group;
    计算模块,用于根据所述均有重叠的读序计算洞的实际洞长,其中,表征所述实际洞长的序列为所述均有重叠的读序落入洞内的洞内序列;a calculation module, configured to calculate an actual hole length of the hole according to the overlapped reading order, wherein the sequence characterizing the actual hole length is an intra-hole sequence in which the overlapping readings fall within the hole;
    第一判断模块,用于判断所述计算出实际洞长的读序分别与第一重叠群和第二重叠群的重叠区域是否存在串联重复、所述计算出洞长的读序落入洞内的序列是否存在串联重复。a first judging module, configured to determine whether the read sequence of the actual hole length is overlapped with the overlap region of the first contig group and the second contig, respectively, and the read order of the calculated hole length falls into the hole Whether the sequence has a tandem repeat.
    第一补洞模块,用于在所述第一判断模块的各项判断结果全部为否时,获取所述洞内序列,完成补洞。 The first hole-filling module is configured to acquire the sequence in the hole and complete the hole filling when all the determination results of the first determining module are all negative.
  9. 根据权利要求8所述的装置,其特征在于, The device of claim 8 wherein:
    所述装置包括: The device includes:
    取消模块,用于在第一判断模块的判断结果中,有任何一项或以上为是时,不进行补洞。 The cancel module is configured to perform no hole filling when any one or more of the judgment results of the first judgment module are YES.
  10. 根据权利要求8所述的装置,其特征在于, The device of claim 8 wherein:
    所述装置包括: The device includes:
    继续判断模块,用于所述计算模块计算出实际洞长后,判断所述第一重叠群和第二重叠群的重叠区域是否存在串联重复; a judging module, configured to determine whether there is a tandem repetition of the overlapping area of the first contig and the second contig after the calculating module calculates the actual hole length;
    第二补洞模块,用于在所述继续判断模块判断为否时,截掉所述第一重叠群或第二重叠群的重叠区域,完成补洞; a second hole-filling module, configured to: when the continuation determination module determines to be no, the overlapping area of the first contig group or the second contig group is cut off, and the hole is completed;
    其中,所述取消模块,用于在所述继续判断模块判断为是时,不进行补洞。 The canceling module is configured to not fill holes when the continuation determining module determines to be YES.
  11. 根据权利要求8所述的装置,其特征在于,The device of claim 8 wherein:
    所述计算模块包括:The calculation module includes:
    计算单元,用于根据每条均有重叠的读序计算所述洞的长度,形成洞长频数表;a calculation unit, configured to calculate a length of the hole according to each read order having an overlap, and form a hole length frequency table;
    选择单元,用于选择所述洞长频数表中频率最大的洞长为实际洞长。 The selecting unit is configured to select the length of the hole with the largest frequency in the long frequency table of the hole as the actual hole length.
  12. 根据权利要求11所述的装置,其特征在于,The device of claim 11 wherein:
    所述装置包括:The device includes:
    第二判断模块,用于判断计算模块计算得到的实际洞长是否满足预定洞长范围,所述洞长频数表中表征实际洞长的频率是否是第一频率的一倍以上;a second determining module, configured to determine whether the actual hole length calculated by the calculating module satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole long frequency table is more than double of the first frequency;
    第三判断模块,用于在所述第二判断模块的判断结果全部为是时,使第一判断模块和继续判断模块进行工作;a third determining module, configured to: when the determination result of the second determining module is all yes, enable the first determining module and the continuing determining module to work;
    其中,所述取消模块,用于在所述第二判断模块的判断结果中,有任何一项或以上判断为否时,不进行补洞。 The canceling module is configured to: when any one or more of the determination results of the second determining module is negative, no hole is filled.
  13. 根据权利要求8所述的装置,其特征在于, The device of claim 8 wherein:
    所述装置包括: The device includes:
    连接模块,用于所述计算模块计算出实际洞长后,当所述实际洞长为0时,连接所述第一重叠群和第二重叠群,完成补洞。 a connection module, after the calculation module calculates the actual hole length, when the actual hole length is 0, connecting the first contig group and the second contig group to complete the hole filling.
  14. 根据权利要求8所述的装置,其特征在于, The device of claim 8 wherein:
    所述第一补洞模块包括: The first hole filling module includes:
    对比单元,用于将获取的洞内序列与用于补洞的所有读序进行比对,以确定所述洞内序列碱基的准确性; a comparing unit, configured to compare the acquired intra-hole sequence with all reading sequences for filling holes to determine the accuracy of the sequence bases in the hole;
    补洞单元,用于若所述洞内序列碱基准确,则获取所述洞内序列,完成补洞。 The hole-filling unit is configured to acquire the sequence in the hole and complete the hole if the sequence of the sequence in the hole is accurate.
PCT/CN2011/083186 2011-11-29 2011-11-29 Gap closure method and device in nucleotide sequence assembly WO2013078625A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083186 WO2013078625A1 (en) 2011-11-29 2011-11-29 Gap closure method and device in nucleotide sequence assembly

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083186 WO2013078625A1 (en) 2011-11-29 2011-11-29 Gap closure method and device in nucleotide sequence assembly

Publications (1)

Publication Number Publication Date
WO2013078625A1 true WO2013078625A1 (en) 2013-06-06

Family

ID=48534610

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083186 WO2013078625A1 (en) 2011-11-29 2011-11-29 Gap closure method and device in nucleotide sequence assembly

Country Status (1)

Country Link
WO (1) WO2013078625A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070082358A1 (en) * 2005-10-11 2007-04-12 Roderic Fuerst Sequencing by synthesis based ordered restriction mapping

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070082358A1 (en) * 2005-10-11 2007-04-12 Roderic Fuerst Sequencing by synthesis based ordered restriction mapping

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DUAN, BIAO: "Study on No.27, 28 Linkage Group Based on the Fine Map for the Silkworm (Bombyx Mori) Genome", CHINA DOCTORAL DISSERTATIONS FULL-TEXT DATABASE (BASIC SCIENCES), March 2011 (2011-03-01), pages A006 - 49 *
KOREN ET AL.: "An algorithm for automated closure during assembly", BMC BIOINFORMATICS, vol. 457, November 2010 (2010-11-01), pages 1 - 7, XP021071784 *
ZHAO, DONGSHENG ET AL.: "Computing resource of AMMS biomedicine super-computing center and its applications", BULLETIN OF THE ACADEMY OF MILITARY MEDICAL SCIENCES, vol. 29, no. 4, August 2005 (2005-08-01), pages 363 - 367 *

Similar Documents

Publication Publication Date Title
US7062737B2 (en) Method of automated repair of crosstalk violations and timing violations in an integrated circuit design
Murakami et al. Gapped code clone detection with lightweight source code analysis
JPH08129568A (en) Statistical method
WO2017086675A1 (en) Apparatus for diagnosing metabolic abnormalities and method therefor
WO2012092821A1 (en) Data compression system for dna sequence
WO2010024628A2 (en) Searching method using extended keyword pool and system thereof
WO2012124117A1 (en) Timing error elimination method, design assistance device, and program
WO2014069767A1 (en) Base sequence alignment system and method
WO2013078625A1 (en) Gap closure method and device in nucleotide sequence assembly
US20090138838A1 (en) Method and apparatus for supporting delay analysis, and computer product
WO2018236120A1 (en) Method and device for identifying quasispecies by using negative marker
WO2013078619A1 (en) Method and device for identifying extension conflict and determining confidence level of seed read in nucleotide sequence assembly
WO2013078623A1 (en) Method and device for gap closure in nucleotide sequence assembly
Täubig et al. PAST: fast structure-based searching in the PDB
WO2010095807A2 (en) Document ranking system and method based on contribution scoring
JP4969416B2 (en) Operation timing verification apparatus and program
WO2013071480A1 (en) Circuit optimization method and device for analog circuit transplantation
WO2018191889A1 (en) Photo processing method and apparatus, and computer device
WO2022164236A1 (en) Method and system for searching target node related to queried entity in network
WO2023163405A1 (en) Method and apparatus for updating or replacing credit evaluation model
Fasulo et al. Efficiently detecting polymorphisms during the fragment assembly process
US8103991B2 (en) Semiconductor integrated circuit designing method, semiconductor integrated circuit designing apparatus, and recording medium storing semiconductor integrated circuit designing software
WO2015009046A1 (en) Molecular orbital library having exclusive molecular orbital distribution, molecular orbital distribution region evaluation method using same, and system using same
WO2018021636A1 (en) Human haplotyping system and method
WO2019074151A1 (en) Method and device for efficiently calculating similarity between nodes for large scale graph

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11876617

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/11/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11876617

Country of ref document: EP

Kind code of ref document: A1