WO2013078625A1 - Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique - Google Patents

Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique Download PDF

Info

Publication number
WO2013078625A1
WO2013078625A1 PCT/CN2011/083186 CN2011083186W WO2013078625A1 WO 2013078625 A1 WO2013078625 A1 WO 2013078625A1 CN 2011083186 W CN2011083186 W CN 2011083186W WO 2013078625 A1 WO2013078625 A1 WO 2013078625A1
Authority
WO
WIPO (PCT)
Prior art keywords
hole
sequence
contig
length
actual
Prior art date
Application number
PCT/CN2011/083186
Other languages
English (en)
Chinese (zh)
Inventor
刘兵行
李振宇
陈燕香
李英睿
汪建
王俊
杨焕明
Original Assignee
深圳华大基因科技有限公司
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司, 深圳华大基因研究院 filed Critical 深圳华大基因科技有限公司
Priority to PCT/CN2011/083186 priority Critical patent/WO2013078625A1/fr
Publication of WO2013078625A1 publication Critical patent/WO2013078625A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the invention relates to the field of genetic engineering technology, in particular to a method and device for filling holes in nucleic acid sequence assembly.
  • the cost of sequencing is getting lower and lower, driving the whole genome sequencing of more species.
  • the principle of the second generation sequencing technology determines the length of the sequenced fragment is short. In the specific implementation process, the sequencing fragments only have tens to hundreds of bases, which undoubtedly increases the difficulty of analyzing the data obtained by sequencing.
  • Genomic assembly usually first masks the repeat region and then reads it at the double end ( With the aid of pair-end read , PE read ), the non-repetitive region relationship is determined, but the unassembled region between the non-repetitive regions is likely to form a gap, which is called a hole.
  • the partial assembly of the graph can identify the conflicting sites caused by the repetition, but it is difficult to resolve the conflict and needs to be disconnected, thus affecting the number of holes.
  • the prior art mainly has two hole filling programs, which are corresponding to overlap-based partial assembly.
  • the Gapcloser program and the SOAPdenovo program based on the partial assembly of the De bruijn diagram.
  • the hole-filling software Gapcloser is based on the base sequence segment with overlap
  • the method does local assembly because it does not take into account the complexity of the situation inside the hole, so it is easy to cause errors in the processing of complex holes and reduce the overall accuracy.
  • Gapcloser Because it consumes a large amount of memory and is time-consuming, it is not suitable for large-genome primary holes.
  • the hole in the SOAPdenovo assembly software is based on the area inside the hole.
  • De bruijn The figure is used for secondary assembly. Although it can effectively solve the hole with smaller length, the number of holes is limited.
  • the technical problem to be solved by the present invention is to provide a method and device for filling holes in nucleic acid sequence assembly, which can accurately obtain the intra-hole sequence for characterizing the length of the hole and improve the accuracy of the hole filling.
  • the present invention adopts a technical solution to provide a method for filling a hole in a nucleic acid sequence assembly, the hole having a first contig at one end and a second contig at the other end, the method comprising the following Step: obtaining a reading sequence overlapping the first contig group and the second contig group from the reading sequence for filling the hole; calculating the actual hole length of the hole according to the overlapping reading order, wherein the actual hole length is represented
  • the sequence is an intra-hole sequence in which the overlapping readings fall into the hole; determining whether the reading order of the actual hole length is overlapped with the overlapping region of the first contig and the second contig, respectively, and the length of the hole is calculated. Whether there is a tandem repetition of the sequence in which the reading sequence falls into the hole; if the result of the above judgments is all no, the sequence in the hole is obtained, and the hole is completed.
  • the step of performing each judgment includes: if any one or more of the above determinations is yes, no hole is filled.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: determining whether there is a tandem repetition of the overlapping area of the first contig and the second contig; if the determination is no, cutting the first contig Or the overlapping area of the second contig is completed, and if the YES is determined, the hole is not filled.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: calculating the length of the hole according to the reading order of each overlapping, forming a length meter of the hole; selecting the length of the hole with the largest frequency in the long frequency table of the hole For the actual hole length.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: determining whether the calculated actual hole length satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is the first frequency If the above judgment is all YES, then it is judged whether the reading order of the actual hole length and the overlapping area of the first contig group and the second contig are respectively overlapped in series, and the reading order of the hole length is calculated. Whether there is a step of tandem repeating in the sequence in the hole, and a step of judging whether there is a series repeat in the overlapping area of the first contig and the second contig; if any one or more of the above is negative, the hole is stopped.
  • the step of calculating the actual hole length of the hole according to the overlapping reading order includes: if the actual hole length is 0, connecting the first contig group and the second contig group to complete the hole filling.
  • the step of obtaining the intra-hole sequence and completing the hole-filling process comprises: comparing the acquired intra-hole sequence with all reading sequences for filling the hole to determine the accuracy of the sequence base in the hole; If you are accurate, you can get the sequence inside the hole and complete the hole.
  • a hole filling device in nucleic acid sequence assembly, the device comprising: an obtaining module, configured to acquire a first contig from a reading sequence of a hole filling hole And the second contig has an overlapping reading order; the calculating module is configured to calculate the actual hole length of the hole according to the overlapping reading order, wherein the sequence characterization of the actual hole length is an overlapping reading sequence falling into the hole The intra-hole sequence; the first judging module is configured to judge whether the reading order of the actual hole length is overlapped with the overlapping area of the first contig group and the second contig, respectively, and the reading order of the hole length is calculated to fall into the hole Whether there is a series repeat in the sequence inside; the first fill hole module is configured to acquire the sequence in the hole and complete the fill hole when all the judgment results of the first judgment module are all negative.
  • the device includes: a canceling module, configured to: when any one or more of the determination results of the first determining module are YES, no hole filling is performed.
  • the device includes: a continuation determining module, configured to determine whether the overlapping area of the first contig and the second contig has a series repeat after the calculating module calculates the actual hole length; and the second hole filling module is configured to continue determining When the module judges to be no, the overlapping area of the first contig or the second contig is cut off, and the hole is completed; wherein the canceling module is used to not fill the hole when the continuation determining module determines YES.
  • a continuation determining module configured to determine whether the overlapping area of the first contig and the second contig has a series repeat after the calculating module calculates the actual hole length
  • the second hole filling module is configured to continue determining When the module judges to be no, the overlapping area of the first contig or the second contig is cut off, and the hole is completed; wherein the canceling module is used to not fill the hole when the continuation determining module determines YES.
  • the calculation module includes: a calculation unit, configured to calculate a length of the hole according to each of the overlapped reading sequences to form a hole length frequency table; and a selection unit for selecting a hole having the largest frequency in the long frequency table as the actual hole long.
  • the device includes: a second determining module, configured to determine whether the actual hole length calculated by the computing module satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double of the first frequency;
  • the third judging module is configured to: when the judgment result of the second judging module is all YES, the first judging module and the continuation judging module are operated; wherein the canceling module is used in the judging result of the second judging module, If any one or more of the judgments is no, no holes will be made.
  • the device includes: a connection module, configured to calculate the actual hole length, and when the actual hole length is 0, connect the first contig and the second contig to complete the hole.
  • the first filling hole module comprises: a comparing unit, configured to compare the acquired intra-hole sequence with all reading sequences used for filling the hole to determine the accuracy of the sequence base in the hole; and the hole-filling unit is used for If the sequence in the hole is accurate, the sequence in the hole is obtained and the hole is filled.
  • the present invention acquires an order of reading that overlaps both the first contig and the second contig from the reading for the hole filling. Calculate the length of the hole according to the reading order, obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed.
  • FIG. 1 is a schematic flow chart of an embodiment of a method for filling a hole in the assembly of a nucleic acid sequence of the present invention
  • FIG. 2 is a schematic view showing the connection of a hole in the assembly process of the nucleic acid sequence of the present invention
  • Fig. 3 is a schematic view showing the structure of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • Kmar Fixed length string Is a DNA sequence of length K K is usually taken 17 Single read Single-ended read order Mainly based on a sequence information obtained by the sanger sequencing method, the sanger sequencing method is used to obtain one end sequence information of a longer DNA sequence or a shorter sequence of measurement information.
  • Scaffold Connecting bracket Results of contigs linked by linkage information from plasmids, BACs, mRN A, or other sources of double-end reads, where the contigs are ordered and oriented Gap hole Genomic assembly usually first masks the repeat region, and then, with the aid of the double-end read (PE read), determines the non-repetitive region relationship, while the non-replicated region between the repeat regions forms a gap, called the hole region.
  • PE read double-end read
  • FIG. 1 is a schematic flow chart of an embodiment of a method for filling a hole in the assembly of a nucleic acid sequence of the present invention
  • one end of the hole has a first contig and the other end has a second contig
  • the hole filling method includes the following steps:
  • Step 101 Obtain a reading sequence that overlaps the first contig group and the second contig group from the reading sequence used for filling the hole;
  • a small hole for a hole having a length of less than 100 bp, we define a small hole, and the hole is filled by the method of the present invention. First, find all the readings in the small hole and analyze them, and select the readings that fall into the hole and overlap with the overlapping groups on both sides of the hole.
  • Step 102 Calculate the actual hole length of the hole according to the overlapping reading order, wherein the sequence representing the actual hole length is an intra-hole sequence in which the overlapping readings fall into the hole;
  • the actual hole length of the hole is calculated based on the remaining sequence completely falling within the hole.
  • a frequency table is formed to characterize a range of hole lengths.
  • the frequency table is formed because the possible errors in the connection result in different lengths of the holes displayed when the different readings are connected to the contig. Select the hole with the largest frequency in the frequency table as the actual hole length.
  • the first judgment condition is to judge whether the calculated actual hole length satisfies the predetermined hole length range.
  • the focus of the present invention is to assemble a hole having a hole length of less than 100 bp, so it is necessary to first judge whether the calculated actual hole length satisfies a predetermined hole length range of less than 100 bp.
  • the second judgment condition is whether the frequency of the actual hole length in the hole length frequency table is more than double of the first frequency, and the first frequency referred to herein refers to the frequency which is only smaller than the maximum frequency in the hole length frequency table.
  • the length of the hole corresponding to the first frequency is the length of the first hole. That is to say, in order to make the actual hole length credible, it is necessary to judge whether the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length. If the actual hole length calculated is less than 100 bp, the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length, then the calculated actual hole length is closer to the true hole length.
  • Step 103 judging whether the read sequence of the actual hole length is overlapped with the overlap region of the first contig group and the second contig, respectively, and whether the sequence of the hole length reading sequence falls into the hole has a series repeat;
  • the inspection of the actual hole length includes: a repeat check of the overlap of the read sequence and the contig, a repeat check of the area falling within the hole, and a repeat check of the overlap area of the first contig and the second contig.
  • the repeat check of the overlap area and the inner area of the hole is to check the tandem repeat, that is, repeat feature recognition.
  • the general tandem repeat check is implemented by a block check method, which is to determine whether the type of tandem repeat hole belongs to multi-mode concatenation, to single-mode concatenation, and to the distance between tandem repeat blocks.
  • the actual hole length inspection provides support for subsequent hole assembly.
  • Step 104 If all the results of the above determinations are no, the sequence in the hole is obtained, and the hole is completed.
  • the method of the present invention does not assemble the duplicated holes, if the judgment result of the step 103 is all negative, the sequence in the hole is acquired, and the hole is completed; if any of the judgment results of the step 103 is one or more, Then, the method of the present invention is not used to fill holes.
  • the obtained intra-hole sequence is compared with all the reading sequences used for filling the hole to determine the accuracy of the sequence base in the hole, and if the sequence within the hole is accurate, the hole is utilized. Inside the sequence, complete the hole.
  • the base on the sequence in the hole characterizing the hole length may be the true base of the hole. All reads that characterize the actual hole length can be analyzed from base to base to determine the base at each position.
  • the present embodiment searches for other readings that fall within the small hole but do not cross the small hole, and compares with the reading order for determining the length of the small hole. If the alignment fault tolerance is less than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole falls into the hole.
  • Each base is authentic and can be used to fill the hole; If the alignment fault tolerance is greater than 3% (usually 3%), then it can be determined that the sequence used to determine the length of the small hole is falling into the hole. Each base is untrustworthy and will be untrustworthy. shear. This ensures the accuracy of the readings filled in the small holes.
  • FIG. 2 shows a schematic diagram of the connection of the hole in the assembly process of the nucleic acid sequence of the present invention.
  • the two ends of the hole respectively have the first a contig x and a second contig y
  • A is an order of overlap with the first contig and the second contig during the hole filling process
  • a and b are the lengths of overlap between the read sequence and the contig, respectively
  • c is the length of the sequence in which the read sequence A falls into the hole, wherein the sizes of a and b are not limited, and the lengths of x and y are greater than the length of A.
  • a reading sequence A overlapping the first contig group and the second contig group is selected from the reading order for filling the hole, and the reading sequence overlaps with the first contig group x and overlaps with the second contig group y.
  • b then calculating the length of the hole between the first contig x and the second contig y, as shown, the sequence length c of the reading sequence A falling into the hole is the hole length of the hole, during the specific operation, It is necessary to calculate a hole length by reading a plurality of readings overlapping the first contig and the second contig, and a hole length can be calculated for each reading across the hole.
  • a frequency table is formed, which characterizes a range of hole lengths, and selects the frequency of the largest frequency in the frequency table as the actual hole length.
  • the actual hole length calculated must also satisfy the following two conditions.
  • the actual hole length calculated is less than 100 bp, and the number of readings supporting the actual hole length is more than double the number of readings supporting the first hole length.
  • the first frequency refers to a frequency that is only smaller than the maximum frequency in the hole length frequency table, and the hole length corresponding to the first frequency is the first hole length.
  • the hole filling method of the present invention does not assemble the repeated holes, it is necessary to judge the reading order A of the actual hole length and the first contig and the second, respectively. Whether the overlapping regions a and c of the contig have a tandem repeat, a sequence c falling into the hole, whether there is a tandem repeat, if there is no tandem repeat, the sequence c in the hole is obtained, and the hole is completed.
  • the number of readings required to calculate the length of the hole and overlap with the first contig and the second contig is not limited.
  • the present invention obtains an order of reading that overlaps both the first contig and the second contig from the reading sequence for the hole filling.
  • Calculate the length of the hole according to the reading order obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed.
  • the hole-filling method in nucleic acid sequence assembly comprises: obtaining a reading sequence overlapping the first contig group and the second contig group from the reading sequence for filling the hole, and calculating the hole according to the reading order Length, if the determined actual hole length is less than the first threshold preset by the system, for example, 0, it is determined that there is overlap at both ends of the contig, further determining whether the overlap is a repetition, and then determining the repeat mode, otherwise the contig will be The end intercepts the overlap length.
  • the details are as follows:
  • the assembly software cannot connect the two contigs into a contig due to the overlap, so when outputting, A contig and a second contig are separated, and the length of the hole between the first contig and the second contig becomes a negative number, forming a negative hole.
  • the result of the repeat check of the overlapping area of the first contig and the second contig is that there is no tandem repetition, the overlapping area of the first contig or the second contig is cut off, and the filling hole can be completed. If there is a tandem repeat, it is necessary to determine a negative value of the negative hole length, and then fill the hole according to the user's needs.
  • the method of filling a hole in the assembly of the nucleic acid sequence comprises: obtaining a reading sequence overlapping the first contig and the second contig from the reading sequence for filling the hole, and calculating the hole according to the reading order The length, if the determined actual hole length is equal to the first threshold preset by the system, such as 0, connects the first contig and the second contig to complete the hole.
  • FIG. 3 is a schematic structural view of an embodiment of a hole-filling device in the assembly of the nucleic acid sequence of the present invention.
  • the device includes: an obtaining module 31, a calculating module 32, a first determining module 33, and a first filling hole module 34.
  • the cancel module 35, the second fill hole module 36, the second judgment module 37, the third judgment module 38, the connection module 39, and the continuation determination module 40 are eliminated.
  • the calculation module 32 includes a calculation unit 321 and a selection unit 322.
  • the first fill hole module 34 includes a comparison unit 341 and a fill hole unit 342.
  • the obtaining module 31 is configured to obtain an reading sequence that overlaps the first contig group and the second contig group from the reading order of the hole filling; the calculating module 32 is configured to calculate the actual hole length of the hole according to the overlapping reading order.
  • the sequence characterization of the actual hole length is an intra-hole sequence in which the overlapping readings fall within the hole; the calculating unit 321 is configured to calculate the length of the hole according to each of the overlapping reading orders to form a hole length frequency table;
  • the selecting unit 322 is configured to select the length of the hole whose frequency is the largest in the hole length frequency table as the actual hole length; the first determining module 33 is configured to determine that the reading order of the actual hole length overlaps with the first contig and the second contig respectively Whether there is a series repeat in the region, and whether the sequence of the hole length reading sequence falls into the hole has a series repeat; the first hole repairing module 34 is configured to acquire the hole when all the judgment results of the first determining module 33 are all negative.
  • the inner sequence completes the hole filling; the comparing unit 341 is configured to compare the acquired intra-hole sequence with all reading sequences for the hole filling to determine the accuracy of the sequence base in the hole; the hole-filling unit 342 is used for the hole If the base sequence is accurate, then the hole is obtained.
  • the inner sequence completes the fill hole; the cancel module 35 is used to fill the hole when the judgment result does not satisfy the condition in the hole filling process.
  • the continuation determining module 40 is configured to determine whether the overlapping area of the first contig group and the second contig has a series repeat after the calculating module 32 calculates the actual hole length; and the second filling hole module 36 is configured to determine, in the continuation determining module 40, When the overlapping area of the first contig or the second contig is cut off, the hole is completed.
  • the second determining module 37 is configured to determine whether the actual hole length calculated by the calculating module 32 satisfies a predetermined hole length range, and whether the frequency of the actual hole length in the hole length frequency table is more than double the first frequency; the third determining module 38 When the determination result of the second judging module 37 is all YES, the first judging module 33 and the continuation judging module 40 are caused to operate.
  • the connection module 39 is configured to calculate the actual hole length, and when the actual hole length is 0, connect the first contig and the second contig to complete the hole.
  • each module is as follows: First, the obtaining module 31 obtains an reading sequence overlapping the first contig group and the second contig group from the reading order of the hole filling, and then the calculating module 32 reads the reading according to the overlapping Calculating the actual hole length of the hole, wherein the sequence characterizing the actual hole length is an intra-hole sequence in which the overlapping readings fall into the hole, and the process of calculating the hole length is as follows: the calculating unit 321 has an overlapping reading according to each piece. The length of the hole is calculated in order to form a hole length table, and then the selecting unit 322 selects the hole having the largest frequency in the hole length table as the actual hole length.
  • the second judging module 37 judges whether the calculated actual hole length satisfies the predetermined hole length range, and the hole length frequency. Whether the frequency of the actual hole length in the table is more than double the first frequency, the two judgment conditions are explained as follows: Since the present invention is to assemble a hole having a hole length of less than 100 bp, it is necessary to first judge the actual calculated. Whether the hole length satisfies the predetermined hole length range of less than 100 bp.
  • the first frequency referred to herein refers to a frequency that is only smaller than the maximum frequency in the hole length frequency table, and the hole length corresponding to the first frequency is the first hole length.
  • the third determination module 38 causes the first determination module 33 and the continuation determination module 40 to operate.
  • the cancel module 35 does not perform the hole filling.
  • the first judging module 33 judges whether the reading sequence of the actual hole length is overlapped with the overlapping area of the first contig group and the second contig, respectively, and whether the sequence of the hole length reading sequence falls into the hole has a tandem repeat. When all the determination results of the first judging module 33 are all negative, the first filling hole module 34 acquires the intra-hole sequence and completes the filling hole. In order to improve the accuracy of the hole filling, in the process of acquiring the intra-hole sequence by the first hole-filling module 34 and completing the hole-filling process, the comparing unit 341 compares the acquired intra-hole sequence with all the reading sequences for filling the hole, so as to compare Determine the accuracy of the sequence bases in the hole.
  • the hole-filling unit 342 acquires the sequence within the hole and completes the hole filling.
  • the canceling module 35 does not perform the hole filling.
  • the continuation determining module 40 first determines whether the overlapping area of the first contig and the second contig is There is a series repeat, and if the determination is no, the second fill hole module 36 cuts off the overlap region of the first contig or the second contig to complete the fill hole; if the determination is YES, the cancel module 35 does not fill the hole.
  • connection module 39 connects the first contig and the second contig to complete the hole.
  • the present invention obtains an order of reading that overlaps both the first contig and the second contig from the reading sequence for the hole filling.
  • Calculate the length of the hole according to the reading order obtain the actual length of the hole, and check the length of the hole. If the reading order of the hole length is calculated, the overlapping area of the first contig and the second contig is calculated and the hole length is calculated. If the sequence in which the reading sequence falls into the hole is not tandem, the sequence within the hole characterizing the length of the hole is obtained, and the hole is completed.
  • the level of the hole is obtained according to the size of the hole and the judgment standard set by the system, wherein the level of the gene sequence hole is divided into a small hole, a middle hole, and a large hole, and according to the level of the nucleic acid sequence hole and the corresponding alkali
  • the base sequence segment is filled with holes.
  • the holes are classified according to the following method: a hole having a length of less than 100 bp is defined as a small hole, a length of the hole between 100 bp and 1.5 kb is defined as a middle hole, and a length of the hole greater than 1.5 kb is defined as a large hole.
  • a hole having a length of less than 100 bp is defined as a small hole
  • a length of the hole between 100 bp and 1.5 kb is defined as a middle hole
  • a length of the hole greater than 1.5 kb is defined as a large hole.
  • the above is only one of the definitions of various holes, and the size of each hole is
  • a scaffold that forms a gene sequence hole is acquired and analyzed.
  • the original scaffold is broken to form a contig, and the gap between the two contigs is a hole.
  • the size of the hole and the contiguous group before and after the hole can be accurately obtained.
  • the embodiment further divides all acquired nucleic acid sequence holes and overlapping groups according to the user's setting, and stores the associated overlapping groups and readings correspondingly into corresponding folders. For example, if the user sets 4 folders, the obtained nucleic acid sequence holes and the contigs are divided into 4 copies, and 4 folders are generated, and the associated contigs and readings are stored one by one. Cut into good folders. Through the above segmentation, each folder contains a contig and a reading sequence for filling holes, and in the subsequent hole filling process, the contig and the reading order for the hole can be directly obtained from the corresponding folder. Obviously, by dividing the above, the original required memory can be reduced by a quarter, saving space, and the search time can be reduced when filling holes, thereby reducing the time spent filling holes.
  • the reading order for filling the hole is read in the nucleic acid sequence hole.
  • the reading order for filling the hole mostly belongs to PE.
  • the reading sequence, the sequencing results from solexa, the remainder is a long single-ended reading from the sanger sequencing results.
  • the PE reading order supports each other, PE
  • the reading sequence is from the two ends of an insert, and the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp.
  • the insert for filling the hole is generally composed of 180 bp, 500 bp and 800 bp.
  • a high-throughput multiplier by inserting a high-throughput multiplier, one insert can be passed through PE The overlapping relationship of the reading order is restored.
  • the reading sequence having the PE relationship with the reading sequence falls within the nucleic acid sequence hole or falls on the contig of the nucleic acid sequence hole, and the nucleic acid sequence hole can be filled.
  • the long reading order since the long reading sequence itself has a long length, it can span a nucleic acid sequence hole with a small hole length. If each base of the long reading order is authentic, the long reading order can be used for each position. The base is used to complete the exact filling of the hole in the nucleic acid sequence with a smaller hole length.
  • the positional relationship between the read sequence and the nucleic acid sequence hole, the contig and the scaffold to which the read sequence belongs, and the sequence of the read sequence itself are acquired. information.
  • the hole filling process specifically includes: A, the hole filling treatment of the small hole; B, the hole filling the hole in the middle hole Processing and C, the hole filling of the big hole.
  • A the hole filling treatment of the small hole
  • B the hole filling the hole in the middle hole Processing
  • C the hole filling of the big hole.
  • the block is set to 6 bp or 12 bp.
  • block is a window containing a certain number of bases and sliding one base at a time in the reading sequence. Specifically, suppose a window contains X bases, first the window takes the first to the Xth base, the first slide, the window takes the second to the (X+1)th base, and so on. Each time it is slid, the window moves forward by one base. When sliding the nth time, the window takes n+1th to (X+n)th bases.
  • the present embodiment records the block frequency (block_freq) and the distance of the same block (b_dis) for analysis. If the frequency block_freq has a maximum value at a certain distance b_dis value, and the distance b_dis is equal to or equal to the number of bases in the block, it is determined that there is a tandem repetition in the sequence.
  • the present embodiment further infers the mode of the tandem repetition based on the information obtained in the above-described process of judging the series repetition: that is, if there is only one series connection in the sequence, it is determined to be a single mode concatenation; if there are multiple crossovers or no The series connection of the intersection is determined to be a multi-series mode.
  • the present embodiment records the block frequency, and determines the repetition condition in the hole by calculating the expected depth of the block in the hole and analyzing the block depth distribution in the hole, if the block frequency in the hole is larger than the block requirement in the hole. If the depth is multiplied, there is a repetition.
  • the overlap calculation first uses the hash method to quickly determine whether there is a common fixed-length short string (kmer) between the read orders, and there may be overlap between the read orders of the common kmer.
  • Kmer is defined as a contiguous sequence of bases of length k.
  • the distribution of kmer is closely related to the size of the genome, error rate, and heterozygosity. After that, pattern recognition is performed for a pair of readings that may overlap.
  • the number of blocks can be appropriately raised.
  • the front-end extension For the front-end extension, for the starting node read sequence, find all the read orders that overlap with the node read order, and select the read order with the smallest overlap with the node read order as the seed read sequence, then the other read orders should be There is overlap with the seed reading order, and the length of these overlaps is necessarily greater than the overlapping length of the seed reading order and the node reading order. If there is no overlap between the reading order and the seed reading order, it is determined that a conflict occurs, by replacing the seed reading order, ie Re-find a new seed reading to resolve the conflict. The above method ensures the correctness of the found seed reading.
  • this embodiment is extended. Treating the seed reading as part of the contig, continue to find the new seed reading as described above. If it can be found, it is judged that the sequence will continue to extend. Otherwise, it is judged that the sequence extension is over, and it is necessary to wait for the extension of the other end to determine two. Whether there is overlap in the end extensions to determine whether the hole can be completed.
  • the forward-extended read sequence may find the sequence that was previously the seed read sequence as its extended read sequence, which will cause an infinite loop extending within the range and treat it as a conflict.
  • the extension is terminated.
  • the extension of the back end of this embodiment is similar to the extension of the front end, and will not be described in detail herein.
  • the conflict identification should be as sensitive as possible, and the read data of the hole should also have a lower error rate.
  • the read sequence is pre-corrected. In order to improve the quality of the reading order, ensure the accuracy of both ends of the reading.
  • comparison rate filtering must have a 100% alignment rate to extend as a seed reading.
  • the comparison rate filtering uses the following strategies:
  • the present embodiment avoids the problem that the base cannot be extended due to the base error of the seed reading.
  • the overlap between the reading sequences adopts the stepwise extension mode of the block, that is, selects a block from the seed reading order, sets a target reading order, and compares whether the bases in the block can be in the target. Find in the reading sequence, if possible, move the block in the seed reading sequence forward by one unit, and then compare it with the target reading order, and repeat until it cannot match. At this time, the length between the seed reading and the target reading can be obtained. For the length, a third threshold is needed, and the third threshold is 1. Kmer, to characterize the overlap between two readings, is non-accidental and is truly credible. If the previous seed read sequence itself has a sequencing error, it may cause a large number of read orders to be filtered out. In this case, a loop setting is set to replace the previous seed read sequence.
  • position filtering According to the double-end relationship positioning read order, calculate the position of the reading sequence in the hole, and filter the reading order according to the position, thereby reducing the conflict caused by the repetition of the long segment in the hole. In order to ensure the accuracy of the calculation of the position within the hole, the embodiment can set strict filtering conditions.
  • read sequence length filtering in the process of reading order, PE read sequence length is short, and single-end read order (single Read) is usually longer. Longer single-ended reads overlap with one end of the hole. In this embodiment, a short double-end read sequence is preferentially extended in the inner region of the hole, and a long single-end read sequence is preferentially extended at both ends of the hole.
  • end filtering according to the expected hole length, if the extended reading sequence overlaps the other end too early, the non-overlapping reading order is selected, that is, the reading order position is just behind the extended reading order, and there is no overlap with the extended reading order. And put it on the reading order that does not conflict with the expected hole length. This ensures that the repeat area is crossed. In this embodiment, the end filtration can only occur once.
  • short similar repetitive processing and recognition short similar repeats are usually less than 50 bp, and the position is relatively close, which will eventually cause base deletion in the sequence of the nucleic acid sequence.
  • the present embodiment preferentially selects a longer overlapping read sequence as a seed reading to extend, which can effectively avoid the problem of short similar repetition.
  • the above three credibility may exist, and the quality of the first credibility is higher, but it does not mean that it is correct.
  • the quality of the second credibility is second highest. Similarly, it does not mean that it is correct. Therefore, the connection situation in the hole in this embodiment is classified and processed according to the actual use situation.
  • the connections in the hole are divided into three categories: the contigs at both ends are directly connected, and the one end extends to the other end contig or the two ends extend.
  • a credibility, sequence connection if there is no first credibility, but there is a second credibility, then the second credibility is selected, the sequence is connected; there is no first credibility and second credibility, However, when there is a third credibility, the third credibility is selected and the sequence is connected.
  • the main hole is divided into multiple holes, which are processed according to the processing process of the center hole.
  • the longest insert of the support PE is 800 bp.
  • the length of the hole exceeds 1.5 kb, the length of the overlap between the contigs and the contigs of the two ends is subtracted, and the two 800 bp inserts are not There may be overlapping relationships, ie it is impossible to find a complete path to fully fill a large hole.
  • the large hole is divided into a plurality of middle holes, and then the middle holes are respectively assembled, and finally the assembly results are connected, and the details are as follows:
  • each block is assembled in blocks by means of a medium hole.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un procédé pour la fermeture d'un espace dans un ensemble séquence nucléotidique. Une extrémité d'un espace présente un premier contig et l'autre extrémité présente un second contig. Le procédé comprend les étapes consistant à : acquérir, à partir de lectures pour la fermeture d'un espace, une lecture chevauchant le premier contig et le second contig (101); calculer une longueur d'espace réelle de l'espace selon la lecture chevauchant le premier contig et le second contig, une séquence caractérisant la longueur d'espace réelle étant une séquence dans l'espace, qui fait partie de la lecture chevauchant le premier contig et le second contig se trouvant dans l'espace (102); évaluer si une répétition en tandem existe dans une région de chevauchement de la lecture avec la longueur d'espace réelle calculée et le premier contig et dans une région de chevauchement de la lecture avec la longueur d'espace réelle calculée et le second contig, et si une répétition en tandem existe dans la séquence de la lecture avec la longueur d'espace réelle calculée se trouvant dans l'espace (103); si les résultats de l'évaluation précédente sont négatifs, acquérir la séquence dans l'espace afin de réaliser la fermeture de l'espace (104). L'invention concerne également un dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique.
PCT/CN2011/083186 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique WO2013078625A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083186 WO2013078625A1 (fr) 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/083186 WO2013078625A1 (fr) 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique

Publications (1)

Publication Number Publication Date
WO2013078625A1 true WO2013078625A1 (fr) 2013-06-06

Family

ID=48534610

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083186 WO2013078625A1 (fr) 2011-11-29 2011-11-29 Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique

Country Status (1)

Country Link
WO (1) WO2013078625A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070082358A1 (en) * 2005-10-11 2007-04-12 Roderic Fuerst Sequencing by synthesis based ordered restriction mapping

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070082358A1 (en) * 2005-10-11 2007-04-12 Roderic Fuerst Sequencing by synthesis based ordered restriction mapping

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DUAN, BIAO: "Study on No.27, 28 Linkage Group Based on the Fine Map for the Silkworm (Bombyx Mori) Genome", CHINA DOCTORAL DISSERTATIONS FULL-TEXT DATABASE (BASIC SCIENCES), March 2011 (2011-03-01), pages A006 - 49 *
KOREN ET AL.: "An algorithm for automated closure during assembly", BMC BIOINFORMATICS, vol. 457, November 2010 (2010-11-01), pages 1 - 7, XP021071784 *
ZHAO, DONGSHENG ET AL.: "Computing resource of AMMS biomedicine super-computing center and its applications", BULLETIN OF THE ACADEMY OF MILITARY MEDICAL SCIENCES, vol. 29, no. 4, August 2005 (2005-08-01), pages 363 - 367 *

Similar Documents

Publication Publication Date Title
WO2014069764A1 (fr) Système et procédé d'alignement de séquences de base
US20060026539A1 (en) Method of automated repair of crosstalk violations and timing violations in an integrated circuit design
Murakami et al. Gapped code clone detection with lightweight source code analysis
WO2017086675A1 (fr) Appareil pour diagnostiquer des anomalies métaboliques et procédé associé
WO2012092821A1 (fr) Système de compression de données pour une séquence d'adn
JP2002543470A (ja) 補正の再使用による合理的なicマスク・レイアウトの光学的プロセス補正
WO2018034426A1 (fr) Procédé de correction automatique d'erreurs dans un corpus balisé à l'aide de règles pdr de noyau
WO2010024628A2 (fr) Procédé de recherche utilisant un groupe de mots clés étendu et système correspondant
WO2012124117A1 (fr) Procédé d'élimination d'erreur de synchronisation, dispositif d'aide à la conception et programme
WO2014069767A1 (fr) Système et procédé d'alignement de séquences de bases
WO2013078625A1 (fr) Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique
US20090138838A1 (en) Method and apparatus for supporting delay analysis, and computer product
WO2018236120A1 (fr) Procédé et dispositif d'identification de quasi-espèces au moyen d'un marqueur négatif
WO2013078619A1 (fr) Procédé et dispositif pour identifier un conflit d'extension et déterminer un niveau de confiance d'une lecture germe dans un ensemble séquence nucléotidique
WO2013078623A1 (fr) Procédé et dispositif pour la fermeture d'un espace dans un ensemble séquence nucléotidique
WO2021145713A1 (fr) Appareil et procédé de génération d'un modèle virtuel
WO2010095807A2 (fr) Système et procédé de classement de document fondés sur une notation de contribution
JP4969416B2 (ja) 動作タイミング検証装置及びプログラム
WO2013071480A1 (fr) Procédé et dispositif d'optimisation de circuit permettant une transplantation de circuit analogique
WO2022164236A1 (fr) Procédé et système de recherche de nœud cible associé à une entité interrogée dans un réseau
WO2023163405A1 (fr) Procédé et appareil de mise à jour ou de remplacement de modèle d'évaluation de crédit
WO2015009046A1 (fr) Bibliothèque d'orbites moléculaires possédant une distribution d'orbites moléculaires exclusive, procédé et système d'évaluation de région de distribution d'orbites moléculaires l'utilisant
US20080270956A1 (en) Semiconductor integrated circuit designing method, semiconductor integrated circuit designing apparatus, and recording medium storing semiconductor integrated circuit designing software
US8181146B1 (en) Equivalence checker
Yao et al. Path selection based on static timing analysis considering input necessary assignments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11876617

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/11/2014)

122 Ep: pct application non-entry in european phase

Ref document number: 11876617

Country of ref document: EP

Kind code of ref document: A1