WO2010066114A1 - 一种测序序列纠错方法、系统及基因组装设备 - Google Patents

一种测序序列纠错方法、系统及基因组装设备 Download PDF

Info

Publication number
WO2010066114A1
WO2010066114A1 PCT/CN2009/001426 CN2009001426W WO2010066114A1 WO 2010066114 A1 WO2010066114 A1 WO 2010066114A1 CN 2009001426 W CN2009001426 W CN 2009001426W WO 2010066114 A1 WO2010066114 A1 WO 2010066114A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
high frequency
frequency short
sequencing
short string
Prior art date
Application number
PCT/CN2009/001426
Other languages
English (en)
French (fr)
Inventor
李瑞强
朱红梅
李胜霆
王俊
杨焕明
汪建
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to EP09831391.9A priority Critical patent/EP2377948B1/en
Priority to US13/132,038 priority patent/US8751165B2/en
Priority to JP2011539874A priority patent/JP5344774B2/ja
Publication of WO2010066114A1 publication Critical patent/WO2010066114A1/zh
Priority to HK12101570.2A priority patent/HK1161313A1/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to the field of genetic engineering technology, and in particular, to a sequencing sequence error correction method and a system gray genetic assembly device.
  • An object of the embodiments of the present invention is to provide a sequencing sequence error correction method, which aims to solve the problem that the utilization ratio of the sequencing sequence of the existing sequencing sequence error correction method is relatively low.
  • the embodiment of the present invention is implemented as follows, a sequencing sequence error correction method, the method comprising the steps of: receiving a sequencing sequence, constructing a high frequency short string table according to a preset high frequency threshold; traversing the received sequence of each sequence Combining the high frequency short string table to find each region in the sequencing sequence with the highest number of consecutive high frequency short strings in each sequencing sequence; Constructing a left sequence of all high frequency short strings on the left side of the searched region according to the corresponding received sequencing sequence and the high frequency short string table, and/or the structure on the right side of the region is all high The right sequence of the short string;
  • Another object of the embodiments of the present invention is to provide a sequencing sequence error correction system, the system comprising: a high frequency short string statistics unit, configured to receive a sequencing sequence, and construct a
  • Another object of embodiments of the present invention is to provide a gene assembly apparatus comprising the above-described sequencing sequence error correction system.
  • the high frequency short string table is constructed according to the preset high frequency threshold, and the sequence of the discontinuous high frequency short string region in each sequencing sequence is recombined into a continuous high frequency short according to the constructed high frequency short string table.
  • the sequence of the string, the sequence after recombination retained as much as possible the number and length of the original sequencing sequence, improved the utilization of the sequence, and confirmed by experiments that the proportion and depth of the error-free sequence in the sequence after error correction were Great improvement; error correction processing
  • the resulting sequence can be cut into longer high frequency short strings, resulting in fewer high frequency short strings, thereby reducing memory usage during subsequent short sequence assembly.
  • FIG. 1 is a flowchart showing an implementation of a sequencing sequence error correction method according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a left tree according to an embodiment of the present invention. The structural diagram of the wrong system.
  • the high frequency short string table is constructed according to the preset high frequency threshold, and the sequence of the discontinuous high frequency short string region in each sequencing sequence is reconstructed into a continuous high frequency short according to the constructed high frequency short string table.
  • the sequence of strings is constructed according to the preset high frequency threshold, and the sequence of the discontinuous high frequency short string region in each sequencing sequence is reconstructed into a continuous high frequency short according to the constructed high frequency short string table. The sequence of strings.
  • step S101 a sequencing sequence is received, and a high frequency short string (kmer) table is constructed according to a preset high frequency threshold.
  • kmer short string
  • step S102 traversing each of the received sequencing sequences, and searching for each region in each sequencing sequence of each sequencing sequence with the highest frequency of the high frequency short string in combination with the high frequency short string table; in step S103, according to the corresponding receiving Sequencing sequence and high frequency short string table, in The left side of the found area is constructed by a left sequence of high frequency short strings, and/or a right sequence of high frequency short strings is constructed on the right side of the area; in step S104, the area and the structure are constructed The left and/or right sequences are combined into a corresponding sequencing sequence.
  • the foregoing step S101 is specifically:
  • a short-frequency string of short strings obtained by cutting and appearing more than a preset high-frequency threshold.
  • the length of each sequence of the received sequence is not limited in the logic of the processing program, but is generally below 200 base length (bp)
  • the preset length n of the short string is 17 bp
  • the preset high frequency threshold is 5 times. It is considered that a short string of five or more times is a high frequency short string, and a high frequency short string is added to the high frequency short string table.
  • the preset length n of the short string can be from 1 to less than any integer within the base length of the sequencing sequence, but the memory and operation time overhead is increased when the value of n is greater than 17 bp, and the value of n is increased.
  • n is preferably 17 bp.
  • the high frequency threshold can be determined according to the frequency distribution of the short string cut. The frequency distribution should theoretically have two peaks, one peak is due to sequencing error, and the second peak is due to sequencing depth, so The first valley is generally taken as the high frequency threshold. Then, the region with the highest number of consecutive high-frequency short strings of each sequencing sequence is searched, and step S102 is:
  • the longest region found in each sequencing sequence is taken as the region with the highest number of high-frequency short strings. It is assumed here that the region with the highest frequency of short-frequency strings in each sequencing sequence is [sl, s2], wherein Sl, s2 are the starting bases of the longest continuous high-frequency short-string region, and the number of bases corresponding to the first base of the corresponding sequencing sequence. If a sequencing sequence is Where 1 ⁇ is the base length of the sequencing sequence, Xi represents the ith base of the sequence, and the longest continuous sequence of high-frequency short strings is [26, 46], then X 26 X 27 . . X 46 is the longest high frequency sequence in this sequencing sequence.
  • Step 1 From the corresponding sequencing sequence The ith base starts to take the sequence of n-1 length as the root node of the tree, and the four sub-bases of C, G, and T are used to construct a left-side tree with a depth of si for the leaves of each node.
  • the tree is shown in Figure 2, where the depth si is 26;
  • Step 2 Traverse the left tree, find a path that is all high-frequency short strings, and construct a high-frequency short string from the leaf nodes according to the path. sequence.
  • the tree is traversed from the root node, the root node is a sequence N 1 of length n-1, and its child nodes are four bases A, C, G, and T in turn, and whether the short string kmeri - LrH ⁇ is The high frequency short string, that is, whether the short string is determined in the high frequency short string table, and if not, the path corresponding to the corresponding base is ended; if yes, whether the value further determined is related to the corresponding sequencing sequence X2X3 across X 49 X 5 .
  • the smallest path found is a path that is all high-frequency short strings, and the sequence traversed from the leaf node to the root node according to the path is the left sequence of all high-frequency short strings that need to be constructed.
  • the sequence traversed from the leaf node to the root node according to the path is the left sequence of all high-frequency short strings that need to be constructed.
  • multiple paths with equal total scores and the smallest scores are obtained, one is randomly taken, and then the left sequence of all high-frequency short strings that need to be constructed is traversed from the leaf nodes to the root nodes.
  • Step 4 Take the sequence of n-1 length from the ⁇ s 2 bases of the corresponding sequencing sequence as the root node of the tree, and construct a depth of l n with the four bases of C, G and T as the nodes of each node. - ( s2 - 1 ) the right tree, where l n is the base length of the test sequence, and its construction is the same as step 1 above, and will not be described again; Step 4. Traverse the right tree and find a high frequency A short-string path is constructed according to the path from the root node to the right sequence of the high-frequency short-string. The manner of finding the minimum path corresponds to step 2 above, and details are not described herein.
  • the obtained left sequence is added to the corresponding longest high frequency sequence X sl X sl+1 ?? X s2 left And adding the obtained right sequence to the right of the longest high frequency sequence X sl X sl+1 ... X s2 to obtain the corresponding sequencing sequence after error correction processing.
  • the region with the highest number of consecutive high frequency short sequences in the corresponding sequencing sequence is [1, s2] or [si, 1 radiation], that is, the region is on the left or right side of the corresponding sequencing sequence, then only
  • Table 2 It can be seen from Table 1 and Table 2 that after error correction processing, the error-free sequence is in the measured sequence. The proportion increased by about 30%, and the error-free sequence depth increased by about 10%.
  • the following is a memory resource estimation required to implement error correction processing by using the sequencing sequence error correction method provided by the embodiment of the present invention.
  • the short string is 17 bases in length
  • the memory is occupied 16Go, since each thread is processing a file. All the sequences stored in the file need to be read into the memory.
  • One sequencing sequence occupies 50 bytes, the sequence name occupies 50 bytes, and each file stores 10M sequencing sequences. Then, the sequencing sequence stored in one file is performed. Error correction processing requires 1G of memory.
  • each thread also has a separate dynamic programming table that occupies 1G of memory, so one thread occupies 2G of memory, and needs to occupy 24G of memory when 4 threads are opened by default.
  • the time consumption of the short-string frequency and the output frequency table is related to the file capacity and the input/output status. It takes about 100s to process a file and 606 files in the African genome sequence. The first step of outputting the frequency table takes 15 hours.
  • the memory consumed by the subsequent short-sequence assembled genome can be reduced by half.
  • the short-frequency short strings are merged by the high-frequency short strings (that is, the low-frequency short strings are corrected to the high-frequency short strings), and the subsequent assembly strategy only needs to cut the sequence into longer short strings (for example, Assembly of 25 bases in length will reduce memory usage.
  • the second step it takes only 7 hours to split the 606 files of the African human genome sequence into 6 parts by 6 threads, and the error correction process takes a total of 22 hours.
  • all or part of the steps of the foregoing embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium. Storage medium, such as ROM/RAM, disk, CD, etc. This program is used to perform the following steps:
  • FIG. 3 shows the structure of the sequencing sequence error correction system provided by the embodiment of the present invention, and only parts related to the embodiment of the present invention are shown for convenience of description.
  • the system can be used for genetic assembly equipment, can be a software unit, a hardware unit or a combination of hardware and software running in these devices, or can be integrated into these devices as a stand-alone pendant or run in the application system of these devices.
  • the high-frequency short-string statistical unit 301 receives the sequencing sequence and constructs a high-frequency short-string table according to a preset high-frequency threshold. The implementation manner is as described above, and details are not described herein.
  • the high frequency region searching unit 302 traverses each of the received sequencing sequences, and combines the high frequency short string table to find each region in each sequencing sequence with the highest number of consecutive high frequency short strings.
  • the sequence constructing unit 303 constructs a left sequence of all high frequency short strings on the left side of the searched area according to the corresponding received sequencing sequence and the high frequency short string table, and/or the entire right side of the area is constructed The right sequence of high frequency short strings.
  • Sequence reduction unit 304, the region and the left and/or right sequence of the construct The columns are combined into the corresponding sequencing sequence.
  • the high frequency short string statistics unit 301 includes: a short string cutting module 3011, which receives the sequencing sequence, and cuts each received sequencing sequence into a short string of a preset length.
  • the high-frequency short-string acquisition module 3012 constructs the high-frequency short-string table according to a short string obtained by cutting and having a number of occurrences exceeding a preset high-frequency threshold, and the implementation manner is as described above, and is not further described.
  • the sequence construction unit 303 includes: a left tree construction module 3031, taking a sequence of n-1 length from the si base of the corresponding sequencing sequence as a root node of the tree, and four bases of C, G, and T A tree of depth si is constructed for the leaves of each node, and the definitions of sl and n and the implementation manner of the left tree construction module 3031 are as described above, and are not described again.
  • the left sequence construction module 3032 traverses the tree on the left side to find a path of a high-frequency short string, and constructs a left sequence of all high-frequency short strings from the leaf node according to the path, and the implementation manner is as described above. Let me repeat.
  • the right tree structure module 3033 takes the sequence of n-1 length from the ⁇ s base of the corresponding sequencing sequence as the root node of the tree, and constructs a leaf with the four bases of C, G and T as the nodes.
  • the definition of the s2, n, 1 are as described above and will not be described again.
  • the right sequence construction module 3034 traverses the right tree, finds a path that is all high frequency short strings, and constructs a right sequence of all high frequency short strings from the root node according to the path, and the implementation manner is as described above, no longer Narration.
  • the high frequency short string table is constructed according to the preset high frequency threshold, and the knot is The constructed high-frequency short-string table recombines the sequences of the non-contiguous high-frequency short-string regions in each sequencing sequence into a sequence of continuous high-frequency short strings, and the reconstructed sequence retains the number and length of the original sequencing sequences, thereby improving The utilization of the sequence, and experimentally confirmed that the proportion and depth of the error-free sequence in the sequence after error correction are greatly improved; the sequence after error correction can be cut into longer high-frequency short strings, which is reduced in the subsequent The use of memory during short sequence assembly.
  • the present invention also provides a gene assembly device comprising the aforementioned sequencing sequence error correction system, which consumes less memory during assembly than the uncorrected sequencing sequence, because the sequence after error correction can be cut into High-frequency short strings of longer base lengths are assembled and fewer high-frequency short strings are obtained, reducing memory usage.
  • a gene assembly device comprising the aforementioned sequencing sequence error correction system, which consumes less memory during assembly than the uncorrected sequencing sequence, because the sequence after error correction can be cut into High-frequency short strings of longer base lengths are assembled and fewer high-frequency short strings are obtained, reducing memory usage.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Description

一种测序序列纠错方法、 系统及基因组装设备 技术领域 本发明属于基因工程技术领域, 尤其涉及一种测序序列纠错方 法、 系统灰基因组装设备。
背景技术 基于现有的基因测序技术, 碱基测错的可能性是存在的, 碱基 测错后对于后续的数据分析、 短序列组装等都存在一定的影响。 由 于在高测序深度情况下, 无错序列包含低频短串的概率极低, 因此 现有的糾错策略是将测序序列中低频短串简单的屏蔽掉, 删除含有 一定比例低频短串的序列, 而实际上并没有进行有效的纠正, 测序 序列的利用率比较低。
发明内容 本发明实施例的目的在于提供一种测序序列纠错方法, 旨在解 决现有测序序列纠错方法测序序列的利用率比较低的问题。 本发明实施例是这样实现的, 一种测序序列糾错方法, 所述方 法包括下述步骤: 接收测序序列, 根据预设的高频阈值构造高频短串表; 遍历接收到的各测序序列, 结合所述高频短串表在各测序序列 上查找每个测序序列中连续为高频短串最多的区域; 根据相应接收到的测序序列和所述高频短串表, 在查找到的所 述区域左侧构造全是高频短串的左序列, 和 /或在所述区域的右侧构 造全是高频短串的右序列;
Figure imgf000004_0001
本发明实施例的另一目的在于提供一种测序序列纠错系统, 所 述系统包括: 高频短串统计单元, 用于接收测序序列, 根据预设的高频阔值 构造高频短串表; 高频区域查找单元, 用于遍历接收到的各测序序列, 结合所述 高频短串表在各测序序列 Jl每个测序序列中查找连续为高频短串最 多的区域; 序列构造单元, 用于根据相应接收到的测序序列和所述高频短 串表, 在查找到的所述区域左侧构造全是高频短串的左序列, 和 /或 在所述区域的右侧构造全是高频短串的右序列; 以及 序列組合单元, 用于将所述区域与构造的所述左序列和 /或右序 列组合成相应测序序列。 本发明实施例的另一目的在于提供包含上述测序序列纠错系统 的基因組装设备。 在本发明实施例中, 根据预设的高频阈值构造高频短串表, 結 合构建的高频短串表将各测序序列中非连续高频短串区域的序列重 新组合为连续高频短串的序列, 重组后的序列尽量多的保留了原测 序序列的个数和长度, 提高了序列的利用率, 并通过实验证实了纠 错处理后的序列中无错序列的比例和深度均有很大提高; 纠错处理 后的序列可以切割成更长的高频短串, 获得更少的高频短串, 从而 降低在后续短序列组装过程中内存的使用。
附图说明 图 1是本发明实施例提供的测序序列糾错方法的实现流程图; 图 2是本发明实施例提供的左侧树的结构示意图; 图 3是本发明实施例提供的测序序列纠错系统的结构图。
具体实施方式 为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结 合附图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处 所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发明。 在本发明实施例中, 根据预设的高频阈值构造高频短串表, 结 合构建的高频短串表将各测序序列中非连续高频短串区域的序列重 新构造为连续高频短串的序列。 图 1 示出了本发明实施例提供的测序序列纠错方法的实现流 程, 详述如下: 在步骤 S101中,接收测序序列,根据预设的高频阔值构造高频 短串 (kmer )表; 在步驟 S102中,遍历接收到的各测序序列, 结合高频短串表在 各测序序列上每个测序序列中查找连续为高频短串最多的区域; 在步骤 S103中,根据相应接收到的测序序列和高频短串表, 在 查找到的区域左侧构造全是高频短串的左序列, 和 /或在所述区域的 右侧构造全是高频短串的右序列; 在步驟 S104 中, 将所述区域与构造的所述左序列和 /或右序列 组合成相应测序序列。 在本发明实施例中, 上述步骤 S101具体为:
1.接收测序序列 , 将接收到的各测序序列逐个碱基切割成预设 长度的短串;
2.将切割得到的且出现次数超过预设的高频阔值的短串构造出 高频短串表。 这里, 接收到的各测序序列长度在处理程序逻辑上没有限制, 但一般在 200碱基长度 (bp)以下, 短串的预设长度 n为 17 bp, 预设 的高频阈值为 5次, 认为出现 5次以上的短串即为高频短串, 将高 频短串添加到高频短串表。 当然, 短串的预设长度 n可以取从 1到 小于测序序列碱基长度内的任意整数, 但是在 n的取值大于 17 bp 时内存和运算时间的开销会加大, 在 n的取值小于 17 bp时纠错效 果不理想, 所以 n最好取 17bp。 高频阔值可以根据切割成的短串的 频率分布来确定, 频率分布在理论上应该存在两个峰值, 笫一个峰 是由于测序错误造成的, 第二个峰是由于测序深度造成的, 所以一 般取第一个谷值为高频阈值。 接着, 查找各测序序列连续高频短串最多的区域, 步骤 S102具 体为:
1.遍历接收到的各测序序列, 结合高频短串表, 在各测序序列 上查找连续为高频短串的区域, 即顺序遍历测序序列的短串, 如果 该短串出现在高频短串表中, 则认为该短串为高频短串; 否则, 认 为该短串不是高频短串, 这样遍历完各测序序列即可得到各测序序 列相应的连续为高频短串的区域;
2.在各测序序列中取查找到的最长的区域作为其连续为高频短 串最多的区域, 这里假设各测序序列中连续为高频短串最多的区域 为 [sl, s2], 其中 sl、 s2 为查找到的最长的连续为高频短串的区域 的起始碱基、 结^基距离相应测序序列首个碱基的数目。 如果一个测序序列为
Figure imgf000007_0001
其中 1η为该测序序列 的碱基长度, Xi表示该序列的第 i个碱基, 该测序序列最长的连续 为高频短串的区域为 [26, 46], 则 X26X27...... X46为该测序序列中最 长的高频序列。 然后, 根据原测序序列和高频短串表, 在 [si, s2】的左侧和右侧 各构造一条全是高频短串的序列, 上述步骤 S103具体为: 步骤 1.从相应测序序列的第 si个碱基开始取 n-1长度的序列 作为树的根节点, 以 、 C, G、 T四种减基为各节点的叶子构造一 棵深度为 si的左侧树, 其构造的树如图 2所示, 这里, 深度 si即 为 26; 步驟 2.遍历左侧树, 找到一条全是高频短串的路径, 根据该路 径从叶子节点向上构造全是高频短串的左序列。 这里, 从根节点开始向下遍历树, 根节点为长度为 n-1的序列 N1 ? 其子节点 依次为 A、 C、 G、 T四种碱基, 考察短串 kmeri - LrH^是否是高频短串, 即判断高频短串表中是否有该短串, 如 果否, 则结束相应碱基对应的路径; 如果是, 则进一步判断 的值 是否与相应测序序列 X2X3...... X49X5。中相应减基 Xsl4的值相同, 如果相同则 1级节点分数 scores 0, 否则 1级节点分数 scorei = 1, 并继续在 kme 左端取 n-1长度的序列 N2, 按照上述方式考察短串 kmer2 = L2 + N2, 其子节点 L2依次为 A、 C, G、 T四种碱基。 按照 该规则向叶子节点迭代、 判断, 并在迭代结束后, 找到一条总分数 sCOre=§scorei最小路径, 其中 SCor 为相应路径中笫 i级节点的分数。
i=l
找到的最小路径即为全是高频短串的路径, 根据该路径从叶子节点 向根节点遍历得到的序列即为需要构造的全是高频短串的左序列。 当然, 如果迭代结束后, 得到多条总分数 score都相等且都最小的 路径, 则随机取一条, 然后从叶子节点向根节点遍历得到需要构造 的全是高频短串的左序列。 当然, 也可以从下向上遍历树, 来查找 一条全是高频短串的路径。 步骤 3. 从相应测序序列的笫 s2个碱基开始取 n-1长度的序列 作为树的根节点, 以 、 C, G、 T四种碱基为各节点的叶子构造一 棵深度为 ln - ( s2 - 1 )的右侧树,其中 ln为该测试序列的碱基长度, 其构造方式与上述步骤 1相同, 不再赘述; 步驟 4.遍历右侧树, 找到一条全是高频短串的路径, 根据该路 径从根节点向下构造全是高频短串的右序列, 其查找最小路径的方 式与上述步骤 2相应, 不再赘述。 在得到相应测序序列左侧和右侧的全是高频短串的序列后, 将 得到的左序列添加到相应最长的高频序列 XslXsl+1...... Xs2左边, 并 将得到的右序列添加到相应最长的高频序列 XslXsl+1...... Xs2右边, 即得到经过纠错处理后的相应测序序列。 当然, 如果相应测序序列中连续为高频短串最多的区域为 [1 , s2】或 [si, 1„], 即该区域在相应测序序列的左侧或右侧, 则只需要在
[1, s2】的右侧构造一条全是高频短串的右序列, 或者只需要在 [sl, ln]的左侧构造一条全是高频短串的左序列。 此时, 在还原相应测序 序列时, 只需要将得到的左序列添加到相应最长的高频序列左边, 或者将得到的有序列添加到相应最长的高频序列右边。
下面通过实验来说明本发明测序序列纠错方法的技术效果, 在 这个实验中使用本发明上迷实施例提供的方法对人类基因质量控制 序列 (Human control BAC ) 和非洲人类基因組序列 (Human genome )进行纠错处理, 纠错前的数据对比如表 1所示, 糾错后的 数据对比如表 2所示:
Figure imgf000009_0001
表 1
Figure imgf000009_0002
表 2 由表 1、 表 2可知, 经纠错处理后, 使无错序列在被测序列中 所占比例提高 30%左右, 无错序列深度提高约 10%。 下面是采用本发明实施例提供的测序序列纠错方法实现纠错处 理时所需要的内存资源估计, 在短串为 17碱基长度时, 占用内存 16Go 另外, 由于每个线程在处理一个文件时需要将该文件中存储 的所有序列读入内存, 殳一条测序序列占用 50字节, 其序列名称 占用 50个字节, 每个文件存储 10M个测序序列, 那么对一个文件 中存储的测序序列进行糾错处理需要占用 1G 内存。 并且, 每个线 程还有单独的动态规划表占用 1G内存,那么一个线程占用 2G内存, 在默认开设 4个线程时需要占用 24G内存。 另外, 统计短串频数、 输出频数表的耗时跟文件容量的多少和 输入 /输出状况有关。 处理一个文件约需 100s, 非洲人基因組序列总 共有 606个文件, 第一步输出频数表需要耗时 15h。
错处理后, 后续的短序列组装基因组所耗内存可以降低一半。 并且, 纠完错后低频的短串被高频的短串合并了 (即低频短串被纠正为高 频短串) , 后面的組装策略只需要将序列切割成更长的短串 (例如 25碱基长度 )进行组装, 这样就会降低内存的使用。 进一步地, 为了提高纠错处理的速度, 可采用多个线程将所有 需纠错文件拆分处理。处理一个文件约需 1000s,采用 4个线程处理 100个文件需要耗时 1000s*100/4 = 25000s = 71ι。 第二步采用 6个线 程将非洲人类基因组序列的 606个文件拆分成 6份处理时只需要耗 时 7h, 纠错处理总共耗时 22h。 本领域普通技术人员可以理解, 实现上述实施例方法中的全部 或部分步驟是可以通过程序来指令相关的硬件来完成, 所述的程序 可以在存储于一计算机可读取存储介质中, 所述的存储介质, 如 ROM/RAM, 磁盘、 光盘等, 该程序用来执行如下步驟:
1.接收测序序列, 根据预设的高频阁值构造高频短串表;
2.遍历接收到的各测序序列, 结合所迷高频短串表在各测序序 列上查找每个测序序列中连续为高频短串最多的区域;
3. 根据相应接收到的测序序列和所述高频短串表, 在查找到的 所述区域左侧构造全是高频短串的左序列, 和 /或在所述区域的右侧 构造全是高频短串的右序列;
4. 将所述区域与构造的所述左序列和 /或右序列組合成相应测 序序列。 图 3示出了本发明实施例提供的测序序列纠错系统的结构, 为 了便于说明仅示出了与本发明实施例相关的部分。 该系统可以用于基因组装设备, 可以是运行于这些设备内的软 件单元、 硬件单元或者软硬件相结合的单元, 也可以作为独立的挂 件集成到这些设备中或者运行于这些设备的应用系统中, 其中: 高频短串统计单元 301, 接收测序序列, 根据预设的高频阈值 构造高频短串表, 其实现方式如上所述, 不再赘述。 高频区域查找单元 302, 遍历接收到的各测序序列, 结合高频 短串表在各测序序列上每个测序序列中查找连续为高频短串最多的 区域。 序列构造单元 303, 根据相应接收到的测序序列和高频短串表, 在查找到的区域左侧构造全是高频短串的左序列, 和 /或在所述区域 的右侧构造全是高频短串的右序列。 序列还原单元 304, 将所述区域与构造的所述左序列和 /或右序 列组合成相应测序序列。 其中, 高频短串统计单元 301包括: 短串切割模块 3011, 接收测序序列, 将接收到的各测序序列逐 个碱基切割成预设长度的短串。 高频短串采集模块 3012, 将根据切割得到的且出现次数超过预 设的高频阈值的短串构造出所述高频短串表,其实现方式如上所述, 不再赞述。 另外, 序列构造单元 303包括: 左侧树构造模块 3031, 从相应测序序列的第 si个碱基开始取 n-1长度的序列作为树的根节点, 以 、 C、 G、 T四种碱基为各节 点的叶子构造一棵深度为 si的树, 其 sl、 n的定义及该左侧树构造 模块 3031的实现方式如上所述, 不再赘述。 左序列构造模块 3032, 遍历左侧树, 找到一奈全是高频短串的 路径, 根据该路径从叶子节点向上构造全是高频短串的左序列, 其 实现方式如上所述,. 不再赘述。 右侧树构造模块 3033, 从相应测序序列的笫 s2个碱基开始取 n-1长度的序列作为树的根节点, 以 、 C, G、 T四种碱基为各节 点的叶子构造一棵深度为 1„- ( s2 - l ) 的右侧树, 其 s2、 n、 1„的 定义及该左侧树构造模块 3031的实现方式如上所述, 不再赘述。 右序列构造模块 3034, 遍历右侧树, 找到一条全是高频短串的 路径, 根据该路径从根节点向下构造全是高频短串的右序列, 其实 现方式如上所述, 不再赘述。 在本发明实施例中, 根据预设的高频阈值构造高频短串表, 结 合构建的高频短串表将各测序序列中非连续高频短串区域的序列重 新组合为连续高频短串的序列, 重组后的序列保留了原测序序列的 个数和长度, 提高了序列的利用率, 并通过实验证实了糾错处理后 的序列中无错序列的比例和深度均有很大提高; 纠错处理后的序列 可以切割成更长的高频短串, 降低在后续短序列组装过程中内存的 使用。 本发明还提供了一种包含前述测序序列纠错系统的基因组装设 备,在组装时消耗内存可以比对未糾错的测序序列的内存消耗要少, 这是由于纠错后的序列可以切割成更长碱基长度的高频短串来进行 组装, 并获得更少的高频短串, 从而降低了内存的使用。 以上所述仅为本发明的较佳实施例而已,并不用以限制本发明, 凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等, 均应包含在本发明的保护范围之内。

Claims

1. 一种测序序列纠错方法, 其特征在于, 所述方法包括下述步 骤:
接收测序序列, 根据预设的高频阈值构造高频短串表; 遍历接收到的各测序序列, 结合所述高频短串表在各测序序列 上查找每个测序序列中连续为高频短串最多的区域;
根据相应接收到的测序序列和所述高频短串表, 在查找到的所 述区域左侧构造全是高频短串的左序列, 和 /或在所述区域的右侧构 造全是高频短串的右序列; 序列: 、, ^一 A5、 、
2. 如权利要求 1所述的方法, 其特征在于, 所述接收测序序列, 根据预设的高频阈值构造高频短串表的步驟具体为:
接收测序序列, 将接收到的各测序序列逐个碱基切割成预设长 度的短串;
根据切割得到的且出现次数超过预设的高频阈值的短串构造出 所述高频短串表。
3. 如权利要求 2所述的方法, 其特征在于, 所迷预设的高频阈 值根据切割成的预设长度的短串的频率分布确定, 所述预设长度为 17个碱基长度。
4. 如权利要求 1 所述的方法, 其特征在于, 所迷根据相应接收 到的测序序列和所述高频短串表, 在查找到的所述区域左侧构造全 是高频短串的左序列的步驟具体为:
从相应测序序列的第 si个碱基开始取 n-1长度的序列作为树的 才艮节点, 以 、 C, G、 T四种碱基为各节点的叶子构造一棵深度为 sl的左侧树;
遍历所述左侧树, 找到一条全是高频短串的路径, 根据所述路 径从叶子节点向上构造全是高频短串的左序列;
所述根据相应接收到的测序序列和所述高频短串表, 在所述区 域的右侧构造全是高频短串的右序列的步骤具体为:
从相应测序序列的第 s2个碱基开始取 n-1长度的序列作为树的 才艮节点, 以 、 C、 G、 T四种碱基为各节点的叶子构造一棵深度为 ln - ( s2 - 1 ) 的右侧树;
遍历所述右侧树, 找到一条全是高频短串的路径, 根据所述路 径从根节点向下构造全是高频短串的右序列;
其中, sl、 s2分别为查找到的所述连续为高频短串最多的区域 的起始碱基、 结束碱基距离相应测序序列首个碱基的数目, n 为所 述高频短串的碱基长度, 1„为相应测序序列的碱基长度。
5. 如权利要求 1 所述的方法, 其特征在于, 所述接收到的测序 序列的长度小于等于 200碱基长度。
6. —种测序序列纠错系统, 其特征在于, 所述系统包括:
高频短串统计单元, 用于接收测序序列, 根据预设的高频闹值 构造高频短串表;
高频区域查找单元, 用于遍历接收到的各测序序列, 结合所述 高频短串表在各测序序列上每个测序序列中查找连续为高频短串最 多的区域;
序列构造单元, 用于才艮据相应接收到的测序序列和所述高频短 串表, 在查找到的所述区域左侧构造全是高频短串的左序列, 和 /或 在所述区域的右侧构造全是高频短串的右序列; 以及
序列组合单元, 用于将所述区域与构造的所述左序列和 /或右序 列组合成相应测序序列。
7. 如权利要求 6所述的系统, 其特征在于, 所述高频短串统计 单元包括:
短串切割模块, 用于接收测序序列, 将接收到的各测序序列逐 个碱基切割成预设长度的短串; 以及
高频短串采集模块, 用于根据切割得到的且出现次数超过预设 的高频阈值的短串构造出所述高频短串表。
8. 如权利要求 7 所述的系统, 其特征在于, 所述预设的高频阈 值根据所述短串切割模块切割成的预设长度的短串的频率分布确 定, 所述预设长度为 17个碱基长度。
9. 如权利要求 6 所述的系统, 其特征在于, 所述序列构造单元 包括:
左侧树构造模块, 用于从相应测序序列的第 si 个碱基开始取 n-1长度的序列作为树的根节点, 以 、 C, G、 T四种碱基为各节 点的叶子构造一棵深度为 si的树;
左序列构造模块, 用于遍历所述左侧树, 找到一奈全是高频短 串的路径, 根据所述路径从叶子节点向上构造全是高频短串的左序 列;
右侧树构造模块, 用于从相应测序序列的第 s2 个碱基开始取 11-1长度的序列作为树的根节点, 以 、 C G、 T四种碱基为各节 点的叶子构造一棵深度为 ln - ( s2 - l ) 的右侧树; 以及
右序列构造模块, 用于遍历所述右侧树, 找到一条全是高频短 串的路径,根据所述路径从根节点向下构造全是高频短串的右序列; 其中, sl、 s2分别为查找到的所述连续为高频短串最多的区域 的起始碱基、 结束碱基距离相应测序序列首个碱基的数目, n 为所 述高频短串的碱基长度, ln为相应测序序列的碱基长度。
10. 一种包含权利要求 6至 9任一项所述测序序列纠错系统的基 因组装设备。
PCT/CN2009/001426 2008-12-12 2009-12-11 一种测序序列纠错方法、系统及基因组装设备 WO2010066114A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP09831391.9A EP2377948B1 (en) 2008-12-12 2009-12-11 Error correcting method of test sequence, corresponding system and gene assembly equipment
US13/132,038 US8751165B2 (en) 2008-12-12 2009-12-11 Error correcting method of test sequence, corresponding system and gene assembly equipment
JP2011539874A JP5344774B2 (ja) 2008-12-12 2009-12-11 テスト配列の誤り訂正方法、対応するシステム及び遺伝子のアセンブリ装置
HK12101570.2A HK1161313A1 (zh) 2008-12-12 2012-02-17 種測序序列糾錯方法、系統及基因組裝設備

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2008102183406A CN101457253B (zh) 2008-12-12 2008-12-12 一种测序序列纠错方法、系统及设备
CN200810218340.6 2008-12-12

Publications (1)

Publication Number Publication Date
WO2010066114A1 true WO2010066114A1 (zh) 2010-06-17

Family

ID=40768373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/001426 WO2010066114A1 (zh) 2008-12-12 2009-12-11 一种测序序列纠错方法、系统及基因组装设备

Country Status (6)

Country Link
US (1) US8751165B2 (zh)
EP (1) EP2377948B1 (zh)
JP (1) JP5344774B2 (zh)
CN (1) CN101457253B (zh)
HK (1) HK1161313A1 (zh)
WO (1) WO2010066114A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3084426B1 (en) * 2013-12-18 2020-04-15 Pacific Biosciences Of California, Inc. Iterative clustering of sequence reads for error correction
CN103971031B (zh) * 2014-05-04 2017-05-17 南京师范大学 一种面向大规模基因数据的读段定位方法
JP6520362B2 (ja) * 2014-08-25 2019-05-29 富士通株式会社 生成方法、装置、及びプログラム
US20160335298A1 (en) * 2015-05-12 2016-11-17 Extreme Networks, Inc. Methods, systems, and non-transitory computer readable media for generating a tree structure with nodal comparison fields and cut values for rapid tree traversal and reduced numbers of full comparisons at leaf nodes
EP3295345B1 (en) 2015-05-14 2023-01-25 Life Technologies Corporation Barcode sequences, and related systems and methods
CN105063208B (zh) * 2015-08-10 2018-03-06 北京吉因加科技有限公司 一种血浆中游离的目标dna低频突变富集测序方法
CN111385022B (zh) * 2018-12-29 2022-02-25 深圳市海思半导体有限公司 误码检测方法及相关设备
CN114937475A (zh) * 2022-04-12 2022-08-23 桂林电子科技大学 一种PacBio测序数据纠错结果的自动化评估方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1360057A (zh) * 2001-11-16 2002-07-24 北京华大基因研究中心 一种基于重复序列识别的全基因组测序数据的拼接方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2400890A1 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
CN2684471Y (zh) * 2004-03-15 2005-03-09 北京格林威尔科技发展有限公司 内嵌误码测试功能的光端机

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1360057A (zh) * 2001-11-16 2002-07-24 北京华大基因研究中心 一种基于重复序列识别的全基因组测序数据的拼接方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
See also references of EP2377948A4 *
SONG YAJUN ET AL.: "Detection and analysis of complete genome sequence ofYersinia pestis human-avirulent strain 91001", MED J CHIN PLA, vol. 9, no. 3, March 2004 (2004-03-01), pages 192 - 199, XP008143927 *
ZHANG GUIGANG ET AL.: "An approach based on fast Walsh transform and heuristic search to DNA fragments assembly", JOURNAL OF COMMUNICATION AND COMPUTER, vol. 4, no. 9, September 2007 (2007-09-01), pages 5 - 8, XP008146050 *

Also Published As

Publication number Publication date
EP2377948A4 (en) 2014-07-30
CN101457253B (zh) 2011-08-31
HK1161313A1 (zh) 2012-08-24
JP5344774B2 (ja) 2013-11-20
US8751165B2 (en) 2014-06-10
EP2377948B1 (en) 2016-05-18
US20110295784A1 (en) 2011-12-01
CN101457253A (zh) 2009-06-17
JP2012511752A (ja) 2012-05-24
EP2377948A1 (en) 2011-10-19

Similar Documents

Publication Publication Date Title
WO2010066114A1 (zh) 一种测序序列纠错方法、系统及基因组装设备
Coja-Oghlan et al. Contagious sets in expanders
US10370246B1 (en) Portable and low-error DNA-based data storage
Batuwita et al. Efficient resampling methods for training support vector machines with imbalanced datasets
EP1506621B1 (en) Decoding of chain reaction codes through inactivation of recovered symbols
US7558290B1 (en) Method and apparatus of data compression for computer networks
US10680645B2 (en) System and method for data storage, transfer, synchronization, and security using codeword probability estimation
Wandelt et al. Adaptive efficient compression of genomes
US11372929B2 (en) Sorting an array consisting of a large number of elements
US20130262941A1 (en) Data Alignment Over Multiple Physical Lanes
WO2015021879A1 (zh) 一种数据正则表达式的挖掘方法及装置
WO2020151150A1 (zh) 一种基于dcgan的音乐生成方法及装置
He et al. De novo assembly methods for next generation sequencing data
CN108712232A (zh) 一种用于连续变量量子密钥分发系统中的多码字并行译码方法
Song et al. Super-robust data storage in DNA by de Bruijn graph-based decoding
CN115312129A (zh) 高通量测序背景下的基因数据压缩方法、装置及相关设备
WO2009124419A1 (zh) 拓扑抽象方法、拓扑抽象装置以及路由控制器
Yan et al. Scalable de novo genome assembly using pregel
US20240134825A1 (en) Fastq/fasta compression systems and methods
WO2020124275A1 (en) Method, system, and computing device for optimizing computing operations of gene sequencing system
Pal et al. Improved algorithms for finding edit distance based motifs
Ye et al. SparseAssembler: de novo Assembly with the Sparse de Bruijn Graph
Sinnamon Analysis of self-adjusting heaps
CN113285985A (zh) 一种多数据中心背景下基于遗传算法的rs码节点修复方法
Limasset Novel approaches for the exploitation of high throughput sequencing data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09831391

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011539874

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009831391

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 13132038

Country of ref document: US