WO2010066116A1 - 一种片段连接支架的构建方法、系统和基因组测序设备 - Google Patents

一种片段连接支架的构建方法、系统和基因组测序设备 Download PDF

Info

Publication number
WO2010066116A1
WO2010066116A1 PCT/CN2009/001428 CN2009001428W WO2010066116A1 WO 2010066116 A1 WO2010066116 A1 WO 2010066116A1 CN 2009001428 W CN2009001428 W CN 2009001428W WO 2010066116 A1 WO2010066116 A1 WO 2010066116A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
segment connection
scaffold
size
contig
Prior art date
Application number
PCT/CN2009/001428
Other languages
English (en)
French (fr)
Inventor
朱红梅
倪培相
李瑞强
方晓东
王俊
杨焕明
汪建
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to EP09831393.5A priority Critical patent/EP2377949B1/en
Priority to US13/132,027 priority patent/US20110288845A1/en
Priority to JP2011539875A priority patent/JP2012511753A/ja
Publication of WO2010066116A1 publication Critical patent/WO2010066116A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the invention belongs to the field of genetic engineering, and in particular relates to a method, a system and a genome sequencing device for constructing a fragment-linked scaffold. Background technique
  • Genomics research is a comparative analysis of biological full set of heritage information to grasp the mechanisms and functions of biological full set of heritage information.
  • One of the most fundamental aspects of genomics research is how to obtain a complete genome sequence for a living being.
  • the prior art provides a first generation sequencing technology represented by a full-genome shotgun sequencing technique (Sanger sequencing technology) and a second generation sequencing technology represented by Solexa, Solid, and 454 to obtain a complete genome sequence of a living organism.
  • the process of Sanger sequencing technology is as follows: First, the whole genome is interrupted into DNA fragments of different sizes to construct the Shotgun library, the Shotgun library is randomly sequenced, and finally the bioinformatics method is used to splicing the sequence fragments into whole genome sequences. The sequencing reads are longer.
  • Solexa sequencing technology is as follows: First, the entire genome is interrupted to a DNA fragment of about 100-200 bp size, and then the linker is ligated to the DN A fragment and amplified by polymerase chain reaction (PCR). The library is made into a library, and then the DNA fragment to which the linker has been added is bound to a flow cell on a flow cell containing a linker, and different DNA fragments are amplified by reaction. In the next reaction, four fluorescently labeled dyes were applied by Sequencing By Synthesis. Solexa sequencing technology is characterized by high throughput, low cost, low sequencing error rate, and long sequencing read length.
  • PCR polymerase chain reaction
  • the construction method of the fragment attachment scaffold has always been the splicing process.
  • the important link is mainly used to determine the positional relationship between the contigs and to construct the basic skeleton for genome assembly.
  • the pros and cons of this method directly affect the final results of the whole genome sequence.
  • the existing scaffold construction method is to complete the splicing task by connecting the sequenced overlapping fragments obtained by sequencing.
  • the sequencing read length is short, the overlap between the sequencing fragments is relatively short, resulting in low accuracy of the existing scaffold construction method.
  • the sequencing read length of the second-generation sequencing technology represented by Solexa, Solid and 454 is significantly shorter than that of the first-generation sequencing technology, which makes the existing scaffold construction method difficult to apply to the second-generation sequencing technology to complete the splicing of genome sequencing fragments. task. Summary of the invention
  • the object of the present invention is to provide a method for constructing a fragment-ligated scaffold, which aims to solve the problem that the existing fragment-connected scaffold construction method is difficult to apply to the second-generation sequencing technology to complete the splicing task of the genome set.
  • One aspect of the present invention provides a method of constructing a segment-attached stent, the method comprising the steps of:
  • a segment-ligated scaffold is constructed based on the size of the gap between the segment-connected groups and the forward-reverse relationship between the segment-connected groups.
  • Another object of the present invention is to provide a segment attachment stent construction system, the system comprising:
  • a forward and reverse information mapping unit configured to map the forward and reverse information obtained by sequencing to the segment connection group
  • a slot size acquisition unit for mapping forward and reverse letters on the segment connection group Get the size of the gap between the connected groups of fragments
  • the Scaffold building unit is configured to construct a segment connecting bracket according to the calculated gap size between the segment connecting groups and the forward and reverse relationship between the segment connecting groups, and obtain a complete segment connecting bracket map.
  • Another object of the present invention is to provide a genome sequencing apparatus comprising the above-described fragment-ligated scaffold construction system.
  • each segment connection group can be constructed into a complete segment.
  • the scaffold map is connected, so that when the sequencing length of the genome sequencing technology is short, the scaffolding task of the scaffold can be completed by the above-mentioned fragment-ligated scaffold construction method, and the error rate of the sequencing fragment splicing is reduced.
  • FIG. 1 is a flow chart of one embodiment of a scaffold construction method of the present invention
  • FIG. 2 is a flow chart of another embodiment of a scaffold construction method of the present invention
  • FIG. 3 is a diagram of utilizing mapping to a contig fragment according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of shielding of repeated segments according to an embodiment of the present invention.
  • 5a and 5b are schematic diagrams of a linearized scaffold diagram provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of recovery of a repeated contig segment according to an embodiment of the present invention
  • FIG. 7 is a structural block diagram of an embodiment of a scaffold construction system of the present invention
  • FIG. 8 is a structural block diagram of another embodiment of a scaffold construction system of the present invention.
  • the forward and reverse information obtained by the sequencing is mapped to the segment connection group, and the gap size between the segment connection groups is calculated according to the plurality of pairs of forward and reverse information, and then the calculated segment connection group is obtained according to the calculation.
  • the vacancy size and positive and negative relationship can be used to construct each fragment connection group into a complete scaffold diagram.
  • Fig. 1 is a flow chart showing an embodiment of a method of constructing a segment attachment scaffold of the present invention, which is described in detail as follows:
  • step 102 the sequenced forward and reverse information (also referred to as forward and reverse reads) is mapped onto the fragment connection group (coiitig fragment).
  • various sequencing technologies can be used to sequence the genome to be tested.
  • the first generation and the second generation sequencing technology can be used to sequence the detected genome, and a plurality of short sequences having a positive and negative relationship can be obtained ( Called forward and reverse information).
  • An embodiment of the present invention uses a second generation sequencing technique with high throughput, sequencing read length and the like to sequence the genome to be tested, which can reduce the complexity of the scaffold construction method.
  • mapping the forward and reverse information obtained by sequencing to the contig fragment it can be mapped by multiple mapping methods, such as soap, eland, maq or BLAT. After mapping the forward and reverse information obtained by sequencing to the contig segment, the position and direction of the forward and reverse information on the contig segment can be obtained.
  • Fig. 3 shows the effect diagram after mapping the above-mentioned forward and reverse information onto the contig fragment.
  • each is obtained based on the forward and reverse information mapped onto the contig segment.
  • the two contig segments are joined by a pair of forward and reverse information, and the length of the gap between the two contig segments can be derived from the information of the pair of forward and reverse directions mapped to the contig segment. If there are multiple pairs of forward and reverse information between two contig segments, the vacancy lengths derived from each of them are taken as the median value, or averaged as the size of the gap between the final contig segments.
  • the number of forward and reverse information across two contig segments is recorded, and the tag is used as a weight, and a specific threshold is selected according to an actual situation, and a connection whose weight is greater than a specific threshold is set as an effective connection to improve the connection relationship. Accuracy.
  • the average value of the gaps between the contig segments is calculated according to the plurality of pairs of forward and backward information between the contig segments as the gap size between the contig segments.
  • the gap between the contigl fragment and the contig2 fragment is calculated according to the three pairs of forward and backward information between the contigl fragment and the contig2 fragment.
  • the average length, and the calculated average length is taken as the gap size between the con tigl fragment and the contig2 fragment.
  • the average length of the gaps between all the contig segments having the forward and reverse connection relationship is calculated as the size of the gap between the contig segments.
  • the vacancy size between the contig segments calculated from a pair of forward and backward information is Xi
  • the normal distribution of the sigma from the expected ⁇ is ⁇ ( ⁇ , ⁇ ⁇ 2 )
  • the variance ⁇ ⁇ 2 is ⁇ ( ⁇ , ⁇ ⁇ 2 )
  • the average value of the gap size between the contig segments calculated by the N pairs of forward and reverse information between the contig segments obeys the N ( ⁇ , ⁇ ⁇ 2/ ⁇ ) distribution.
  • step 106 based on the calculated gap size between the contig segments and the forward and reverse relationship between the contig segments, a scaffold between the contig segments is constructed, and each contig segment is constructed into a complete scaffold map.
  • the forward and reverse relationship between each contig segment can be directly determined based on the positional relationship between the forward and reverse relations given by the experimental raw data.
  • the gap size between the contigl fragment and the contig2 fragment is calculated according to the three pairs of forward and backward information between the contig fragment and the contig2 fragment shown in FIG. 3, according to the contigl fragment and the contig2 fragment, the gap between the positive and negative relationship size and contigl contig2 fragments and fragments thereof, constructed as shown in FIG. 3 between the scaffold fragment contigl contig2 segments, and so on, all in accordance with a con tig by reverse relationship to the connection segment of Between the gap size and the forward and reverse relationship between all contig segments with positive and negative connection, you can construct a scaffold between all contig segments with positive and negative connection, so that all have positive and negative connections.
  • the contig fragment is constructed into a complete scaffold diagram, the effect of which is shown in Figure 4.
  • Fig. 2 is a flow chart showing another embodiment of the method of constructing a segment attachment scaffold of the present invention.
  • step 202 the forward and reverse information obtained by sequencing is mapped onto the confl segment.
  • step 204 the average length of the gaps between the contig segments is calculated according to the pairs of forward and reverse information mapped to the config segment as the size of the space between the config segments.
  • a scaffold map is constructed based on the gap size between the config fragments and the forward and reverse relationship between the config fragments.
  • the repeated contig segments in the constructed scaffold map are detected and the detected duplicate contig segments are masked.
  • the scaffold map constructed according to the above scaffold construction method may include multiple repeat fragments, thereby reducing the accuracy of genome sequencing. Rate, by masking repeated COHtig fragments at this step, can improve the accuracy of genome sequencing.
  • the contig segment is considered to be a repeated contig segment.
  • the detected duplicate contig segment is masked when a duplicate contig segment is detected.
  • the contig segment R is connected to the contig segments A and B in the reverse direction, respectively, and the contig segments A and B overlap, and the contig segment R is respectively connected in the positive direction.
  • the scaffold construction method further includes the following steps:
  • the scaffold map is linearized according to the vacancy size between each contig segment in the scaffold map and the forward and reverse relationship of each contig segment.
  • the repeating segment is masked by step 208, and the scaffold graph with the repeated segment is linearized, and the scaffold constructed by step 206 is performed.
  • the constructed scaffold graph is linearized directly.
  • the appropriate position of each contig slice map if there is no significant overlap between any two contig segments, according to the two The positional relationship between contig fragments is converted into a linear structure.
  • the scaffold diagram is shown in Figure 5a, where the gap size and forward-reverse relationship between contig fragments A and B, and the gap size between contig fragments E and D are known. And the positive and negative relationship, the gap size and the forward and reverse relationship between the contig segments A and E, and the gap size and the positive and negative relationship between the contig segments E and C, according to the size of the gap between the above segments and the positive
  • the linear structure relationship can be directly obtained as AEBCD, that is, the scaffold diagram shown in Fig. 5a can be directly linearized into the scaffold diagram shown in Fig. 5b.
  • the scaffold construction method further includes the following steps:
  • the step of recalculating the size of the gap between the contig segments in the scaffold diagram after linearization is specifically as follows: According to the positional relationship of the contig segment on the scaffold map after linearization, directly calculating the position between the adjacent two contig segments The size of the gap, and reconnect the adjacent contig fragment, transforming the original scaffold map into a true linear structure.
  • the gap between the contig segments after the connection relationship of AB, AC, EC, and ED in FIG. 5a is converted into the connection relationship of AE, EB, BC: and CD in FIG. 5b. Obtained by the original calculated vacancy size directly plus or minus.
  • the scaffold construction method further includes the following steps:
  • step 212 when the masked repeated contig segment is located between two unique contig segments, the repeated contig segments are restored.
  • the direct recovery is Shielded repeat contig fragment R.
  • Fig. 7 is a block diagram showing the construction of an embodiment of the scaffold construction system of the present invention.
  • the scaffold construction system configuration embodiment includes a forward and reverse information mapping unit 71, a slot size acquisition unit 72, and a Scaffold (fragment connection bracket) construction unit 73. among them:
  • the forward and backward information mapping unit 71 maps the sequenced forward and backward information onto the contig segment.
  • a plurality of sequencing technologies can be used to sequence the detected genome.
  • the first generation and the second generation sequencing technology can be used to sequence the detected genome, and a plurality of short sequences having a positive and negative relationship are obtained.
  • forward and reverse information For forward and reverse information).
  • One embodiment of the present invention uses a second generation sequencing technique with high throughput, sequencing read length, and the like to sequence the genome to reduce the complexity of the scaffold construction method.
  • mapping the forward and reverse information obtained by sequencing to the contig fragment it can be mapped by various mapping methods, such as soap, eland, maq or BLA. After mapping the forward and reverse information obtained by sequencing to the contig segment, the position and direction of the forward and reverse information can be obtained. The effect of mapping the forward and reverse information obtained by sequencing onto the contig fragment is shown in Figure 3.
  • the slot size acquisition unit 72 obtains the slot size between the fragment connection groups based on the forward and reverse information mapped to the segment connection group. For example, calculating an average length or a median length of a gap between each contig segment based on a plurality of pairs of forward and backward information mapped between the respective contig segments as a gap size between the contig segments, and recording the positive across the two contig segments The number of reverse messages, marked as weights.
  • the size of the gap between the contig segments calculated according to the pair of forward and reverse information is Xi, which obeys the expected ⁇
  • the normal distribution of the variance ⁇ ⁇ 2 is ⁇ ( ⁇ , ⁇ ⁇ 2 )
  • the average value of the gap size between the con ti g segments calculated by the interest is obeyed by the N ( ⁇ , ⁇ ⁇ 2/ ⁇ ) distribution.
  • the Scaffold building unit 73 constructs a scaffold between the contig segments according to the calculated gap size between the contig segments and the forward and reverse relationship between the contig segments, and constructs each contig segment into a complete scaffold map.
  • the forward and reverse relationship between the contig segments can be directly determined based on the positional relationship between the forward and reverse relations given by the real face data.
  • Fig. 8 is a block diagram showing the construction of an embodiment of the scaffold construction system of the present invention.
  • the scaffold construction system structure embodiment includes a forward and reverse information mapping unit 71, a slot size acquisition unit 72, and a Scaffold construction unit 73, and optionally includes a repetition segment masking unit 84, a linearization unit 85, and The segment recovery unit 86 is repeated.
  • the forward and reverse information mapping unit 71, the slot size obtaining unit 72, and the Scaffold building unit 73 can be referred to the corresponding description in FIG. 7. For the sake of brevity, it will not be described in detail herein.
  • the scaffold diagram constructed by Scaffold building unit 73 may include multiple repeating pieces
  • the scaffold construction system further includes a repeat segment masking unit 84.
  • the repeat segment masking unit 84 detects the repeated segments in the constructed scaffold map and masks the detected duplicate segments. In an embodiment of the present invention, if a contig segment is connected to a plurality of overlapping contig segments in the same direction, the contig segment is considered to be a repeated contig segment.
  • the scaffold construction system further includes a linearization unit 85.
  • the linearization unit 85 linearizes the scaffold map based on the size of the gap between each contig segment in the scaffold map and the forward and reverse relationship of each contig segment. The specific process is as follows: According to the size of the gap between each contig segment in the scaffold diagram and the forward and reverse relationship between the contig segments, the appropriate position of each contig slice map, if there is no significant overlap between any two contig segments , according to the positional relationship between the two contig fragments into a linear structure.
  • the slot size obtaining unit 72 recalculates the slot size between the contig segments in the linearized scaffold map.
  • the scaffold construction system After masking the scaffold diagram and linearizing the subgraph, the contig fragment that was previously masked may be located in two unique contigs due to the change in the size of the gap between the contig segments in the scaffold diagram. Between the segments, at this time, in order to reduce the internal vacancy size of the scaffold so that the scaffold can be filled as much as possible, the scaffold construction system further includes a duplicate segment recovery unit 86.
  • the repeated segment restoring unit 86 restores the masked repeated contig segment when the repeated contig segment is located between the two unique contig segments.
  • the Scaffold construction unit 73 obtains a scaffold map. If the previously contig segment R is located between the two contig segments A and D of the scaffold map, the previously repeated contig is directly restored. Fragment R.
  • the embodiment of the scaffold construction system may include only the repeated segment masking unit 84 or the linearization unit 85, and may also include both the repeated segment masking unit 84 and the linearization unit 85, or both.
  • the masking unit 84, the linearization unit 85, and the repeating segment restoring unit 86 may include only the repeated segment masking unit 84 or the linearization unit 85, and may also include both the repeated segment masking unit 84 and the linearization unit 85, or both.
  • the scaffold construction system may be a software unit, a hardware unit or a combination of hardware and software built in a genome sequencing device, or integrated into a genome sequencing device or a genome sequencing device as a separate pendant. In the system.
  • the size of the gap between the contig fragments greatly improves the estimation accuracy of the gap size between the contig fragments in the scaffold construction.
  • each contig fragment can be used. Constructed into a complete scaffold map, when the sequencing length of the genome sequencing technology used is short, the scaffold construction method can also complete the splicing task of the sequencing fragments, and the error rate of the splicing of the sequencing fragments is reduced.
  • the embodiment of the invention performs the repeated segment masking process on the constructed scaffold diagram, thereby avoiding the problem of scaffold misspelling caused by the effect of the repeated segments, and greatly improving the accuracy of constructing the scaffold; by performing the scaffold diagram on the construction
  • the linearization process determines the positional relationship between the contig segments and increases the coverage length of the scaffold.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Description

一种片段连接支架的构建方法、
系统和基因组测序设备 技术领域
本发明属于基因工程领域, 尤其涉及一种片段连接支架的构建 方法、 系统和基因组测序设备。 背景技术
基因组学研究是对生物全套遗产信息进行比较分析, 以在总体 上把握生物全套遗产信息的机制和功能。 基因组学研究最基础的一 个环节是如何获得生物的全套基因组序列。 现有技术提供了以全基 因组鸟枪法测序技术(Sanger 测序技术) 为代表的第一代测序技 术和以 Solexa、 Solid 以及 454为代表的第二代测序技术来获取生 物的全套基因组序列。
其中 Sanger测序技术的过程简述如下: 先将整个基因组打断 成不同大小的 DNA片段构建 Shotgun文库, 对 Shotgun文库进行 随机测序, 最后运用生物信息学方法将测序片段拼接成全基因组序 列, 其特点是测序读长较长。
Solexa 测序技术的过程简述如下: 先将整个基因组打断到约 100-200bp大小的 DNA片段, 再将接头连接到 DN A片段上, 经聚 合酶链反应 (Polymerase Chain Reaction , PCR ) 扩增后制成 Library, 随后在含有接头的芯片 (flow cell ) 上将已加入接头的 DNA片段绑定在 flow cell上, 经反应, 将不同 DNA片段扩增。 在下一步反应中, 四种荧光标记的染料应用边合成边测序 ( Sequencing By Synthesis )。 Solexa测序技术的特点是通量高、 成 本低、 测序错误率低, 测序读长短。
片段连接支架(scaffold ) 的构建方法一直是 拼接流程 中的重要环节, 它主要是用于确定片段连接群 ( contig )之间的位 置关系, 为基因组组装搭建基本骨架, 该方法的优劣直接影响到全 基因组序列的最终结果。 现有的 scaffold构建方法是通过将测序得 到的有重叠 (overlap ) 的测序片段连接起来, 以完成拼接任务。 而当测序读长较短时, 则测序片段之间的 overlap 相对来说也较 短, 从而导致现有的 scaffold 构建方法的准确率低。 由于以 Solexa, Solid以及 454为代表的第二代测序技术的测序读长明显较 第一代测序技术短, 导致现有的 scaffold构建方法难以适用于第二 代测序技术来完成基因组测序片段的拼接任务。 发明内容
本发明的目的在于提供一种片段连接支架构建方法, 旨在解决 现有的片段连接支架构建方法难以适用于第二代测序技术来完成基 因組测序片段的拼接任务的问题。
本发明的一个方面提供一种片段连接支架构建方法, 所述方法 包括下述步驟:
将测序得到的正反向信息(Pair end information )映射到片段 连接群上;
根据映射到片段连接群上的正反向信息获得片段连接群之间的 空位大小;
根据片段连接群之间的空位大小和片段连接群之间的正反向关 系构建片段连接支架, 得到片段连接支架图。
本发明的另一目的在于提供一种片段连接支架构建系统, 所述 系统包括:
正反向信息映射单元, 用于将测序得到的正反向信息映射到片 段连接群上;
空位大小获取单元, 用于根据映射到片段连接群上的正反向信 息获得片段连接群之间的空位大小;
Scaffold构建单元, 用于根据计算得到的片段连接群之间的空 位大小和片段连接群之间的正反向关系构建片段连接支架, 得到完 整的片段连接支架图。
本发明的另一目的在于提供一种包括上述片段连接支架构建系 统的基因组测序设备。
在本发明实施例中, 通过将测序得到的正反向信息映射到片段 连接群上, 再根据各片段连接群之间的多对正反向信息获得片段连 接群之间的空位大小, 从而大大提高了片段连接支架构建中片段连 接群之间的空位大小的估计精度, 最后再根据计算得到的片段连接 群之间的空位大小和正反向关系即可将各片段连接群构建成完整的 片段连接支架图, 从而当耒用的基因组测序技术的测序读长较短 时, 也可以通过上述片段连接支架构建方法完成测序片段的拼接任 务, 且降低了测序片段拼接的出错率。 附图说明
图 1是本发明的 scaffold构建方法的一个实施例的流程图; 图 2是本发明的 scaffold构建方法的另一个实施例的流程图; 图 3是本发明实施例提供的利用映射到 contig片段上的正反向 信息构建 scaffold图的示意图;
图 4是本发明实施例提供的重复片段的屏蔽示意图;
图 5a和图 5b是本发明实施例提供的线性化 scaffold图的示意 图;
图 6是本发明实施例提供的重复 contig片段的恢复示意图; 图 7是本发明的 scaffold构建系统的一个实施例的结构框图; 图 8 是本发明的 scaffold 构建系统的另一个实施例的结构框 图。 具体实施方式
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结 合附图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处 所描述的具体实施例仅仅用以解鋒本发明, 并不用于限定本发明。 在附图中, 相同的标号表示相同或者相似的组件或者元素。
在本发明实施例中, 通过将测序得到的正反向信息映射到片段 连接群上, 根据多对正反向信息计算片段连接群之间的空位大小, 再根据计算得到的片段连接群之间的空位大小和正负关系即可将各 片段连接群构建成完整的 scaffold图。
图 1 示出了本发明的片段连接支架(scaffold )构建方法的一 个实施例的流程图, 详述如下:
在步骤 102 中, 将测序得到的正反向信息 (也称为正反向 reads ) 映射到片段连接群(coiitig片段)上。
在本发明实施例中, 可以釆用各种测序技术对待测基因组进行 测序, 例如可以采用第一代、 第二代测序技术对待测基因组进行测 序, 得到多个具有正反向关系的短序列 (称为正反向信息)。 本发 明一个实施例釆用具有通量高、 测序读长短等特点的第二代测序技 术对待测基因组进行测序, 能够降低 scaffold构建方法的复杂度。
在将测序得到的正反向信息映射到 contig片段上时, 可以采用 多种映射方法进行映射, 如 soap、 eland, maq或者 BLAT等映射 程序都可以完成该映射。 将测序得到的正反向信息映射到 contig片 段上后, 即可得到正反向信息在 contig片段上的位置和方向。
假设采用测序技术得到的正反向信息为 readsl 和 readsl,, reads2和 reads2,以及 reads3和 reads3,, 图 3示出了将上述正反向 信息映射到 contig片段上后的效果图。
在步驟 104中, 根据映射到 contig片段上的正反向信息获得各 contig片段之间的空位大小。
通过一对正反向信息将两个 contig片段连接起来, 根据每对映 射到 contig 片段上的正反方向信息可以推算获得两个 contig 片段 之间的空位长度。 如果两个 contig片段之间有多对正反向信息, 将 它们各自推算得到的空位长度取中值、 或者取平均值, 作为最终 contig片段之间的空位大小。
在本发明一个实施例中, 记录跨过两 contig片段的正反向信息 的数目, 标记作为权重, 根据实际情况选择特定阏值, 并设置权重 大于特定阈值的连接为有效连接, 以提高连接关系准确度。
在本发明实施例中, 根据 contig片段之间的多对正反向信息计 算 contig 片段之间空位的平均值, 作为 contig 片段之间的空位大 小。 请参阅图 3, 当映射后 contigl片段和 contig2片段之间有 3对 正反向信息时, 则根据 contigl 片段和 contig2 片段之间的 3对正 反向信息计算 contigl 片段和 contig2 片段之间空位的平均长度, 并将计算得到的平均长度作为 contigl 片段和 contig2 片段之间的 空位大小。 在计算 contig片段之间空位的平均长度时, 计算所有有 正反向连接关系的 contig片段之间空位的平均长度, 作为各 contig 片段之间的空位大小。 同时记录 contigl 片段和 contig2 片段之间 的正反向信息的数目 3, 将其标记为权重, 且当该权重大于预设的 特定阀值时, 才认为 contigl 片段和 contig2 片段之间的连接为有 效连接, 以提高连接关系准确度
如果根据一对正反向信息计算得到的 contig片段之间的空位大 小为 Xi, 其月艮从期望 μ, 方差为 σΛ2的正态分布为 Ν ( μ, σΛ2 ), 则当根据 contig片段之间的 N对正反向信息计算得到的 contig片 段之间的空位大小的平均值服从 N ( μ, σΛ2/Ν ) 分布。 这样当 contig 片段上的正反向信息的覆盖度较高时, 可以大大的提高 scaffold构建中 contig之间空位的估计精确度。 在步骤 106中, 根据计算得到的 contig片段之间的空位大小和 contig 片段之间的正反向关系, 构建各 contig 片段之间的 scaffold, 将各 contig片段构建成完整的 scaffold图。 其中各 contig 片段之间的正反向关系可以根据实验原始数据给出的正反向关系的 前后位置关系直接确定
请参阅图 3, 当根据图 3所示的 contigl片段与 contig2片段之 间的 3对正反向信息计算得到 contigl 片段与 contig2 片段之间的 空位大小后, 即可根据 contigl 片段和 contig2 片段之间的空位大 小以及 contigl 片段和 contig2 片段之间的正反向关系, 构建图 3 所示的 contigl片段与 contig2片段之间的 scaffold, 依次类推, 通 过根据所有有正反向连接关系的 contig片段之间的空位大小以及所 有有正反向连接关系的 contig片段之间的正反向关系, 即可构建所 有有正反向连接关系的 contig片段之间的 scaffold, 从而将所有有 正反向连接关系的 contig片段构建成完整的 scaffold图, 其效果如 图 4所示。
图 2 示出了本发明的片段连接支架(scaffold )构建方法的另 一个实施例的流程图。
如图 2 所示, 在步驟 202 , 将测序得到的正反向信息映射到 confl 片段上。
在步骤 204, 根据映射到 config片段上的多对正反向信息计算 各 contig 片段之间空位的平均长度, 作为各 config 片段之间的空 位大小。
在步骤 206, 根据 config片段之间的空位大小和 config片段之 间的正反向关系构建 scaffold图。
在步骤 208, 检测构建的 scaffold图中的重复 contig片段, 并 屏蔽检测到的重复 contig片段。 按照上述 scaffold构建方法构建的 scaffold 图中可能包括多个重复片段, 从而降低基因組测序的准确 率, 在该步骤通过屏蔽重复 COHtig片段, 能够提高提高基因组测序 的准确率。
在本发明实施例中, 如果一个 contig片段在同一方向上连接到 多个有交叠的 contig片段, 则认为该 contig片段为一个重复 contig 片段。 在检测到重复 contig 片段时, 将检测到的重复 contig 片段 屏蔽。
如果构建的 scaffold图如图 4所示, 则由于 contig片段 R在反 方向上分别连接到 contig片段 A和 B, 且 contig片段 A和 B之间 有交叠, 同时 contig 片段 R在正方向上分别连接到 contig 片段 D、 E、 F, 且 contig片段 E和 F之间有交叠, 因此 contig片段 R 为重复的 contig片段, 将此重复的 contig片段 R屏蔽。
为了在可控的误差范围内获得充分长度的 scaffold, 使尽可能 多的 contig片段确定其相互之间的正确位置关系, 在本发明另一实 施例中, 该 scaffold构建方法还包括下述步骤:
在步骤 210中, 根据 scaffold图中各 contig片段之间的空位大 小以及各 contig片段的正反向关系对 scaffold图进行线性化。
在本发明实施例中, 当经步骤 206构建的 scaffold图中包括重 复片段, 则先通过步骤 208屏蔽重复片段, 再对屏蔽了重复片段的 scaffold图进行线性化, 而当经步驟 206构建的 scaffold图中未包 括重复片段时, 则直接对构建的 scaffold 图进行线性化。 其中线性 化的步骤具体如下:
根据 scaffold图中各 contig片段之间的空位大小以及各 contig 片段之间的正反向关系将各 contig片 子图的合适位置, 如果 任意两个 contig 片段之间没有显著的交叠, 则根据这两个 contig 片段之间的位置关系转化为线性结构。
如果 scaffold图如图 5a所示, 其中已知 contig片段 A和 B之 间的空位大小和正反向关系、 contig 片段 E和 D之间的空位大小 和正反向关系、 contig 片段 A 和 E 之间的空位大小和正反向关 系、 以及 contig片段 E和 C之间的空位大小和正反向关系, 则根 据上述片段之间的空位大小以及正反向关系, 可以直接得到线性结 构关系为 AEBCD, 即可以直接将图 5a所示的 scaffold图线性化为 图 5b所示的 scaffold图。
由于对 scaffold图进行了线性化, scaffold图中各 contig片段 之间的空位大小可能反生了变化, 此时, 为了准确的反映线性化后 的 scaffold图中各 contig片段之间的空位大小, 在本发明另一实施 例中, 该 scaffold构建方法还包括下述步骤:
重新计算线性化后 scaffold 图中各 contig 片段之间的空位大 小。
其中重新计算线性化后 scaffold图中各 contig片段之间的空位 大小的步糠具体为: 按线性化以后的 scaffold图上 contig片段的位 置前后关系, 直接计算位置相邻的两两 contig片段之间的空位大小, 并重新连接位置相邻的 contig片段, 将原先的 scaffold图转化为一 个真正的线性结构。 请参阅图 5a 和图 5b, 由图 5a 中的 AB、 AC、 EC、 ED的连接关系转化为图 5b中的 AE、 EB、 BC:、 CD的 连接关系后, 各 contig片段之间的空位大小由原先的已算得空位大 小直接加减获得。 如 AE之间的空位大小可以简单表示为 AE=AC- EC。
在对 scaffold 图进行了重复片段的屏蔽以及子图的线性化后, 由于 scaffold 图中各 contig 片段之间的空位大小发生了变化, 此 时, 可能之前 ^蔽的 contig 片段恰好位于两个唯一 contig 片段 之间, 此时, 为了减少 scaffold的内部空位大小, 使 scaffold能尽 可能地被填充, 该 scaffold构建方法还包括下述步骤:
在步骤 212 中, 当被屏蔽的重复 contig 片段位于两个唯一 contig片段之间时, 恢复 ^蔽的重复 contig片段。 请参阅图 6, 为经步骤 208和步驟 210后得到的 scaffold图, 如果之前被屏蔽的 contig片段 R位于该 scaffold图中的 contig片 段 A和 D两个唯一 contig片段之间, 则直接恢复之前被屏蔽的重 复 contig片段 R。
图 7示出了本发明的 scaffold构建系统结构的一个实施例的结 构框图。 如图 7所示, 该 scaffold构建系统结构实施例包括正反向 信息映射单元 71、 空位大小获取单元 72和 Scaffold (片段连接支 架)构建单元 73。 其中:
正反向信息映射单元 71 将测序得到的正反向信息映射到 contig 片段上。 在本发明实施例中, 可以采用多种测序技术对待测 基因组进行测序, 例如可以采用第一代、 第二代测序技术对待测基 因组进行测序, 得到多个具有正反向关系的短序列 (称为正反向信 息)。 本发明的一个实施例采用具有通量高、 测序读长短等特点的 第二代测序技术对待测基因组进行测序, 以降低 scaffold构建方法 的复杂度。 在将测序得到的正反向信息映射到 contig片段上时, 可 以采用各种映射方法进行映射, 如 soap、 eland, maq 或者 BLA 等映射程序都可以完成该映射。 将测序得到的正反向信息映射到 contig 片段上后, 即可得到正反向信息的位置和方向。 将测序得到 的正反向信息映射到 contig片段上后的效果如图 3所示。
空位大小获取单元 72 根据映射到片段连接群上的正反向信息 获得片段连接群之间的空位大小。 例如, 根据映射到各 contig片段 之间的多对正反向信息计算各 contig片段之间空位的平均长度或者 中值长度, 作为 contig片段之间的空位大小, 并记录跨过两 contig 片段的正反向信息的数目,标记作为权重。
在本发明实施例中, 如果根据一对正反向信息计算得到的 contig片段之间的空位大小为 Xi, 其服从期望 μ, 方差为 σΛ2的正 态分布为 Ν ( μ, σΛ2 ), 则当根据 contig片段之间的 N对正反向信 息计算得到的 contig 片段之间的空位大小的平均值服从 N ( μ, σΛ2/Ν ) 分布。 这样当 contig 片段上的正反向信息的覆盖度较高 时, 可以大大的提高 scaffold构建中 contig之间空位的估计精确 度。
Scaffold构建单元 73根据计算得到的 contig片段之间的空位 大小和 contig 片段之间的正反向关系,构建各 contig 片段之间的 scaffold, 将各 contig片段构建成完整的 scaffold图。 其中各 contig 片段之间的正反向关系可以根据实臉原始数据给出的正反向关系的 前后位置关系直接确定。
请参阅图 3, 当根据图 3所示的 contigl片段与 contig2片段之 间的 3对正反向信息计算得到 contigl 片段与 contig2 片段之间的 空位大小后, 即可根据 contigl 片段和 contig2 片段之间的空位大 小以及 contigl 片段和 contig2 片段之间的正反向关系, 构建图 3 所示的 contigl片段与 contig2片段之间的 scaffold, 依次类推, 通 过根据所有有正反向连接关系的 contig片段之间的空位大小以及所 有有正反向连接关系的 contig片段之间的正反向关系, 即可构建所 有有正反向连接关系的 contig片段之间的 scaffold, 从而将所有有 正反向连接关系的 contig片段构建成完整的 scaffold图, 其效果如 图 4所示。
图 8示出了本发明的 scaffold构建系统结构的一个实施例的结 构框图。 如图 8所示, 该 scaffold构建系统结构实施例包括正反向 信息映射单元 71、 空位大小获取单元 72和 Scaffold构建单元 73, 并可选地包括重复片段屏蔽单元 84、 线性化单元 85、 和重复片段 恢复单元 86。 其中, 正反向信息映射单元 71、 空位大小获取单元 72和 Scaffold构建单元 73可以参见图 7中的对应描述, 为简洁起 见, 在此不再详细叙述。
Scaffold构建单元 73构建的 scaffold图中可能包括多个重复片 段, 从而降低基因组测序的准确率, 为了提高基因组测序的准确 率, 在本发明另一实施例中, 该 scaffold构建系统还包括重复片段 屏蔽单元 84。 该重复片段屏蔽单元 84检测构建的 scaffold图中的 重复片段, 并屏蔽检测到的重复片段。 在本发明实施例中, 如果一 个 contig 片段在同一方向上连接到多个有交叠的 contig 片段, 则 认为该 contig片段为一个重复 contig片段。
为了在可控的误差范围内获得充分长度的 scaffold,使尽可能 多的 contig片段确定其相互之间的正确位置关系, 在本发明另一实 施例中, 该 scaffold构建系统还包括线性化单元 85。 该线性化单元 85根据 scaffold 图中各 contig 片段之间的空位大小以及各 contig 片段的正反向关系对 scaffold 图进行线性化。 其具体过程如下: 根 据 scaffold图中各 contig片段之间的空位大小以及各 contig片段之 间的正反向关系将各 contig片 子图的合适位置, 如果任意两 个 contig 片段之间没有显著的交叠, 则根据这两个 contig 片段之 间的位置关系转化为线性结构。
由于对 scaffold图进行了线性化, scaffold图中各 contig片段 之间的空位大小可能反生了变化, 此时, 为了准确的反映线性化后 的 scaffold图中各 contig片段之间的空位大小, 在本发明另一实施 例中, 空位大小获取单元 72 重新计算线性化后 scaffold 图中各 contig片段之间的空位大小。
其中重新计算线性化后 scaffold图中各 contig片段之间的空位 大小的步骤具体为: 按线性化以后的 scaffold图上 contig的位置前 后关系, 直接计算位置相邻的两两 contig片段之间的空位大小, 并 重新连接位置相邻的 contig片段, 从而将原先的 scaffold图转化为 一个真正的线性结构。 请参阅图 5a和图 5b, 由图 5a 中的 AB、 AE、 AC、 ED的连接关系转化为图 5b中的 AE、 EB、 BC、 CD的 连接关系。 而线性化之后各 contig片段之间的空位大小由原先的已 算得的空位大小直接加减获得。 如 AE 之间的空位大小表示为 AE=AC-EC。
在对 scaffold 图进行了重复片段的屏蔽以及子图的线性化后, 由于 scaffold 图中各 contig 片段之间的空位大小发生了变化, 此 时, 可能之前被屏蔽的 contig 片段恰好位于两个唯一 contig 片段 之间, 此时, 为了减少 scaffold 的内部空位大小,使 scaffold 能尽 可能地被填充, 该 scaffold构建系统还包括重复片段恢复单元 86。 该重复片段恢复单元 86在^^蔽的重复 contig片段位于两个唯一 contig片段之间时, 恢复被屏蔽的重复 contig片段。
请参阅图 6, Scaffold构建单元 73得到的 scaffold图, 如果之 前^ ^蔽的 contig片段 R位于该 scaffold图中的 contig片段 A和 D两个唯一 contig片段之间, 则直接恢复之前 蔽的重复 contig 片段 R。
需要指出, 虽然在图 8 中同时示出了重复片段 蔽单元 84、 线性化单元 85、 和重复片段恢复单元 86, 但是本领域的技术人员 可以理解, 除了正反向信息映射单元 71、 空位大小获取单元 72和 Scaffold构建单元 73, scaffold构建系统的实施例可以仅包括重复 片段屏蔽单元 84或线性化单元 85, 也可以包括重复片段屏蔽单元 84和线性化单元 85两者, 或者同时包括重复片段屏蔽单元 84、 线 性化单元 85和重复片段恢复单元 86。
为了便于说明, 上述描述仅示出了与本发明实施例相关的部 分。 本领域的技术人员应该理解, 该 scaffold构建系统可以是内置 于基因组测序设备中的软件单元、 硬件单元或者软硬件相结合的单 元, 或者作为独立的挂件集成到基因组测序设备或者基因组测序设 备的应用系统中。
在本发明实施例中, 通过将测序得到的正反向信息映射到 contig片段上, 再根据各 contig片段之间的多对正反向信息计算各 contig 片段之间的空位大小, 从而大大提高了 scaffold 构建中 contig 片段之间的空位大小的估计精度, 最后再根据计算得到的 contig片段之间的空位大小和正反向关系即可将各 contig片段构建 成完整的 scaffold 图, 从而当采用的基因组测序技术的测序读长较 短时, 也可以通过上述 scaffold 构建方法完成测序片段的拼接任 务, 且降低了测序片段拼接的出错率。 同时本发明实施例通过对构 建的 scaffold 图进行重复片段屏蔽处理, 从而避免了由于重复片段 的影响而导致的 scaffold错拼的问题, 大大提高了构建 scaffold的 准确度; 通过对构建的 scaffold 图进行线性化处理, 从而确定了 contig 片段之间的位置关系,提高 scaffold 的覆盖长度; 通过恢复 被屏蔽的重复片段, 从而充分利用重复片段的信息, 尽可能使 scaffold的内部空位被填充。
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发 明, 凡在本发明的精神和原则之内所作的任何修改、 等同替换和改 进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求
1、 一种片段连接支架构建方法, 其特征在于, 所述方法包括 下述步骤:
将测序得到的正反向信息映射到片段连接群上;
根据映射到所述片段连接群上的所述正反向信息获得所述片段 连接群之间的空位大小;
根据所述片段连接群之间的空位大小和所述片段连接群之间的 正反向关系构建片段连接支架, 得到片段连接支架图。
2、 如权利要求 1 所述的方法, 其特征在于, 所述根据映射到 所述片段连接群上的所述正反向信息获得所述片段连接群之间的空 位大小的步驟包括:
根据映射到所述片段连接群上的多对正反向信息计算所述片段 连接群之间的空位大小平均长度或中值长度, 作为所述片段连接群 之间的空位大小。
3、 如权利要求 1 所述的方法, 其特征在于, 所述方法还包括 下述步骤:
检测所述片段连接支架图中的重复片段连接群, 并屏蔽检测到 的重复片段连接群。
4、 如权利要求 3 所述的方法, 其特征在于, 所述重复片段连 接群为在同一方向上连接到多个有交叠的片段连接群的片段连接 群。
5、 如权利要求 1 所述的方法, 其特征在于, 所述方法还包括 下述步骤:
根据所述片段连接支架图中各片段连接群之间的空位大小以及 各片段连接群之间的正反向关系线性化所述片段连接支架图。
6、 如权利要求 5 所述的方法, 其特征在于, 所述方法还包括 下述步骤:
重新计算线性化后片段连接支架图中各片段连接群之间的空位 大小。
7、 如权利要求 3或 4所述的方法, 其特征在于, 所述方法还 包括下述步驟:
当被屏蔽的重复片段连接群位于两个唯一片段连接群之间时, 恢复被屏蔽的重复片段连接群。
8、 一种片段连接支架构建系统, 其特征在于, 所述系统包 括:
正反向信息映射单元, 用于将测序得到的正反向信息映射到片 段连接群上;
空位大小获取单元, 用于根据映射到所述片段连接群上的正反 向信息获得所述片段连接群之间的空位大小;
片段连接支架构建单元, 用于根据所述片段连接群之间的空位 大小和所述片段连接群之间的正反向关系构建片段连接支架, 得到 片段连接支架图。
9、 如权利要求 8 所述的系统, 其特征在于, 所述系统还包 括:
重复片段屏蔽单元, 用于检测所述片段连接支架图中的重复片 段连接群, 并屏蔽检测到的重复片段连接群。
10、 如权利要求 9 所述的系统, 其特征在于, 所述系统还包 括:
线性化单元, 用于根据所述片段连接支架图中各片段连接群之 间的空位大小以及各片段连接群之间的正反向关系线性化所述片段 连接支架图。
11、 如权利要求 10 所述的系统, 其特征在于, 所述空位大小 获取单元还用于重新计算线性化后的片段连接支架图中各片段连接 群之间的空位大小。
12、 如权利要求 9 所述的系统, 其特征在于, 所述系统还包 括:
重复片段恢复单元, 用于在 ^蔽的重复片段连接群位于两个 唯一片段连接群之间时, 恢复 蔽的重复片段连接群。
13、 如权利要求 8所述的系统, 其特征在于, 所述空位大小获 取单元根据映射到所述片段连接群上的多对正反向信息计算所述片 段连接群之间的空位大小平均长度或中值长度, 作为所述片段连接 群之间的空位大小。
14、 一种包括权利要求 8至 13任一权利要求所述的片段连接 支架构建系统的基因组测序设备。
PCT/CN2009/001428 2008-12-12 2009-12-11 一种片段连接支架的构建方法、系统和基因组测序设备 WO2010066116A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP09831393.5A EP2377949B1 (en) 2008-12-12 2009-12-11 Construction method and system of fragments assembling scaffold
US13/132,027 US20110288845A1 (en) 2008-12-12 2009-12-11 Construction method and system of fragments assembling scaffold, and genome sequencing device
JP2011539875A JP2012511753A (ja) 2008-12-12 2009-12-11 断片アセンブリングスキャフォールドの構築方法及びシステム、並びにゲノム配列決定装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200810218342.5 2008-12-12
CN2008102183425A CN101504697B (zh) 2008-12-12 2008-12-12 一种片段连接支架的构建方法和系统

Publications (1)

Publication Number Publication Date
WO2010066116A1 true WO2010066116A1 (zh) 2010-06-17

Family

ID=40976941

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/001428 WO2010066116A1 (zh) 2008-12-12 2009-12-11 一种片段连接支架的构建方法、系统和基因组测序设备

Country Status (5)

Country Link
US (1) US20110288845A1 (zh)
EP (1) EP2377949B1 (zh)
JP (1) JP2012511753A (zh)
CN (1) CN101504697B (zh)
WO (1) WO2010066116A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349350A (zh) * 2020-11-09 2021-02-09 山西大学 基于一种杜氏藻核心基因组序列进行品系鉴定的方法

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101504697B (zh) * 2008-12-12 2010-09-08 深圳华大基因研究院 一种片段连接支架的构建方法和系统
CN102206704B (zh) * 2011-03-02 2013-11-20 深圳华大基因科技服务有限公司 组装基因组序列的方法和装置
US10395757B2 (en) 2011-12-02 2019-08-27 Bgi Tech Solutions Co., Ltd. Parental genome assembly method
CN102982252A (zh) * 2012-12-05 2013-03-20 北京诺禾致源生物信息科技有限公司 一种高杂合二倍体基因组支架序列组装策略
CN104850761B (zh) * 2014-02-17 2017-11-07 深圳华大基因科技有限公司 核酸序列拼接方法及装置
CN104017883B (zh) * 2014-06-18 2015-11-18 深圳华大基因科技服务有限公司 组装基因组序列的方法和系统
CN104239750B (zh) * 2014-08-25 2017-07-28 北京百迈客生物科技有限公司 基于高通量测序数据的基因组从头组装方法
WO2016134034A1 (en) 2015-02-17 2016-08-25 Dovetail Genomics Llc Nucleic acid sequence assembly
CN106021978B (zh) * 2016-04-06 2019-03-29 晶能生物技术(上海)有限公司 基于光学图谱平台Irys的一种de novo测序数据组装方法
CN111180014A (zh) * 2020-01-03 2020-05-19 中国检验检疫科学研究院 一种基于低深度siRNA数据的病毒序列组装方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
CN1360057A (zh) * 2001-11-16 2002-07-24 北京华大基因研究中心 一种基于重复序列识别的全基因组测序数据的拼接方法
CN101504697A (zh) * 2008-12-12 2009-08-12 深圳华大基因研究院 一种基因组测序设备及其片段连接支架的构建方法和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
CN1360057A (zh) * 2001-11-16 2002-07-24 北京华大基因研究中心 一种基于重复序列识别的全基因组测序数据的拼接方法
CN101504697A (zh) * 2008-12-12 2009-08-12 深圳华大基因研究院 一种基因组测序设备及其片段连接支架的构建方法和系统

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAN YUJUN ET AL.: "Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis", SCIENCE IN CHINA SER. C LIFE SCIENCES, vol. 48, no. 3, 2005, pages 300 - 306, XP008143825 *
PEVZNER PAVEL A. ET AL: "Fragment assembly with double-barreled data", BIOINFORMATICS, vol. 17, no. 1, 2001, pages S225 - S233, XP008140748 *
See also references of EP2377949A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349350A (zh) * 2020-11-09 2021-02-09 山西大学 基于一种杜氏藻核心基因组序列进行品系鉴定的方法

Also Published As

Publication number Publication date
EP2377949A4 (en) 2014-12-17
US20110288845A1 (en) 2011-11-24
EP2377949B1 (en) 2018-11-21
CN101504697B (zh) 2010-09-08
JP2012511753A (ja) 2012-05-24
EP2377949A1 (en) 2011-10-19
CN101504697A (zh) 2009-08-12

Similar Documents

Publication Publication Date Title
WO2010066116A1 (zh) 一种片段连接支架的构建方法、系统和基因组测序设备
Koren et al. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly
US20240120021A1 (en) Methods and systems for large scale scaffolding of genome assemblies
Ghurye et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly
Kolmogorov et al. Chromosome assembly of large and complex genomes using multiple references
Wick et al. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads
Ghurye et al. Modern technologies and algorithms for scaffolding assembled genomes
Walker et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement
Myers The fragment assembly string graph
Deshpande et al. Cerulean: a hybrid assembly using high throughput short and long reads
Zhu et al. P_RNA_scaffolder: a fast and accurate genome scaffolder using paired-end RNA-sequencing reads
Ma et al. Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome
Pham et al. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly
Varma et al. Fassem: Fpga based acceleration of de novo genome assembly
Baptista et al. Is reliance on an inaccurate genome sequence sabotaging your experiments?
Sung et al. An $\bm {O (m\,\log\, m)} $-Time Algorithm for Detecting Superbubbles
Turner et al. Next-generation sequencing of vertebrate experimental organisms
Indrischek et al. The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
Edera et al. Computational detection of plant RNA editing events
CN113963749A (zh) 高通量测序数据自动化组装方法、系统、设备及存储介质
Wajid et al. The A, C, G, and T of genome assembly
CN103699819A (zh) 基于多步双向De Bruijn图的变长kmer查询的顶点扩展方法
Luo et al. GapReduce: A gap filling algorithm based on partitioned read sets
Li et al. A novel scaffolding algorithm based on contig error correction and path extension
Kuosmanen et al. On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09831393

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011539875

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009831393

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 13132027

Country of ref document: US