CN107784198B - Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence - Google Patents

Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence Download PDF

Info

Publication number
CN107784198B
CN107784198B CN201610741984.8A CN201610741984A CN107784198B CN 107784198 B CN107784198 B CN 107784198B CN 201610741984 A CN201610741984 A CN 201610741984A CN 107784198 B CN107784198 B CN 107784198B
Authority
CN
China
Prior art keywords
sequence
generation
genome
level
framework
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610741984.8A
Other languages
Chinese (zh)
Other versions
CN107784198A (en
Inventor
邓天全
贺丽娟
杨林峰
刘亚斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Technology Solutions Co Ltd
Original Assignee
BGI Technology Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Technology Solutions Co Ltd filed Critical BGI Technology Solutions Co Ltd
Priority to CN201610741984.8A priority Critical patent/CN107784198B/en
Publication of CN107784198A publication Critical patent/CN107784198A/en
Application granted granted Critical
Publication of CN107784198B publication Critical patent/CN107784198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

The invention discloses a method and a system for jointly assembling a second-generation sequence and a third-generation single-molecule real-time sequencing sequence, wherein the method comprises the following steps: assembling the second-generation sequence to obtain a first-generation genome skeleton sequence; the second-generation sequence carries out hole filling on the first-generation genome framework sequence to obtain a second-generation genome framework sequence; the third generation single molecule real-time sequencing sequence fills holes in the second generation genome skeleton sequence to obtain the first two third generation skeleton sequence; splicing the self-corrected third-generation single-molecule real-time sequencing sequence by utilizing the mutual overlapping relation with the first-level second-third-generation framework sequence to obtain a second-level second-third-generation framework sequence; comparing the second-generation sequence with the second-generation framework sequence to obtain an invalid comparison region, and replacing the region by using the invalid sequence to obtain a third-generation second-generation framework sequence; and (3) filling holes in the third-level second-third-level framework sequence by using the second-generation sequence to obtain a final genome assembly sequence. The method can improve the index and accuracy of genome assembly.

Description

Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence
Technical Field
The invention relates to the technical field of nucleotide sequence assembly, in particular to a method and a system for jointly assembling a second-generation sequence and a third-generation single-molecule real-time sequencing sequence.
Background
At present, genome assembly takes Whole genome shotgun sequencing (WGS) as a mainstream design scheme, and mainly matches DNA insert fragments with different lengths to perform double-end sequencing according to the specific characteristics of a genome repetitive sequence, so that the accuracy of a single base and the integrity of a genome can be ensured under the condition that the average sequencing depth of a Whole genome is enough. With the maturity and popularization of Next-generation sequencing (NGS), the sequencing cost is greatly reduced, and whole genome shotgun sequencing based on the second-generation sequencing technology becomes a mainstream scheme for sequencing various genome projects.
However, for complex genomes with high heterozygosity (heterozygosity, i.e. the state of different alleles at one or more loci on homologous chromosomes) and repetitive sequences, the above solutions are susceptible to interference from these problems, and the assembly results do not reach the standard, resulting in difficult data analysis and assembly, which is not suitable for complex genomes.
The assembled contigs (Contig) or backbone sequences (Scaffold) are arranged from large to small, and when the cumulative length of the contigs or backbone sequences just exceeds 50% of the total length of all the assembled sequences, the size of the last Contig or backbone sequence is the size of N50, and N50 has important significance for evaluating the integrity of gene sequencing.
The current third generation-Pacbio single molecule real-time Sequencing (SMRT) technology has the characteristic of ultra-long reading length, and can carry out high-level assembly on genome complex regions such as high repetitive sequences, transposon regions, high variation regions and the like, so that the lengths of Contig N50 and Scaffold N50 are longer, the assembly result is more complete and accurate, and the third generation sequencing technology is used for more and more species for whole genome assembly. However, due to the high sample requirement, high cost and high single base error rate (such as the error rate of 15% of the Pacbio RSII platform sequence on average), if the whole genome self-assembly is carried out by using the pure three-generation sequencing technology, the data volume of the common genome (non-high duplication and high hybridization) generally requires more than 50 times of the genome size, the complex genome requires higher data volume and is very expensive, and the method is mainly used for bacterial, fungal and animal and plant genomes of hundreds of megabytes or less at present.
Based on the characteristics of low cost, high accuracy and ultra-long reading length of third-generation Pacbio sequencing, the mixed Assembly of the second-generation sequence and the third-generation Pacbio sequence is a good scheme for improving the genome Assembly index and accuracy at present, and the mixed Assembly software DBG2OLC (reference DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed overlay Graph (2014)) of the second-generation sequence and the third-generation Pacbio sequence has good performance in a simple genome, but the Assembly effect in a complex genome is not ideal.
Disclosure of Invention
The invention provides a method and a system for jointly assembling a second-generation sequence and a third-generation single-molecule real-time sequencing sequence, which can improve the index and the accuracy of genome assembly.
According to a first aspect of the present invention, the present invention provides a method for assembling a second generation sequence and a third generation single molecule real-time sequencing sequence in a combined manner, comprising: assembling the second-generation sequence to obtain a first-generation genome skeleton sequence; the second-generation sequence carries out hole filling on the first-generation genome framework sequence to obtain a second-generation genome framework sequence; the third generation single molecule real-time sequencing sequence fills holes in the second generation genome skeleton sequence to obtain a first-second third generation skeleton sequence; splicing the self-error-corrected third-generation single-molecule real-time sequencing sequence by utilizing the mutual overlapping relation with the first-level second-generation framework sequence to obtain a second-level second-generation framework sequence; comparing the second-generation sequence with the second-generation framework sequence to obtain an invalid comparison region, and replacing the region with the invalid sequence to obtain a third-generation framework sequence; and (3) filling holes in the three-level two-third-generation framework sequence by using the two-generation sequence to obtain a final genome assembly sequence.
Further, the method further comprises: and splicing the second-level second-generation genome framework sequences by utilizing the pair relationship among the second-generation sequence reads to obtain a third-level second-generation genome framework sequence.
Further, the method further comprises: and the third generation of real-time sequencing sequence utilizes the comparison relationship among the sequences to carry out self-error correction, so as to obtain the self-error-corrected third generation of real-time sequencing sequence.
Further, the third generation single molecule real-time sequencing sequence filters the linker, short sequence and low mass value sequence before self-error correction to obtain filtered sequence.
Furthermore, the third generation single molecule real-time sequencing sequence used for hole filling of the second generation genome skeleton sequence is a sequence before self-correction.
Furthermore, the third generation single molecule real-time sequencing sequence used for hole filling of the second generation genome framework sequence is a sequence after self-error correction.
Further, the step of obtaining the three-level two-third-generation framework sequence specifically includes: the second-generation sequence is compared with the second-generation framework sequence to obtain a comparison result; calculating the coverage of the second-level second-generation framework sequence to obtain the condition that the effective sequence area of the second-level second-generation framework sequence is not covered; and replacing the uncovered effective sequence region with an invalid sequence to obtain the three-level two-third-generation framework sequence.
According to a second aspect of the present invention, the present invention provides a system for assembling a second generation sequence and a third generation single molecule real-time sequencing sequence in a combined manner, comprising: the second-generation sequence assembling unit is used for assembling the second-generation sequence to obtain a first-generation genome skeleton sequence; a second-generation sequence hole filling unit for filling holes in the first-generation second-generation genome skeleton sequence by the second-generation sequence to obtain a second-generation genome skeleton sequence; the third-generation sequence hole filling unit is used for filling holes in the second-generation genome skeleton sequence by a third-generation single-molecule real-time sequencing sequence to obtain a first-level second-generation skeleton sequence; the first splicing unit is used for splicing the three-generation single-molecule real-time sequencing sequence subjected to self-error correction by utilizing the mutual overlapping relation with the first-level second-generation framework sequence to obtain a second-level second-generation framework sequence; a comparison replacement unit, which is used for comparing the second-generation sequence with the second-generation framework sequence to obtain an invalid comparison region and replacing the region with the invalid sequence to obtain a third-generation framework sequence; and a final hole filling unit for filling holes in the three-level two-third-generation framework sequence by using the two-generation sequence to obtain a final genome assembly sequence.
Further, the above system further comprises: and the second splicing unit is used for splicing the second-level second-generation genome framework sequence by utilizing the pair relationship between the second-level sequence reads to obtain a third-level second-generation genome framework sequence.
Further, the above system further comprises: and the self-error correction unit is used for carrying out self-error correction on the three-generation single-molecule real-time sequencing sequence by utilizing the comparison relation among the sequences to obtain the self-error-corrected three-generation single-molecule real-time sequencing sequence.
According to the genome assembly method and system provided by the invention, sequencing is performed by adopting a method combining a second generation sequencing technology and a third generation single molecule real-time sequencing method, and hierarchical assembly is performed, so that the assembly index and accuracy are improved.
Drawings
FIG. 1 shows a flow diagram of an embodiment of three generation single molecule real time sequencing (Pacbio sequencing) sequence self-error correction;
FIG. 2 is a flow diagram illustrating one embodiment of a method for the joint assembly of second generation and third generation single molecule real-time sequencing sequences of the present invention;
FIG. 3 is a flow diagram illustrating one embodiment of genome null sequence acquisition in the combined assembly method of second generation and third generation single molecule real-time sequencing sequences of the present invention;
FIG. 4 is a block diagram showing the structure of an embodiment of the present invention, which is a system for assembling the second generation sequence and the third generation single molecule real-time sequencing sequence in combination.
Detailed Description
The present invention will be described in further detail with reference to the following detailed description and accompanying drawings.
In one embodiment of the invention, a method and a system for combined assembly of sequencing sequences obtained based on a second generation sequencing technology and a third generation (such as Pacbio) single-molecule real-time sequencing technology (SMRT) are provided, and a method for combining a whole genome shotgun method based on the second generation sequencing technology and the third generation (such as Pacbio) single-molecule real-time sequencing technology is adopted for sequencing, so that the problem of assembly of simple genomes and complex genomes is solved.
The terms involved in the present invention are described below:
the second generation sequence refers to a sequencing sequence obtained based on a second generation sequencing technology.
The third generation of real-time sequencing sequence refers to a sequencing sequence obtained based on the third generation of sequencing technology, especially a single-molecule real-time sequencing sequence represented by Pacbio sequencing, and in the present invention, the sequence can also be referred to as a "third generation sequence".
The first generation genome skeleton sequence refers to a genome skeleton sequence obtained by first assembly of second generation sequences, wherein the second generation is used for indicating that the sequence is obtained based on a second generation sequencing technology. Similarly, the second-generation genome backbone sequence refers to a genome backbone sequence obtained by performing second-stage treatment on the first-generation genome backbone sequence, and specifically, the second-generation genome backbone sequence is obtained by performing hole filling on the first-generation genome backbone sequence. The third-generation second-generation genome skeleton sequence refers to a genome skeleton sequence obtained by carrying out third-stage treatment on the second-generation genome skeleton sequence.
The first-generation and second-generation framework sequences refer to framework sequences obtained after the first-stage treatment of the second-generation sequence and the third-generation sequence, and in the invention, the framework sequences are obtained after the third-generation single-molecule real-time sequencing sequence performs hole filling on the second-generation genome framework sequences. The second-generation genome framework sequence can be a second-generation genome framework sequence and a third-generation genome framework sequence. Similarly, the second-third-generation framework sequence refers to a framework sequence obtained by performing second-stage processing on the first-second-third-generation framework sequence, and in the invention, specifically, the framework sequence is obtained by splicing the self-error-corrected third-generation single-molecule real-time sequencing sequence by utilizing the mutual overlapping relationship with the first-second-third-generation framework sequence. The third-level second-third-generation framework sequence refers to a framework sequence obtained by carrying out third-level treatment on the second-level second-third-generation framework sequence.
The invention relates to self-error correction of three generations of single-molecule real-time sequencing sequences, such as self-error correction of Pacbio sequences. Figure 1 shows a flow diagram of an embodiment of three generations of single molecule real-time sequencing (e.g., Pacbio sequencing) sequence self-error correction, including:
in step 102, the original sequence data obtained by three generations of single molecule real-time sequencing (such as Pacbio sequencing) is filtered to remove the adaptor, short sequence and sequence with too low quality value, and finally three generations of single molecule real-time sequencing (such as Pacbio sequencing) sequence data with higher quality value are obtained.
In step 104, since the average error rate of the third generation Single molecule real-time sequencing (e.g., Pacbio sequencing) sequence is generally as high as 15%, in order to improve the efficiency of hole filling in step 208 and the accuracy of splicing in step 210 in the following FIG. 2, the third generation Single molecule real-time sequencing (e.g., Pacbio sequencing) sequence filtered in step 102 is self-corrected by using the alignment relationship between the sequences, and finally the third generation Single molecule real-time sequencing (e.g., Pacbio sequencing) sequence obtained after error correction is obtained, for example, the error correction software MHAP (reference: Assembling Large genome with Single-Mobile-molecular sequencing and location sensing Hashing) can be used. Also, for example, error correction functions in the FALCON assembly software may be used. FALCON download website https:// github. com/Pacific biosciences/FALCON. The sequence error rate after self-error correction is reduced, the accuracy is improved, the data volume is reduced, the comparison time is shortened, the efficiency is improved, and the method has great advantages.
FIG. 2 is a flow chart of an embodiment of the method for assembling the second generation sequence and the third generation single molecule real-time sequencing sequence in combination, which specifically comprises:
in step 202, the second generation sequences are assembled to obtain the first generation genome backbone sequences.
The second-generation sequences are assembled, and reads (sequencing sequences) are sequentially truncated to obtain short sequences of length K (e.g., 30-100 bp) called K-mers, which overlap each other by K-1 bases. Storing the K-mer into a hash table to form a vertex of a DeBruene graph; the two K-mers are considered to be connected together by reading the preceding and succeeding K-mers, forming the edges of the DeBrujin graph. After all the reads are processed, the whole de Brujin graph can be obtained, paths caused by sequencing errors and heterozygous sites in the graph are removed, and linear K-mer paths are connected to form a first-level Contig (Contig) sequence. These K-mer bases are joined to form the first order contig sequence. Then, the reads are aligned to the contig sequences, and the relative position and direction relationship between the contig sequences is established according to the read pair relationship (paired end), so as to form a first-level backbone sequence (Scaffold), namely a first-level second-generation genome backbone sequence. Assembly at this stage can be achieved using the stitching software soapodenovo or Platanus. For example, the assembly at this stage can be performed by using the assembly software SOAPdenovo of Huada institute of genes, and short sequence assembly is performed based on a DeBruguen diagram to obtain a primary framework sequence. Assembly software reference Li, r.et al. de novo assembly of human genes with a mapping parallel short sequencing. genome Res (2009). The software is freely available from the Internet, and the website is http:// soap. Platanus assembly software was also available from the website http:// plant.
In step 204, the second generation sequence fills the holes in the first generation genome backbone sequence to obtain the second generation genome backbone sequence.
After the first-level backbone sequence in step 202 is completed, invalid bases N in the backbone sequence may be filled by using the inter-read pairing relationship, for example, hole filling is performed by using software KGF of huada institute of genes, or the work at this stage may be performed by using hole filling software gapcore matched with soaldenovo, which is freely available in soap. It is also possible to use the Platanus complementary cavity filling tool Gapclose to perform this phase.
In step 206, the second-generation genome backbone sequences are spliced using the pairwise relationship between the second-generation sequence reads to obtain the third-generation genome backbone sequences.
The framework sequences of the genome are spliced by utilizing the pairwise relationship between the reads, and the software used in the embodiment can be SSPACE software. This step is referred to as the skeletal sequence (Scaffold) N50, but is an optional step in the method of assembling the kit, and may be performed by jumping directly from step 204 to step 206 to step 208 for hole filling, instead of being a necessary step. In general, the sequence assembly is advantageously longer by splicing in step 206, and thus, when the second-generation genome backbone sequence is not very long, this step can be performed to effectively increase the backbone sequence (Scaffold) N50. However, the splicing of this step may also introduce assembly errors, and therefore this step may not be performed in cases where N50 of the second generation genomic backbone sequence is already able to meet the criteria.
In step 208, the third generation single molecule real-time sequencing sequence fills the hole in the second generation genome backbone sequence to obtain the first two third generation backbone sequence.
And (3) filling holes in the genome framework sequence after the step 206 or the step 204 by using a third-generation single-molecule real-time sequencing (such as Pacbio sequencing) sequence to obtain a first-second-generation framework sequence. The third generation single molecule real-time sequencing sequence can be the sequence obtained in step 102 before self-error correction or the sequence obtained in step 104 after self-error correction. This step can be implemented using hole filling software PBJelly.
In step 210, the self-corrected third-generation single-molecule real-time sequencing sequence is spliced by using an overlap relationship with the first-second-third-generation framework sequence to obtain a second-third-generation framework sequence.
In step 212, the second-generation sequence is aligned with the second-generation framework sequence to obtain an invalid alignment region, and the invalid sequence is used to replace the region to obtain a third-generation second-generation framework sequence.
The specific implementation steps of this step are shown in FIG. 3. Firstly, in step 302, the second generation sequence is compared with the second and third generation skeleton sequence to obtain the comparison result; secondly, in step 304, calculating the coverage of the second-level second-generation skeleton sequence to obtain the condition that the effective sequence area of the second-level second-generation skeleton sequence is uncovered; finally, in step 306, the uncovered valid sequence region is replaced with the invalid sequence to obtain a three-level two-third-generation framework sequence.
In step 214, the second generation sequence fills the holes in the third generation backbone sequence to obtain the final genome assembly sequence.
In the step, invalid bases N in the framework sequence can be filled by utilizing the inter-read pairing relation, for example, hole filling is carried out by adopting software KGF of Huada institute of genes, and the work at the stage can also be carried out by using hole filling software GapCloser matched with SOAPdenovo, wherein the GapCloser can be freely obtained at the soap. The complementary hole software Gapclose matched with Platanus can also be used for the work of the stage.
Corresponding to the above-mentioned second generation sequence and third generation single molecule real-time sequencing sequence combined assembly method, an embodiment of the present invention further provides a second generation sequence and third generation single molecule real-time sequencing sequence combined assembly system, as shown in fig. 4, the system includes: a second generation sequence assembling unit 402, which is used for assembling the second generation sequence to obtain a first generation genome skeleton sequence; a second-generation sequence hole filling unit 404, configured to fill holes in the first-generation second-generation genome skeleton sequence with the second-generation sequence, so as to obtain a second-generation genome skeleton sequence; a third-generation sequence hole filling unit 408 for filling holes in the second-generation genome skeleton sequence by a third-generation single-molecule real-time sequencing sequence to obtain a first-second-third-generation skeleton sequence; the first splicing unit 410 is used for splicing the self-error-corrected third-generation single-molecule real-time sequencing sequence by utilizing the mutual overlapping relation with the first-level second-third-generation framework sequence to obtain a second-level second-third-generation framework sequence; a comparison and replacement unit 412, configured to compare the second-generation sequence with the second-generation framework sequence to obtain an invalid comparison region, and replace the region with the invalid sequence to obtain a third-generation second-generation framework sequence; and a final hole filling unit 414, configured to fill holes in the third-generation skeleton sequence with the second-generation sequence to obtain a final genome assembly sequence.
As a further improvement, the system further comprises: the second splicing unit 406 is used for splicing the second-generation genome framework sequence by utilizing the pairing relation between the second-generation sequence reads to obtain a third-generation genome framework sequence; and a self-error correction unit 416, configured to perform self-error correction on the third-generation real-time single-molecule sequencing sequence by using the alignment relationship between the sequences, so as to obtain a self-error-corrected third-generation real-time single-molecule sequencing sequence.
It will be understood by those skilled in the art that all or part of the steps of the methods in the above embodiments may be implemented by a program instructing associated hardware, and the program may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.
A specific example of the use of the method of the present invention with a plant genome size of 1.1Gb is provided below. In this example, genome sequencing assembly was achieved by the following specific steps:
(one) Pacbio data processing
1) Removing the adaptor sequence in the original machine-off data (Raw data), simultaneously removing a short sequence with the length of less than 500bp and a sequence with the RQ value of less than 0.8 to obtain filtered sequence data, wherein the data volume is 17 Gb.
2) Self-correcting the filtered sequence by using MHAP software to obtain a Pacbio sequence after self-correcting, wherein the data volume is 9.2 Gb. Reference documents: assembling Large genome with Single-molecule sequence and location Sensitive Hashing.
(II) assembling and hole filling of the second generation sequence by using Platanus assembling software
1) Establishing contigs
The second generation sequences are assembled and the reads (sequencing sequences) are sequentially truncated to obtain short sequences of length K, called K-mers, which overlap each other by K-1 bases. Storing the K-mer into a hash table to form a vertex of a DeBruene graph; the two K-mers are considered to be connected together by reading the preceding and succeeding K-mers, forming the edges of the DeBrujin graph. After all the reads are processed, the whole de Brujin graph can be obtained, paths caused by sequencing errors and heterozygous sites in the graph are removed, and linear K-mer paths are connected to form a first-level contig sequence. These K-mer bases are joined to form the first order contig sequence. A backbone sequence of about 1029Mb was obtained by this assembly, the size of Scaffold N50 was about 1583bp, and the size of Contig N50 was about 1583 bp.
2) Framework sequence splicing
And establishing a second-level framework sequence according to the pair relation (paired-end) information among the reads. Firstly, reading and aligning different insertion fragments to a contig sequence, then determining the front-back relation between the contig sequences according to double-end reading information of which both ends are aligned to the contig sequence, and arranging a skeleton sequence.
A backbone sequence of approximately 971Mb was obtained, with a Scaffold N50 of approximately 902Kb and a Contig N50 of approximately 6.5 Kb.
3) Second generation sequence for filling holes in genome skeleton
After the hole filling step of gapclone of Platanus, a new Scaffold sequence of about 962Mb was obtained, with a Scaffold N50 of about 912Kb and a Contig N50 of about 26 Kb.
After the processing of steps 1) to 3), steps 202 and 204 in fig. 2 have been completed, and a second-generation genome backbone sequence is obtained.
(III) splicing the skeleton sequences of the genome by utilizing the pairwise relation between reads
And (3) splicing the skeleton sequences of the second-level second-generation genome by using SSPACE software by utilizing the pair relationship between the second-generation small fragment and the large fragment to obtain the third-level second-generation genome skeleton sequence in the step 206 of the figure 2. The genomic backbone sequence size was approximately 975Mb, the size of Scaffold N50 was approximately 1283Kb, and the size of Contig N50 was approximately 26 Kb.
(IV) filling holes in the second-generation genome skeleton by using the self-error-corrected Pacbio sequence
And (3) filling holes in the third-level second-generation genome framework by using a 9.2Gb self-error-correction Pacbio sequence and third-level PBJelly hole filling software to obtain a first-level second-level third-level framework sequence after hole filling.
After the PBJelly software hole filling, a new framework sequence with the size of about 1004Mb can be obtained, the Scaffold N50 is about 1305Kb, and the Contig N50 is about 139 Kb.
(V) splicing the self-corrected Pacbio sequence and the genome framework sequence by utilizing the mutual overlapping relation
And (3) splicing the Pacbio subjected to the self-error correction of the 9.2Gb and the genome framework sequence by using SSPACE-LongRead software to obtain a two-level two-generation new framework sequence with the size of about 1007Mb, wherein the Scaffold N50 is about 1504Kb, and the Contig N50 is about 139 Kb.
The SSPACE-Longread reference is SSPACE-Longread: scanning bacterial draft vectors using long read sequence information.
(VI) obtaining the invalid alignment region of the genome skeleton sequence and replacing the invalid sequence
Using SOAPaligner software to compare the second-generation sequence with the second-generation and second-generation framework sequence, setting the comparison result of the repeated sequences as all output, obtaining the comparison result and counting the coverage of the second-generation and second-generation framework sequence; selecting an uncovered effective sequence area; and finally, replacing the effective sequence region without coverage with an invalid sequence to obtain a three-level two-third-generation framework sequence. The SOAPaligner software is freely available in soap.
A new Scaffold sequence of about 1007Mb in size was obtained by substitution, with a Scaffold N50 of about 1504Kb and a Contig N50 of about 62 Kb.
(VII) second generation sequence for filling holes in genome skeleton
Only one end or only a part of the double-ended reading is aligned to the contig, and the other end can be positioned to the N region in the framework sequence according to the size of the insert, so that invalid bases in the framework sequence can be converted into valid bases. In the step, the software KGF of Huada institute of genes is adopted for hole filling, and meanwhile, the hole filling software GapCloser matched with SOAPdenovo is used for working at the stage, and the GapCloser can be freely obtained at the soap.
The new framework sequence with the size of about 1006Mb can be obtained by hole filling through KGF software, and has the size of about 1504Kb for Scaffold N50 and 88Kb for Contig N50. The final genome assembly sequence with size of 1006Mb was obtained by Gapcloser software hole filling, with Scaffold N50 of about 1503Kb and Contig N50 of about 210 Kb.
The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A method for assembling a second-generation sequence and a third-generation single-molecule real-time sequencing sequence in a combined manner, which is characterized by comprising the following steps:
assembling the second-generation sequence to obtain a first-generation genome skeleton sequence;
the second-generation sequence carries out hole filling on the first-generation genome skeleton sequence to obtain a second-generation genome skeleton sequence;
the third-generation single-molecule real-time sequencing sequence is used for filling holes in the second-generation genome skeleton sequence to obtain a first-second-third-generation skeleton sequence;
splicing the self-corrected third-generation single-molecule real-time sequencing sequence by utilizing the mutual overlapping relation with the first-level second-third-generation framework sequence to obtain a second-level second-third-generation framework sequence;
comparing the second-generation sequence with the second-generation framework sequence to obtain an invalid comparison region, and replacing the region with the invalid sequence to obtain a third-generation framework sequence;
and (3) filling holes in the third-level second-third-generation framework sequence by using the second-generation sequence to obtain a final genome assembly sequence.
2. The joint assembly method of claim 1, further comprising:
and splicing the second-level second-generation genome framework sequences by utilizing the pair relationship between the second-generation sequence reads to obtain a third-level second-generation genome framework sequence.
3. The joint assembly method of claim 1, further comprising:
and the third generation of single molecule real-time sequencing sequence carries out self-error correction by utilizing the comparison relation among the sequences to obtain the self-error-corrected third generation of single molecule real-time sequencing sequence.
4. The joint assembly method of claim 3, wherein the third generation single molecule real-time sequencing sequence filters linkers, short sequences, and low quality value sequences to obtain filtered sequences prior to self-error correction.
5. The joint assembly method according to claim 1, wherein the third generation single molecule real-time sequencing sequence used for hole filling of the second generation genome backbone sequence is a sequence before self-error correction.
6. The joint assembly method according to claim 1, wherein the third generation real-time sequencing sequence is a self-corrected sequence used for hole filling of the second generation genome backbone sequence.
7. The joint assembly method according to claim 1, wherein the step of obtaining a three-level two-third-generation skeleton sequence specifically comprises:
comparing the second-generation sequence with the second-generation framework sequence to obtain a comparison result;
calculating the coverage of the second-level second-generation framework sequence to obtain the condition that the effective sequence area of the second-level second-generation framework sequence is not covered;
and replacing the uncovered effective sequence region with an invalid sequence to obtain the three-level two-third-generation framework sequence.
8. A system for assembling a second generation sequence and a third generation single molecule real-time sequencing sequence in a combined manner, the system comprising:
the second-generation sequence assembling unit is used for assembling the second-generation sequence to obtain a first-generation genome skeleton sequence;
a second-generation sequence hole filling unit for filling holes in the first-generation second-generation genome skeleton sequence by the second-generation sequence to obtain a second-generation genome skeleton sequence;
the third-generation sequence hole filling unit is used for filling holes in the second-generation genome skeleton sequence by a third-generation single-molecule real-time sequencing sequence to obtain a first-level second-generation skeleton sequence;
the first splicing unit is used for splicing the three-generation single-molecule real-time sequencing sequence subjected to self-error correction by utilizing the mutual overlapping relation with the first-level second-generation framework sequence to obtain a second-level second-generation framework sequence;
the comparison replacement unit is used for comparing the second-generation sequence with the second-generation framework sequence to obtain an invalid comparison region and replacing the region with the invalid sequence to obtain a third-generation second-generation framework sequence;
and the final hole filling unit is used for filling holes in the third-level second-third-generation framework sequence by using the second-generation sequence to obtain a final genome assembly sequence.
9. The joint assembly system of claim 8, further comprising:
and the second splicing unit is used for splicing the second-level second-generation genome framework sequence by utilizing the pairing relation between the second-level sequence reads to obtain a third-level second-generation genome framework sequence.
10. The joint assembly system of claim 8, further comprising:
and the self-error correction unit is used for carrying out self-error correction on the third-generation single-molecule real-time sequencing sequence by utilizing the comparison relation among the sequences to obtain the self-error-corrected third-generation single-molecule real-time sequencing sequence.
CN201610741984.8A 2016-08-26 2016-08-26 Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence Active CN107784198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610741984.8A CN107784198B (en) 2016-08-26 2016-08-26 Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610741984.8A CN107784198B (en) 2016-08-26 2016-08-26 Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence

Publications (2)

Publication Number Publication Date
CN107784198A CN107784198A (en) 2018-03-09
CN107784198B true CN107784198B (en) 2021-06-15

Family

ID=61441081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610741984.8A Active CN107784198B (en) 2016-08-26 2016-08-26 Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence

Country Status (1)

Country Link
CN (1) CN107784198B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858503A (en) * 2018-08-11 2020-03-03 中国科学院昆明动物研究所 Method for assembling genome de novo by comprehensively applying third-generation ultralong sequencing reads and second-generation linked reads
CN109411020B (en) * 2018-11-01 2022-02-11 中国水产科学研究院 Method for hole filling of whole genome sequence by using long sequencing reads
CN114657175A (en) * 2022-04-08 2022-06-24 武汉百奥微帆生物科技有限公司 Virus genome assembly method based on third-generation sequencing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104508130A (en) * 2012-06-29 2015-04-08 麻省理工学院 Massively parallel combinatorial genetics
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN104951672A (en) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 Splicing method and system of second generation and third generation genomic sequencing data combination

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201220924D0 (en) * 2012-11-21 2013-01-02 Cancer Res Inst Royal Materials and methods for determining susceptibility or predisposition to cancer
US9670530B2 (en) * 2014-01-30 2017-06-06 Illumina, Inc. Haplotype resolved genome sequencing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104508130A (en) * 2012-06-29 2015-04-08 麻省理工学院 Massively parallel combinatorial genetics
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN104951672A (en) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 Splicing method and system of second generation and third generation genomic sequencing data combination

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Sagar M. Utturkar等.Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences .《Bioinformatics》.2014,第30卷(第19期), *
高通量测序中拼接问题的研究现状;徐鹏昊;《山东农业工程学院学报》;20160115(第1期);第42-44页 *

Also Published As

Publication number Publication date
CN107784198A (en) 2018-03-09

Similar Documents

Publication Publication Date Title
Lang et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore
Wang et al. Assembly of chloroplast genomes with long-and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
CN107615283B (en) Methods, software and systems for diploid genome assembly and haplotype sequence reconstruction
Wang et al. The draft nuclear genome assembly of Eucalyptus pauciflora: a pipeline for comparing de novo assemblies
Aury et al. Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads
CN113496760B (en) Polyploid genome assembling method and device based on third generation sequencing
CN107784201B (en) Method and system for joint hole filling of second-generation sequence and third-generation single-molecule real-time sequencing sequence
Scheunert et al. Can we use it? On the utility of de novo and reference-based assembly of Nanopore data for plant plastome sequencing
CN107784198B (en) Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence
Aury et al. Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves high resolution for research and breeding
Renaud et al. Authentication and assessment of contamination in ancient DNA
Steinberg et al. Building and improving reference genome assemblies
Rayamajhi et al. Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki
Wang et al. BAUM: improving genome assembly by adaptive unique mapping and local overlap-layout-consensus approach
Serra Mari et al. Haplotype-resolved assembly of a tetraploid potato genome using long reads and low-depth offspring data
Han et al. Telomere-to-telomere and haplotype-phased genome assemblies of the heterozygous octoploid ‘Florida Brilliance’strawberry (Fragaria× ananassa)
CN112786109A (en) Genome assembly method of genome completion map
CN112489727A (en) Method and system for rapidly acquiring pathogenic site of rare disease
US10395757B2 (en) Parental genome assembly method
Gabaldón et al. Whole-Genome Sequencing Recommendations
Aury et al. Hapo-G, haplotype-aware polishing of genome assemblies
Chuang et al. GABOLA: A Reliable Gap-Filling Strategy for de novo Chromosome-Level Assembly
Adam et al. Nanopore guided assembly of segmental duplications near telomeres
Wong et al. GoldRush: A de novo long read genome assembler with linear time complexity
Espinosa et al. Advancements in long-read genome sequencing technologies and algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1250820

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant