CN106022002B - A kind of filling-up hole method based on three generations's PacBio sequencing data - Google Patents

A kind of filling-up hole method based on three generations's PacBio sequencing data Download PDF

Info

Publication number
CN106022002B
CN106022002B CN201610325767.0A CN201610325767A CN106022002B CN 106022002 B CN106022002 B CN 106022002B CN 201610325767 A CN201610325767 A CN 201610325767A CN 106022002 B CN106022002 B CN 106022002B
Authority
CN
China
Prior art keywords
region
data
filling
hole
generations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610325767.0A
Other languages
Chinese (zh)
Other versions
CN106022002A (en
Inventor
詹东亮
蔡庆乐
王兆宝
罗亚丹
范崇仪
王军
王军一
范玉美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Original Assignee
HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HANGZHOU HEYI GENE TECHNOLOGY Co Ltd filed Critical HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority to CN201610325767.0A priority Critical patent/CN106022002B/en
Publication of CN106022002A publication Critical patent/CN106022002A/en
Application granted granted Critical
Publication of CN106022002B publication Critical patent/CN106022002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention proposes a kind of filling-up hole methods based on three generations's PacBio sequencing data, greatly reduce the comparison time during filling-up hole, and the speed of genome filling-up hole is improved significantly.Pass through corresponding software, three generations's PacBio sequencing data is compared to the both ends in hole in upper genome, the partial region for three generations's PacBio sequencing data that interception compares, and it is clustered according to data of the hole belonging to the data to interception, error correction is carried out using dazcon software, carries out sequence connection with the data after error correction.

Description

A kind of filling-up hole method based on three generations's PacBio sequencing data
Technical field
The present invention relates to technical field of biological information, and in particular to the filling-up hole method of DNA assembling, it uses three generations PacBio Sequencing data carries out the filling-up hole of genomic data.
Background technique
Three generations PacBio sequencing is famous with long length of reading, and the P6-C4 reagent used is sequenced at present, can make being averaged for sequencing data It reads length and reaches 10-15k, and no apparent GC skewed popularity is sequenced, theoretically filling-up hole can be carried out to genome well.At present Based on the software of three generations's PacBio sequencing data filling-up hole, there is PBjelly, but it is to compare software based on blasr, due to Blasr software comparison speed is very slow, causes the time for entirely constructing scaffold also very very long.Especially for greater than 1G Genome, sequencing depth be greater than 10X the case where, it usually needs expend the time of some months.
Summary of the invention
Present invention aim to address posed problems above, propose a kind of benefit based on three generations's PacBio sequencing data Hole method, greatly reduces the comparison time during filling-up hole, and the speed of genome filling-up hole is improved significantly.By corresponding soft Three generations's PacBio sequencing data is compared the both ends in hole in upper genome, three generations's PacBio sequencing data that interception compares by part Partial region, and clustered according to data of the hole belonging to the data to interception, dazcon software used to carry out error correction, used Data after error correction carry out sequence connection.
The present invention is achieved by the following technical solutions:
The present invention is a kind of filling-up hole method based on three generations's PacBio sequencing data, and the filling-up hole method includes following step It is rapid:
(1) unique-kmer is extracted from contig;
(2) it uses unique-kmer as seed, is compared, and intercept the region on comparing;
(3) cluster and error correction are carried out to the region in comparison;
(4) it is attached using the data after error correction.
As optimization, k-mer system is carried out to two generation Illumina sequencing datas using Jellyfish software in step (1) Meter is wrapped, to these unique- using there is primary k-mer as unique-kmer using bit file or GATB open source Kmer is stored.
K≤17 are stored using the bit file (* .bit file) that a size is 2G as optimization, and for The case where k > 17, in (* .h5) file in unique-kmer deposit GATB open source packet.
As optimization, the step (2) the following steps are included:
2.1 use unique-kmer as seed;
2.2 in advance cluster the comparison relationship of seed, calculate optimal comparison range;
If two read can be compared, they have synteny, and the slope between these seed is 1, Using the straight line for gathering most multiple spot as the region on comparing.
2.3 subregions are compared;
The overall region of comparison is divided into the zonule of 100bp first, it is assumed that be divided into n region, share a alkali Base, then LCS similarity calculation is carried out to these zonules, it is assumed that region of the similarity greater than 0.8 has b, these zonules are total The similar base of body is c, and following two dimension is divided to evaluate similarity:
Regional Similarity=b/n
Base similarity=c/a
The last value for only retaining two evaluation similarities is both greater than 0.7 comparison data.
As optimization, the step (3) and (4) the following steps are included:
3.1 intercept certain length (can be set as 500bp) is extended before and after the region in comparison, and record this Hole corresponding to region;
3.2, by the region of interception, are clustered by affiliated hole;
The data of 3.3 pairs of each clusters carry out error correction using dazcon software, then are attached to data.
Compared with existing software, advantage of the process is that
1, unique-kmer is extracted from contig, improves accuracy rate, reduces the reduced time.
In genome, there are many repetitive sequences, some short tandem repeats even occur hundreds and thousands of times, thus can shadow The accuracy for comparing software is rung, the time of comparison is increased.In order to improve the accuracy of comparison, comparison time is reduced, this law is extracted Only occur primary k-mer in contig, as unique-kmer, only unique-kmer is used to make as comparison in comparison Seed.K-mer statistics is carried out used here as Jellyfish software, and filters out unique-kmer.
2, three generations's PacBio sequencing data is compared to the both ends in hole in upper genome, and data intercept, memory is saved, saves Reduced time improves accuracy.
Many comparison methods all employ the algorithm of longest common subsequence (LCS), and this law is compared, and use This algorithm, but done the improvement of following three aspect:
1) use unique-kmer as seed
2) the comparison relationship of seed is clustered in advance, calculates optimal comparison range.
If two read can be compared, they have synteny, and the slope between these seed is 1. We are using the straight line for gathering most multiple spot as the region on comparing.
3) subregion is compared.
Comparing software mostly is all directly to carry out longest common subsequence (LCS) to overall region to calculate, but for larger Comparison area for, be greater than the comparison area of 100k, overall region calculated, not only wasting memory but also can consume Take the plenty of time.The improvement that this law is done solves problem above, while accuracy is also significantly improved.
For the three generations's Pacbio sequencing data being compared, the region both ends chosen on comparing respectively extend certain length The part of (being typically set at 500bp) is intercepted, the hole for ensuring that three generations PacBio sequencing data in this way and comparing The DNA sequence dna at both ends has common region.
3, three generations's PacBio sequencing data in comparison is clustered, error correction simultaneously connects, and saves the make-up time.
The data of previous step interception are clustered according to respectively affiliated hole, using dazcon software to each hole Cluster data error correction, and the existing consensus according to data and hole both ends after error correction carry out sequence connection, complete to mend Hole.Advantage of this is that the regions only to hole to carry out error correction, it is not necessary to error correction is carried out to whole sequence, to be saved greatly Make-up time.
Detailed description of the invention
Fig. 1: flow chart of the present invention.
Specific embodiment
The embodiment of the present invention is further elaborated with reference to the accompanying drawing:
Embodiment:
1, unique-kmer is extracted from contig, uses Jellyfish software to two generation Illumina in step (1) Sequencing data carries out k-mer statistics, and the k-mer for occurring primary is used a size for k≤17 as unique-kmer It is stored for the bit file (* .bit file) of 2G, and the case where for k > 17, unique-kmer deposit GATB open source packet In (* .h5) file in.Wherein, all data are broken into the segment that length is k and are known as k-mer, two generations Illumina surveyed Ordinal number is according to the two generation sequencing datas referred to through the acquisition of Illumina company sequenator.
Program is write according to the above method, usage is as follows:
The path contig is put into one file f ile.lst
Then program is run, to obtain unique-kmer:
Because choosing k=17, result is stored in bit file: k17.bit
2, use unique-kmer as seed, be compared, and intercept the region on comparing;
2.1 use unique-kmer as seed;
2.2 in advance cluster the comparison relationship of seed, calculate optimal comparison range;
If two read can be compared, they have synteny, and the slope between these seed is 1, Using the straight line for gathering most multiple spot as the region on comparing.
2.3 subregions are compared.
The overall region of comparison is divided into the zonule of 100bp first, it is assumed that be divided into n region, share a alkali Base, then LCS similarity calculation is carried out to these zonules, it is assumed that region of the similarity greater than 0.8 has b, these zonules are total The similar base of body is c, and following two dimension is divided to evaluate similarity:
Regional Similarity=b/n
Base similarity=c/a
The last value for only retaining two evaluation similarities is both greater than 0.7 comparison data.
3, cluster and error correction are carried out to the region in comparison, is attached using the data after error correction.
3.1 intercept certain length (can be set as 500bp) is extended before and after the region in comparison, and record this Hole corresponding to region;
3.2, by the region of interception, are clustered by affiliated hole;
The data of 3.3 pairs of each clusters carry out error correction using dazcon software, then are attached to data.
According to the method for above-mentioned comparison and filling-up hole, says that comparison and filling-up hole method are write as a process, facilitate calling, usage is such as Under:
Prepare the explanation of input.cfg:
What has been described above is only a preferred embodiment of the present invention, it is noted that for common skill in the art For art personnel, under the premise of not departing from core technical features of the present invention, several improvements and modifications can also be made, these change It also should be regarded as protection scope of the present invention into retouching.

Claims (4)

1. a kind of filling-up hole method based on three generations's PacBio sequencing data, which is characterized in that the filling-up hole method includes following step It is rapid:
(1) unique-kmer is extracted from contig;
(2) it uses unique-kmer as seed, is compared, and intercept the region on comparing, comprising the following steps:
2.1 use unique-kmer as seed;
2.2 in advance cluster the comparison relationship of seed, calculate optimal comparison range;
If two read can be compared, they have synteny, and the slope between these seed is 1, will be gathered To most multiple spot straight line as compare on region;
2.3 subregions are compared;
The overall region of comparison is divided into the zonule of 100bp first, it is assumed that it is divided into n region, shares a base, then LCS similarity calculation is carried out to these zonules, it is assumed that region of the similarity greater than 0.8 there are b, the phase of these zonules totality It is c like base, following two dimension is divided to evaluate similarity:
Regional Similarity=b/n;
Base similarity=c/a;
The last value for only retaining two evaluation similarities is both greater than 0.7 comparison data;
(3) cluster and error correction are carried out to the region in comparison;
(4) it is attached using the data after error correction.
2. the filling-up hole method according to claim 1 based on three generations's PacBio sequencing data, which is characterized in that in step (1) k-mer statistics is carried out to two generation Illumina sequencing datas using Jellyfish software, using occur primary k-mer as Unique-kmer stores these unique-kmer using bit file or GATB open source packet.
3. the filling-up hole method according to claim 2 based on three generations's PacBio sequencing data, which is characterized in that for k ≤ 17, it is stored using the bit file * .bit that size is 2G, and the case where for k > 17, unique-kmer is deposited Enter in the * .h5 file in GATB open source packet.
4. the filling-up hole method according to claim 1 based on three generations's PacBio sequencing data, which is characterized in that the step Suddenly (3) and (4) the following steps are included:
3.1 intercept certain length is extended before and after the region in comparison, and record hole corresponding to this region;
3.2, by the region of interception, are clustered by affiliated hole;
The data of 3.3 pairs of each clusters carry out error correction using dazcon software, then are attached to data.
CN201610325767.0A 2016-05-17 2016-05-17 A kind of filling-up hole method based on three generations's PacBio sequencing data Active CN106022002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610325767.0A CN106022002B (en) 2016-05-17 2016-05-17 A kind of filling-up hole method based on three generations's PacBio sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610325767.0A CN106022002B (en) 2016-05-17 2016-05-17 A kind of filling-up hole method based on three generations's PacBio sequencing data

Publications (2)

Publication Number Publication Date
CN106022002A CN106022002A (en) 2016-10-12
CN106022002B true CN106022002B (en) 2019-03-29

Family

ID=57097500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610325767.0A Active CN106022002B (en) 2016-05-17 2016-05-17 A kind of filling-up hole method based on three generations's PacBio sequencing data

Country Status (1)

Country Link
CN (1) CN106022002B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599617B (en) * 2016-12-20 2019-02-15 福建师范大学 A kind of magnanimity sequencing data error correcting method running on distributed system
CN108629156B (en) * 2017-03-21 2020-08-28 深圳华大基因科技服务有限公司 Method, device and computer readable storage medium for correcting error of third generation sequencing data
CN107256335A (en) * 2017-06-02 2017-10-17 肖传乐 A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN108763871B (en) * 2018-06-05 2022-05-31 北京诺禾致源科技股份有限公司 Hole filling method and device based on third-generation sequencing sequence
CN109411020B (en) * 2018-11-01 2022-02-11 中国水产科学研究院 Method for hole filling of whole genome sequence by using long sequencing reads

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104017883B (en) * 2014-06-18 2015-11-18 深圳华大基因科技服务有限公司 The method and system of assembling genome sequence
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN104951672B (en) * 2015-06-19 2017-08-29 中国科学院计算技术研究所 Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method

Also Published As

Publication number Publication date
CN106022002A (en) 2016-10-12

Similar Documents

Publication Publication Date Title
CN106022002B (en) A kind of filling-up hole method based on three generations's PacBio sequencing data
AU2021201500B2 (en) Haplotype phasing models
Hsieh et al. Whole-genome sequence analyses of Western Central African Pygmy hunter-gatherers reveal a complex demographic history and identify candidate genes under positive natural selection
CN108121897B (en) Genome variation detection method and detection device
CN110692101B (en) Method for aligning targeted nucleic acid sequencing data
AU2016355983B2 (en) Methods for detecting copy-number variations in next-generation sequencing
CN107133493B (en) Method for assembling genome sequence, method for detecting structural variation and corresponding system
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN106021997B (en) A kind of comparison method of three generations PacBio sequencing data
CN106795568A (en) Method, system and the process of the DE NOVO assemblings of read is sequenced
JP2018535484A (en) DNA alignment using hierarchical inverted index table
WO2012155296A1 (en) Methods of acquiring genome size and error
CN112133371B (en) Method and device for assembling framework based on single-tube long-fragment sequencing data
CN103793626A (en) System and method for aligning genome sequence
Vasimuddin et al. Identification of significant computational building blocks through comprehensive investigation of NGS secondary analysis methods
US9348968B2 (en) System and method for processing genome sequence in consideration of seed length
CN106022003B (en) A kind of scaffold construction method based on three generations's PacBio sequencing data
WO2015062183A1 (en) Method and apparatus for separating quality levels in sequence data and sequencing longer reads
CN112825268B (en) Sequencing result comparison method and application thereof
CN104239749A (en) System and method for aligning genome sequence
Zhang et al. Simultaneous history reconstruction for complex gene clusters in multiple species
Yang et al. Combinatorial detection algorithm for copy number variations using high-throughput sequencing reads
Vasimuddin et al. Identification of significant computational building blocks through comprehensive deep dive of ngs secondary analysis methods
Meleshko Novel Synthetic Long-Read Methods for Structural Variant Discovery and Transcriptomic Assembly

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant