CN106021997B - A kind of comparison method of three generations PacBio sequencing data - Google Patents
A kind of comparison method of three generations PacBio sequencing data Download PDFInfo
- Publication number
- CN106021997B CN106021997B CN201610329027.4A CN201610329027A CN106021997B CN 106021997 B CN106021997 B CN 106021997B CN 201610329027 A CN201610329027 A CN 201610329027A CN 106021997 B CN106021997 B CN 106021997B
- Authority
- CN
- China
- Prior art keywords
- kmer
- unique
- generations
- comparison
- sequencing data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of comparison method for being effectively reduced and comparing three generations's PacBio sequencing data of mistake caused by repetitive sequence.It establishes k-mer model using the Illumina data in two generations, extract unique-kmer, in the comparison of three generations's PacBio sequencing data, using this unique-kmer as the seed (seed) used when comparing, the influence that repetitive sequence can be greatly reduced improves the speed of comparison.
Description
Technical field
The present invention relates to technical field of biological information, and in particular to the comparison method of DNA sequence dna, it used for two generations
Illumina sequencing data carries out modeling and extracts key message, and assists three generations PacBio that number is sequenced using these key messages
According to comparison.
Background technique
The error rate of the sequencing data of three generations PacBio, single sequencing is about 15%, the special comparison software for supporting three generations
And it is few, current most commonly used software is following two: (1) blasr;(2)dalign.
This two is all that very outstanding three generations compares software, can support the high error rate of PacBio.Due to genome sheet
There are repetitive sequences for body, they possess the similar sequence of height.And these compare software, these repetitive sequences can be compared
And output, to influence subsequent biological analysis (for example assembling, expression analysis etc.).
Summary of the invention
Present invention aim to address posed problems above, provide one kind and comparison mistake caused by repetitive sequence is effectively reduced
The comparison method of three generations's PacBio sequencing data accidentally.It establishes kmer model using the Illumina data in two generations, extracts
Unique-kmer is used in the comparison of three generations's PacBio sequencing data using this unique-kmer as when comparing
Seed (seed), the influence of repetitive sequence can be greatly reduced, improve the speed of comparison.
The present invention is achieved by the following technical solutions:
The present invention is a kind of comparison method of three generations PacBio sequencing data, it the following steps are included:
(1) kmer model is established using Illumina sequencing data, therefrom extracts unique-kmer;
(2) unique-mer is used to carry out candidate reads screening as the seed compared
(3) candidate reads is compared in detail.
As optimization, k-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, according to k-
Mer distribution map obtains the k-mer within two times of main peaks as unique-kmer, and using bit file or GATB open source packet, right
The unique-kmer is stored.
K≤17 are stored using the bit file (* .bit) that a size is 2G as optimization, and for k >
17 the case where, in (* .h5) file in unique-kmer deposit GATB open source packet.
As optimization, in step (2), using the unique-kmer of step (1), if shared between reads
Unique-kmer is counted more than 3, just these reads is screened, as candidate reads.
As optimization, the step (3) the following steps are included:
A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:
Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2,
Each point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather most multiple spot
Straight line as compare on region;
B. range will be compared again and carries out cell regional partition, to each cut zone, calculate similarity using LCS algorithm,
It gives a mark again to whole, the method is as follows:
Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the phase of these zonules totality
It is c like base, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the number that the two values are both greater than 0.7
According to.
Beneficial effects of the present invention are as follows:
1, unique-kmer is extracted using two generation Illumina sequencing datas, improves the accuracy rate and speed of comparison.
In genome, there are many repetitive sequences, some short tandem repeats even occur hundreds and thousands of times, thus can shadow
The accuracy compared is rung, the time of comparison is increased.In order to improve the accuracy of comparison, comparison time is reduced, we extract
Only occurs primary k-mer in contig, as unique-kmer.Because the quality of two generation Illumina sequencing datas is very
Height, in the case where sequencing depth is sufficiently random (ordinary circumstance is~40x), using Jellyfish software to two generations
Illumina sequencing data carries out kmer statistics, the distribution map (Fig. 1) of available k-mer.By the k- of 2 times of inner regions of peak value
Mer is as unique-kmer.For k≤17, stored using the bit file (* .bit file) that a size is 2G, and
Unique-kmer is stored in file (* .h5 file) using GATB (Open Framework) by the case where for k > 17.Used in it
Two generation Illumina sequencing data quality it is higher, Jellyfish software with multithreading run, speed is fast, and memory consumption is small
The advantages of, it ensure that entire method data processing quality with higher, and apparent processing speed advantage;
2, use unique-kmer to carry out candidate reads screening as the seed compared, save comparison time, improve ratio
To speed.
Because unique-kmer is in probability and theoretically, in haploid genome, only will appear once, so as to
Avoid influence caused by repetitive sequence.On the other hand, due to avoiding the influence of repetitive sequence, the candidate reads found is accurate
Degree is very high, has saved many comparison times, has substantially increased comparison speed.
3, candidate reads is compared in detail, has saved memory and comparison time, improved and compare speed.
Many comparison methods for comparing software, all employ the algorithm of longest common subsequence (LCS), directly to whole area
Domain carries out LCS calculating, then wastes very much memory and time for the comparison area greater than 100k.This method is also using this calculation
Method, but improved of both having done: (1) the comparison relationship of seed is clustered in advance, calculates optimal comparison range;
(2) subregion is compared.To save memory and comparison time, improves and compare speed.
Detailed description of the invention
Fig. 1: kmer distribution map
All data are broken into the segment (referred to as k-mer) that length is k, abscissa is the frequency in k-mer, indulges and sits
It is designated as the type of frequency k-mer, using the k-mer of 2 times of inner regions of peak value as unique-kmer.
Fig. 2: it calculates and compares range schematic diagram
Each point on figure represents the seed shared on two read, and abscissa represents the position in read1 comparison, indulges and sits
Mark represents the position in read2 comparison, these seed are clustered with the straight line that slope is 1, selects and clusters most straight lines,
Using this region as the range on comparing.
Fig. 3: flow chart of the present invention
Specific embodiment
The embodiment of the present invention is further elaborated with reference to the accompanying drawing:
Embodiment:
(1) kmer model is established using two generation Illumina sequencing datas, therefrom extracts unique-kmer
K-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, all data are interrupted
The segment (referred to as k-mer) for being k at length, abscissa are the frequency in k-mer, and ordinate is the type of frequency k-mer.Root
The k-mer within two times of main peaks is obtained as unique-kmer according to k-mer distribution map, for k≤17, is using a size
The bit file (* .bit) of 2G stores, and the case where for k > 17, in unique-kmer deposit GATB open source packet
In (* .h5) file.Wherein, two generation Illumina sequencing datas refer to is surveyed by two generations that Illumina company sequenator obtains
Ordinal number evidence.
According to the above method, following procedure is write, for extracting unique-kmer, concrete operations order operation instruction is such as
Under:
It is as follows that concrete case implements operation:
From the Illumina sequencing data in two generations, the data of about 40X are screened, it is written one and is fq.lst file
In:
Then program is run, to obtain unique-kmer:
Because choosing k=17, result is stored in bit file: k17.bit
(2) it is compared using unique-kmer with three generations's Pacbio sequencing data, screens candidate reads
Using this unique-kmer as the seed (seed) used when comparing, if shared between reads
When unique-kmer is more than 3, using them as candidate reads.Wherein, three generations Pacbio sequencing data, which refers to, passes through Pacbio
The two generation sequencing datas that company's sequenator obtains.
According to the above method, an alignment programs are write, three generations's Pacbio sequencing data to be compared, concrete operations
Order operation instruction is as follows:
It is as follows that concrete case implements operation:
The data file being sequenced using two three generations Pacbio, respectively read1.fa, read2.fa, in addition there are one
The unique-kmer file that two generation Illumina sequencing datas extract: k17.bit, operation are compared with issuing orders:
(3) candidate reads is compared in detail.
A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:
Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2,
Each point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather most multiple spot
Straight line as compare on region;
B. range progress cell regional partition will be compared again (can set segmentation length as 100bp), to each cut section
Domain calculates similarity using LCS algorithm, then gives a mark to whole, the method is as follows:
Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the phase of these zonules totality
It is c like base, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the number that the two values are both greater than 0.7
According to.
What has been described above is only a preferred embodiment of the present invention, it is noted that for common skill in the art
For art personnel, under the premise of not departing from core technical features of the present invention, several improvements and modifications can also be made, these change
It also should be regarded as protection scope of the present invention into retouching.
Claims (4)
1. a kind of comparison method of three generations PacBio sequencing data, which is characterized in that it the following steps are included:
(1) kmer model is established using two generation Illumina sequencing datas, and therefrom extracts unique-kmer;
(2) it uses unique-kmer using it as the seed compared, is compared, filters out with three generations's Pacbio sequencing data
Candidate reads;
(3) candidate reads is compared in detail, comprising the following steps:
A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:
Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2, each
Point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather the straight of most multiple spot
Line is as the region on comparing;
B. range will be compared again and carries out cell regional partition, to each cut zone, calculate similarity using LCS algorithm, then right
Entirety is given a mark, the method is as follows:
Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the similar alkali of these zonules totality
Base is c, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the data that the two values are both greater than 0.7.
2. according to the comparison method of three generations PacBio sequencing data described in claims 1, which is characterized in that in the step
Suddenly in (1), k-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, according to k-mer distribution map
The k-mer within two times of main peaks is obtained as unique-kmer, and using bit file or GATB open source packet, to described
Unique-kmer is stored.
3. according to the comparison method of three generations PacBio sequencing data described in claims 2, which is characterized in that for k≤
17, it is stored using the bit file * .bit that size is 2G, and the case where for k > 17, unique-kmer is stored in
In * .h5 file in GATB open source packet.
4. according to the comparison method of three generations PacBio sequencing data described in claims 1, which is characterized in that in the step
Suddenly in (2), using the unique-kmer of step (1), if between reads share unique-kmer count more than 3, just
These reads are screened, as candidate reads.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610329027.4A CN106021997B (en) | 2016-05-17 | 2016-05-17 | A kind of comparison method of three generations PacBio sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610329027.4A CN106021997B (en) | 2016-05-17 | 2016-05-17 | A kind of comparison method of three generations PacBio sequencing data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021997A CN106021997A (en) | 2016-10-12 |
CN106021997B true CN106021997B (en) | 2019-03-29 |
Family
ID=57098804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610329027.4A Active CN106021997B (en) | 2016-05-17 | 2016-05-17 | A kind of comparison method of three generations PacBio sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021997B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108614954B (en) * | 2016-12-12 | 2020-07-28 | 深圳华大基因科技服务有限公司 | Method and device for short sequence error correction of second-generation sequence |
CN108460245B (en) * | 2017-02-21 | 2020-11-06 | 深圳华大基因科技服务有限公司 | Method and apparatus for optimizing second generation assembly results using third generation sequences |
CN107256335A (en) * | 2017-06-02 | 2017-10-17 | 肖传乐 | A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed |
CN107229842A (en) * | 2017-06-02 | 2017-10-03 | 肖传乐 | A kind of three generations's sequencing sequence bearing calibration based on Local map |
CN111564181B (en) * | 2020-04-02 | 2024-06-04 | 北京百迈客生物科技有限公司 | Method for carrying out metagenome assembly based on second-generation and third-generation ONT technology |
CN114420209A (en) * | 2022-03-28 | 2022-04-29 | 山东大学 | Sequencing data-based pathogenic microorganism detection method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104164479A (en) * | 2014-04-04 | 2014-11-26 | 深圳华大基因科技服务有限公司 | Heterozygous genome processing method |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104017883B (en) * | 2014-06-18 | 2015-11-18 | 深圳华大基因科技服务有限公司 | The method and system of assembling genome sequence |
CN104951672B (en) * | 2015-06-19 | 2017-08-29 | 中国科学院计算技术研究所 | Joining method and system associated with a kind of second generation, three generations's gene order-checking data |
-
2016
- 2016-05-17 CN CN201610329027.4A patent/CN106021997B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104164479A (en) * | 2014-04-04 | 2014-11-26 | 深圳华大基因科技服务有限公司 | Heterozygous genome processing method |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
Also Published As
Publication number | Publication date |
---|---|
CN106021997A (en) | 2016-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021997B (en) | A kind of comparison method of three generations PacBio sequencing data | |
US20190139624A1 (en) | Identifying ancestral relationships using a continuous stream of input | |
Wächter et al. | Proposal for a subdivision of the family Psathyrellaceae based on a taxon-rich phylogenetic analysis with iterative multigene guide tree | |
CN107305577B (en) | K-means-based appropriate address data processing method and system | |
Nelesen et al. | DACTAL: divide-and-conquer trees (almost) without alignments | |
CN106022002B (en) | A kind of filling-up hole method based on three generations's PacBio sequencing data | |
WO2018218788A1 (en) | Third-generation sequencing sequence alignment method based on global seed scoring optimization | |
CN106480221B (en) | Based on gene copy number variation site to the method for forest tree population genotyping | |
CN109346130A (en) | A method of directly micro- haplotype and its parting are obtained from full-length genome weight sequencing data | |
CN108256289A (en) | A kind of method based on target area capture sequencing genomes copy number variation | |
CN106021985B (en) | A kind of genomic data compression method | |
Bourgeois et al. | Disentangling the determinants of transposable elements dynamics in vertebrate genomes using empirical evidences and simulations | |
Walker et al. | Short-range template switching in great ape genomes explored using pair hidden Markov models | |
CN109706231A (en) | A kind of high-throughput SNP classifying method for litopenaeus vannamei molecular breeding | |
Zhou et al. | Eigenvalue significance testing for genetic association | |
CN114566214B (en) | Method for detecting genome deletion insertion variation, detection device, computer readable storage medium and application | |
Lu et al. | Discovery and annotation of a novel transposable element family in Gossypium | |
Liu et al. | Spatial Cluster Analysis by the Adleman‐Lipton DNA Computing Model and Flexible Grids | |
CN112133371A (en) | Method and device for performing framework assembly based on single-tube long-fragment sequencing data | |
CN107688727B (en) | Method and device for identifying transcript subtypes in biological sequence clustering and full-length transcription group | |
CN112687339B (en) | Method and device for counting sequence errors in plasma DNA fragment sequencing data | |
KR20130101711A (en) | System and method for processing genome sequence in consideration of seed length | |
Yi et al. | Genome-wide data reveal cryptic diversity and hybridization in a group of tree ferns | |
CN106022003B (en) | A kind of scaffold construction method based on three generations's PacBio sequencing data | |
Sinha et al. | A model for optimal assignment of non-uniquely mapped NGS reads in DNA regions of duplications or deletions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |