CN106021997A - Third-generation PacBio sequencing data comparison method - Google Patents
Third-generation PacBio sequencing data comparison method Download PDFInfo
- Publication number
- CN106021997A CN106021997A CN201610329027.4A CN201610329027A CN106021997A CN 106021997 A CN106021997 A CN 106021997A CN 201610329027 A CN201610329027 A CN 201610329027A CN 106021997 A CN106021997 A CN 106021997A
- Authority
- CN
- China
- Prior art keywords
- comparison
- kmer
- sequencing data
- unique
- generations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a third-generation PacBio sequencing data comparison method capable of effectively reducing comparison errors caused by repeated sequences. According to the method, a k-mer model is established by using second-generation Illumina data; unique-kmer is extracted; and in third-generation PacBio sequencing data comparison, unique-kmer is used as a seed used in the comparison, so that the influence of the repeated sequences can be greatly reduced and the comparison speed can be increased.
Description
Technical field
The present invention relates to technical field of biological information, be specifically related to the comparison method of DNA sequence, it uses secondary
Illumina sequencing data be modeled extracting key message, and utilize these key messages to assist three generations
The comparison of PacBio sequencing data.
Background technology
The sequencing data of three generations PacBio, the error rate of single order-checking is about 15%, the special comparison supporting three generations
Software is the most few, and currently used most software is following two: (1) blasr;(2)dalign.
This two is all the most outstanding three generations's comparison software, can support the high error rate of PacBio.Due to gene
Itself there is repetitive sequence in group, they have the most similar sequence.And these comparison softwares, can be heavy by these
Complex sequences is compared and exports, thus affects follow-up biological analysis (such as assembling, expression analysis etc.).
Summary of the invention
Present invention aim to address posed problems above, it is provided that the ratio that a kind of effective reduction repetitive sequence causes
Comparison method to three generations's PacBio sequencing data of mistake.It uses secondary Illumina data to set up kmer
Model, extracts unique-kmer, in the comparison of three generations's PacBio sequencing data, uses this unique-kmer
It is used as the seed (seed) used during comparison, the impact of repetitive sequence can be greatly reduced, improve the speed of comparison
Degree.
The present invention is achieved by the following technical solutions:
The present invention is the comparison method of a kind of three generations's PacBio sequencing data, and it comprises the following steps:
(1) use Illumina sequencing data to set up kmer model, therefrom extract unique-kmer;
(2) unique-mer is used to carry out candidate's reads screening as the seed of comparison
(3) candidate reads is carried out detailed comparison.
As optimization, use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, according to
K-mer scattergram obtains the k-mer within two times of main peaks as unique-kmer, and use bit file or
GATB increases income bag, stores described unique-kmer.
As optimization, for k≤17, the bit file (* .bit) using size to be 2G stores, and
As k > 17, unique-kmer is stored in GATB and increases income in (* .h5) file in bag.
As optimization, in step (2), use the unique-kmer of step (1), if total between reads
Unique-kmer count more than 3, just these reads are screened, as candidate reads.
As optimization, described step (3) comprises the following steps:
The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:
Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents on read2 in comparison
Position, each point represents seed total on two read, is gathered by the straight line that these seed slopes are 1
Class, using gather multiple spot straight line as the region in comparison;
The most again comparison scope is carried out zonule segmentation, to each cut zone, use LCS algorithm to calculate phase
Like degree, then giving a mark entirety, method is as follows:
Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are total
The similar base of body is c, then Regional Similarity is b/n, and base similarity is c/a, the most only retains the two
The data of value both greater than 0.7.
Beneficial effects of the present invention is as follows:
1, use secondary Illumina sequencing data to extract unique-kmer, improve accuracy rate and the speed of comparison.
In genome, there is many repetitive sequences, some short tandem repeat even occurs hundreds and thousands of times, from
And the accuracy of comparison can be affected, increase the time of comparison.In order to improve the accuracy of comparison, when reducing comparison
Between, we are extracted in contig the k-mer only occurred once, as unique-kmer.Because secondary Illumina
The quality of sequencing data is the highest, in the case of the order-checking degree of depth is sufficiently random (ordinary circumstance is~40x), uses
Jellyfish software carries out kmer statistics to secondary Illumina sequencing data, can obtain the scattergram of k-mer
(Fig. 1).Using the k-mer of 2 times of inner regions of peak value as unique-kmer.For k≤17, use one greatly
The little bit file for 2G (* .bit file) stores, and for k > 17 in the case of, use GATB (to increase income
Framework), unique-kmer is stored in file (* .h5 file).Secondary Illumina order-checking number used in it
Higher according to quality, Jellyfish software has multithreading and runs, and speed is fast, the advantage that memory consumption is little, it is ensured that
Whole method has higher data and processes quality, and significantly processing speed advantage;
2, use unique-kmer to carry out candidate's reads screening as the seed of comparison, save comparison time,
Improve comparison speed.
Because unique-kmer at probability and in theory, in haploid genome, only there will be once, from
And it is avoided that the impact that repetitive sequence causes.On the other hand, owing to avoiding the impact of repetitive sequence, find
Candidate's reads accuracy is the highest, has saved a lot of comparison time, substantially increases comparison speed.
3, the reads of candidate is carried out detailed comparison, saved internal memory and comparison time, improve comparison speed.
The comparison method of a lot of comparison softwares, all employ the algorithm of longest common subsequence (LCS), the most right
Overall region carries out LCS calculating, wastes the most very much internal memory and time for the comparison area more than 100k.We
Method is also to use this algorithm, but improves of both having done: the comparison relation of seed is carried out by (1) in advance
Cluster, calculates the comparison scope of optimum;(2) compare in subregion.Thus saved internal memory and comparison time,
Improve comparison speed.
Accompanying drawing explanation
Fig. 1: kmer scattergram
All of data are broken into the segment (referred to as k-mer) of a length of k, and abscissa is the frequency at k-mer
Number, vertical coordinate is the kind of this frequency k-mer, using the k-mer of 2 times of inner regions of peak value as unique-kmer.
Fig. 2: calculate comparison scope schematic diagram
Each point on figure represents seed total on two read, and abscissa represents the position in read1 comparison,
Vertical coordinate represents the position in read2 comparison, is clustered by the straight line that these seed slopes are 1, selects
Cluster most straight lines, using this region as the scope in comparison.
Fig. 3: flow chart of the present invention
Detailed description of the invention
Below in conjunction with the accompanying drawings embodiments of the invention are further elaborated:
Embodiment:
(1) use secondary Illumina sequencing data to set up kmer model, therefrom extract unique-kmer
Use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, all of data are beaten
Being broken into the segment (referred to as k-mer) of a length of k, abscissa is the frequency at k-mer, and vertical coordinate is this frequency
The kind of k-mer.According to the k-mer within k-mer scattergram two times of main peaks of acquisition as unique-kmer, right
In k≤17, the bit file (* .bit) using size to be 2G stores, and during for k > 17,
Unique-kmer is stored in GATB increase income in (* .h5) file in bag.Wherein, secondary Illumina order-checking
Data refer to the secondary sequencing data obtained by Illumina company sequenator.
According to said method, writing following program, be used for extracting unique-kmer, concrete operations order uses and says
Bright as follows:
It is as follows that concrete case implements operation:
From secondary Illumina sequencing data, screen the data of about 40X, be fq.lst its write one
In file:
Then run program, obtain unique-kmer:
Because choosing k=17, result is stored in bit file: k17.bit
(2) use unique-kmer to compare with three generations's Pacbio sequencing data, screen candidate reads
This unique-kmer is used to be used as the seed (seed) used during comparison, if having between reads
When unique-kmer is more than 3, using them as candidate reads.Wherein, three generations Pacbio sequencing data refers to
The secondary sequencing data obtained by Pacbio company sequenator.
According to said method, write an alignment programs, three generations's Pacbio sequencing data is compared, tool
Body operational order operation instruction is as follows:
It is as follows that concrete case implements operation:
Use the data file of two three generations Pacbio order-checking, respectively read1.fa, read2.fa, the most also one
The unique-kmer file that individual secondary Illumina sequencing data extracts: k17.bit, runs and carries out to issue orders
Comparison:
(3) candidate reads is carried out detailed comparison.
The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:
Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents on read2 in comparison
Position, each point represents seed total on two read, is gathered by the straight line that these seed slopes are 1
Class, using gather multiple spot straight line as the region in comparison;
Comparison scope is carried out zonule segmentation (can set a length of 100bp of segmentation) the most again, to each point
Cutting region, use LCS algorithm to calculate similarity, then give a mark entirety, method is as follows:
Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are total
The similar base of body is c, then Regional Similarity is b/n, and base similarity is c/a, the most only retains the two
The data of value both greater than 0.7.
Above-described is only the preferred embodiment of the present invention, it is noted that general in the art
For logical technical staff, on the premise of without departing from core technical features of the present invention, it is also possible to make some improvement
And retouching, these improvements and modifications also should be regarded as protection scope of the present invention.
Claims (5)
1. the comparison method of three generations's PacBio sequencing data, it is characterised in that it comprises the following steps:
(1) use secondary Illumina sequencing data to set up kmer model, and therefrom extract unique-kmer;
(2) use unique-kmer using it as the seed of comparison, compare with three generations's Pacbio sequencing data,
Filter out candidate reads;
(3) candidate reads is carried out detailed comparison.
2. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that
In described step (1), use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics,
Obtain the k-mer within two times of main peaks as unique-kmer according to k-mer scattergram, and use bit file or
GATB increases income bag, stores described unique-kmer.
3. according to the comparison method of the three generations's PacBio sequencing data described in claims 2, it is characterised in that
For k≤17, the bit file (* .bit) using size to be 2G stores, and for the feelings of k > 17
Condition, is stored in GATB unique-kmer and increases income in (* .h5) file in bag.
4. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that
In described step (2), use the unique-kmer of step (1), if unique-kmer total between reads
Counting, more than 3, just screens these reads, as candidate reads.
5. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that
Described step (3) comprises the following steps:
The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:
Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents the position on read2 in comparison,
Each point represents seed total on two read, is clustered by the straight line that these seed slopes are 1, will
Gather the straight line of multiple spot as the region in comparison;
The most again comparison scope is carried out zonule segmentation, to each cut zone, uses LCS algorithm to calculate similarity,
Giving a mark entirety, method is as follows again:
Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are overall
Similar base is c, then Regional Similarity is b/n, and base similarity is c/a, and last only reservation the two value is all
Data more than 0.7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610329027.4A CN106021997B (en) | 2016-05-17 | 2016-05-17 | A kind of comparison method of three generations PacBio sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610329027.4A CN106021997B (en) | 2016-05-17 | 2016-05-17 | A kind of comparison method of three generations PacBio sequencing data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021997A true CN106021997A (en) | 2016-10-12 |
CN106021997B CN106021997B (en) | 2019-03-29 |
Family
ID=57098804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610329027.4A Active CN106021997B (en) | 2016-05-17 | 2016-05-17 | A kind of comparison method of three generations PacBio sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021997B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107256335A (en) * | 2017-06-02 | 2017-10-17 | 肖传乐 | A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed |
CN108460245A (en) * | 2017-02-21 | 2018-08-28 | 深圳华大基因科技服务有限公司 | The method and apparatus for assembling result using two generation of three generations's sequence optimisation |
CN108614954A (en) * | 2016-12-12 | 2018-10-02 | 深圳华大基因科技服务有限公司 | A kind of method and apparatus of the short sequencing error corrections of two generation sequences |
WO2018218787A1 (en) * | 2017-06-02 | 2018-12-06 | 肖传乐 | Third-generation sequencing sequence correction method based on local graph |
CN111564181A (en) * | 2020-04-02 | 2020-08-21 | 北京百迈客生物科技有限公司 | Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies |
CN114420209A (en) * | 2022-03-28 | 2022-04-29 | 山东大学 | Sequencing data-based pathogenic microorganism detection method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104017883A (en) * | 2014-06-18 | 2014-09-03 | 深圳华大基因科技服务有限公司 | Method and system for assembling genomic sequence |
CN104164479A (en) * | 2014-04-04 | 2014-11-26 | 深圳华大基因科技服务有限公司 | Heterozygous genome processing method |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
CN104951672A (en) * | 2015-06-19 | 2015-09-30 | 中国科学院计算技术研究所 | Splicing method and system of second generation and third generation genomic sequencing data combination |
-
2016
- 2016-05-17 CN CN201610329027.4A patent/CN106021997B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104164479A (en) * | 2014-04-04 | 2014-11-26 | 深圳华大基因科技服务有限公司 | Heterozygous genome processing method |
CN104017883A (en) * | 2014-06-18 | 2014-09-03 | 深圳华大基因科技服务有限公司 | Method and system for assembling genomic sequence |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
CN104951672A (en) * | 2015-06-19 | 2015-09-30 | 中国科学院计算技术研究所 | Splicing method and system of second generation and third generation genomic sequencing data combination |
Non-Patent Citations (2)
Title |
---|
KONSTANTINOS PATIS等: "Evaluation of DNA scaffolding techniques using pacbio long reads", 《网站在线公开:HTTPS://WWW.MYSCIENCEWORK.COM/PUBLICATION/SHOW/EVALUATION-DNA-SCAFFOLDING-TECHNIQUES-USING-PACBIO-LONG-READS-DC66B81E》 * |
任毅鹏等: "基于Pacbio平台的全长转录组测序", 《中国科学》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108614954A (en) * | 2016-12-12 | 2018-10-02 | 深圳华大基因科技服务有限公司 | A kind of method and apparatus of the short sequencing error corrections of two generation sequences |
CN108614954B (en) * | 2016-12-12 | 2020-07-28 | 深圳华大基因科技服务有限公司 | Method and device for short sequence error correction of second-generation sequence |
CN108460245A (en) * | 2017-02-21 | 2018-08-28 | 深圳华大基因科技服务有限公司 | The method and apparatus for assembling result using two generation of three generations's sequence optimisation |
CN107256335A (en) * | 2017-06-02 | 2017-10-17 | 肖传乐 | A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed |
WO2018218787A1 (en) * | 2017-06-02 | 2018-12-06 | 肖传乐 | Third-generation sequencing sequence correction method based on local graph |
WO2018218788A1 (en) * | 2017-06-02 | 2018-12-06 | 肖传乐 | Third-generation sequencing sequence alignment method based on global seed scoring optimization |
CN111564181A (en) * | 2020-04-02 | 2020-08-21 | 北京百迈客生物科技有限公司 | Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies |
CN111564181B (en) * | 2020-04-02 | 2024-06-04 | 北京百迈客生物科技有限公司 | Method for carrying out metagenome assembly based on second-generation and third-generation ONT technology |
CN114420209A (en) * | 2022-03-28 | 2022-04-29 | 山东大学 | Sequencing data-based pathogenic microorganism detection method and system |
Also Published As
Publication number | Publication date |
---|---|
CN106021997B (en) | 2019-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021997A (en) | Third-generation PacBio sequencing data comparison method | |
US7945097B2 (en) | Classifying digital ink into a writing or a drawing | |
CN110399878B (en) | Form format recovery method, computer readable medium and computer | |
CN106022002A (en) | Three-generation PacBio sequencing data-based hole filling method | |
CN108228825B (en) | A kind of station address data cleaning method based on participle | |
US20150095769A1 (en) | Layout Analysis Method And System | |
CN102903136B (en) | A kind of handwriting electronization method and system | |
CN1492377A (en) | Form processing system and method | |
CN108073930A (en) | A kind of target detection and tracking based on multiple irregular ROI | |
CN109960808A (en) | A kind of text recognition method, device, equipment and computer readable storage medium | |
CN106021985A (en) | Genome data compression method | |
CN102117373A (en) | Sign data entry method and device | |
CN101964048A (en) | Character recognition method and system | |
CN106255979A (en) | Row dividing method | |
CN107480466A (en) | Genomic data storage method and electronic equipment | |
CN106802958A (en) | Conversion method and system of the CAD data to GIS data | |
WO2013097817A1 (en) | Method and system for generating control instruction according to change of glyph outline | |
CN100456317C (en) | Program, method and device for determining line direction | |
CN106156772B (en) | For determining the method and apparatus of word spacing and for the method and system of participle | |
CN104112287B (en) | Method and device for segmenting characters in picture | |
CN106022003B (en) | A kind of scaffold construction method based on three generations's PacBio sequencing data | |
CN102385630B (en) | A kind of method and system that file mark is carried out in file | |
US20240046686A1 (en) | Document Extraction Template Induction | |
CN109189966A (en) | A kind of trapping patterns search method based on shape feature | |
US11656881B2 (en) | Detecting repetitive patterns of user interface actions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |