CN106021997A - Third-generation PacBio sequencing data comparison method - Google Patents

Third-generation PacBio sequencing data comparison method Download PDF

Info

Publication number
CN106021997A
CN106021997A CN201610329027.4A CN201610329027A CN106021997A CN 106021997 A CN106021997 A CN 106021997A CN 201610329027 A CN201610329027 A CN 201610329027A CN 106021997 A CN106021997 A CN 106021997A
Authority
CN
China
Prior art keywords
comparison
kmer
sequencing data
unique
generations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610329027.4A
Other languages
Chinese (zh)
Other versions
CN106021997B (en
Inventor
詹东亮
王军
王军一
郝美荣
何荣军
俞凯成
高金龙
蔡庆乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Original Assignee
HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HANGZHOU HEYI GENE TECHNOLOGY Co Ltd filed Critical HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority to CN201610329027.4A priority Critical patent/CN106021997B/en
Publication of CN106021997A publication Critical patent/CN106021997A/en
Application granted granted Critical
Publication of CN106021997B publication Critical patent/CN106021997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a third-generation PacBio sequencing data comparison method capable of effectively reducing comparison errors caused by repeated sequences. According to the method, a k-mer model is established by using second-generation Illumina data; unique-kmer is extracted; and in third-generation PacBio sequencing data comparison, unique-kmer is used as a seed used in the comparison, so that the influence of the repeated sequences can be greatly reduced and the comparison speed can be increased.

Description

A kind of comparison method of three generations's PacBio sequencing data
Technical field
The present invention relates to technical field of biological information, be specifically related to the comparison method of DNA sequence, it uses secondary Illumina sequencing data be modeled extracting key message, and utilize these key messages to assist three generations The comparison of PacBio sequencing data.
Background technology
The sequencing data of three generations PacBio, the error rate of single order-checking is about 15%, the special comparison supporting three generations Software is the most few, and currently used most software is following two: (1) blasr;(2)dalign.
This two is all the most outstanding three generations's comparison software, can support the high error rate of PacBio.Due to gene Itself there is repetitive sequence in group, they have the most similar sequence.And these comparison softwares, can be heavy by these Complex sequences is compared and exports, thus affects follow-up biological analysis (such as assembling, expression analysis etc.).
Summary of the invention
Present invention aim to address posed problems above, it is provided that the ratio that a kind of effective reduction repetitive sequence causes Comparison method to three generations's PacBio sequencing data of mistake.It uses secondary Illumina data to set up kmer Model, extracts unique-kmer, in the comparison of three generations's PacBio sequencing data, uses this unique-kmer It is used as the seed (seed) used during comparison, the impact of repetitive sequence can be greatly reduced, improve the speed of comparison Degree.
The present invention is achieved by the following technical solutions:
The present invention is the comparison method of a kind of three generations's PacBio sequencing data, and it comprises the following steps:
(1) use Illumina sequencing data to set up kmer model, therefrom extract unique-kmer;
(2) unique-mer is used to carry out candidate's reads screening as the seed of comparison
(3) candidate reads is carried out detailed comparison.
As optimization, use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, according to K-mer scattergram obtains the k-mer within two times of main peaks as unique-kmer, and use bit file or GATB increases income bag, stores described unique-kmer.
As optimization, for k≤17, the bit file (* .bit) using size to be 2G stores, and As k > 17, unique-kmer is stored in GATB and increases income in (* .h5) file in bag.
As optimization, in step (2), use the unique-kmer of step (1), if total between reads Unique-kmer count more than 3, just these reads are screened, as candidate reads.
As optimization, described step (3) comprises the following steps:
The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:
Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents on read2 in comparison Position, each point represents seed total on two read, is gathered by the straight line that these seed slopes are 1 Class, using gather multiple spot straight line as the region in comparison;
The most again comparison scope is carried out zonule segmentation, to each cut zone, use LCS algorithm to calculate phase Like degree, then giving a mark entirety, method is as follows:
Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are total The similar base of body is c, then Regional Similarity is b/n, and base similarity is c/a, the most only retains the two The data of value both greater than 0.7.
Beneficial effects of the present invention is as follows:
1, use secondary Illumina sequencing data to extract unique-kmer, improve accuracy rate and the speed of comparison.
In genome, there is many repetitive sequences, some short tandem repeat even occurs hundreds and thousands of times, from And the accuracy of comparison can be affected, increase the time of comparison.In order to improve the accuracy of comparison, when reducing comparison Between, we are extracted in contig the k-mer only occurred once, as unique-kmer.Because secondary Illumina The quality of sequencing data is the highest, in the case of the order-checking degree of depth is sufficiently random (ordinary circumstance is~40x), uses Jellyfish software carries out kmer statistics to secondary Illumina sequencing data, can obtain the scattergram of k-mer (Fig. 1).Using the k-mer of 2 times of inner regions of peak value as unique-kmer.For k≤17, use one greatly The little bit file for 2G (* .bit file) stores, and for k > 17 in the case of, use GATB (to increase income Framework), unique-kmer is stored in file (* .h5 file).Secondary Illumina order-checking number used in it Higher according to quality, Jellyfish software has multithreading and runs, and speed is fast, the advantage that memory consumption is little, it is ensured that Whole method has higher data and processes quality, and significantly processing speed advantage;
2, use unique-kmer to carry out candidate's reads screening as the seed of comparison, save comparison time, Improve comparison speed.
Because unique-kmer at probability and in theory, in haploid genome, only there will be once, from And it is avoided that the impact that repetitive sequence causes.On the other hand, owing to avoiding the impact of repetitive sequence, find Candidate's reads accuracy is the highest, has saved a lot of comparison time, substantially increases comparison speed.
3, the reads of candidate is carried out detailed comparison, saved internal memory and comparison time, improve comparison speed.
The comparison method of a lot of comparison softwares, all employ the algorithm of longest common subsequence (LCS), the most right Overall region carries out LCS calculating, wastes the most very much internal memory and time for the comparison area more than 100k.We Method is also to use this algorithm, but improves of both having done: the comparison relation of seed is carried out by (1) in advance Cluster, calculates the comparison scope of optimum;(2) compare in subregion.Thus saved internal memory and comparison time, Improve comparison speed.
Accompanying drawing explanation
Fig. 1: kmer scattergram
All of data are broken into the segment (referred to as k-mer) of a length of k, and abscissa is the frequency at k-mer Number, vertical coordinate is the kind of this frequency k-mer, using the k-mer of 2 times of inner regions of peak value as unique-kmer.
Fig. 2: calculate comparison scope schematic diagram
Each point on figure represents seed total on two read, and abscissa represents the position in read1 comparison, Vertical coordinate represents the position in read2 comparison, is clustered by the straight line that these seed slopes are 1, selects Cluster most straight lines, using this region as the scope in comparison.
Fig. 3: flow chart of the present invention
Detailed description of the invention
Below in conjunction with the accompanying drawings embodiments of the invention are further elaborated:
Embodiment:
(1) use secondary Illumina sequencing data to set up kmer model, therefrom extract unique-kmer
Use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, all of data are beaten Being broken into the segment (referred to as k-mer) of a length of k, abscissa is the frequency at k-mer, and vertical coordinate is this frequency The kind of k-mer.According to the k-mer within k-mer scattergram two times of main peaks of acquisition as unique-kmer, right In k≤17, the bit file (* .bit) using size to be 2G stores, and during for k > 17, Unique-kmer is stored in GATB increase income in (* .h5) file in bag.Wherein, secondary Illumina order-checking Data refer to the secondary sequencing data obtained by Illumina company sequenator.
According to said method, writing following program, be used for extracting unique-kmer, concrete operations order uses and says Bright as follows:
It is as follows that concrete case implements operation:
From secondary Illumina sequencing data, screen the data of about 40X, be fq.lst its write one In file:
Then run program, obtain unique-kmer:
Because choosing k=17, result is stored in bit file: k17.bit
(2) use unique-kmer to compare with three generations's Pacbio sequencing data, screen candidate reads
This unique-kmer is used to be used as the seed (seed) used during comparison, if having between reads When unique-kmer is more than 3, using them as candidate reads.Wherein, three generations Pacbio sequencing data refers to The secondary sequencing data obtained by Pacbio company sequenator.
According to said method, write an alignment programs, three generations's Pacbio sequencing data is compared, tool Body operational order operation instruction is as follows:
It is as follows that concrete case implements operation:
Use the data file of two three generations Pacbio order-checking, respectively read1.fa, read2.fa, the most also one The unique-kmer file that individual secondary Illumina sequencing data extracts: k17.bit, runs and carries out to issue orders Comparison:
(3) candidate reads is carried out detailed comparison.
The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:
Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents on read2 in comparison Position, each point represents seed total on two read, is gathered by the straight line that these seed slopes are 1 Class, using gather multiple spot straight line as the region in comparison;
Comparison scope is carried out zonule segmentation (can set a length of 100bp of segmentation) the most again, to each point Cutting region, use LCS algorithm to calculate similarity, then give a mark entirety, method is as follows:
Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are total The similar base of body is c, then Regional Similarity is b/n, and base similarity is c/a, the most only retains the two The data of value both greater than 0.7.
Above-described is only the preferred embodiment of the present invention, it is noted that general in the art For logical technical staff, on the premise of without departing from core technical features of the present invention, it is also possible to make some improvement And retouching, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (5)

1. the comparison method of three generations's PacBio sequencing data, it is characterised in that it comprises the following steps:
(1) use secondary Illumina sequencing data to set up kmer model, and therefrom extract unique-kmer;
(2) use unique-kmer using it as the seed of comparison, compare with three generations's Pacbio sequencing data, Filter out candidate reads;
(3) candidate reads is carried out detailed comparison.
2. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that In described step (1), use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, Obtain the k-mer within two times of main peaks as unique-kmer according to k-mer scattergram, and use bit file or GATB increases income bag, stores described unique-kmer.
3. according to the comparison method of the three generations's PacBio sequencing data described in claims 2, it is characterised in that For k≤17, the bit file (* .bit) using size to be 2G stores, and for the feelings of k > 17 Condition, is stored in GATB unique-kmer and increases income in (* .h5) file in bag.
4. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that In described step (2), use the unique-kmer of step (1), if unique-kmer total between reads Counting, more than 3, just screens these reads, as candidate reads.
5. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that Described step (3) comprises the following steps:
The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:
Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents the position on read2 in comparison, Each point represents seed total on two read, is clustered by the straight line that these seed slopes are 1, will Gather the straight line of multiple spot as the region in comparison;
The most again comparison scope is carried out zonule segmentation, to each cut zone, uses LCS algorithm to calculate similarity, Giving a mark entirety, method is as follows again:
Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are overall Similar base is c, then Regional Similarity is b/n, and base similarity is c/a, and last only reservation the two value is all Data more than 0.7.
CN201610329027.4A 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data Active CN106021997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610329027.4A CN106021997B (en) 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610329027.4A CN106021997B (en) 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data

Publications (2)

Publication Number Publication Date
CN106021997A true CN106021997A (en) 2016-10-12
CN106021997B CN106021997B (en) 2019-03-29

Family

ID=57098804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610329027.4A Active CN106021997B (en) 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data

Country Status (1)

Country Link
CN (1) CN106021997B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256335A (en) * 2017-06-02 2017-10-17 肖传乐 A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed
CN108460245A (en) * 2017-02-21 2018-08-28 深圳华大基因科技服务有限公司 The method and apparatus for assembling result using two generation of three generations's sequence optimisation
CN108614954A (en) * 2016-12-12 2018-10-02 深圳华大基因科技服务有限公司 A kind of method and apparatus of the short sequencing error corrections of two generation sequences
WO2018218787A1 (en) * 2017-06-02 2018-12-06 肖传乐 Third-generation sequencing sequence correction method based on local graph
CN111564181A (en) * 2020-04-02 2020-08-21 北京百迈客生物科技有限公司 Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies
CN114420209A (en) * 2022-03-28 2022-04-29 山东大学 Sequencing data-based pathogenic microorganism detection method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104017883A (en) * 2014-06-18 2014-09-03 深圳华大基因科技服务有限公司 Method and system for assembling genomic sequence
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN104951672A (en) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 Splicing method and system of second generation and third generation genomic sequencing data combination

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method
CN104017883A (en) * 2014-06-18 2014-09-03 深圳华大基因科技服务有限公司 Method and system for assembling genomic sequence
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN104951672A (en) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 Splicing method and system of second generation and third generation genomic sequencing data combination

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KONSTANTINOS PATIS等: "Evaluation of DNA scaffolding techniques using pacbio long reads", 《网站在线公开:HTTPS://WWW.MYSCIENCEWORK.COM/PUBLICATION/SHOW/EVALUATION-DNA-SCAFFOLDING-TECHNIQUES-USING-PACBIO-LONG-READS-DC66B81E》 *
任毅鹏等: "基于Pacbio平台的全长转录组测序", 《中国科学》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614954A (en) * 2016-12-12 2018-10-02 深圳华大基因科技服务有限公司 A kind of method and apparatus of the short sequencing error corrections of two generation sequences
CN108614954B (en) * 2016-12-12 2020-07-28 深圳华大基因科技服务有限公司 Method and device for short sequence error correction of second-generation sequence
CN108460245A (en) * 2017-02-21 2018-08-28 深圳华大基因科技服务有限公司 The method and apparatus for assembling result using two generation of three generations's sequence optimisation
CN107256335A (en) * 2017-06-02 2017-10-17 肖传乐 A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed
WO2018218787A1 (en) * 2017-06-02 2018-12-06 肖传乐 Third-generation sequencing sequence correction method based on local graph
WO2018218788A1 (en) * 2017-06-02 2018-12-06 肖传乐 Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN111564181A (en) * 2020-04-02 2020-08-21 北京百迈客生物科技有限公司 Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies
CN111564181B (en) * 2020-04-02 2024-06-04 北京百迈客生物科技有限公司 Method for carrying out metagenome assembly based on second-generation and third-generation ONT technology
CN114420209A (en) * 2022-03-28 2022-04-29 山东大学 Sequencing data-based pathogenic microorganism detection method and system

Also Published As

Publication number Publication date
CN106021997B (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN106021997A (en) Third-generation PacBio sequencing data comparison method
US7945097B2 (en) Classifying digital ink into a writing or a drawing
CN110399878B (en) Form format recovery method, computer readable medium and computer
CN106022002A (en) Three-generation PacBio sequencing data-based hole filling method
CN108228825B (en) A kind of station address data cleaning method based on participle
US20150095769A1 (en) Layout Analysis Method And System
CN102903136B (en) A kind of handwriting electronization method and system
CN1492377A (en) Form processing system and method
CN108073930A (en) A kind of target detection and tracking based on multiple irregular ROI
CN109960808A (en) A kind of text recognition method, device, equipment and computer readable storage medium
CN106021985A (en) Genome data compression method
CN102117373A (en) Sign data entry method and device
CN101964048A (en) Character recognition method and system
CN106255979A (en) Row dividing method
CN107480466A (en) Genomic data storage method and electronic equipment
CN106802958A (en) Conversion method and system of the CAD data to GIS data
WO2013097817A1 (en) Method and system for generating control instruction according to change of glyph outline
CN100456317C (en) Program, method and device for determining line direction
CN106156772B (en) For determining the method and apparatus of word spacing and for the method and system of participle
CN104112287B (en) Method and device for segmenting characters in picture
CN106022003B (en) A kind of scaffold construction method based on three generations's PacBio sequencing data
CN102385630B (en) A kind of method and system that file mark is carried out in file
US20240046686A1 (en) Document Extraction Template Induction
CN109189966A (en) A kind of trapping patterns search method based on shape feature
US11656881B2 (en) Detecting repetitive patterns of user interface actions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant