CN106021997B - A kind of comparison method of three generations PacBio sequencing data - Google Patents

A kind of comparison method of three generations PacBio sequencing data Download PDF

Info

Publication number
CN106021997B
CN106021997B CN201610329027.4A CN201610329027A CN106021997B CN 106021997 B CN106021997 B CN 106021997B CN 201610329027 A CN201610329027 A CN 201610329027A CN 106021997 B CN106021997 B CN 106021997B
Authority
CN
China
Prior art keywords
kmer
unique
generations
comparison
sequencing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610329027.4A
Other languages
Chinese (zh)
Other versions
CN106021997A (en
Inventor
詹东亮
王军
王军一
郝美荣
何荣军
俞凯成
高金龙
蔡庆乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Original Assignee
HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HANGZHOU HEYI GENE TECHNOLOGY Co Ltd filed Critical HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority to CN201610329027.4A priority Critical patent/CN106021997B/en
Publication of CN106021997A publication Critical patent/CN106021997A/en
Application granted granted Critical
Publication of CN106021997B publication Critical patent/CN106021997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of comparison method for being effectively reduced and comparing three generations's PacBio sequencing data of mistake caused by repetitive sequence.It establishes k-mer model using the Illumina data in two generations, extract unique-kmer, in the comparison of three generations's PacBio sequencing data, using this unique-kmer as the seed (seed) used when comparing, the influence that repetitive sequence can be greatly reduced improves the speed of comparison.

Description

A kind of comparison method of three generations PacBio sequencing data
Technical field
The present invention relates to technical field of biological information, and in particular to the comparison method of DNA sequence dna, it used for two generations Illumina sequencing data carries out modeling and extracts key message, and assists three generations PacBio that number is sequenced using these key messages According to comparison.
Background technique
The error rate of the sequencing data of three generations PacBio, single sequencing is about 15%, the special comparison software for supporting three generations And it is few, current most commonly used software is following two: (1) blasr;(2)dalign.
This two is all that very outstanding three generations compares software, can support the high error rate of PacBio.Due to genome sheet There are repetitive sequences for body, they possess the similar sequence of height.And these compare software, these repetitive sequences can be compared And output, to influence subsequent biological analysis (for example assembling, expression analysis etc.).
Summary of the invention
Present invention aim to address posed problems above, provide one kind and comparison mistake caused by repetitive sequence is effectively reduced The comparison method of three generations's PacBio sequencing data accidentally.It establishes kmer model using the Illumina data in two generations, extracts Unique-kmer is used in the comparison of three generations's PacBio sequencing data using this unique-kmer as when comparing Seed (seed), the influence of repetitive sequence can be greatly reduced, improve the speed of comparison.
The present invention is achieved by the following technical solutions:
The present invention is a kind of comparison method of three generations PacBio sequencing data, it the following steps are included:
(1) kmer model is established using Illumina sequencing data, therefrom extracts unique-kmer;
(2) unique-mer is used to carry out candidate reads screening as the seed compared
(3) candidate reads is compared in detail.
As optimization, k-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, according to k- Mer distribution map obtains the k-mer within two times of main peaks as unique-kmer, and using bit file or GATB open source packet, right The unique-kmer is stored.
K≤17 are stored using the bit file (* .bit) that a size is 2G as optimization, and for k > 17 the case where, in (* .h5) file in unique-kmer deposit GATB open source packet.
As optimization, in step (2), using the unique-kmer of step (1), if shared between reads Unique-kmer is counted more than 3, just these reads is screened, as candidate reads.
As optimization, the step (3) the following steps are included:
A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:
Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2, Each point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather most multiple spot Straight line as compare on region;
B. range will be compared again and carries out cell regional partition, to each cut zone, calculate similarity using LCS algorithm, It gives a mark again to whole, the method is as follows:
Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the phase of these zonules totality It is c like base, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the number that the two values are both greater than 0.7 According to.
Beneficial effects of the present invention are as follows:
1, unique-kmer is extracted using two generation Illumina sequencing datas, improves the accuracy rate and speed of comparison.
In genome, there are many repetitive sequences, some short tandem repeats even occur hundreds and thousands of times, thus can shadow The accuracy compared is rung, the time of comparison is increased.In order to improve the accuracy of comparison, comparison time is reduced, we extract Only occurs primary k-mer in contig, as unique-kmer.Because the quality of two generation Illumina sequencing datas is very Height, in the case where sequencing depth is sufficiently random (ordinary circumstance is~40x), using Jellyfish software to two generations Illumina sequencing data carries out kmer statistics, the distribution map (Fig. 1) of available k-mer.By the k- of 2 times of inner regions of peak value Mer is as unique-kmer.For k≤17, stored using the bit file (* .bit file) that a size is 2G, and Unique-kmer is stored in file (* .h5 file) using GATB (Open Framework) by the case where for k > 17.Used in it Two generation Illumina sequencing data quality it is higher, Jellyfish software with multithreading run, speed is fast, and memory consumption is small The advantages of, it ensure that entire method data processing quality with higher, and apparent processing speed advantage;
2, use unique-kmer to carry out candidate reads screening as the seed compared, save comparison time, improve ratio To speed.
Because unique-kmer is in probability and theoretically, in haploid genome, only will appear once, so as to Avoid influence caused by repetitive sequence.On the other hand, due to avoiding the influence of repetitive sequence, the candidate reads found is accurate Degree is very high, has saved many comparison times, has substantially increased comparison speed.
3, candidate reads is compared in detail, has saved memory and comparison time, improved and compare speed.
Many comparison methods for comparing software, all employ the algorithm of longest common subsequence (LCS), directly to whole area Domain carries out LCS calculating, then wastes very much memory and time for the comparison area greater than 100k.This method is also using this calculation Method, but improved of both having done: (1) the comparison relationship of seed is clustered in advance, calculates optimal comparison range; (2) subregion is compared.To save memory and comparison time, improves and compare speed.
Detailed description of the invention
Fig. 1: kmer distribution map
All data are broken into the segment (referred to as k-mer) that length is k, abscissa is the frequency in k-mer, indulges and sits It is designated as the type of frequency k-mer, using the k-mer of 2 times of inner regions of peak value as unique-kmer.
Fig. 2: it calculates and compares range schematic diagram
Each point on figure represents the seed shared on two read, and abscissa represents the position in read1 comparison, indulges and sits Mark represents the position in read2 comparison, these seed are clustered with the straight line that slope is 1, selects and clusters most straight lines, Using this region as the range on comparing.
Fig. 3: flow chart of the present invention
Specific embodiment
The embodiment of the present invention is further elaborated with reference to the accompanying drawing:
Embodiment:
(1) kmer model is established using two generation Illumina sequencing datas, therefrom extracts unique-kmer
K-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, all data are interrupted The segment (referred to as k-mer) for being k at length, abscissa are the frequency in k-mer, and ordinate is the type of frequency k-mer.Root The k-mer within two times of main peaks is obtained as unique-kmer according to k-mer distribution map, for k≤17, is using a size The bit file (* .bit) of 2G stores, and the case where for k > 17, in unique-kmer deposit GATB open source packet In (* .h5) file.Wherein, two generation Illumina sequencing datas refer to is surveyed by two generations that Illumina company sequenator obtains Ordinal number evidence.
According to the above method, following procedure is write, for extracting unique-kmer, concrete operations order operation instruction is such as Under:
It is as follows that concrete case implements operation:
From the Illumina sequencing data in two generations, the data of about 40X are screened, it is written one and is fq.lst file In:
Then program is run, to obtain unique-kmer:
Because choosing k=17, result is stored in bit file: k17.bit
(2) it is compared using unique-kmer with three generations's Pacbio sequencing data, screens candidate reads
Using this unique-kmer as the seed (seed) used when comparing, if shared between reads When unique-kmer is more than 3, using them as candidate reads.Wherein, three generations Pacbio sequencing data, which refers to, passes through Pacbio The two generation sequencing datas that company's sequenator obtains.
According to the above method, an alignment programs are write, three generations's Pacbio sequencing data to be compared, concrete operations Order operation instruction is as follows:
It is as follows that concrete case implements operation:
The data file being sequenced using two three generations Pacbio, respectively read1.fa, read2.fa, in addition there are one The unique-kmer file that two generation Illumina sequencing datas extract: k17.bit, operation are compared with issuing orders:
(3) candidate reads is compared in detail.
A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:
Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2, Each point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather most multiple spot Straight line as compare on region;
B. range progress cell regional partition will be compared again (can set segmentation length as 100bp), to each cut section Domain calculates similarity using LCS algorithm, then gives a mark to whole, the method is as follows:
Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the phase of these zonules totality It is c like base, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the number that the two values are both greater than 0.7 According to.
What has been described above is only a preferred embodiment of the present invention, it is noted that for common skill in the art For art personnel, under the premise of not departing from core technical features of the present invention, several improvements and modifications can also be made, these change It also should be regarded as protection scope of the present invention into retouching.

Claims (4)

1. a kind of comparison method of three generations PacBio sequencing data, which is characterized in that it the following steps are included:
(1) kmer model is established using two generation Illumina sequencing datas, and therefrom extracts unique-kmer;
(2) it uses unique-kmer using it as the seed compared, is compared, filters out with three generations's Pacbio sequencing data Candidate reads;
(3) candidate reads is compared in detail, comprising the following steps:
A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:
Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2, each Point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather the straight of most multiple spot Line is as the region on comparing;
B. range will be compared again and carries out cell regional partition, to each cut zone, calculate similarity using LCS algorithm, then right Entirety is given a mark, the method is as follows:
Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the similar alkali of these zonules totality Base is c, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the data that the two values are both greater than 0.7.
2. according to the comparison method of three generations PacBio sequencing data described in claims 1, which is characterized in that in the step Suddenly in (1), k-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, according to k-mer distribution map The k-mer within two times of main peaks is obtained as unique-kmer, and using bit file or GATB open source packet, to described Unique-kmer is stored.
3. according to the comparison method of three generations PacBio sequencing data described in claims 2, which is characterized in that for k≤ 17, it is stored using the bit file * .bit that size is 2G, and the case where for k > 17, unique-kmer is stored in In * .h5 file in GATB open source packet.
4. according to the comparison method of three generations PacBio sequencing data described in claims 1, which is characterized in that in the step Suddenly in (2), using the unique-kmer of step (1), if between reads share unique-kmer count more than 3, just These reads are screened, as candidate reads.
CN201610329027.4A 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data Active CN106021997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610329027.4A CN106021997B (en) 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610329027.4A CN106021997B (en) 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data

Publications (2)

Publication Number Publication Date
CN106021997A CN106021997A (en) 2016-10-12
CN106021997B true CN106021997B (en) 2019-03-29

Family

ID=57098804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610329027.4A Active CN106021997B (en) 2016-05-17 2016-05-17 A kind of comparison method of three generations PacBio sequencing data

Country Status (1)

Country Link
CN (1) CN106021997B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614954B (en) * 2016-12-12 2020-07-28 深圳华大基因科技服务有限公司 Method and device for short sequence error correction of second-generation sequence
CN108460245B (en) * 2017-02-21 2020-11-06 深圳华大基因科技服务有限公司 Method and apparatus for optimizing second generation assembly results using third generation sequences
CN107256335A (en) * 2017-06-02 2017-10-17 肖传乐 A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN111564181B (en) * 2020-04-02 2024-06-04 北京百迈客生物科技有限公司 Method for carrying out metagenome assembly based on second-generation and third-generation ONT technology
CN114420209A (en) * 2022-03-28 2022-04-29 山东大学 Sequencing data-based pathogenic microorganism detection method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104017883B (en) * 2014-06-18 2015-11-18 深圳华大基因科技服务有限公司 The method and system of assembling genome sequence
CN104951672B (en) * 2015-06-19 2017-08-29 中国科学院计算技术研究所 Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence

Also Published As

Publication number Publication date
CN106021997A (en) 2016-10-12

Similar Documents

Publication Publication Date Title
CN106021997B (en) A kind of comparison method of three generations PacBio sequencing data
US20190139624A1 (en) Identifying ancestral relationships using a continuous stream of input
Wächter et al. Proposal for a subdivision of the family Psathyrellaceae based on a taxon-rich phylogenetic analysis with iterative multigene guide tree
CN107305577B (en) K-means-based appropriate address data processing method and system
Nelesen et al. DACTAL: divide-and-conquer trees (almost) without alignments
CN106022002B (en) A kind of filling-up hole method based on three generations's PacBio sequencing data
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN106480221B (en) Based on gene copy number variation site to the method for forest tree population genotyping
CN109346130A (en) A method of directly micro- haplotype and its parting are obtained from full-length genome weight sequencing data
CN108256289A (en) A kind of method based on target area capture sequencing genomes copy number variation
CN106021985B (en) A kind of genomic data compression method
Bourgeois et al. Disentangling the determinants of transposable elements dynamics in vertebrate genomes using empirical evidences and simulations
Walker et al. Short-range template switching in great ape genomes explored using pair hidden Markov models
CN109706231A (en) A kind of high-throughput SNP classifying method for litopenaeus vannamei molecular breeding
Zhou et al. Eigenvalue significance testing for genetic association
CN114566214B (en) Method for detecting genome deletion insertion variation, detection device, computer readable storage medium and application
Lu et al. Discovery and annotation of a novel transposable element family in Gossypium
Liu et al. Spatial Cluster Analysis by the Adleman‐Lipton DNA Computing Model and Flexible Grids
CN112133371A (en) Method and device for performing framework assembly based on single-tube long-fragment sequencing data
CN107688727B (en) Method and device for identifying transcript subtypes in biological sequence clustering and full-length transcription group
CN112687339B (en) Method and device for counting sequence errors in plasma DNA fragment sequencing data
KR20130101711A (en) System and method for processing genome sequence in consideration of seed length
Yi et al. Genome-wide data reveal cryptic diversity and hybridization in a group of tree ferns
CN106022003B (en) A kind of scaffold construction method based on three generations's PacBio sequencing data
Sinha et al. A model for optimal assignment of non-uniquely mapped NGS reads in DNA regions of duplications or deletions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant