CN106021997B

CN106021997B - A kind of comparison method of three generations PacBio sequencing data

Info

Publication number: CN106021997B
Application number: CN201610329027.4A
Authority: CN
Inventors: 詹东亮; 王军; 王军一; 郝美荣; 何荣军; 俞凯成; 高金龙; 蔡庆乐
Original assignee: HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Current assignee: HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority date: 2016-05-17
Filing date: 2016-05-17
Publication date: 2019-03-29
Anticipated expiration: 2036-05-17
Also published as: CN106021997A

Abstract

The present invention provides a kind of comparison method for being effectively reduced and comparing three generations's PacBio sequencing data of mistake caused by repetitive sequence.It establishes k-mer model using the Illumina data in two generations, extract unique-kmer, in the comparison of three generations's PacBio sequencing data, using this unique-kmer as the seed (seed) used when comparing, the influence that repetitive sequence can be greatly reduced improves the speed of comparison.

Description

A kind of comparison method of three generations PacBio sequencing data

Technical field

The present invention relates to technical field of biological information, and in particular to the comparison method of DNA sequence dna, it used for two generations Illumina sequencing data carries out modeling and extracts key message, and assists three generations PacBio that number is sequenced using these key messages According to comparison.

Background technique

The error rate of the sequencing data of three generations PacBio, single sequencing is about 15%, the special comparison software for supporting three generations And it is few, current most commonly used software is following two: (1) blasr；(2)dalign.

This two is all that very outstanding three generations compares software, can support the high error rate of PacBio.Due to genome sheet There are repetitive sequences for body, they possess the similar sequence of height.And these compare software, these repetitive sequences can be compared And output, to influence subsequent biological analysis (for example assembling, expression analysis etc.).

Summary of the invention

Present invention aim to address posed problems above, provide one kind and comparison mistake caused by repetitive sequence is effectively reduced The comparison method of three generations's PacBio sequencing data accidentally.It establishes kmer model using the Illumina data in two generations, extracts Unique-kmer is used in the comparison of three generations's PacBio sequencing data using this unique-kmer as when comparing Seed (seed), the influence of repetitive sequence can be greatly reduced, improve the speed of comparison.

The present invention is achieved by the following technical solutions:

The present invention is a kind of comparison method of three generations PacBio sequencing data, it the following steps are included:

(1) kmer model is established using Illumina sequencing data, therefrom extracts unique-kmer；

(2) unique-mer is used to carry out candidate reads screening as the seed compared

(3) candidate reads is compared in detail.

As optimization, k-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, according to k- Mer distribution map obtains the k-mer within two times of main peaks as unique-kmer, and using bit file or GATB open source packet, right The unique-kmer is stored.

K≤17 are stored using the bit file (* .bit) that a size is 2G as optimization, and for k > 17 the case where, in (* .h5) file in unique-kmer deposit GATB open source packet.

As optimization, in step (2), using the unique-kmer of step (1), if shared between reads Unique-kmer is counted more than 3, just these reads is screened, as candidate reads.

As optimization, the step (3) the following steps are included:

A. first the seed in comparison is clustered, calculates most probable comparison range, the method is as follows:

Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2, Each point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather most multiple spot Straight line as compare on region；

B. range will be compared again and carries out cell regional partition, to each cut zone, calculate similarity using LCS algorithm, It gives a mark again to whole, the method is as follows:

Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the phase of these zonules totality It is c like base, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the number that the two values are both greater than 0.7 According to.

Beneficial effects of the present invention are as follows:

1, unique-kmer is extracted using two generation Illumina sequencing datas, improves the accuracy rate and speed of comparison.

In genome, there are many repetitive sequences, some short tandem repeats even occur hundreds and thousands of times, thus can shadow The accuracy compared is rung, the time of comparison is increased.In order to improve the accuracy of comparison, comparison time is reduced, we extract Only occurs primary k-mer in contig, as unique-kmer.Because the quality of two generation Illumina sequencing datas is very Height, in the case where sequencing depth is sufficiently random (ordinary circumstance is~40x), using Jellyfish software to two generations Illumina sequencing data carries out kmer statistics, the distribution map (Fig. 1) of available k-mer.By the k- of 2 times of inner regions of peak value Mer is as unique-kmer.For k≤17, stored using the bit file (* .bit file) that a size is 2G, and Unique-kmer is stored in file (* .h5 file) using GATB (Open Framework) by the case where for k > 17.Used in it Two generation Illumina sequencing data quality it is higher, Jellyfish software with multithreading run, speed is fast, and memory consumption is small The advantages of, it ensure that entire method data processing quality with higher, and apparent processing speed advantage；

2, use unique-kmer to carry out candidate reads screening as the seed compared, save comparison time, improve ratio To speed.

Because unique-kmer is in probability and theoretically, in haploid genome, only will appear once, so as to Avoid influence caused by repetitive sequence.On the other hand, due to avoiding the influence of repetitive sequence, the candidate reads found is accurate Degree is very high, has saved many comparison times, has substantially increased comparison speed.

3, candidate reads is compared in detail, has saved memory and comparison time, improved and compare speed.

Many comparison methods for comparing software, all employ the algorithm of longest common subsequence (LCS), directly to whole area Domain carries out LCS calculating, then wastes very much memory and time for the comparison area greater than 100k.This method is also using this calculation Method, but improved of both having done: (1) the comparison relationship of seed is clustered in advance, calculates optimal comparison range； (2) subregion is compared.To save memory and comparison time, improves and compare speed.

Detailed description of the invention

Fig. 1: kmer distribution map

All data are broken into the segment (referred to as k-mer) that length is k, abscissa is the frequency in k-mer, indulges and sits It is designated as the type of frequency k-mer, using the k-mer of 2 times of inner regions of peak value as unique-kmer.

Fig. 2: it calculates and compares range schematic diagram

Each point on figure represents the seed shared on two read, and abscissa represents the position in read1 comparison, indulges and sits Mark represents the position in read2 comparison, these seed are clustered with the straight line that slope is 1, selects and clusters most straight lines, Using this region as the range on comparing.

Fig. 3: flow chart of the present invention

Specific embodiment

The embodiment of the present invention is further elaborated with reference to the accompanying drawing:

Embodiment:

(1) kmer model is established using two generation Illumina sequencing datas, therefrom extracts unique-kmer

K-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, all data are interrupted The segment (referred to as k-mer) for being k at length, abscissa are the frequency in k-mer, and ordinate is the type of frequency k-mer.Root The k-mer within two times of main peaks is obtained as unique-kmer according to k-mer distribution map, for k≤17, is using a size The bit file (* .bit) of 2G stores, and the case where for k > 17, in unique-kmer deposit GATB open source packet In (* .h5) file.Wherein, two generation Illumina sequencing datas refer to is surveyed by two generations that Illumina company sequenator obtains Ordinal number evidence.

According to the above method, following procedure is write, for extracting unique-kmer, concrete operations order operation instruction is such as Under:

It is as follows that concrete case implements operation:

From the Illumina sequencing data in two generations, the data of about 40X are screened, it is written one and is fq.lst file In:

Then program is run, to obtain unique-kmer:

Because choosing k=17, result is stored in bit file: k17.bit

(2) it is compared using unique-kmer with three generations's Pacbio sequencing data, screens candidate reads

Using this unique-kmer as the seed (seed) used when comparing, if shared between reads When unique-kmer is more than 3, using them as candidate reads.Wherein, three generations Pacbio sequencing data, which refers to, passes through Pacbio The two generation sequencing datas that company's sequenator obtains.

According to the above method, an alignment programs are write, three generations's Pacbio sequencing data to be compared, concrete operations Order operation instruction is as follows:

It is as follows that concrete case implements operation:

The data file being sequenced using two three generations Pacbio, respectively read1.fa, read2.fa, in addition there are one The unique-kmer file that two generation Illumina sequencing datas extract: k17.bit, operation are compared with issuing orders:

(3) candidate reads is compared in detail.

B. range progress cell regional partition will be compared again (can set segmentation length as 100bp), to each cut section Domain calculates similarity using LCS algorithm, then gives a mark to whole, the method is as follows:

What has been described above is only a preferred embodiment of the present invention, it is noted that for common skill in the art For art personnel, under the premise of not departing from core technical features of the present invention, several improvements and modifications can also be made, these change It also should be regarded as protection scope of the present invention into retouching.

Claims

1. a kind of comparison method of three generations PacBio sequencing data, which is characterized in that it the following steps are included:

(1) kmer model is established using two generation Illumina sequencing datas, and therefrom extracts unique-kmer；

(2) it uses unique-kmer using it as the seed compared, is compared, filters out with three generations's Pacbio sequencing data Candidate reads；

(3) candidate reads is compared in detail, comprising the following steps:

Coordinate system is established, abscissa represents the position in read1 comparison, and ordinate represents the position on comparing on read2, each Point represents the seed shared on two read, these seed are clustered with the straight line that slope is 1, will gather the straight of most multiple spot Line is as the region on comparing；

B. range will be compared again and carries out cell regional partition, to each cut zone, calculate similarity using LCS algorithm, then right Entirety is given a mark, the method is as follows:

Assuming that will compare range is divided into n region, region of the similarity greater than 0.8 has b, the similar alkali of these zonules totality Base is c, then Regional Similarity is b/n, and base similarity is c/a, finally only retains the data that the two values are both greater than 0.7.

2. according to the comparison method of three generations PacBio sequencing data described in claims 1, which is characterized in that in the step Suddenly in (1), k-mer statistics is carried out to two generation Illumina sequencing datas using jellyfish software, according to k-mer distribution map The k-mer within two times of main peaks is obtained as unique-kmer, and using bit file or GATB open source packet, to described Unique-kmer is stored.

3. according to the comparison method of three generations PacBio sequencing data described in claims 2, which is characterized in that for k≤ 17, it is stored using the bit file * .bit that size is 2G, and the case where for k > 17, unique-kmer is stored in In * .h5 file in GATB open source packet.

4. according to the comparison method of three generations PacBio sequencing data described in claims 1, which is characterized in that in the step Suddenly in (2), using the unique-kmer of step (1), if between reads share unique-kmer count more than 3, just These reads are screened, as candidate reads.