CN106021997A

CN106021997A - Third-generation PacBio sequencing data comparison method

Info

Publication number: CN106021997A
Application number: CN201610329027.4A
Authority: CN
Inventors: 詹东亮; 王军; 王军一; 郝美荣; 何荣军; 俞凯成; 高金龙; 蔡庆乐
Original assignee: HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Current assignee: HANGZHOU HEYI GENE TECHNOLOGY Co Ltd
Priority date: 2016-05-17
Filing date: 2016-05-17
Publication date: 2016-10-12
Anticipated expiration: 2036-05-17
Also published as: CN106021997B

Abstract

The invention provides a third-generation PacBio sequencing data comparison method capable of effectively reducing comparison errors caused by repeated sequences. According to the method, a k-mer model is established by using second-generation Illumina data; unique-kmer is extracted; and in third-generation PacBio sequencing data comparison, unique-kmer is used as a seed used in the comparison, so that the influence of the repeated sequences can be greatly reduced and the comparison speed can be increased.

Description

A kind of comparison method of three generations's PacBio sequencing data

Technical field

The present invention relates to technical field of biological information, be specifically related to the comparison method of DNA sequence, it uses secondary Illumina sequencing data be modeled extracting key message, and utilize these key messages to assist three generations The comparison of PacBio sequencing data.

Background technology

The sequencing data of three generations PacBio, the error rate of single order-checking is about 15%, the special comparison supporting three generations Software is the most few, and currently used most software is following two: (1) blasr；(2)dalign.

This two is all the most outstanding three generations's comparison software, can support the high error rate of PacBio.Due to gene Itself there is repetitive sequence in group, they have the most similar sequence.And these comparison softwares, can be heavy by these Complex sequences is compared and exports, thus affects follow-up biological analysis (such as assembling, expression analysis etc.).

Summary of the invention

Present invention aim to address posed problems above, it is provided that the ratio that a kind of effective reduction repetitive sequence causes Comparison method to three generations's PacBio sequencing data of mistake.It uses secondary Illumina data to set up kmer Model, extracts unique-kmer, in the comparison of three generations's PacBio sequencing data, uses this unique-kmer It is used as the seed (seed) used during comparison, the impact of repetitive sequence can be greatly reduced, improve the speed of comparison Degree.

The present invention is achieved by the following technical solutions:

The present invention is the comparison method of a kind of three generations's PacBio sequencing data, and it comprises the following steps:

(1) use Illumina sequencing data to set up kmer model, therefrom extract unique-kmer；

(2) unique-mer is used to carry out candidate's reads screening as the seed of comparison

(3) candidate reads is carried out detailed comparison.

As optimization, use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, according to K-mer scattergram obtains the k-mer within two times of main peaks as unique-kmer, and use bit file or GATB increases income bag, stores described unique-kmer.

As optimization, for k≤17, the bit file (* .bit) using size to be 2G stores, and As k ＞ 17, unique-kmer is stored in GATB and increases income in (* .h5) file in bag.

As optimization, in step (2), use the unique-kmer of step (1), if total between reads Unique-kmer count more than 3, just these reads are screened, as candidate reads.

As optimization, described step (3) comprises the following steps:

The most first clustering the seed in comparison, calculate most probable comparison scope, method is as follows:

Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents on read2 in comparison Position, each point represents seed total on two read, is gathered by the straight line that these seed slopes are 1 Class, using gather multiple spot straight line as the region in comparison；

The most again comparison scope is carried out zonule segmentation, to each cut zone, use LCS algorithm to calculate phase Like degree, then giving a mark entirety, method is as follows:

Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are total The similar base of body is c, then Regional Similarity is b/n, and base similarity is c/a, the most only retains the two The data of value both greater than 0.7.

Beneficial effects of the present invention is as follows:

1, use secondary Illumina sequencing data to extract unique-kmer, improve accuracy rate and the speed of comparison.

In genome, there is many repetitive sequences, some short tandem repeat even occurs hundreds and thousands of times, from And the accuracy of comparison can be affected, increase the time of comparison.In order to improve the accuracy of comparison, when reducing comparison Between, we are extracted in contig the k-mer only occurred once, as unique-kmer.Because secondary Illumina The quality of sequencing data is the highest, in the case of the order-checking degree of depth is sufficiently random (ordinary circumstance is～40x), uses Jellyfish software carries out kmer statistics to secondary Illumina sequencing data, can obtain the scattergram of k-mer (Fig. 1).Using the k-mer of 2 times of inner regions of peak value as unique-kmer.For k≤17, use one greatly The little bit file for 2G (* .bit file) stores, and for k > 17 in the case of, use GATB (to increase income Framework), unique-kmer is stored in file (* .h5 file).Secondary Illumina order-checking number used in it Higher according to quality, Jellyfish software has multithreading and runs, and speed is fast, the advantage that memory consumption is little, it is ensured that Whole method has higher data and processes quality, and significantly processing speed advantage；

2, use unique-kmer to carry out candidate's reads screening as the seed of comparison, save comparison time, Improve comparison speed.

Because unique-kmer at probability and in theory, in haploid genome, only there will be once, from And it is avoided that the impact that repetitive sequence causes.On the other hand, owing to avoiding the impact of repetitive sequence, find Candidate's reads accuracy is the highest, has saved a lot of comparison time, substantially increases comparison speed.

3, the reads of candidate is carried out detailed comparison, saved internal memory and comparison time, improve comparison speed.

The comparison method of a lot of comparison softwares, all employ the algorithm of longest common subsequence (LCS), the most right Overall region carries out LCS calculating, wastes the most very much internal memory and time for the comparison area more than 100k.We Method is also to use this algorithm, but improves of both having done: the comparison relation of seed is carried out by (1) in advance Cluster, calculates the comparison scope of optimum；(2) compare in subregion.Thus saved internal memory and comparison time, Improve comparison speed.

Accompanying drawing explanation

Fig. 1: kmer scattergram

All of data are broken into the segment (referred to as k-mer) of a length of k, and abscissa is the frequency at k-mer Number, vertical coordinate is the kind of this frequency k-mer, using the k-mer of 2 times of inner regions of peak value as unique-kmer.

Fig. 2: calculate comparison scope schematic diagram

Each point on figure represents seed total on two read, and abscissa represents the position in read1 comparison, Vertical coordinate represents the position in read2 comparison, is clustered by the straight line that these seed slopes are 1, selects Cluster most straight lines, using this region as the scope in comparison.

Fig. 3: flow chart of the present invention

Detailed description of the invention

Below in conjunction with the accompanying drawings embodiments of the invention are further elaborated:

Embodiment:

(1) use secondary Illumina sequencing data to set up kmer model, therefrom extract unique-kmer

Use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, all of data are beaten Being broken into the segment (referred to as k-mer) of a length of k, abscissa is the frequency at k-mer, and vertical coordinate is this frequency The kind of k-mer.According to the k-mer within k-mer scattergram two times of main peaks of acquisition as unique-kmer, right In k≤17, the bit file (* .bit) using size to be 2G stores, and during for k ＞ 17, Unique-kmer is stored in GATB increase income in (* .h5) file in bag.Wherein, secondary Illumina order-checking Data refer to the secondary sequencing data obtained by Illumina company sequenator.

According to said method, writing following program, be used for extracting unique-kmer, concrete operations order uses and says Bright as follows:

It is as follows that concrete case implements operation:

From secondary Illumina sequencing data, screen the data of about 40X, be fq.lst its write one In file:

Then run program, obtain unique-kmer:

Because choosing k=17, result is stored in bit file: k17.bit

(2) use unique-kmer to compare with three generations's Pacbio sequencing data, screen candidate reads

This unique-kmer is used to be used as the seed (seed) used during comparison, if having between reads When unique-kmer is more than 3, using them as candidate reads.Wherein, three generations Pacbio sequencing data refers to The secondary sequencing data obtained by Pacbio company sequenator.

According to said method, write an alignment programs, three generations's Pacbio sequencing data is compared, tool Body operational order operation instruction is as follows:

It is as follows that concrete case implements operation:

Use the data file of two three generations Pacbio order-checking, respectively read1.fa, read2.fa, the most also one The unique-kmer file that individual secondary Illumina sequencing data extracts: k17.bit, runs and carries out to issue orders Comparison:

(3) candidate reads is carried out detailed comparison.

Comparison scope is carried out zonule segmentation (can set a length of 100bp of segmentation) the most again, to each point Cutting region, use LCS algorithm to calculate similarity, then give a mark entirety, method is as follows:

Above-described is only the preferred embodiment of the present invention, it is noted that general in the art For logical technical staff, on the premise of without departing from core technical features of the present invention, it is also possible to make some improvement And retouching, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. the comparison method of three generations's PacBio sequencing data, it is characterised in that it comprises the following steps:

(1) use secondary Illumina sequencing data to set up kmer model, and therefrom extract unique-kmer；

(2) use unique-kmer using it as the seed of comparison, compare with three generations's Pacbio sequencing data, Filter out candidate reads；

(3) candidate reads is carried out detailed comparison.

2. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that In described step (1), use jellyfish software that secondary Illumina sequencing data is carried out k-mer statistics, Obtain the k-mer within two times of main peaks as unique-kmer according to k-mer scattergram, and use bit file or GATB increases income bag, stores described unique-kmer.

3. according to the comparison method of the three generations's PacBio sequencing data described in claims 2, it is characterised in that For k≤17, the bit file (* .bit) using size to be 2G stores, and for the feelings of k ＞ 17 Condition, is stored in GATB unique-kmer and increases income in (* .h5) file in bag.

4. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that In described step (2), use the unique-kmer of step (1), if unique-kmer total between reads Counting, more than 3, just screens these reads, as candidate reads.

5. according to the comparison method of the three generations's PacBio sequencing data described in claims 1, it is characterised in that Described step (3) comprises the following steps:

Setting up coordinate system, abscissa represents the position in read1 comparison, and vertical coordinate represents the position on read2 in comparison, Each point represents seed total on two read, is clustered by the straight line that these seed slopes are 1, will Gather the straight line of multiple spot as the region in comparison；

The most again comparison scope is carried out zonule segmentation, to each cut zone, uses LCS algorithm to calculate similarity, Giving a mark entirety, method is as follows again:

Assuming comparison scope is divided into n region, the similarity region more than 0.8 has b, and these zonules are overall Similar base is c, then Regional Similarity is b/n, and base similarity is c/a, and last only reservation the two value is all Data more than 0.7.