CN106599617A

CN106599617A - Mass sequencing data error correcting method applied to distributed system

Info

Publication number: CN106599617A
Application number: CN201611186654.3A
Authority: CN
Inventors: 林劼; 江育娥
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-04-26
Anticipated expiration: 2036-12-20
Also published as: CN106599617B

Abstract

The invention discloses a mass sequencing data error correcting method applied to a distributed system. The distributed system comprises a main node, a switch and a plurality of computation nodes, wherein the plurality of computation nodes are connected with the main node via the switch. The mass sequencing data error correcting method comprises the following steps of 1, preprocessing sequencing data, and determining a grouping standard of the sequencing data; 2, dividing the sequencing data into partitions, balancing the load of each computation node in the distributed system and conveying the sequencing data to the computation nodes; and 3, performing distributed error correction on the sequencing data. Compared with a concentrated system, the method provided by the invention has the advantages of high speed and accuracy and low cost in the aspect of processing the mass sequencing data.

Description

A kind of magnanimity sequencing data error correcting method for running on distributed system

Technical field

The present invention relates to the biological gene technology interdisciplinary field related to computer science and technology, more particularly to it is a kind of Run on the magnanimity sequencing data error correcting method of distributed system.

Background technology

High-flux sequence of future generation（Next generation sequencing, NGS, Chinese name frequently referred to secondary survey again Sequence or new-generation sequencing）Technology allows Whole genome analysis and personalized gene medical treatment to be possibly realized.Sequencing technologies of future generation are with passing The Sanger sequencings of system are compared, and have the characteristics of speed is fast, and expense is few, but their shortcoming is that occur greatly in sequencing The short sequence data of amount and its mistake of carrying.Due to the limitation of experimental technique, these short sequences are inevitably present Mistake, if without being modified to these mistakes before sequence assembly, algorithm is spliced according to these wrong data, will The quality of ultimate sequence can be reduced.Before short sequence data is spliced into as long sequence (contig), short sequence data is repaired It is a very important step, is the precondition and guarantee of the reliable long sequence of restructuring.

The error produced in sequencing data is always the major issue of a puzzlement sequence quality and subsequent analysis, under Error rate in generation sequencing is relevant with base quality, by the common shadow of the Multiple factors such as sequenator itself, sequencing reagent, sample Ring.Sequencing error can not only disturb sequencing data normally to splice, but also cannot correctly recognize hereditary information present in sample Polymorphism, it is difficult to obtain valuable result.It is more complicated due to experimentation is sequenced, exist during each many uncontrollable The random factor of system, is to be difficult to thoroughly eliminate sequencing mistake purely by the specification and improvement of experimental technique.

Sequencing technologies of future generation are decomposed into short-movie section (read is referred to as read) whole piece sequence to be measured, short to each Piece read carries out that measurement is repeated several times.All of error correction method all follows such a precondition：Sequencing is out most Read sequence is correct, the wrong presence of sequence of only minority.For example, during repairing lookup error, if M bars it is identical Sequence A, N bar identical sequence Bs, the threshold value of sequence A and sequence B in the Hamming distance of regulation（y）In the range of, In this case, typically it is considered as sequence A and sequence B is same region from original sequence to be measured, now judges number The size of value M and N, the sequence more than quantity is regarded as correctly, and the few sequence of quantity then can be corrected（More than quantity Sequence）.

The error correction method for using at present mainly has following three kinds：(1) method based on k-spectrum.(2) it is based on The method of suffix tree/suffix array.（3）Side based on multiple sequence alignment (MSA) Method.

Existing error correction algorithm computation complexity is high, and execution efficiency is low, and the requirement to computing resource is very high, is not suitable for Apply the environment in magnanimity sequencing data.When mass data is processed, a large amount of internal memories and very long run time are needed, especially It is that in the environment of complete sequence sequencing produces mass data, general server will be unable to provide enough internal memories and calculating energy Power, needs supercomputer to process.

The content of the invention

It is an object of the invention to overcome the deficiencies in the prior art, there is provided a kind of magnanimity sequencing for running on distributed system Error in data modification method.

The technical solution used in the present invention is：

A kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system include host node, Switch and some calculate nodes, some calculate nodes connect host node by switch, and the magnanimity sequencing data mistake is repaiied Correction method is comprised the following steps：

1）Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined；

2）Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to Calculate node；

3）Distributed error correction is carried out to sequencing data.

Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1：

1-1, sampling of data process：Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data With certain representativeness；

1-2, sequencing data cluster process of sampling：Application sequence Similarity Algorithm is calculated between sampling sequencing data each short sequence Similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class；

1-3, Various types of data characteristic extraction procedure：All kinds of sequencing datas will be constituted to be combined and calculate, extracting feature is used for The short sequence of Quick with such the distance between.

The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps：

2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey The ownership cluster of ordinal number evidence；

2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates；

2-3, according to balanced load situation, distribution cluster is to specified calculate node；

2-4, by sequencing data corresponding calculate node is sent to.

Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each Class sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount.It is little Fractional part rounds up.

According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, really Determine the corresponding calculate node of cluster data, and meet 1 calculate node to process one or more cluster corresponding datas, and one is gathered Class can only be assigned to a calculate node and be calculated.

When sequencing data is transmitted in step 2-4 sequencing data is compressed respectively using compression algorithm, and adds verification Code.

In step 2 before data are sent to each calculate node, first data are anticipated, and differentiate sequencing number According to which node should be sent to.

Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps：

3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring；

3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction schemes are determined.

Calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are sequenced number to this node According to wrong identification process is carried out, the wrong probability and errors present of each short sequence is calculated, and mistake in computation amendment is commented Point.

In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Kazakhstan Uncommon function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction As a result return.

The present invention adopts above technical scheme, the performance of abundant application distribution formula calculating platform to propose to be based on distributed ring The solution of the biological secondary sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is taken into full account Rationality, and impact of the load balance to Distributed Computing Platform performance, by way of application sampling cluster sequencing number is determined According to center, by the comparison with center, determine that sequencing data specifically belongs to.Using unit method, unit of account node Computing capability and unit cluster data amount, as load balance basis, design (calculated) load balance method.By being distributed sequencing Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each Data volume to be processed is reduced in a large number needed for node, significantly improves the error correction treatment effeciency of whole system.Finally will scoring Collected, elected scoring highest as error correction schemes.

Magnanimity sequencing data error correction solution of the present invention based on distributed environment, provides for bioinformatics and cuts Real available sequencing data error correction instrument, and new thinking is provided for other mass data application solutions, so as to The research contents of abundant Distributed Calculation, promotes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention Method is stated compared with integrated system, have speed fast in terms of magnanimity sequencing data is processed, high precision, and cost is low excellent Gesture.

Description of the drawings

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is the distributed system architecture schematic diagram of the present invention；

Fig. 2 is a kind of schematic flow sheet of the magnanimity sequencing data error correcting method for running on distributed system of the present invention.

Specific embodiment

As depicted in figs. 1 and 2, the present invention discloses a kind of magnanimity sequencing data error correction side for running on distributed system Method, the distributed system includes host node, switch and some calculate nodes, and some calculate nodes are led by switch connection Node, each calculate node can be PC server or PC desktop computers, to the less demanding of hardware environment, due to carrying out Load balance, does not require the configuration unification of all calculate nodes yet.

The magnanimity sequencing data error correcting method is comprised the following steps：

3）Distributed error correction is carried out to sequencing data.

Stochastic sampling is carried out to sequencing data to be processed, sampled data amount is N/m³, wherein N is sequence sum, and m is saved to calculate The quantity of point, is at least about 1000-3000 bars. and then sampled data is simulated using Monte-carlo Simulation Method, finally The sequencing data collection of 1000-2000 bars simulation is obtained, the data set can represent the global feature of sequencing data.

1-2, sequencing data cluster process of sampling：Application sequence Similarity Algorithm calculates sampling sequencing data each short sequence Between similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class；

Calculate the Hamming distance between sample data set each short sequence.Then, application level clustering algorithm, according to Hamming distance These sampling sequencing datas are gathered respectively for close class.Cluster principle has two kinds, and the first is that Hamming distance is poly- within 5 For same class, if of a sort short sequence is less than 3, illustrate that such sample is very few, cancel such.Clustering principle second is Setting cluster number n, general n>6m, application level clustering algorithm gathers sample data for n classes.

1-3, Various types of data characteristic extraction procedure：All kinds of sequencing datas will be constituted to be combined and calculate, feature is extracted For the short sequence of Quick and such the distance between.

All kinds of sample sequencing datas is separately constituted into the long sequence of a connection, the wherein short sequence of apoplexy due to endogenous wind in order It is connected with next sequence, between two sequences symbol segmentation is used. the length that applied probability Suffix array clustering algorithm constitutes each class Sequence construct calculates each node branch probability into a Suffix array clustering. and the step will export a data structure, wherein The Suffix array clustering and branch probabilities of such long Sequence composition are recorded, here it is the feature of each class sample sequencing data.

For data characteristicses method is extracted, by apoplexy due to endogenous wind, each sequence carries out linear combination, according to the frequency for occurring and generally Rate determines the weight of subsequence.

Wherein, probability Suffix array clustering is a kind of VLMMs realizations based on traditional Suffix array clustering.As Suffix array clustering, PSA can To represent all N（N + 1）/ 2 substrings from root to leaf.Variable-length Markov model based on PSA model realizations （VLMMs）, the depth representing of the character string of each of which node the length of substring.By limiting respective leaf node depth, The length of identical character string can be represented, the conditional probability for occurring certain state under a given sequence also can be just represented, Here it is the transition probability in transfer matrix.Transition probability is a symbol and observed data by given path Substring path sign computation above and the relative frequency come.The conditional probability determined by the length of substring can pass through One paths of the determination in PSA models are calculated.Due to Suffix array clustering by the way of labelling starting final position remembering The character string of each node is recorded, and Suffix array clustering possesses N number of node, therefore, can be in linear space using Suffix array clustering It is middle to represent all N（N + 1）/ 2 substrings from root to leaf.Thus there can be N by one（N + 1）/ 2 transfers are general The transfer matrix of rate is represented with the linear data structure of a N number of node come probability Suffix array clustering.

For a given sequencing data s, calculating is compared to the probability Suffix array clustering that each apoplexy due to endogenous wind in sample is extracted similar Spend, specific practice is:From root node, the node in PSA is accessed, match corresponding node, and turning according to the node The matching probability moved between probability calculation s and the probability Suffix array clustering.Then determined with sequencing data s most according to matching probability 3 close classes are ownership cluster.

2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates；According to calculate node Quantity and disposal ability, calculate criterion calculation ability, the computing capability of each node for criterion calculation ability integral multiple Number, at least 1 times.According to cluster sequencing data quantity, cluster normal data amount is calculated, the quantity of each cluster is standard The integer multiple of data volume, at least 1 times.

According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, it is determined that poly- The corresponding calculate node of class data, and meet 1 calculate node and process one or more cluster corresponding datas, and a cluster is only A calculate node can be assigned to be calculated.Its Computational Methods is:First calculate a criterion calculation ability and normal data Ratio between amount, using the solution of knapsack problem the method for salary distribution is calculated.

2-4, by sequencing data corresponding calculate node is sent to.According to above-mentioned balancing method of loads, sequencing data is passed It is sent to corresponding calculate node.Data are sent to before each calculate node, and first data are anticipated, and differentiate sequencing Which node is data should be sent to.Sequencing data is respectively compressed sequencing data using compression algorithm when transmitting, and is added Check code, to guarantee that compressed package does not have the situation of loss in transmitting procedure.Sequencing data compressed package is passed using ftp programs Corresponding calculate node is sent to, calculate node receives the data for sending, first verified, is confirmed without error of transmission Afterwards, compressed package is unziped to into working directory, completes receiving data work.If it find that error of transmission, sends re-transmission request of data, Host node is allowed to transmit the compressed data packets of the node again.

3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring；It is flat according to load After weighing apparatus scheme distributed data, it is assumed that the disposal ability of each calculate node is identical, each calculate node data volume to be processed is about For 3N/m, wherein N is sequencing data total amount, and m is number of nodes.When quantity m is settled accounts than larger (such as m=50), each node will The data volume of process is substantially reduced, and now the sequencing data error correction algorithm of allocating conventional is processed sequencing data, root According to result, error correction scoring is returned.

For example, calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are to this node Sequencing data carries out wrong identification process, calculates the wrong probability and errors present of each short sequence, and mistake in computation Revised scoring.

3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction is determined Scheme.

Claims

1. a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes main section Point, switch and some calculate nodes, some calculate nodes connect host node by switch, it is characterised in that：The magnanimity Sequencing data error correcting method is comprised the following steps：

3）Distributed error correction is carried out to sequencing data.

2. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature It is：Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1：

3. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 2, its feature It is：The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps：

2-4, by sequencing data corresponding calculate node is sent to.

4. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature It is：Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, normal data amount=min is (all kinds of for cluster Sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount；Decimal Part rounds up.

5. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature It is：According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, it is determined that The corresponding calculate node of cluster data, and meet 1 calculate node process one or more cluster corresponding data, and one cluster A calculate node can only be assigned to be calculated.

6. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature It is：When sequencing data is transmitted in step 2-4 sequencing data is compressed respectively using compression algorithm, and adds check code.

7. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature It is：In step 2 before data are sent to each calculate node, first data are anticipated, and differentiate sequencing data Which node should be sent to.

8. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature It is：Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps：

9. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 8, its feature It is：Calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are to this node sequencing data Wrong identification process is carried out, the wrong probability and errors present of each short sequence, and mistake in computation revised scoring is calculated.

10. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 8, its feature It is：In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Hash Function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction As a result return.