CN106599617B

CN106599617B - A kind of magnanimity sequencing data error correcting method running on distributed system

Info

Publication number: CN106599617B
Application number: CN201611186654.3A
Authority: CN
Inventors: 林劼; 江育娥
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2019-02-15
Anticipated expiration: 2036-12-20
Also published as: CN106599617A

Abstract

The present invention discloses a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes host node, interchanger and several calculate nodes, several calculate nodes connect host node by interchanger, the magnanimity sequencing data error correcting method determines the grouping standard of sequencing data the following steps are included: 1) pre-process to sequencing data；2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing data to calculate node；3) distributed error correction is carried out to sequencing data.The advantages such as the method for the invention has speed fast compared with integrated system in terms of handling magnanimity sequencing data, and precision is high and at low cost.

Description

A kind of magnanimity sequencing data error correcting method running on distributed system

Technical field

The present invention relates to biological gene technology interdisciplinary fields relevant to computer science and technology, more particularly to one kind Run on the magnanimity sequencing data error correcting method of distributed system.

Background technique

Next-generation high-flux sequence (next generation sequencing, NGS, survey Chinese name again by frequently referred to two generations Sequence or new-generation sequencing) technology allows Whole genome analysis and personalized gene medical treatment to be possibly realized.Next-generation sequencing technologies are with passing The Sanger sequencing of system is compared, and has that speed is fast, the few feature of expense, but they the shortcomings that be occur in sequencing it is big The short sequence data of amount and its mistake of carrying.Due to the limitation of experimental technique, these short sequences inevitably exist Mistake, if be not modified to these mistakes before sequence assembly, algorithm is spliced according to these wrong data, will The quality of ultimate sequence can be reduced.Before short sequence data is spliced into as long sequence (contig), short sequence data is repaired It is very important a step, is the precondition and guarantee for recombinating reliable long sequence.

The error generated in sequencing data is always the major issue of a puzzlement sequence quality and subsequent analysis, under Error rate in generation sequencing is related with base quality, by the common shadow of the Multiple factors such as sequenator itself, sequencing reagent, sample It rings.Sequencing error can not only interfere sequencing data normally to splice, but also can not correctly identify hereditary information present in sample Polymorphism, it is difficult to obtain valuable result.Since sequencing experimentation is more complicated, exist during each many uncontrollable The enchancement factor of system is purely to be difficult to thoroughly eliminate sequencing mistake by the specification of experimental technique and improvement.

Whole sequence to be measured is decomposed into short-movie section (read, referred to as read) by next-generation sequencing technologies, to each short Piece read carries out multiplicating measurement.All error correction methods all follow such a precondition: sequencing is most of out Read sequence is correctly only a small number of wrong presence of sequence.For example, when correcting mistake, if there is M item is identical The identical sequence B of sequence A, N item, sequence A and sequence B in threshold value (y) range of defined Hamming distance, In this case, it is generally considered as sequence A and sequence B is to judge number at this time from the same region of original sequence to be measured The size of value M and N, the sequence more than quantity are regarded as correctly, and the few sequence of quantity can then be corrected (more according to quantity Sequence).

Error correction method used at present mainly has following three kinds: (1) based on the method for k-spectrum.(2) it is based on The method of suffix tree/suffix array.(3) it is based on the side of multiple sequence alignment (MSA) Method.

Existing error correction algorithm computation complexity is high, and execution efficiency is low, and the requirement to computing resource is very high, is not suitable for Apply the environment in magnanimity sequencing data.When handling mass data, a large amount of memories and very long runing time are needed, especially It is in the environment of complete sequence is sequenced and generates mass data, general server will be unable to provide enough memories and calculate energy Power needs supercomputer that can handle.

Summary of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of magnanimity sequencings for running on distributed system Error in data modification method.

The technical solution adopted by the present invention is that:

A kind of magnanimity sequencing data error correcting method running on distributed system, the distributed system includes main section Point, interchanger and several calculate nodes, several calculate nodes connect host node, the magnanimity sequencing data mistake by interchanger Modification method the following steps are included:

1) sequencing data is pre-processed, determines the grouping standard of sequencing data；

2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing number According to arrive calculate node；

3) distributed error correction is carried out to sequencing data.

In step 1 determine sequencing data grouping standard specifically includes the following steps:

Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing Data have certain representativeness；

1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates the sampling each short sequence of sequencing data Between similitude, applied statistical method gathers the sampling sequencing data respectively for similar class；

1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extract feature For the short sequence of quick discrimination and such the distance between.

In step 2 each calculate node of balanced distribution formula system load the following steps are included:

2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each The ownership of a sequencing data clusters；

2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates；

2-3, according to balanced load situation, distribution cluster to specified calculate node；

Sequencing data is transmitted to corresponding calculate node by 2-4.

Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each Class sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount.It is small Number part rounds up.

According to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, load balance calculating is carried out, It determines the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one Cluster can only be assigned to a calculate node and be calculated.

Sequencing data is compressed using compression algorithm respectively when sequencing data transmits in step 2-4, and verification is added Code.

In step 2 before data transmission to each calculate node, first data are pre-processed, and differentiate sequencing number According to which node should be transmitted to.

Distributed error correction is carried out to sequencing data in step 3 the following steps are included:

3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring；

3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction Scheme.

After calculate node receives sequencing data in step 3-1, number is sequenced to this node in operation HiTEC error correction algorithm According to wrong identification processing is carried out, the wrong possibility and errors present of each short sequence are calculated, and calculate error correction and comment Point.

Each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Kazakhstan Uncommon function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction As a result it returns.

The invention adopts the above technical scheme, the performance of abundant application distribution formula computing platform, proposes based on distributed ring The solution of the biological two generations sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is fully considered The influence of rationality and load balance to Distributed Computing Platform performance determines sequencing number by way of application sampling cluster According to center, by compared with center, determining that sequencing data specifically belongs to.Using unit method, unit of account node Computing capability and unit cluster data amount, as load balance calculation basis, design (calculated) load balance method.It is sequenced by distribution Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each Data volume to be processed needed for node largely reduces, and significantly improves the error correction treatment effeciency of whole system.It finally will scoring Summarized, it is highest as error correction schemes to elect scoring.

The present invention is based on the magnanimity sequencing data error correction solutions of distributed environment, provide and cut for bioinformatics Real available sequencing data error correction tool, and new thinking is provided for other mass data application solutions, thus The research contents of abundant distributed computing, pushes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention Method is stated compared with integrated system, has speed fast in terms of handling magnanimity sequencing data, it is excellent that precision is high and at low cost etc. Gesture.

Detailed description of the invention

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is distributed system architecture schematic diagram of the invention；

Fig. 2 is a kind of process signal of magnanimity sequencing data error correcting method for running on distributed system of the present invention Figure.

Specific embodiment

As depicted in figs. 1 and 2, the present invention discloses a kind of magnanimity sequencing data error correction side for running on distributed system Method, the distributed system include host node, interchanger and several calculate nodes, and several calculate nodes are connected by interchanger and led Node, each calculate node can be PC server or PC desktop computer, to the of less demanding of hardware environment, due to that can carry out Load balance does not require the configuration of all calculate nodes unified yet.

The magnanimity sequencing data error correcting method the following steps are included:

3) distributed error correction is carried out to sequencing data.

Random sampling is carried out to sequencing data to be processed, data from the sample survey amount is N/m³, wherein N is sequence sum, and m is meter The quantity of operator node is at least about 1000-3000 item.Then data from the sample survey is simulated using Monte-carlo Simulation Method, The sequencing data collection of 1000-2000 item simulation is finally obtained, which can indicate the global feature of sequencing data.

Calculate the Hamming distance between each short sequence of sample data set.Then, application level clustering algorithm, according to Hamming Distance gathers these sampling sequencing datas respectively for similar class.It clusters there are two types of principles, the first is Hamming distance within 5 Gather for same class, if of a sort short sequence is less than 3, illustrates that such sample is very few, cancel such.Second of cluster is former It is then that setting clusters number n, general n > 6m, application level clustering algorithm gathers sample data for n class.

All kinds of sample sequencing datas is separately constituted to the long sequence of a connection, wherein the short sequence in class is in sequence It is connect with next sequence, the length for being formed every one kind with symbol segmentation applied probability Suffix array clustering algorithm between two sequences Sequence construct calculates each node branch probability step and will export a data structure at a Suffix array clustering, wherein The Suffix array clustering and branch probabilities for recording such long Sequence composition, here it is the features of every a kind of sample sequencing data.

For extracting data characteristics method, sequence each in class is subjected to linear combination, according to the frequency of appearance and generally Rate determines the weight of subsequence

Wherein, probability Suffix array clustering is that a kind of VLMMs based on traditional Suffix array clustering is realized.As Suffix array clustering, PSA can indicate all N(N+1)/2 substrings from root to leaf.Variable-length markov based on PSA model realization Model (VLMMs), the depth representing of the character string of each of which node the length of substring.It is deep by limiting respective leaf node Degree, can indicate the length of identical character string, can also indicate that the condition for occurring some state under a given sequence is general Rate, here it is the transition probabilities in transfer matrix.Transition probability is one by the symbol in given path and observed Substring path sign computation before data and the relative frequency come.The conditional probability determined by the length of substring can lead to One paths of the determination crossed in PSA model are calculated.Due to Suffix array clustering using label originate final position by the way of come The character string of each node is recorded, and Suffix array clustering possesses N number of node, it therefore, can be in linear sky using Suffix array clustering Between middle indicate all N(N+1)/2 substrings from root to leaf.Thus can have N(N+1 for one)/2 transfers The transfer matrix of probability is indicated with the linear data structure of a N number of node come probability Suffix array clustering.

For giving a sequencing data s, calculating is compared with the probability Suffix array clustering extracted in one kind every in sample Similarity, specific practice are: from root node, accessing the node in PSA, be matched to corresponding node, and according to the node Transition probability calculate the matching probability between s and the probability Suffix array clustering.Then according to matching probability determination and the sequencing number It is ownership cluster according to most similar 3 classes of s.

2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates；According to calculate node Quantity and processing capacity, calculate criterion calculation ability, the computing capability of each node is the integral multiple of criterion calculation ability Number, at least 1 times.According to cluster sequencing data quantity, cluster normal data amount is calculated, the quantity of each cluster is standard The integer multiple of data volume, at least 1 times.

According to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, load balance calculating is carried out, It determines the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one Cluster can only be assigned to a calculate node and be calculated.Wherein calculation method are as follows: first calculate a criterion calculation ability and mark Ratio between quasi- data volume calculates the method for salary distribution using the solution of knapsack problem.

Sequencing data is transmitted to corresponding calculate node by 2-4.According to above-mentioned balancing method of loads, sequencing data is passed It is sent to corresponding calculate node.Before data transmission to each calculate node, first data are pre-processed, and differentiate sequencing Which node is data should be transmitted to.Sequencing data is compressed using compression algorithm respectively when sequencing data transmits, and is added Check code, to ensure that there is no lose compressed package in transmission process.Sequencing data compressed package is passed using ftp program It is sent to corresponding calculate node, calculate node receives the data sent, first verified, and confirms no error of transmission Afterwards, compressed package is unziped into working directory, completes to receive datamation.If it find that error of transmission, issues and retransmits request of data, Host node is allowed to transmit the compressed data packets of the node again.

3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring；According to negative After carrying balance scheme distributed data, it is assumed that the processing capacity of each calculate node is identical, each calculate node data to be processed Amount is about 3N/m, and wherein N is sequencing data total amount, and m is number of nodes.When settling accounts bigger (such as m=50) quantity m, Mei Gejie Point data volume to be processed substantially reduces, at this time the sequencing data error correction algorithm of allocating conventional to sequencing data at Reason returns to error correction scoring according to processing result.

For example, running HiTEC error correction algorithm to this node after calculate node receives sequencing data in step 3-1 Sequencing data carries out wrong identification processing, calculates the wrong possibility and errors present of each short sequence, and calculate mistake Revised scoring.

Claims

1. a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes main section Point, interchanger and several calculate nodes, several calculate nodes connect host node by interchanger, it is characterised in that: the magnanimity Sequencing data error correcting method the following steps are included:

1) sequencing data is pre-processed, determines the grouping standard of sequencing data；The grouping of sequencing data is determined in step 1) Standard specifically includes the following steps:

Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data With certain representativeness；

1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates between the sampling each short sequence of sequencing data Similitude, applied statistical method gathers the sampling sequencing data respectively for similar class；

1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extracts feature and is used for The short sequence of quick discrimination and such the distance between；

2) to sequencing data carry out multidomain treat-ment, the load of each calculate node of balanced distribution formula system and transmit sequencing data to Calculate node；

2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each survey The ownership of ordinal number evidence clusters；

Sequencing data is transmitted to corresponding calculate node by 2-4；

3) distributed error correction is carried out to sequencing data；

Distributed error correction is carried out to sequencing data in step 3) the following steps are included:

3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction schemes.

2. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (all kinds of Sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount；Decimal Part rounds up.

3. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: according to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, carries out load balance calculating, really Determine the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one poly- Class can only be assigned to a calculate node and be calculated.

4. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: sequencing data is compressed using compression algorithm respectively when sequencing data transmits in step 2-4, and check code is added.

5. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: in step 2 before data transmission to each calculate node, first data is pre-processed, and differentiate sequencing data Which node should be transmitted to.

6. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: after calculate node receives sequencing data in step 3-1, runs HiTEC error correction algorithm to this node sequencing data Wrong identification processing is carried out, the wrong possibility and errors present of each short sequence are calculated, and calculates error correction scoring.

7. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature Be: each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Hash Function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction As a result it returns.