CN106599617A - Mass sequencing data error correcting method applied to distributed system - Google Patents
Mass sequencing data error correcting method applied to distributed system Download PDFInfo
- Publication number
- CN106599617A CN106599617A CN201611186654.3A CN201611186654A CN106599617A CN 106599617 A CN106599617 A CN 106599617A CN 201611186654 A CN201611186654 A CN 201611186654A CN 106599617 A CN106599617 A CN 106599617A
- Authority
- CN
- China
- Prior art keywords
- sequencing data
- data
- cluster
- sequencing
- calculate node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 145
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012937 correction Methods 0.000 claims abstract description 47
- 238000004422 calculation algorithm Methods 0.000 claims description 23
- 238000005070 sampling Methods 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 241001269238 Data Species 0.000 claims description 7
- 238000011282 treatment Methods 0.000 claims description 6
- 230000003466 anti-cipated effect Effects 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- PDEDQSAFHNADLV-UHFFFAOYSA-M potassium;disodium;dinitrate;nitrite Chemical compound [Na+].[Na+].[K+].[O-]N=O.[O-][N+]([O-])=O.[O-][N+]([O-])=O PDEDQSAFHNADLV-UHFFFAOYSA-M 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 abstract 1
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 6
- 206010008190 Cerebrovascular accident Diseases 0.000 description 3
- 208000006011 Stroke Diseases 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000002887 multiple sequence alignment Methods 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012827 research and development Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000002715 modification method Methods 0.000 description 1
- 238000007480 sanger sequencing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a mass sequencing data error correcting method applied to a distributed system. The distributed system comprises a main node, a switch and a plurality of computation nodes, wherein the plurality of computation nodes are connected with the main node via the switch. The mass sequencing data error correcting method comprises the following steps of 1, preprocessing sequencing data, and determining a grouping standard of the sequencing data; 2, dividing the sequencing data into partitions, balancing the load of each computation node in the distributed system and conveying the sequencing data to the computation nodes; and 3, performing distributed error correction on the sequencing data. Compared with a concentrated system, the method provided by the invention has the advantages of high speed and accuracy and low cost in the aspect of processing the mass sequencing data.
Description
Technical field
The present invention relates to the biological gene technology interdisciplinary field related to computer science and technology, more particularly to it is a kind of
Run on the magnanimity sequencing data error correcting method of distributed system.
Background technology
High-flux sequence of future generation(Next generation sequencing, NGS, Chinese name frequently referred to secondary survey again
Sequence or new-generation sequencing)Technology allows Whole genome analysis and personalized gene medical treatment to be possibly realized.Sequencing technologies of future generation are with passing
The Sanger sequencings of system are compared, and have the characteristics of speed is fast, and expense is few, but their shortcoming is that occur greatly in sequencing
The short sequence data of amount and its mistake of carrying.Due to the limitation of experimental technique, these short sequences are inevitably present
Mistake, if without being modified to these mistakes before sequence assembly, algorithm is spliced according to these wrong data, will
The quality of ultimate sequence can be reduced.Before short sequence data is spliced into as long sequence (contig), short sequence data is repaired
It is a very important step, is the precondition and guarantee of the reliable long sequence of restructuring.
The error produced in sequencing data is always the major issue of a puzzlement sequence quality and subsequent analysis, under
Error rate in generation sequencing is relevant with base quality, by the common shadow of the Multiple factors such as sequenator itself, sequencing reagent, sample
Ring.Sequencing error can not only disturb sequencing data normally to splice, but also cannot correctly recognize hereditary information present in sample
Polymorphism, it is difficult to obtain valuable result.It is more complicated due to experimentation is sequenced, exist during each many uncontrollable
The random factor of system, is to be difficult to thoroughly eliminate sequencing mistake purely by the specification and improvement of experimental technique.
Sequencing technologies of future generation are decomposed into short-movie section (read is referred to as read) whole piece sequence to be measured, short to each
Piece read carries out that measurement is repeated several times.All of error correction method all follows such a precondition:Sequencing is out most
Read sequence is correct, the wrong presence of sequence of only minority.For example, during repairing lookup error, if M bars it is identical
Sequence A, N bar identical sequence Bs, the threshold value of sequence A and sequence B in the Hamming distance of regulation(y)In the range of,
In this case, typically it is considered as sequence A and sequence B is same region from original sequence to be measured, now judges number
The size of value M and N, the sequence more than quantity is regarded as correctly, and the few sequence of quantity then can be corrected(More than quantity
Sequence).
The error correction method for using at present mainly has following three kinds:(1) method based on k-spectrum.(2) it is based on
The method of suffix tree/suffix array.(3)Side based on multiple sequence alignment (MSA)
Method.
Existing error correction algorithm computation complexity is high, and execution efficiency is low, and the requirement to computing resource is very high, is not suitable for
Apply the environment in magnanimity sequencing data.When mass data is processed, a large amount of internal memories and very long run time are needed, especially
It is that in the environment of complete sequence sequencing produces mass data, general server will be unable to provide enough internal memories and calculating energy
Power, needs supercomputer to process.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, there is provided a kind of magnanimity sequencing for running on distributed system
Error in data modification method.
The technical solution used in the present invention is:
A kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system include host node,
Switch and some calculate nodes, some calculate nodes connect host node by switch, and the magnanimity sequencing data mistake is repaiied
Correction method is comprised the following steps:
1)Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined;
2)Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to
Calculate node;
3)Distributed error correction is carried out to sequencing data.
Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1:
1-1, sampling of data process:Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data
With certain representativeness;
1-2, sequencing data cluster process of sampling:Application sequence Similarity Algorithm is calculated between sampling sequencing data each short sequence
Similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class;
1-3, Various types of data characteristic extraction procedure:All kinds of sequencing datas will be constituted to be combined and calculate, extracting feature is used for
The short sequence of Quick with such the distance between.
The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps:
2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey
The ownership cluster of ordinal number evidence;
2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster is to specified calculate node;
2-4, by sequencing data corresponding calculate node is sent to.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each
Class sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount.It is little
Fractional part rounds up.
According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, really
Determine the corresponding calculate node of cluster data, and meet 1 calculate node to process one or more cluster corresponding datas, and one is gathered
Class can only be assigned to a calculate node and be calculated.
When sequencing data is transmitted in step 2-4 sequencing data is compressed respectively using compression algorithm, and adds verification
Code.
In step 2 before data are sent to each calculate node, first data are anticipated, and differentiate sequencing number
According to which node should be sent to.
Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps:
3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring;
3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction schemes are determined.
Calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are sequenced number to this node
According to wrong identification process is carried out, the wrong probability and errors present of each short sequence is calculated, and mistake in computation amendment is commented
Point.
In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Kazakhstan
Uncommon function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save
Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction
As a result return.
The present invention adopts above technical scheme, the performance of abundant application distribution formula calculating platform to propose to be based on distributed ring
The solution of the biological secondary sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is taken into full account
Rationality, and impact of the load balance to Distributed Computing Platform performance, by way of application sampling cluster sequencing number is determined
According to center, by the comparison with center, determine that sequencing data specifically belongs to.Using unit method, unit of account node
Computing capability and unit cluster data amount, as load balance basis, design (calculated) load balance method.By being distributed sequencing
Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each
Data volume to be processed is reduced in a large number needed for node, significantly improves the error correction treatment effeciency of whole system.Finally will scoring
Collected, elected scoring highest as error correction schemes.
Magnanimity sequencing data error correction solution of the present invention based on distributed environment, provides for bioinformatics and cuts
Real available sequencing data error correction instrument, and new thinking is provided for other mass data application solutions, so as to
The research contents of abundant Distributed Calculation, promotes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention
Method is stated compared with integrated system, have speed fast in terms of magnanimity sequencing data is processed, high precision, and cost is low excellent
Gesture.
Description of the drawings
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is the distributed system architecture schematic diagram of the present invention;
Fig. 2 is a kind of schematic flow sheet of the magnanimity sequencing data error correcting method for running on distributed system of the present invention.
Specific embodiment
As depicted in figs. 1 and 2, the present invention discloses a kind of magnanimity sequencing data error correction side for running on distributed system
Method, the distributed system includes host node, switch and some calculate nodes, and some calculate nodes are led by switch connection
Node, each calculate node can be PC server or PC desktop computers, to the less demanding of hardware environment, due to carrying out
Load balance, does not require the configuration unification of all calculate nodes yet.
The magnanimity sequencing data error correcting method is comprised the following steps:
1)Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined;
2)Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to
Calculate node;
3)Distributed error correction is carried out to sequencing data.
Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1:
1-1, sampling of data process:Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data
With certain representativeness;
Stochastic sampling is carried out to sequencing data to be processed, sampled data amount is N/m3, wherein N is sequence sum, and m is saved to calculate
The quantity of point, is at least about 1000-3000 bars. and then sampled data is simulated using Monte-carlo Simulation Method, finally
The sequencing data collection of 1000-2000 bars simulation is obtained, the data set can represent the global feature of sequencing data.
1-2, sequencing data cluster process of sampling:Application sequence Similarity Algorithm calculates sampling sequencing data each short sequence
Between similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class;
Calculate the Hamming distance between sample data set each short sequence.Then, application level clustering algorithm, according to Hamming distance
These sampling sequencing datas are gathered respectively for close class.Cluster principle has two kinds, and the first is that Hamming distance is poly- within 5
For same class, if of a sort short sequence is less than 3, illustrate that such sample is very few, cancel such.Clustering principle second is
Setting cluster number n, general n>6m, application level clustering algorithm gathers sample data for n classes.
1-3, Various types of data characteristic extraction procedure:All kinds of sequencing datas will be constituted to be combined and calculate, feature is extracted
For the short sequence of Quick and such the distance between.
All kinds of sample sequencing datas is separately constituted into the long sequence of a connection, the wherein short sequence of apoplexy due to endogenous wind in order
It is connected with next sequence, between two sequences symbol segmentation is used. the length that applied probability Suffix array clustering algorithm constitutes each class
Sequence construct calculates each node branch probability into a Suffix array clustering. and the step will export a data structure, wherein
The Suffix array clustering and branch probabilities of such long Sequence composition are recorded, here it is the feature of each class sample sequencing data.
For data characteristicses method is extracted, by apoplexy due to endogenous wind, each sequence carries out linear combination, according to the frequency for occurring and generally
Rate determines the weight of subsequence.
Wherein, probability Suffix array clustering is a kind of VLMMs realizations based on traditional Suffix array clustering.As Suffix array clustering, PSA can
To represent all N(N + 1)/ 2 substrings from root to leaf.Variable-length Markov model based on PSA model realizations
(VLMMs), the depth representing of the character string of each of which node the length of substring.By limiting respective leaf node depth,
The length of identical character string can be represented, the conditional probability for occurring certain state under a given sequence also can be just represented,
Here it is the transition probability in transfer matrix.Transition probability is a symbol and observed data by given path
Substring path sign computation above and the relative frequency come.The conditional probability determined by the length of substring can pass through
One paths of the determination in PSA models are calculated.Due to Suffix array clustering by the way of labelling starting final position remembering
The character string of each node is recorded, and Suffix array clustering possesses N number of node, therefore, can be in linear space using Suffix array clustering
It is middle to represent all N(N + 1)/ 2 substrings from root to leaf.Thus there can be N by one(N + 1)/ 2 transfers are general
The transfer matrix of rate is represented with the linear data structure of a N number of node come probability Suffix array clustering.
The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps:
2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey
The ownership cluster of ordinal number evidence;
For a given sequencing data s, calculating is compared to the probability Suffix array clustering that each apoplexy due to endogenous wind in sample is extracted similar
Spend, specific practice is:From root node, the node in PSA is accessed, match corresponding node, and turning according to the node
The matching probability moved between probability calculation s and the probability Suffix array clustering.Then determined with sequencing data s most according to matching probability
3 close classes are ownership cluster.
2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates;According to calculate node
Quantity and disposal ability, calculate criterion calculation ability, the computing capability of each node for criterion calculation ability integral multiple
Number, at least 1 times.According to cluster sequencing data quantity, cluster normal data amount is calculated, the quantity of each cluster is standard
The integer multiple of data volume, at least 1 times.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each
Class sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount.It is little
Fractional part rounds up.
2-3, according to balanced load situation, distribution cluster is to specified calculate node;
According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, it is determined that poly-
The corresponding calculate node of class data, and meet 1 calculate node and process one or more cluster corresponding datas, and a cluster is only
A calculate node can be assigned to be calculated.Its Computational Methods is:First calculate a criterion calculation ability and normal data
Ratio between amount, using the solution of knapsack problem the method for salary distribution is calculated.
2-4, by sequencing data corresponding calculate node is sent to.According to above-mentioned balancing method of loads, sequencing data is passed
It is sent to corresponding calculate node.Data are sent to before each calculate node, and first data are anticipated, and differentiate sequencing
Which node is data should be sent to.Sequencing data is respectively compressed sequencing data using compression algorithm when transmitting, and is added
Check code, to guarantee that compressed package does not have the situation of loss in transmitting procedure.Sequencing data compressed package is passed using ftp programs
Corresponding calculate node is sent to, calculate node receives the data for sending, first verified, is confirmed without error of transmission
Afterwards, compressed package is unziped to into working directory, completes receiving data work.If it find that error of transmission, sends re-transmission request of data,
Host node is allowed to transmit the compressed data packets of the node again.
Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps:
3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring;It is flat according to load
After weighing apparatus scheme distributed data, it is assumed that the disposal ability of each calculate node is identical, each calculate node data volume to be processed is about
For 3N/m, wherein N is sequencing data total amount, and m is number of nodes.When quantity m is settled accounts than larger (such as m=50), each node will
The data volume of process is substantially reduced, and now the sequencing data error correction algorithm of allocating conventional is processed sequencing data, root
According to result, error correction scoring is returned.
For example, calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are to this node
Sequencing data carries out wrong identification process, calculates the wrong probability and errors present of each short sequence, and mistake in computation
Revised scoring.
3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction is determined
Scheme.
In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Kazakhstan
Uncommon function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save
Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction
As a result return.
The present invention adopts above technical scheme, the performance of abundant application distribution formula calculating platform to propose to be based on distributed ring
The solution of the biological secondary sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is taken into full account
Rationality, and impact of the load balance to Distributed Computing Platform performance, by way of application sampling cluster sequencing number is determined
According to center, by the comparison with center, determine that sequencing data specifically belongs to.Using unit method, unit of account node
Computing capability and unit cluster data amount, as load balance basis, design (calculated) load balance method.By being distributed sequencing
Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each
Data volume to be processed is reduced in a large number needed for node, significantly improves the error correction treatment effeciency of whole system.Finally will scoring
Collected, elected scoring highest as error correction schemes.
Magnanimity sequencing data error correction solution of the present invention based on distributed environment, provides for bioinformatics and cuts
Real available sequencing data error correction instrument, and new thinking is provided for other mass data application solutions, so as to
The research contents of abundant Distributed Calculation, promotes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention
Method is stated compared with integrated system, have speed fast in terms of magnanimity sequencing data is processed, high precision, and cost is low excellent
Gesture.
Claims (10)
1. a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes main section
Point, switch and some calculate nodes, some calculate nodes connect host node by switch, it is characterised in that:The magnanimity
Sequencing data error correcting method is comprised the following steps:
1)Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined;
2)Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to
Calculate node;
3)Distributed error correction is carried out to sequencing data.
2. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature
It is:Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1:
1-1, sampling of data process:Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data
With certain representativeness;
1-2, sequencing data cluster process of sampling:Application sequence Similarity Algorithm is calculated between sampling sequencing data each short sequence
Similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class;
1-3, Various types of data characteristic extraction procedure:All kinds of sequencing datas will be constituted to be combined and calculate, extracting feature is used for
The short sequence of Quick with such the distance between.
3. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 2, its feature
It is:The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps:
2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey
The ownership cluster of ordinal number evidence;
2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster is to specified calculate node;
2-4, by sequencing data corresponding calculate node is sent to.
4. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature
It is:Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, normal data amount=min is (all kinds of for cluster
Sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount;Decimal
Part rounds up.
5. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature
It is:According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, it is determined that
The corresponding calculate node of cluster data, and meet 1 calculate node process one or more cluster corresponding data, and one cluster
A calculate node can only be assigned to be calculated.
6. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature
It is:When sequencing data is transmitted in step 2-4 sequencing data is compressed respectively using compression algorithm, and adds check code.
7. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature
It is:In step 2 before data are sent to each calculate node, first data are anticipated, and differentiate sequencing data
Which node should be sent to.
8. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature
It is:Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps:
3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring;
3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction schemes are determined.
9. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 8, its feature
It is:Calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are to this node sequencing data
Wrong identification process is carried out, the wrong probability and errors present of each short sequence, and mistake in computation revised scoring is calculated.
10. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 8, its feature
It is:In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Hash
Function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save
Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction
As a result return.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611186654.3A CN106599617B (en) | 2016-12-20 | 2016-12-20 | A kind of magnanimity sequencing data error correcting method running on distributed system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611186654.3A CN106599617B (en) | 2016-12-20 | 2016-12-20 | A kind of magnanimity sequencing data error correcting method running on distributed system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599617A true CN106599617A (en) | 2017-04-26 |
CN106599617B CN106599617B (en) | 2019-02-15 |
Family
ID=58600461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611186654.3A Expired - Fee Related CN106599617B (en) | 2016-12-20 | 2016-12-20 | A kind of magnanimity sequencing data error correcting method running on distributed system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599617B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110737696A (en) * | 2019-10-12 | 2020-01-31 | 北京百度网讯科技有限公司 | Data sampling method, device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130173738A1 (en) * | 2012-01-04 | 2013-07-04 | International Business Machines Corporation | Administering Globally Accessible Memory Space In A Distributed Computing System |
CN104270437A (en) * | 2014-09-25 | 2015-01-07 | 中国科学院大学 | Mass data processing and visualizing system and method of distributed mixed architecture |
CN104615752A (en) * | 2015-02-12 | 2015-05-13 | 北京嘀嘀无限科技发展有限公司 | Information classification method and system |
US20160180018A1 (en) * | 2014-10-28 | 2016-06-23 | Bisn Laboratory Services Ltd. | Molecular and bioinformatics methods for direct sequencing |
CN106022002A (en) * | 2016-05-17 | 2016-10-12 | 杭州和壹基因科技有限公司 | Three-generation PacBio sequencing data-based hole filling method |
-
2016
- 2016-12-20 CN CN201611186654.3A patent/CN106599617B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130173738A1 (en) * | 2012-01-04 | 2013-07-04 | International Business Machines Corporation | Administering Globally Accessible Memory Space In A Distributed Computing System |
CN104270437A (en) * | 2014-09-25 | 2015-01-07 | 中国科学院大学 | Mass data processing and visualizing system and method of distributed mixed architecture |
US20160180018A1 (en) * | 2014-10-28 | 2016-06-23 | Bisn Laboratory Services Ltd. | Molecular and bioinformatics methods for direct sequencing |
CN104615752A (en) * | 2015-02-12 | 2015-05-13 | 北京嘀嘀无限科技发展有限公司 | Information classification method and system |
CN106022002A (en) * | 2016-05-17 | 2016-10-12 | 杭州和壹基因科技有限公司 | Three-generation PacBio sequencing data-based hole filling method |
Non-Patent Citations (1)
Title |
---|
江育娥等: "下一代测序纠错方法综述", 《北京工业大学学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110737696A (en) * | 2019-10-12 | 2020-01-31 | 北京百度网讯科技有限公司 | Data sampling method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106599617B (en) | 2019-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sirén et al. | Indexing graphs for path queries with applications in genome research | |
Keller et al. | Including RNA secondary structures improves accuracy and robustness in reconstruction of phylogenetic trees | |
US20090119313A1 (en) | Determining structure of binary data using alignment algorithms | |
WO2017120128A1 (en) | Systems and methods for adaptive local alignment for graph genomes | |
PEREIRA | Sousa | |
EP2759952B1 (en) | Efficient genomic read alignment in an in-memory database | |
Arram et al. | Hardware acceleration of genetic sequence alignment | |
Ng et al. | Reconfigurable acceleration of genetic sequence alignment: A survey of two decades of efforts | |
US20180247016A1 (en) | Systems and methods for providing assisted local alignment | |
CN115146865A (en) | Task optimization method based on artificial intelligence and related equipment | |
Roberts et al. | Fragment assignment in the cloud with eXpress-D | |
von Haeseler et al. | Network models for sequence evolution | |
Sirén et al. | Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit | |
Gupta et al. | Fast processing and querying of 170tb of genomics data via a repeated and merged bloom filter (rambo) | |
Ng et al. | Acceleration of short read alignment with runtime reconfiguration | |
Ionescu et al. | Local rank distance | |
CN112559482B (en) | Binary data classification processing method and system based on distribution | |
CN106599617A (en) | Mass sequencing data error correcting method applied to distributed system | |
CN106021992A (en) | Computation pipeline of location-dependent variant calls | |
CN113344125A (en) | Long text matching identification method and device, electronic equipment and storage medium | |
CN103699819A (en) | Peak expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry | |
Saeed et al. | A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes | |
CN110176276A (en) | Analysis of biological information orderly management method and system | |
Muggli et al. | A succinct solution to Rmap alignment | |
Esmat et al. | A parallel hash‐based method for local sequence alignment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190215 |