CN106599617B - A kind of magnanimity sequencing data error correcting method running on distributed system - Google Patents
A kind of magnanimity sequencing data error correcting method running on distributed system Download PDFInfo
- Publication number
- CN106599617B CN106599617B CN201611186654.3A CN201611186654A CN106599617B CN 106599617 B CN106599617 B CN 106599617B CN 201611186654 A CN201611186654 A CN 201611186654A CN 106599617 B CN106599617 B CN 106599617B
- Authority
- CN
- China
- Prior art keywords
- sequencing data
- data
- sequencing
- calculate node
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 140
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012937 correction Methods 0.000 claims abstract description 48
- 230000008569 process Effects 0.000 claims abstract description 8
- 238000011282 treatment Methods 0.000 claims abstract description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 23
- 238000005070 sampling Methods 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 9
- 241001269238 Data Species 0.000 claims description 8
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- PDEDQSAFHNADLV-UHFFFAOYSA-M potassium;disodium;dinitrate;nitrite Chemical compound [Na+].[Na+].[K+].[O-]N=O.[O-][N+]([O-])=O.[O-][N+]([O-])=O PDEDQSAFHNADLV-UHFFFAOYSA-M 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 5
- 238000007481 next generation sequencing Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000002715 modification method Methods 0.000 description 2
- 238000002887 multiple sequence alignment Methods 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012827 research and development Methods 0.000 description 2
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000007480 sanger sequencing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
The present invention discloses a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes host node, interchanger and several calculate nodes, several calculate nodes connect host node by interchanger, the magnanimity sequencing data error correcting method determines the grouping standard of sequencing data the following steps are included: 1) pre-process to sequencing data;2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing data to calculate node;3) distributed error correction is carried out to sequencing data.The advantages such as the method for the invention has speed fast compared with integrated system in terms of handling magnanimity sequencing data, and precision is high and at low cost.
Description
Technical field
The present invention relates to biological gene technology interdisciplinary fields relevant to computer science and technology, more particularly to one kind
Run on the magnanimity sequencing data error correcting method of distributed system.
Background technique
Next-generation high-flux sequence (next generation sequencing, NGS, survey Chinese name again by frequently referred to two generations
Sequence or new-generation sequencing) technology allows Whole genome analysis and personalized gene medical treatment to be possibly realized.Next-generation sequencing technologies are with passing
The Sanger sequencing of system is compared, and has that speed is fast, the few feature of expense, but they the shortcomings that be occur in sequencing it is big
The short sequence data of amount and its mistake of carrying.Due to the limitation of experimental technique, these short sequences inevitably exist
Mistake, if be not modified to these mistakes before sequence assembly, algorithm is spliced according to these wrong data, will
The quality of ultimate sequence can be reduced.Before short sequence data is spliced into as long sequence (contig), short sequence data is repaired
It is very important a step, is the precondition and guarantee for recombinating reliable long sequence.
The error generated in sequencing data is always the major issue of a puzzlement sequence quality and subsequent analysis, under
Error rate in generation sequencing is related with base quality, by the common shadow of the Multiple factors such as sequenator itself, sequencing reagent, sample
It rings.Sequencing error can not only interfere sequencing data normally to splice, but also can not correctly identify hereditary information present in sample
Polymorphism, it is difficult to obtain valuable result.Since sequencing experimentation is more complicated, exist during each many uncontrollable
The enchancement factor of system is purely to be difficult to thoroughly eliminate sequencing mistake by the specification of experimental technique and improvement.
Whole sequence to be measured is decomposed into short-movie section (read, referred to as read) by next-generation sequencing technologies, to each short
Piece read carries out multiplicating measurement.All error correction methods all follow such a precondition: sequencing is most of out
Read sequence is correctly only a small number of wrong presence of sequence.For example, when correcting mistake, if there is M item is identical
The identical sequence B of sequence A, N item, sequence A and sequence B in threshold value (y) range of defined Hamming distance,
In this case, it is generally considered as sequence A and sequence B is to judge number at this time from the same region of original sequence to be measured
The size of value M and N, the sequence more than quantity are regarded as correctly, and the few sequence of quantity can then be corrected (more according to quantity
Sequence).
Error correction method used at present mainly has following three kinds: (1) based on the method for k-spectrum.(2) it is based on
The method of suffix tree/suffix array.(3) it is based on the side of multiple sequence alignment (MSA)
Method.
Existing error correction algorithm computation complexity is high, and execution efficiency is low, and the requirement to computing resource is very high, is not suitable for
Apply the environment in magnanimity sequencing data.When handling mass data, a large amount of memories and very long runing time are needed, especially
It is in the environment of complete sequence is sequenced and generates mass data, general server will be unable to provide enough memories and calculate energy
Power needs supercomputer that can handle.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of magnanimity sequencings for running on distributed system
Error in data modification method.
The technical solution adopted by the present invention is that:
A kind of magnanimity sequencing data error correcting method running on distributed system, the distributed system includes main section
Point, interchanger and several calculate nodes, several calculate nodes connect host node, the magnanimity sequencing data mistake by interchanger
Modification method the following steps are included:
1) sequencing data is pre-processed, determines the grouping standard of sequencing data;
2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing number
According to arrive calculate node;
3) distributed error correction is carried out to sequencing data.
In step 1 determine sequencing data grouping standard specifically includes the following steps:
Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing
Data have certain representativeness;
1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates the sampling each short sequence of sequencing data
Between similitude, applied statistical method gathers the sampling sequencing data respectively for similar class;
1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extract feature
For the short sequence of quick discrimination and such the distance between.
In step 2 each calculate node of balanced distribution formula system load the following steps are included:
2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each
The ownership of a sequencing data clusters;
2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster to specified calculate node;
Sequencing data is transmitted to corresponding calculate node by 2-4.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each
Class sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount.It is small
Number part rounds up.
According to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, load balance calculating is carried out,
It determines the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one
Cluster can only be assigned to a calculate node and be calculated.
Sequencing data is compressed using compression algorithm respectively when sequencing data transmits in step 2-4, and verification is added
Code.
In step 2 before data transmission to each calculate node, first data are pre-processed, and differentiate sequencing number
According to which node should be transmitted to.
Distributed error correction is carried out to sequencing data in step 3 the following steps are included:
3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring;
3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction
Scheme.
After calculate node receives sequencing data in step 3-1, number is sequenced to this node in operation HiTEC error correction algorithm
According to wrong identification processing is carried out, the wrong possibility and errors present of each short sequence are calculated, and calculate error correction and comment
Point.
Each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Kazakhstan
Uncommon function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section
Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction
As a result it returns.
The invention adopts the above technical scheme, the performance of abundant application distribution formula computing platform, proposes based on distributed ring
The solution of the biological two generations sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is fully considered
The influence of rationality and load balance to Distributed Computing Platform performance determines sequencing number by way of application sampling cluster
According to center, by compared with center, determining that sequencing data specifically belongs to.Using unit method, unit of account node
Computing capability and unit cluster data amount, as load balance calculation basis, design (calculated) load balance method.It is sequenced by distribution
Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each
Data volume to be processed needed for node largely reduces, and significantly improves the error correction treatment effeciency of whole system.It finally will scoring
Summarized, it is highest as error correction schemes to elect scoring.
The present invention is based on the magnanimity sequencing data error correction solutions of distributed environment, provide and cut for bioinformatics
Real available sequencing data error correction tool, and new thinking is provided for other mass data application solutions, thus
The research contents of abundant distributed computing, pushes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention
Method is stated compared with integrated system, has speed fast in terms of handling magnanimity sequencing data, it is excellent that precision is high and at low cost etc.
Gesture.
Detailed description of the invention
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is distributed system architecture schematic diagram of the invention;
Fig. 2 is a kind of process signal of magnanimity sequencing data error correcting method for running on distributed system of the present invention
Figure.
Specific embodiment
As depicted in figs. 1 and 2, the present invention discloses a kind of magnanimity sequencing data error correction side for running on distributed system
Method, the distributed system include host node, interchanger and several calculate nodes, and several calculate nodes are connected by interchanger and led
Node, each calculate node can be PC server or PC desktop computer, to the of less demanding of hardware environment, due to that can carry out
Load balance does not require the configuration of all calculate nodes unified yet.
The magnanimity sequencing data error correcting method the following steps are included:
1) sequencing data is pre-processed, determines the grouping standard of sequencing data;
2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing number
According to arrive calculate node;
3) distributed error correction is carried out to sequencing data.
In step 1 determine sequencing data grouping standard specifically includes the following steps:
Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing
Data have certain representativeness;
Random sampling is carried out to sequencing data to be processed, data from the sample survey amount is N/m3, wherein N is sequence sum, and m is meter
The quantity of operator node is at least about 1000-3000 item.Then data from the sample survey is simulated using Monte-carlo Simulation Method,
The sequencing data collection of 1000-2000 item simulation is finally obtained, which can indicate the global feature of sequencing data.
1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates the sampling each short sequence of sequencing data
Between similitude, applied statistical method gathers the sampling sequencing data respectively for similar class;
Calculate the Hamming distance between each short sequence of sample data set.Then, application level clustering algorithm, according to Hamming
Distance gathers these sampling sequencing datas respectively for similar class.It clusters there are two types of principles, the first is Hamming distance within 5
Gather for same class, if of a sort short sequence is less than 3, illustrates that such sample is very few, cancel such.Second of cluster is former
It is then that setting clusters number n, general n > 6m, application level clustering algorithm gathers sample data for n class.
1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extract feature
For the short sequence of quick discrimination and such the distance between.
All kinds of sample sequencing datas is separately constituted to the long sequence of a connection, wherein the short sequence in class is in sequence
It is connect with next sequence, the length for being formed every one kind with symbol segmentation applied probability Suffix array clustering algorithm between two sequences
Sequence construct calculates each node branch probability step and will export a data structure at a Suffix array clustering, wherein
The Suffix array clustering and branch probabilities for recording such long Sequence composition, here it is the features of every a kind of sample sequencing data.
For extracting data characteristics method, sequence each in class is subjected to linear combination, according to the frequency of appearance and generally
Rate determines the weight of subsequence
Wherein, probability Suffix array clustering is that a kind of VLMMs based on traditional Suffix array clustering is realized.As Suffix array clustering,
PSA can indicate all N(N+1)/2 substrings from root to leaf.Variable-length markov based on PSA model realization
Model (VLMMs), the depth representing of the character string of each of which node the length of substring.It is deep by limiting respective leaf node
Degree, can indicate the length of identical character string, can also indicate that the condition for occurring some state under a given sequence is general
Rate, here it is the transition probabilities in transfer matrix.Transition probability is one by the symbol in given path and observed
Substring path sign computation before data and the relative frequency come.The conditional probability determined by the length of substring can lead to
One paths of the determination crossed in PSA model are calculated.Due to Suffix array clustering using label originate final position by the way of come
The character string of each node is recorded, and Suffix array clustering possesses N number of node, it therefore, can be in linear sky using Suffix array clustering
Between middle indicate all N(N+1)/2 substrings from root to leaf.Thus can have N(N+1 for one)/2 transfers
The transfer matrix of probability is indicated with the linear data structure of a N number of node come probability Suffix array clustering.
In step 2 each calculate node of balanced distribution formula system load the following steps are included:
2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each
The ownership of a sequencing data clusters;
For giving a sequencing data s, calculating is compared with the probability Suffix array clustering extracted in one kind every in sample
Similarity, specific practice are: from root node, accessing the node in PSA, be matched to corresponding node, and according to the node
Transition probability calculate the matching probability between s and the probability Suffix array clustering.Then according to matching probability determination and the sequencing number
It is ownership cluster according to most similar 3 classes of s.
2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates;According to calculate node
Quantity and processing capacity, calculate criterion calculation ability, the computing capability of each node is the integral multiple of criterion calculation ability
Number, at least 1 times.According to cluster sequencing data quantity, cluster normal data amount is calculated, the quantity of each cluster is standard
The integer multiple of data volume, at least 1 times.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each
Class sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount.It is small
Number part rounds up.
2-3, according to balanced load situation, distribution cluster to specified calculate node;
According to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, load balance calculating is carried out,
It determines the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one
Cluster can only be assigned to a calculate node and be calculated.Wherein calculation method are as follows: first calculate a criterion calculation ability and mark
Ratio between quasi- data volume calculates the method for salary distribution using the solution of knapsack problem.
Sequencing data is transmitted to corresponding calculate node by 2-4.According to above-mentioned balancing method of loads, sequencing data is passed
It is sent to corresponding calculate node.Before data transmission to each calculate node, first data are pre-processed, and differentiate sequencing
Which node is data should be transmitted to.Sequencing data is compressed using compression algorithm respectively when sequencing data transmits, and is added
Check code, to ensure that there is no lose compressed package in transmission process.Sequencing data compressed package is passed using ftp program
It is sent to corresponding calculate node, calculate node receives the data sent, first verified, and confirms no error of transmission
Afterwards, compressed package is unziped into working directory, completes to receive datamation.If it find that error of transmission, issues and retransmits request of data,
Host node is allowed to transmit the compressed data packets of the node again.
Distributed error correction is carried out to sequencing data in step 3 the following steps are included:
3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring;According to negative
After carrying balance scheme distributed data, it is assumed that the processing capacity of each calculate node is identical, each calculate node data to be processed
Amount is about 3N/m, and wherein N is sequencing data total amount, and m is number of nodes.When settling accounts bigger (such as m=50) quantity m, Mei Gejie
Point data volume to be processed substantially reduces, at this time the sequencing data error correction algorithm of allocating conventional to sequencing data at
Reason returns to error correction scoring according to processing result.
For example, running HiTEC error correction algorithm to this node after calculate node receives sequencing data in step 3-1
Sequencing data carries out wrong identification processing, calculates the wrong possibility and errors present of each short sequence, and calculate mistake
Revised scoring.
3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction
Scheme.
Each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Kazakhstan
Uncommon function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section
Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction
As a result it returns.
The invention adopts the above technical scheme, the performance of abundant application distribution formula computing platform, proposes based on distributed ring
The solution of the biological two generations sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is fully considered
The influence of rationality and load balance to Distributed Computing Platform performance determines sequencing number by way of application sampling cluster
According to center, by compared with center, determining that sequencing data specifically belongs to.Using unit method, unit of account node
Computing capability and unit cluster data amount, as load balance calculation basis, design (calculated) load balance method.It is sequenced by distribution
Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each
Data volume to be processed needed for node largely reduces, and significantly improves the error correction treatment effeciency of whole system.It finally will scoring
Summarized, it is highest as error correction schemes to elect scoring.
The present invention is based on the magnanimity sequencing data error correction solutions of distributed environment, provide and cut for bioinformatics
Real available sequencing data error correction tool, and new thinking is provided for other mass data application solutions, thus
The research contents of abundant distributed computing, pushes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention
Method is stated compared with integrated system, has speed fast in terms of handling magnanimity sequencing data, it is excellent that precision is high and at low cost etc.
Gesture.
Claims (7)
1. a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes main section
Point, interchanger and several calculate nodes, several calculate nodes connect host node by interchanger, it is characterised in that: the magnanimity
Sequencing data error correcting method the following steps are included:
1) sequencing data is pre-processed, determines the grouping standard of sequencing data;The grouping of sequencing data is determined in step 1)
Standard specifically includes the following steps:
Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data
With certain representativeness;
1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates between the sampling each short sequence of sequencing data
Similitude, applied statistical method gathers the sampling sequencing data respectively for similar class;
1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extracts feature and is used for
The short sequence of quick discrimination and such the distance between;
2) to sequencing data carry out multidomain treat-ment, the load of each calculate node of balanced distribution formula system and transmit sequencing data to
Calculate node;
In step 2 each calculate node of balanced distribution formula system load the following steps are included:
2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each survey
The ownership of ordinal number evidence clusters;
2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster to specified calculate node;
Sequencing data is transmitted to corresponding calculate node by 2-4;
3) distributed error correction is carried out to sequencing data;
Distributed error correction is carried out to sequencing data in step 3) the following steps are included:
3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring;
3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction schemes.
2. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature
It is: cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (all kinds of
Sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount;Decimal
Part rounds up.
3. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature
It is: according to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, carries out load balance calculating, really
Determine the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one poly-
Class can only be assigned to a calculate node and be calculated.
4. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature
It is: sequencing data is compressed using compression algorithm respectively when sequencing data transmits in step 2-4, and check code is added.
5. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature
It is: in step 2 before data transmission to each calculate node, first data is pre-processed, and differentiate sequencing data
Which node should be transmitted to.
6. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature
It is: after calculate node receives sequencing data in step 3-1, runs HiTEC error correction algorithm to this node sequencing data
Wrong identification processing is carried out, the wrong possibility and errors present of each short sequence are calculated, and calculates error correction scoring.
7. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature
Be: each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Hash
Function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section
Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction
As a result it returns.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611186654.3A CN106599617B (en) | 2016-12-20 | 2016-12-20 | A kind of magnanimity sequencing data error correcting method running on distributed system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611186654.3A CN106599617B (en) | 2016-12-20 | 2016-12-20 | A kind of magnanimity sequencing data error correcting method running on distributed system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599617A CN106599617A (en) | 2017-04-26 |
CN106599617B true CN106599617B (en) | 2019-02-15 |
Family
ID=58600461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611186654.3A Expired - Fee Related CN106599617B (en) | 2016-12-20 | 2016-12-20 | A kind of magnanimity sequencing data error correcting method running on distributed system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599617B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110737696A (en) * | 2019-10-12 | 2020-01-31 | 北京百度网讯科技有限公司 | Data sampling method, device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104270437A (en) * | 2014-09-25 | 2015-01-07 | 中国科学院大学 | Mass data processing and visualizing system and method of distributed mixed architecture |
CN104615752A (en) * | 2015-02-12 | 2015-05-13 | 北京嘀嘀无限科技发展有限公司 | Information classification method and system |
CN106022002A (en) * | 2016-05-17 | 2016-10-12 | 杭州和壹基因科技有限公司 | Three-generation PacBio sequencing data-based hole filling method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8805952B2 (en) * | 2012-01-04 | 2014-08-12 | International Business Machines Corporation | Administering globally accessible memory space in a distributed computing system |
GB2531741A (en) * | 2014-10-28 | 2016-05-04 | Bisn Laboratory Services Ltd | Molecular and bioinformatics methods for direct sequencing |
-
2016
- 2016-12-20 CN CN201611186654.3A patent/CN106599617B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104270437A (en) * | 2014-09-25 | 2015-01-07 | 中国科学院大学 | Mass data processing and visualizing system and method of distributed mixed architecture |
CN104615752A (en) * | 2015-02-12 | 2015-05-13 | 北京嘀嘀无限科技发展有限公司 | Information classification method and system |
CN106022002A (en) * | 2016-05-17 | 2016-10-12 | 杭州和壹基因科技有限公司 | Three-generation PacBio sequencing data-based hole filling method |
Non-Patent Citations (1)
Title |
---|
下一代测序纠错方法综述;江育娥等;《北京工业大学学报》;20160331;第42卷(第3期);第377-384页 |
Also Published As
Publication number | Publication date |
---|---|
CN106599617A (en) | 2017-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sirén et al. | Indexing graphs for path queries with applications in genome research | |
US20140214334A1 (en) | Efficient genomic read alignment in an in-memory database | |
Quicke et al. | Utility of the DNA barcoding gene fragment for parasitic wasp phylogeny (Hymenoptera: Ichneumonoidea): data release and new measure of taxonomic congruence | |
EP2759952B1 (en) | Efficient genomic read alignment in an in-memory database | |
CN105243297A (en) | Quick comparing and positioning method for gene sequence segments on reference genome | |
CN104112005B (en) | Distributed mass fingerprint identification method | |
US20170017717A1 (en) | Sequence Data Analyzer, DNA Analysis System and Sequence Data Analysis Method | |
Sirén et al. | Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit | |
Evangelista et al. | Assessing support for Blaberoidea phylogeny suggests optimal locus quality | |
CN110867231A (en) | Disease prediction method, device, computer equipment and medium based on text classification | |
Pei et al. | CLADES: A classification‐based machine learning method for species delimitation from population genetic data | |
CN110600135A (en) | Breast cancer prediction system based on improved random forest algorithm | |
CN104573405B (en) | Phylogenetic tree rebuilding method for building sub trees on basis of big trees | |
CN106599617B (en) | A kind of magnanimity sequencing data error correcting method running on distributed system | |
CN109857892B (en) | Semi-supervised cross-modal Hash retrieval method based on class label transfer | |
CN103440292B (en) | Multimedia information retrieval method and system based on bit vectors | |
CN114821818A (en) | Motion data analysis method and system based on intelligent sports | |
Van Etten et al. | A k-mer-based approach for phylogenetic classification of taxa in environmental genomic data | |
CN114491081A (en) | Electric power data tracing method and system based on data blood relationship graph | |
CN113344125A (en) | Long text matching identification method and device, electronic equipment and storage medium | |
CN111984745A (en) | Dynamic expansion method, device, equipment and storage medium for database field | |
Zou et al. | HPTree: reconstructing phylogenetic trees for ultra-large unaligned DNA sequences via NJ model and Hadoop | |
Borges et al. | Distinguishing between spectral clustering and cluster analysis of mass spectra | |
CN106529212B (en) | Biological sequence evolution information extracting method based on sequence dependent Frequency matrix | |
Fan et al. | Coupled feature mapping and correlation mining for cross-media retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190215 |