CN106599617A - Mass sequencing data error correcting method applied to distributed system - Google Patents

Mass sequencing data error correcting method applied to distributed system Download PDF

Info

Publication number
CN106599617A
CN106599617A CN201611186654.3A CN201611186654A CN106599617A CN 106599617 A CN106599617 A CN 106599617A CN 201611186654 A CN201611186654 A CN 201611186654A CN 106599617 A CN106599617 A CN 106599617A
Authority
CN
China
Prior art keywords
sequencing data
data
cluster
sequencing
calculate node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611186654.3A
Other languages
Chinese (zh)
Other versions
CN106599617B (en
Inventor
林劼
江育娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201611186654.3A priority Critical patent/CN106599617B/en
Publication of CN106599617A publication Critical patent/CN106599617A/en
Application granted granted Critical
Publication of CN106599617B publication Critical patent/CN106599617B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a mass sequencing data error correcting method applied to a distributed system. The distributed system comprises a main node, a switch and a plurality of computation nodes, wherein the plurality of computation nodes are connected with the main node via the switch. The mass sequencing data error correcting method comprises the following steps of 1, preprocessing sequencing data, and determining a grouping standard of the sequencing data; 2, dividing the sequencing data into partitions, balancing the load of each computation node in the distributed system and conveying the sequencing data to the computation nodes; and 3, performing distributed error correction on the sequencing data. Compared with a concentrated system, the method provided by the invention has the advantages of high speed and accuracy and low cost in the aspect of processing the mass sequencing data.

Description

A kind of magnanimity sequencing data error correcting method for running on distributed system
Technical field
The present invention relates to the biological gene technology interdisciplinary field related to computer science and technology, more particularly to it is a kind of Run on the magnanimity sequencing data error correcting method of distributed system.
Background technology
High-flux sequence of future generation(Next generation sequencing, NGS, Chinese name frequently referred to secondary survey again Sequence or new-generation sequencing)Technology allows Whole genome analysis and personalized gene medical treatment to be possibly realized.Sequencing technologies of future generation are with passing The Sanger sequencings of system are compared, and have the characteristics of speed is fast, and expense is few, but their shortcoming is that occur greatly in sequencing The short sequence data of amount and its mistake of carrying.Due to the limitation of experimental technique, these short sequences are inevitably present Mistake, if without being modified to these mistakes before sequence assembly, algorithm is spliced according to these wrong data, will The quality of ultimate sequence can be reduced.Before short sequence data is spliced into as long sequence (contig), short sequence data is repaired It is a very important step, is the precondition and guarantee of the reliable long sequence of restructuring.
The error produced in sequencing data is always the major issue of a puzzlement sequence quality and subsequent analysis, under Error rate in generation sequencing is relevant with base quality, by the common shadow of the Multiple factors such as sequenator itself, sequencing reagent, sample Ring.Sequencing error can not only disturb sequencing data normally to splice, but also cannot correctly recognize hereditary information present in sample Polymorphism, it is difficult to obtain valuable result.It is more complicated due to experimentation is sequenced, exist during each many uncontrollable The random factor of system, is to be difficult to thoroughly eliminate sequencing mistake purely by the specification and improvement of experimental technique.
Sequencing technologies of future generation are decomposed into short-movie section (read is referred to as read) whole piece sequence to be measured, short to each Piece read carries out that measurement is repeated several times.All of error correction method all follows such a precondition:Sequencing is out most Read sequence is correct, the wrong presence of sequence of only minority.For example, during repairing lookup error, if M bars it is identical Sequence A, N bar identical sequence Bs, the threshold value of sequence A and sequence B in the Hamming distance of regulation(y)In the range of, In this case, typically it is considered as sequence A and sequence B is same region from original sequence to be measured, now judges number The size of value M and N, the sequence more than quantity is regarded as correctly, and the few sequence of quantity then can be corrected(More than quantity Sequence).
The error correction method for using at present mainly has following three kinds:(1) method based on k-spectrum.(2) it is based on The method of suffix tree/suffix array.(3)Side based on multiple sequence alignment (MSA) Method.
Existing error correction algorithm computation complexity is high, and execution efficiency is low, and the requirement to computing resource is very high, is not suitable for Apply the environment in magnanimity sequencing data.When mass data is processed, a large amount of internal memories and very long run time are needed, especially It is that in the environment of complete sequence sequencing produces mass data, general server will be unable to provide enough internal memories and calculating energy Power, needs supercomputer to process.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, there is provided a kind of magnanimity sequencing for running on distributed system Error in data modification method.
The technical solution used in the present invention is:
A kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system include host node, Switch and some calculate nodes, some calculate nodes connect host node by switch, and the magnanimity sequencing data mistake is repaiied Correction method is comprised the following steps:
1)Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined;
2)Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to Calculate node;
3)Distributed error correction is carried out to sequencing data.
Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1:
1-1, sampling of data process:Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data With certain representativeness;
1-2, sequencing data cluster process of sampling:Application sequence Similarity Algorithm is calculated between sampling sequencing data each short sequence Similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class;
1-3, Various types of data characteristic extraction procedure:All kinds of sequencing datas will be constituted to be combined and calculate, extracting feature is used for The short sequence of Quick with such the distance between.
The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps:
2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey The ownership cluster of ordinal number evidence;
2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster is to specified calculate node;
2-4, by sequencing data corresponding calculate node is sent to.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each Class sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount.It is little Fractional part rounds up.
According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, really Determine the corresponding calculate node of cluster data, and meet 1 calculate node to process one or more cluster corresponding datas, and one is gathered Class can only be assigned to a calculate node and be calculated.
When sequencing data is transmitted in step 2-4 sequencing data is compressed respectively using compression algorithm, and adds verification Code.
In step 2 before data are sent to each calculate node, first data are anticipated, and differentiate sequencing number According to which node should be sent to.
Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps:
3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring;
3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction schemes are determined.
Calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are sequenced number to this node According to wrong identification process is carried out, the wrong probability and errors present of each short sequence is calculated, and mistake in computation amendment is commented Point.
In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Kazakhstan Uncommon function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction As a result return.
The present invention adopts above technical scheme, the performance of abundant application distribution formula calculating platform to propose to be based on distributed ring The solution of the biological secondary sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is taken into full account Rationality, and impact of the load balance to Distributed Computing Platform performance, by way of application sampling cluster sequencing number is determined According to center, by the comparison with center, determine that sequencing data specifically belongs to.Using unit method, unit of account node Computing capability and unit cluster data amount, as load balance basis, design (calculated) load balance method.By being distributed sequencing Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each Data volume to be processed is reduced in a large number needed for node, significantly improves the error correction treatment effeciency of whole system.Finally will scoring Collected, elected scoring highest as error correction schemes.
Magnanimity sequencing data error correction solution of the present invention based on distributed environment, provides for bioinformatics and cuts Real available sequencing data error correction instrument, and new thinking is provided for other mass data application solutions, so as to The research contents of abundant Distributed Calculation, promotes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention Method is stated compared with integrated system, have speed fast in terms of magnanimity sequencing data is processed, high precision, and cost is low excellent Gesture.
Description of the drawings
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is the distributed system architecture schematic diagram of the present invention;
Fig. 2 is a kind of schematic flow sheet of the magnanimity sequencing data error correcting method for running on distributed system of the present invention.
Specific embodiment
As depicted in figs. 1 and 2, the present invention discloses a kind of magnanimity sequencing data error correction side for running on distributed system Method, the distributed system includes host node, switch and some calculate nodes, and some calculate nodes are led by switch connection Node, each calculate node can be PC server or PC desktop computers, to the less demanding of hardware environment, due to carrying out Load balance, does not require the configuration unification of all calculate nodes yet.
The magnanimity sequencing data error correcting method is comprised the following steps:
1)Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined;
2)Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to Calculate node;
3)Distributed error correction is carried out to sequencing data.
Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1:
1-1, sampling of data process:Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data With certain representativeness;
Stochastic sampling is carried out to sequencing data to be processed, sampled data amount is N/m3, wherein N is sequence sum, and m is saved to calculate The quantity of point, is at least about 1000-3000 bars. and then sampled data is simulated using Monte-carlo Simulation Method, finally The sequencing data collection of 1000-2000 bars simulation is obtained, the data set can represent the global feature of sequencing data.
1-2, sequencing data cluster process of sampling:Application sequence Similarity Algorithm calculates sampling sequencing data each short sequence Between similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class;
Calculate the Hamming distance between sample data set each short sequence.Then, application level clustering algorithm, according to Hamming distance These sampling sequencing datas are gathered respectively for close class.Cluster principle has two kinds, and the first is that Hamming distance is poly- within 5 For same class, if of a sort short sequence is less than 3, illustrate that such sample is very few, cancel such.Clustering principle second is Setting cluster number n, general n>6m, application level clustering algorithm gathers sample data for n classes.
1-3, Various types of data characteristic extraction procedure:All kinds of sequencing datas will be constituted to be combined and calculate, feature is extracted For the short sequence of Quick and such the distance between.
All kinds of sample sequencing datas is separately constituted into the long sequence of a connection, the wherein short sequence of apoplexy due to endogenous wind in order It is connected with next sequence, between two sequences symbol segmentation is used. the length that applied probability Suffix array clustering algorithm constitutes each class Sequence construct calculates each node branch probability into a Suffix array clustering. and the step will export a data structure, wherein The Suffix array clustering and branch probabilities of such long Sequence composition are recorded, here it is the feature of each class sample sequencing data.
For data characteristicses method is extracted, by apoplexy due to endogenous wind, each sequence carries out linear combination, according to the frequency for occurring and generally Rate determines the weight of subsequence.
Wherein, probability Suffix array clustering is a kind of VLMMs realizations based on traditional Suffix array clustering.As Suffix array clustering, PSA can To represent all N(N + 1)/ 2 substrings from root to leaf.Variable-length Markov model based on PSA model realizations (VLMMs), the depth representing of the character string of each of which node the length of substring.By limiting respective leaf node depth, The length of identical character string can be represented, the conditional probability for occurring certain state under a given sequence also can be just represented, Here it is the transition probability in transfer matrix.Transition probability is a symbol and observed data by given path Substring path sign computation above and the relative frequency come.The conditional probability determined by the length of substring can pass through One paths of the determination in PSA models are calculated.Due to Suffix array clustering by the way of labelling starting final position remembering The character string of each node is recorded, and Suffix array clustering possesses N number of node, therefore, can be in linear space using Suffix array clustering It is middle to represent all N(N + 1)/ 2 substrings from root to leaf.Thus there can be N by one(N + 1)/ 2 transfers are general The transfer matrix of rate is represented with the linear data structure of a N number of node come probability Suffix array clustering.
The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps:
2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey The ownership cluster of ordinal number evidence;
For a given sequencing data s, calculating is compared to the probability Suffix array clustering that each apoplexy due to endogenous wind in sample is extracted similar Spend, specific practice is:From root node, the node in PSA is accessed, match corresponding node, and turning according to the node The matching probability moved between probability calculation s and the probability Suffix array clustering.Then determined with sequencing data s most according to matching probability 3 close classes are ownership cluster.
2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates;According to calculate node Quantity and disposal ability, calculate criterion calculation ability, the computing capability of each node for criterion calculation ability integral multiple Number, at least 1 times.According to cluster sequencing data quantity, cluster normal data amount is calculated, the quantity of each cluster is standard The integer multiple of data volume, at least 1 times.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each Class sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount.It is little Fractional part rounds up.
2-3, according to balanced load situation, distribution cluster is to specified calculate node;
According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, it is determined that poly- The corresponding calculate node of class data, and meet 1 calculate node and process one or more cluster corresponding datas, and a cluster is only A calculate node can be assigned to be calculated.Its Computational Methods is:First calculate a criterion calculation ability and normal data Ratio between amount, using the solution of knapsack problem the method for salary distribution is calculated.
2-4, by sequencing data corresponding calculate node is sent to.According to above-mentioned balancing method of loads, sequencing data is passed It is sent to corresponding calculate node.Data are sent to before each calculate node, and first data are anticipated, and differentiate sequencing Which node is data should be sent to.Sequencing data is respectively compressed sequencing data using compression algorithm when transmitting, and is added Check code, to guarantee that compressed package does not have the situation of loss in transmitting procedure.Sequencing data compressed package is passed using ftp programs Corresponding calculate node is sent to, calculate node receives the data for sending, first verified, is confirmed without error of transmission Afterwards, compressed package is unziped to into working directory, completes receiving data work.If it find that error of transmission, sends re-transmission request of data, Host node is allowed to transmit the compressed data packets of the node again.
Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps:
3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring;It is flat according to load After weighing apparatus scheme distributed data, it is assumed that the disposal ability of each calculate node is identical, each calculate node data volume to be processed is about For 3N/m, wherein N is sequencing data total amount, and m is number of nodes.When quantity m is settled accounts than larger (such as m=50), each node will The data volume of process is substantially reduced, and now the sequencing data error correction algorithm of allocating conventional is processed sequencing data, root According to result, error correction scoring is returned.
For example, calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are to this node Sequencing data carries out wrong identification process, calculates the wrong probability and errors present of each short sequence, and mistake in computation Revised scoring.
3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction is determined Scheme.
In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Kazakhstan Uncommon function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction As a result return.
The present invention adopts above technical scheme, the performance of abundant application distribution formula calculating platform to propose to be based on distributed ring The solution of the biological secondary sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is taken into full account Rationality, and impact of the load balance to Distributed Computing Platform performance, by way of application sampling cluster sequencing number is determined According to center, by the comparison with center, determine that sequencing data specifically belongs to.Using unit method, unit of account node Computing capability and unit cluster data amount, as load balance basis, design (calculated) load balance method.By being distributed sequencing Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each Data volume to be processed is reduced in a large number needed for node, significantly improves the error correction treatment effeciency of whole system.Finally will scoring Collected, elected scoring highest as error correction schemes.
Magnanimity sequencing data error correction solution of the present invention based on distributed environment, provides for bioinformatics and cuts Real available sequencing data error correction instrument, and new thinking is provided for other mass data application solutions, so as to The research contents of abundant Distributed Calculation, promotes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention Method is stated compared with integrated system, have speed fast in terms of magnanimity sequencing data is processed, high precision, and cost is low excellent Gesture.

Claims (10)

1. a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes main section Point, switch and some calculate nodes, some calculate nodes connect host node by switch, it is characterised in that:The magnanimity Sequencing data error correcting method is comprised the following steps:
1)Pretreatment is carried out to sequencing data, the packet standard of sequencing data is determined;
2)Carry out multidomain treat-ment to sequencing data, the load of balanced distribution formula system each calculate node and transmit sequencing data to Calculate node;
3)Distributed error correction is carried out to sequencing data.
2. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature It is:Determine that the packet standard of sequencing data is specifically comprised the following steps in step 1:
1-1, sampling of data process:Sampling of data is carried out according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data With certain representativeness;
1-2, sequencing data cluster process of sampling:Application sequence Similarity Algorithm is calculated between sampling sequencing data each short sequence Similarity, applied statistical method by it is described sampling sequencing data gather respectively for close class;
1-3, Various types of data characteristic extraction procedure:All kinds of sequencing datas will be constituted to be combined and calculate, extracting feature is used for The short sequence of Quick with such the distance between.
3. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 2, its feature It is:The load of each calculate node of balanced distribution formula system in step 2 is comprised the following steps:
2-1, determines the distance between sequencing data and each sample clustering, and according to the distance for calculating, calculates each survey The ownership cluster of ordinal number evidence;
2-2, according to the possessed sequencing data quantity of each cluster, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster is to specified calculate node;
2-4, by sequencing data corresponding calculate node is sent to.
4. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature It is:Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, normal data amount=min is (all kinds of for cluster Sequencing data quantity is poor), each cluster correspondence normal data amount=such cluster sequencing data quantity/normal data amount;Decimal Part rounds up.
5. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature It is:According to class standard data volume and the criterion calculation ability of calculate node in step 2-3, load balance calculating is carried out, it is determined that The corresponding calculate node of cluster data, and meet 1 calculate node process one or more cluster corresponding data, and one cluster A calculate node can only be assigned to be calculated.
6. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 3, its feature It is:When sequencing data is transmitted in step 2-4 sequencing data is compressed respectively using compression algorithm, and adds check code.
7. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature It is:In step 2 before data are sent to each calculate node, first data are anticipated, and differentiate sequencing data Which node should be sent to.
8. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, its feature It is:Distributed error correction is carried out in step 3 to sequencing data to comprise the following steps:
3-1, application error correction algorithm is processed sequencing data in each calculate node, calculates scoring;
3-2, integrated judgement is calculated, and score data is collected, and according to each calculate node scoring error correction schemes are determined.
9. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 8, its feature It is:Calculate node is received after sequencing data in step 3-1, and operation HiTEC error correction algorithms are to this node sequencing data Wrong identification process is carried out, the wrong probability and errors present of each short sequence, and mistake in computation revised scoring is calculated.
10. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 8, its feature It is:In step 3-2 each calculate node using sequence number as key value, using score data as value of calculation, using Hash Function is distributed to again each calculate node, and three error corrections scoring of same sequence can all be distributed to same calculating and save Point, using election algorithm the error correction schemes of the sequence are calculated, and collect the error correction schemes of determination as error correction As a result return.
CN201611186654.3A 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system Active CN106599617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611186654.3A CN106599617B (en) 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611186654.3A CN106599617B (en) 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system

Publications (2)

Publication Number Publication Date
CN106599617A true CN106599617A (en) 2017-04-26
CN106599617B CN106599617B (en) 2019-02-15

Family

ID=58600461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611186654.3A Active CN106599617B (en) 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system

Country Status (1)

Country Link
CN (1) CN106599617B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737696A (en) * 2019-10-12 2020-01-31 北京百度网讯科技有限公司 Data sampling method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130173738A1 (en) * 2012-01-04 2013-07-04 International Business Machines Corporation Administering Globally Accessible Memory Space In A Distributed Computing System
CN104270437A (en) * 2014-09-25 2015-01-07 中国科学院大学 Mass data processing and visualizing system and method of distributed mixed architecture
CN104615752A (en) * 2015-02-12 2015-05-13 北京嘀嘀无限科技发展有限公司 Information classification method and system
US20160180018A1 (en) * 2014-10-28 2016-06-23 Bisn Laboratory Services Ltd. Molecular and bioinformatics methods for direct sequencing
CN106022002A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Three-generation PacBio sequencing data-based hole filling method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130173738A1 (en) * 2012-01-04 2013-07-04 International Business Machines Corporation Administering Globally Accessible Memory Space In A Distributed Computing System
CN104270437A (en) * 2014-09-25 2015-01-07 中国科学院大学 Mass data processing and visualizing system and method of distributed mixed architecture
US20160180018A1 (en) * 2014-10-28 2016-06-23 Bisn Laboratory Services Ltd. Molecular and bioinformatics methods for direct sequencing
CN104615752A (en) * 2015-02-12 2015-05-13 北京嘀嘀无限科技发展有限公司 Information classification method and system
CN106022002A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Three-generation PacBio sequencing data-based hole filling method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
江育娥等: "下一代测序纠错方法综述", 《北京工业大学学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737696A (en) * 2019-10-12 2020-01-31 北京百度网讯科技有限公司 Data sampling method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106599617B (en) 2019-02-15

Similar Documents

Publication Publication Date Title
US10381106B2 (en) Efficient genomic read alignment in an in-memory database
Keller et al. Including RNA secondary structures improves accuracy and robustness in reconstruction of phylogenetic trees
Georganas et al. meraligner: A fully parallel sequence aligner
US20090119313A1 (en) Determining structure of binary data using alignment algorithms
WO2017120128A1 (en) Systems and methods for adaptive local alignment for graph genomes
PEREIRA de Sousa
Ng et al. Reconfigurable acceleration of genetic sequence alignment: A survey of two decades of efforts
CN112735528A (en) Gene sequence comparison method and system
EP2759952A1 (en) Efficient genomic read alignment in an in-memory database
US20180247016A1 (en) Systems and methods for providing assisted local alignment
CN115146865A (en) Task optimization method based on artificial intelligence and related equipment
Sirén et al. Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit
von Haeseler et al. Network models for sequence evolution
Gupta et al. Fast processing and querying of 170tb of genomics data via a repeated and merged bloom filter (rambo)
Ionescu et al. Local rank distance
Ng et al. Acceleration of short read alignment with runtime reconfiguration
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN112559482B (en) Binary data classification processing method and system based on distribution
CN106599617A (en) Mass sequencing data error correcting method applied to distributed system
US20120278362A1 (en) Taxonomic classification system
CN106021992A (en) Computation pipeline of location-dependent variant calls
CN103699819A (en) Peak expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry
Saeed et al. A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes
Muggli et al. A succinct solution to Rmap alignment
Esmat et al. A parallel hash‐based method for local sequence alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant