CN106599617B - A kind of magnanimity sequencing data error correcting method running on distributed system - Google Patents

A kind of magnanimity sequencing data error correcting method running on distributed system Download PDF

Info

Publication number
CN106599617B
CN106599617B CN201611186654.3A CN201611186654A CN106599617B CN 106599617 B CN106599617 B CN 106599617B CN 201611186654 A CN201611186654 A CN 201611186654A CN 106599617 B CN106599617 B CN 106599617B
Authority
CN
China
Prior art keywords
sequencing data
data
sequencing
calculate node
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201611186654.3A
Other languages
Chinese (zh)
Other versions
CN106599617A (en
Inventor
林劼
江育娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201611186654.3A priority Critical patent/CN106599617B/en
Publication of CN106599617A publication Critical patent/CN106599617A/en
Application granted granted Critical
Publication of CN106599617B publication Critical patent/CN106599617B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention discloses a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes host node, interchanger and several calculate nodes, several calculate nodes connect host node by interchanger, the magnanimity sequencing data error correcting method determines the grouping standard of sequencing data the following steps are included: 1) pre-process to sequencing data;2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing data to calculate node;3) distributed error correction is carried out to sequencing data.The advantages such as the method for the invention has speed fast compared with integrated system in terms of handling magnanimity sequencing data, and precision is high and at low cost.

Description

A kind of magnanimity sequencing data error correcting method running on distributed system
Technical field
The present invention relates to biological gene technology interdisciplinary fields relevant to computer science and technology, more particularly to one kind Run on the magnanimity sequencing data error correcting method of distributed system.
Background technique
Next-generation high-flux sequence (next generation sequencing, NGS, survey Chinese name again by frequently referred to two generations Sequence or new-generation sequencing) technology allows Whole genome analysis and personalized gene medical treatment to be possibly realized.Next-generation sequencing technologies are with passing The Sanger sequencing of system is compared, and has that speed is fast, the few feature of expense, but they the shortcomings that be occur in sequencing it is big The short sequence data of amount and its mistake of carrying.Due to the limitation of experimental technique, these short sequences inevitably exist Mistake, if be not modified to these mistakes before sequence assembly, algorithm is spliced according to these wrong data, will The quality of ultimate sequence can be reduced.Before short sequence data is spliced into as long sequence (contig), short sequence data is repaired It is very important a step, is the precondition and guarantee for recombinating reliable long sequence.
The error generated in sequencing data is always the major issue of a puzzlement sequence quality and subsequent analysis, under Error rate in generation sequencing is related with base quality, by the common shadow of the Multiple factors such as sequenator itself, sequencing reagent, sample It rings.Sequencing error can not only interfere sequencing data normally to splice, but also can not correctly identify hereditary information present in sample Polymorphism, it is difficult to obtain valuable result.Since sequencing experimentation is more complicated, exist during each many uncontrollable The enchancement factor of system is purely to be difficult to thoroughly eliminate sequencing mistake by the specification of experimental technique and improvement.
Whole sequence to be measured is decomposed into short-movie section (read, referred to as read) by next-generation sequencing technologies, to each short Piece read carries out multiplicating measurement.All error correction methods all follow such a precondition: sequencing is most of out Read sequence is correctly only a small number of wrong presence of sequence.For example, when correcting mistake, if there is M item is identical The identical sequence B of sequence A, N item, sequence A and sequence B in threshold value (y) range of defined Hamming distance, In this case, it is generally considered as sequence A and sequence B is to judge number at this time from the same region of original sequence to be measured The size of value M and N, the sequence more than quantity are regarded as correctly, and the few sequence of quantity can then be corrected (more according to quantity Sequence).
Error correction method used at present mainly has following three kinds: (1) based on the method for k-spectrum.(2) it is based on The method of suffix tree/suffix array.(3) it is based on the side of multiple sequence alignment (MSA) Method.
Existing error correction algorithm computation complexity is high, and execution efficiency is low, and the requirement to computing resource is very high, is not suitable for Apply the environment in magnanimity sequencing data.When handling mass data, a large amount of memories and very long runing time are needed, especially It is in the environment of complete sequence is sequenced and generates mass data, general server will be unable to provide enough memories and calculate energy Power needs supercomputer that can handle.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of magnanimity sequencings for running on distributed system Error in data modification method.
The technical solution adopted by the present invention is that:
A kind of magnanimity sequencing data error correcting method running on distributed system, the distributed system includes main section Point, interchanger and several calculate nodes, several calculate nodes connect host node, the magnanimity sequencing data mistake by interchanger Modification method the following steps are included:
1) sequencing data is pre-processed, determines the grouping standard of sequencing data;
2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing number According to arrive calculate node;
3) distributed error correction is carried out to sequencing data.
In step 1 determine sequencing data grouping standard specifically includes the following steps:
Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing Data have certain representativeness;
1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates the sampling each short sequence of sequencing data Between similitude, applied statistical method gathers the sampling sequencing data respectively for similar class;
1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extract feature For the short sequence of quick discrimination and such the distance between.
In step 2 each calculate node of balanced distribution formula system load the following steps are included:
2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each The ownership of a sequencing data clusters;
2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster to specified calculate node;
Sequencing data is transmitted to corresponding calculate node by 2-4.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each Class sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount.It is small Number part rounds up.
According to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, load balance calculating is carried out, It determines the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one Cluster can only be assigned to a calculate node and be calculated.
Sequencing data is compressed using compression algorithm respectively when sequencing data transmits in step 2-4, and verification is added Code.
In step 2 before data transmission to each calculate node, first data are pre-processed, and differentiate sequencing number According to which node should be transmitted to.
Distributed error correction is carried out to sequencing data in step 3 the following steps are included:
3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring;
3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction Scheme.
After calculate node receives sequencing data in step 3-1, number is sequenced to this node in operation HiTEC error correction algorithm According to wrong identification processing is carried out, the wrong possibility and errors present of each short sequence are calculated, and calculate error correction and comment Point.
Each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Kazakhstan Uncommon function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction As a result it returns.
The invention adopts the above technical scheme, the performance of abundant application distribution formula computing platform, proposes based on distributed ring The solution of the biological two generations sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is fully considered The influence of rationality and load balance to Distributed Computing Platform performance determines sequencing number by way of application sampling cluster According to center, by compared with center, determining that sequencing data specifically belongs to.Using unit method, unit of account node Computing capability and unit cluster data amount, as load balance calculation basis, design (calculated) load balance method.It is sequenced by distribution Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each Data volume to be processed needed for node largely reduces, and significantly improves the error correction treatment effeciency of whole system.It finally will scoring Summarized, it is highest as error correction schemes to elect scoring.
The present invention is based on the magnanimity sequencing data error correction solutions of distributed environment, provide and cut for bioinformatics Real available sequencing data error correction tool, and new thinking is provided for other mass data application solutions, thus The research contents of abundant distributed computing, pushes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention Method is stated compared with integrated system, has speed fast in terms of handling magnanimity sequencing data, it is excellent that precision is high and at low cost etc. Gesture.
Detailed description of the invention
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is distributed system architecture schematic diagram of the invention;
Fig. 2 is a kind of process signal of magnanimity sequencing data error correcting method for running on distributed system of the present invention Figure.
Specific embodiment
As depicted in figs. 1 and 2, the present invention discloses a kind of magnanimity sequencing data error correction side for running on distributed system Method, the distributed system include host node, interchanger and several calculate nodes, and several calculate nodes are connected by interchanger and led Node, each calculate node can be PC server or PC desktop computer, to the of less demanding of hardware environment, due to that can carry out Load balance does not require the configuration of all calculate nodes unified yet.
The magnanimity sequencing data error correcting method the following steps are included:
1) sequencing data is pre-processed, determines the grouping standard of sequencing data;
2) multidomain treat-ment is carried out to sequencing data, the load of each calculate node of balanced distribution formula system simultaneously transmits sequencing number According to arrive calculate node;
3) distributed error correction is carried out to sequencing data.
In step 1 determine sequencing data grouping standard specifically includes the following steps:
Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing Data have certain representativeness;
Random sampling is carried out to sequencing data to be processed, data from the sample survey amount is N/m3, wherein N is sequence sum, and m is meter The quantity of operator node is at least about 1000-3000 item.Then data from the sample survey is simulated using Monte-carlo Simulation Method, The sequencing data collection of 1000-2000 item simulation is finally obtained, which can indicate the global feature of sequencing data.
1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates the sampling each short sequence of sequencing data Between similitude, applied statistical method gathers the sampling sequencing data respectively for similar class;
Calculate the Hamming distance between each short sequence of sample data set.Then, application level clustering algorithm, according to Hamming Distance gathers these sampling sequencing datas respectively for similar class.It clusters there are two types of principles, the first is Hamming distance within 5 Gather for same class, if of a sort short sequence is less than 3, illustrates that such sample is very few, cancel such.Second of cluster is former It is then that setting clusters number n, general n > 6m, application level clustering algorithm gathers sample data for n class.
1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extract feature For the short sequence of quick discrimination and such the distance between.
All kinds of sample sequencing datas is separately constituted to the long sequence of a connection, wherein the short sequence in class is in sequence It is connect with next sequence, the length for being formed every one kind with symbol segmentation applied probability Suffix array clustering algorithm between two sequences Sequence construct calculates each node branch probability step and will export a data structure at a Suffix array clustering, wherein The Suffix array clustering and branch probabilities for recording such long Sequence composition, here it is the features of every a kind of sample sequencing data.
For extracting data characteristics method, sequence each in class is subjected to linear combination, according to the frequency of appearance and generally Rate determines the weight of subsequence
Wherein, probability Suffix array clustering is that a kind of VLMMs based on traditional Suffix array clustering is realized.As Suffix array clustering, PSA can indicate all N(N+1)/2 substrings from root to leaf.Variable-length markov based on PSA model realization Model (VLMMs), the depth representing of the character string of each of which node the length of substring.It is deep by limiting respective leaf node Degree, can indicate the length of identical character string, can also indicate that the condition for occurring some state under a given sequence is general Rate, here it is the transition probabilities in transfer matrix.Transition probability is one by the symbol in given path and observed Substring path sign computation before data and the relative frequency come.The conditional probability determined by the length of substring can lead to One paths of the determination crossed in PSA model are calculated.Due to Suffix array clustering using label originate final position by the way of come The character string of each node is recorded, and Suffix array clustering possesses N number of node, it therefore, can be in linear sky using Suffix array clustering Between middle indicate all N(N+1)/2 substrings from root to leaf.Thus can have N(N+1 for one)/2 transfers The transfer matrix of probability is indicated with the linear data structure of a N number of node come probability Suffix array clustering.
In step 2 each calculate node of balanced distribution formula system load the following steps are included:
2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each The ownership of a sequencing data clusters;
For giving a sequencing data s, calculating is compared with the probability Suffix array clustering extracted in one kind every in sample Similarity, specific practice are: from root node, accessing the node in PSA, be matched to corresponding node, and according to the node Transition probability calculate the matching probability between s and the probability Suffix array clustering.Then according to matching probability determination and the sequencing number It is ownership cluster according to most similar 3 classes of s.
2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates;According to calculate node Quantity and processing capacity, calculate criterion calculation ability, the computing capability of each node is the integral multiple of criterion calculation ability Number, at least 1 times.According to cluster sequencing data quantity, cluster normal data amount is calculated, the quantity of each cluster is standard The integer multiple of data volume, at least 1 times.
Cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (each Class sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount.It is small Number part rounds up.
2-3, according to balanced load situation, distribution cluster to specified calculate node;
According to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, load balance calculating is carried out, It determines the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one Cluster can only be assigned to a calculate node and be calculated.Wherein calculation method are as follows: first calculate a criterion calculation ability and mark Ratio between quasi- data volume calculates the method for salary distribution using the solution of knapsack problem.
Sequencing data is transmitted to corresponding calculate node by 2-4.According to above-mentioned balancing method of loads, sequencing data is passed It is sent to corresponding calculate node.Before data transmission to each calculate node, first data are pre-processed, and differentiate sequencing Which node is data should be transmitted to.Sequencing data is compressed using compression algorithm respectively when sequencing data transmits, and is added Check code, to ensure that there is no lose compressed package in transmission process.Sequencing data compressed package is passed using ftp program It is sent to corresponding calculate node, calculate node receives the data sent, first verified, and confirms no error of transmission Afterwards, compressed package is unziped into working directory, completes to receive datamation.If it find that error of transmission, issues and retransmits request of data, Host node is allowed to transmit the compressed data packets of the node again.
Distributed error correction is carried out to sequencing data in step 3 the following steps are included:
3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring;According to negative After carrying balance scheme distributed data, it is assumed that the processing capacity of each calculate node is identical, each calculate node data to be processed Amount is about 3N/m, and wherein N is sequencing data total amount, and m is number of nodes.When settling accounts bigger (such as m=50) quantity m, Mei Gejie Point data volume to be processed substantially reduces, at this time the sequencing data error correction algorithm of allocating conventional to sequencing data at Reason returns to error correction scoring according to processing result.
For example, running HiTEC error correction algorithm to this node after calculate node receives sequencing data in step 3-1 Sequencing data carries out wrong identification processing, calculates the wrong possibility and errors present of each short sequence, and calculate mistake Revised scoring.
3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction Scheme.
Each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Kazakhstan Uncommon function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction As a result it returns.
The invention adopts the above technical scheme, the performance of abundant application distribution formula computing platform, proposes based on distributed ring The solution of the biological two generations sequencing error correction of the magnanimity in border.In this method, the conjunction of the distribution of sequencing data is fully considered The influence of rationality and load balance to Distributed Computing Platform performance determines sequencing number by way of application sampling cluster According to center, by compared with center, determining that sequencing data specifically belongs to.Using unit method, unit of account node Computing capability and unit cluster data amount, as load balance calculation basis, design (calculated) load balance method.It is sequenced by distribution Data, each calculate node application error correction algorithm carries out mistake and revised scoring to data on this node, due to each Data volume to be processed needed for node largely reduces, and significantly improves the error correction treatment effeciency of whole system.It finally will scoring Summarized, it is highest as error correction schemes to elect scoring.
The present invention is based on the magnanimity sequencing data error correction solutions of distributed environment, provide and cut for bioinformatics Real available sequencing data error correction tool, and new thinking is provided for other mass data application solutions, thus The research contents of abundant distributed computing, pushes the research and development of bioinformatics and high performance parallel computation.Institute of the present invention Method is stated compared with integrated system, has speed fast in terms of handling magnanimity sequencing data, it is excellent that precision is high and at low cost etc. Gesture.

Claims (7)

1. a kind of magnanimity sequencing data error correcting method for running on distributed system, the distributed system includes main section Point, interchanger and several calculate nodes, several calculate nodes connect host node by interchanger, it is characterised in that: the magnanimity Sequencing data error correcting method the following steps are included:
1) sequencing data is pre-processed, determines the grouping standard of sequencing data;The grouping of sequencing data is determined in step 1) Standard specifically includes the following steps:
Sampling of data process: 1-1 carries out sampling of data according to the feature of sequencing data to be processed, it is ensured that sampling sequencing data With certain representativeness;
1-2, sequencing data cluster process of sampling: application sequence Similarity Algorithm calculates between the sampling each short sequence of sequencing data Similitude, applied statistical method gathers the sampling sequencing data respectively for similar class;
1-3, Various types of data characteristic extraction procedure: will form all kinds of sequencing datas and be combined and calculate, and extracts feature and is used for The short sequence of quick discrimination and such the distance between;
2) to sequencing data carry out multidomain treat-ment, the load of each calculate node of balanced distribution formula system and transmit sequencing data to Calculate node;
In step 2 each calculate node of balanced distribution formula system load the following steps are included:
2-1 determines the distance between sequencing data and each sample clustering, and according to the distance of calculating, calculates each survey The ownership of ordinal number evidence clusters;
2-2 clusters possessed sequencing data quantity according to each, is balanced load and calculates;
2-3, according to balanced load situation, distribution cluster to specified calculate node;
Sequencing data is transmitted to corresponding calculate node by 2-4;
3) distributed error correction is carried out to sequencing data;
Distributed error correction is carried out to sequencing data in step 3) the following steps are included:
3-1, application error correction algorithm handles sequencing data in each calculate node, calculates scoring;
3-2 is integrated and is determined to calculate, score data is summarized, scored according to each calculate node and determine error correction schemes.
2. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: cluster normal data amount is obtained according to cluster sequencing data quantity in step 2-2, cluster normal data amount=min is (all kinds of Sequencing data quantity is poor), each clusters corresponding normal data amount=such cluster sequencing data quantity/normal data amount;Decimal Part rounds up.
3. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: according to the criterion calculation ability of cluster normal data amount and calculate node in step 2-3, carries out load balance calculating, really Determine the corresponding calculate node of cluster data, and meets 1 calculate node and handle one or more cluster corresponding datas, and one poly- Class can only be assigned to a calculate node and be calculated.
4. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: sequencing data is compressed using compression algorithm respectively when sequencing data transmits in step 2-4, and check code is added.
5. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: in step 2 before data transmission to each calculate node, first data is pre-processed, and differentiate sequencing data Which node should be transmitted to.
6. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature It is: after calculate node receives sequencing data in step 3-1, runs HiTEC error correction algorithm to this node sequencing data Wrong identification processing is carried out, the wrong possibility and errors present of each short sequence are calculated, and calculates error correction scoring.
7. a kind of magnanimity sequencing data error correcting method for running on distributed system according to claim 1, feature Be: each calculate node is using sequence number as key value in step 3-2, using score data as calculated value, using Hash Function is distributed to each calculate node again, and three error corrections scoring of same sequence can all be distributed to the same calculating section Point calculates the error correction schemes of the sequence using election algorithm, summarizes determining error correction schemes as error correction As a result it returns.
CN201611186654.3A 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system Expired - Fee Related CN106599617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611186654.3A CN106599617B (en) 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611186654.3A CN106599617B (en) 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system

Publications (2)

Publication Number Publication Date
CN106599617A CN106599617A (en) 2017-04-26
CN106599617B true CN106599617B (en) 2019-02-15

Family

ID=58600461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611186654.3A Expired - Fee Related CN106599617B (en) 2016-12-20 2016-12-20 A kind of magnanimity sequencing data error correcting method running on distributed system

Country Status (1)

Country Link
CN (1) CN106599617B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737696A (en) * 2019-10-12 2020-01-31 北京百度网讯科技有限公司 Data sampling method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104270437A (en) * 2014-09-25 2015-01-07 中国科学院大学 Mass data processing and visualizing system and method of distributed mixed architecture
CN104615752A (en) * 2015-02-12 2015-05-13 北京嘀嘀无限科技发展有限公司 Information classification method and system
CN106022002A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Three-generation PacBio sequencing data-based hole filling method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8805952B2 (en) * 2012-01-04 2014-08-12 International Business Machines Corporation Administering globally accessible memory space in a distributed computing system
GB2531741A (en) * 2014-10-28 2016-05-04 Bisn Laboratory Services Ltd Molecular and bioinformatics methods for direct sequencing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104270437A (en) * 2014-09-25 2015-01-07 中国科学院大学 Mass data processing and visualizing system and method of distributed mixed architecture
CN104615752A (en) * 2015-02-12 2015-05-13 北京嘀嘀无限科技发展有限公司 Information classification method and system
CN106022002A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Three-generation PacBio sequencing data-based hole filling method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
下一代测序纠错方法综述;江育娥等;《北京工业大学学报》;20160331;第42卷(第3期);第377-384页

Also Published As

Publication number Publication date
CN106599617A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
Sirén et al. Indexing graphs for path queries with applications in genome research
US20140214334A1 (en) Efficient genomic read alignment in an in-memory database
Quicke et al. Utility of the DNA barcoding gene fragment for parasitic wasp phylogeny (Hymenoptera: Ichneumonoidea): data release and new measure of taxonomic congruence
EP2759952B1 (en) Efficient genomic read alignment in an in-memory database
CN105243297A (en) Quick comparing and positioning method for gene sequence segments on reference genome
CN104112005B (en) Distributed mass fingerprint identification method
US20170017717A1 (en) Sequence Data Analyzer, DNA Analysis System and Sequence Data Analysis Method
Sirén et al. Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit
Evangelista et al. Assessing support for Blaberoidea phylogeny suggests optimal locus quality
CN110867231A (en) Disease prediction method, device, computer equipment and medium based on text classification
Pei et al. CLADES: A classification‐based machine learning method for species delimitation from population genetic data
CN110600135A (en) Breast cancer prediction system based on improved random forest algorithm
CN104573405B (en) Phylogenetic tree rebuilding method for building sub trees on basis of big trees
CN106599617B (en) A kind of magnanimity sequencing data error correcting method running on distributed system
CN109857892B (en) Semi-supervised cross-modal Hash retrieval method based on class label transfer
CN103440292B (en) Multimedia information retrieval method and system based on bit vectors
CN114821818A (en) Motion data analysis method and system based on intelligent sports
Van Etten et al. A k-mer-based approach for phylogenetic classification of taxa in environmental genomic data
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN111984745A (en) Dynamic expansion method, device, equipment and storage medium for database field
Zou et al. HPTree: reconstructing phylogenetic trees for ultra-large unaligned DNA sequences via NJ model and Hadoop
Borges et al. Distinguishing between spectral clustering and cluster analysis of mass spectra
CN106529212B (en) Biological sequence evolution information extracting method based on sequence dependent Frequency matrix
Fan et al. Coupled feature mapping and correlation mining for cross-media retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190215