CN106228035A - Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method - Google Patents

Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method Download PDF

Info

Publication number
CN106228035A
CN106228035A CN201610534138.9A CN201610534138A CN106228035A CN 106228035 A CN106228035 A CN 106228035A CN 201610534138 A CN201610534138 A CN 201610534138A CN 106228035 A CN106228035 A CN 106228035A
Authority
CN
China
Prior art keywords
cluster
sequence
mer
clustering method
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610534138.9A
Other languages
Chinese (zh)
Other versions
CN106228035B (en
Inventor
陈宁
陈挺
蒋林浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201610534138.9A priority Critical patent/CN106228035B/en
Publication of CN106228035A publication Critical patent/CN106228035A/en
Application granted granted Critical
Publication of CN106228035B publication Critical patent/CN106228035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of based on local sensitivity Hash with the efficient clustering method of imparametrization bayes method.The inventive method can process magnanimity sequence data effectively, including 16s rRNA and 18s rRNA data.Owing to employing efficient block-iterative solutions, it is to avoid the comparison of a large amount of dissimilar sequences, for the clustering problem of large-scale dataset, this method can quickly provide cluster result, is that current bio information field processes the most efficient method of extensive clustering problem.Simultaneously as more accurate to the estimation of cluster centre in DP means algorithm, the cluster result that the inventive method draws can ensure that the highest accuracy.

Description

Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method
Technical field
The invention belongs to computer utility (bioinformatics) field, be specifically related to a kind of based on local sensitivity Hash with non- The efficient clustering method of parametrization bayes method.
Background technology
Recently as developing rapidly of second filial generation sequencing technologies, people can be quick, cheaply from environmental samples The a large amount of high-quality DNA/RNA of middle acquisition checks order fragment.The whole world microorganism project (Earth Microbiome Project) from 15,000 environmental samples obtain 1,300,000,000 biological 16s rRNA gene orders all over the world;Human intestinal microorganisms's project Group obtains 1,100,000,000 16s rRNA gene orders from 531 tested samples.By this large scale sequencing, people are permissible Obtain the most useful information for Environment control, study of disease, exploitation medicine etc..When analyzing these data, most basic Sequence is clustered by one work exactly, and according to similarity degree, wall scroll gene order is clustered into microbiologic population (OTU), right Entirety calculates and analyzes.
In practical operation, we generally define its similarity degree according to the editing distance between sequence.The most biological Informatics has had many software to may be used for the cluster of gene order, mainly has a following kind: hierarchical clustering Method, greed clustering procedure, probabilistic model method.The software using hierarchical clustering method mainly has DOTUR (Schloss and Handelsman, 2005) and Mothur (Schloss et al., 2009), this clustering method is generally entered by structure one Change tree, by defining the threshold value of a microbiologic population, cluster result can be immediately arrived on cladogram.This method is usual Need to calculate the similarity degree of sequence two-by-two, and be saved in internal memory, then start to construct cladogram.The time of this method Complexity and space complexity are the highest, the most helpless when in the face of the data of magnanimity.Use the software master of greedy algorithm CD-HIT to be had (Fu et al., 2012) and UCLUST (Edgar, 2010), these softwares can be using Article 1 sequence as first The center (representing whole cluster) of individual cluster, later each sequence all compares with existing cluster centre, if threshold value it In, then this sequence being belonged in this cluster, the distance if all of cluster centre Yu this sequence is above threshold value, then This sequence constitutes a class by itself.This algorithm is the most efficient, to such an extent as to numerous studies all use based on greedy algorithm at present Clustering tool.The shortcoming of this algorithm is, cluster result will depend critically upon the order of list entries, and result accuracy is relatively Difference.The software using probabilistic model mainly has CROP (Hao et al., 2011), uses non-supervisory Bayes side in this software Method, is modeled cluster with gauss hybrid models, uses MCMC methodology to estimate model parameter, clusters with this.In practice In, this algorithm is owing to employing soft-threshold, so accuracy is higher, but owing to needs use MCMC methodology successive ignition, Therefore speed is the slowest.
Local sensitivity Hash (Locality Sensitive Hashing) is abbreviated as LSH, is a kind of satisfied by design The special nature i.e. hash function of local sensitivity, the method improving similar search efficiency.
Summary of the invention
The technical issues that need to address of the present invention be to provide one efficiently, Gene Sequences Clustering method accurately, overcome greedy Center algorithm result is inaccurate, and result depends on the problem of sequence inputting order, the most also to ensure that calculating speed can reach To more original instruments, enabling real process magnanimity gene data quickly and efficiently.
It is an object of the invention to provide a kind of based on local sensitivity Hash (LSH) and imparametrization bayes method (DP- Means) efficient clustering method, makes cluster result more accurate than traditional greedy algorithm by means of imparametrization bayes method Really, LSH partition strategy accelerates whole cluster process simultaneously.Specifically, the step that realizes of the method includes:
S1. remove all repetitive sequences in data set, and gene order is converted to dimension is 4KK-mer count vector
Due in general grand genome 16s rRNA, 18s rRNA sample, it is understood that there may be the most identical gene order, Be appreciated that the abundance of a kind of microorganism by the quantity of these sequences, in clustering at one, the highest sequence of abundance is more simultaneously It is probably cluster centre.Therefore, we are that every sequence sets a weight, and the initialization of weight is exactly that this sequence is at data set The quantity of middle repetitive sequence.Meanwhile, each gene order is all converted to dimension by us is 4KK-mer count vector, for Successive iterations uses LSH to carry out packet and does pretreatment.
Described dimension is 4KK-mer count vector refer to that vector dimension is 4K, each dimension correspondence one k-mer, record The number of times that this k-mer occurs in the sequence.
S2. utilize the LSH algorithm for two norm distances that described k-mer count vector is grouped so that similar Sequence is assigned to same group
We assume that higher its k-mer count vector of sequence of similarity is close together under two norm measure, thus sharp With the LSH algorithm of two norm distance measure, k-mer count vector is grouped so that the k-mer in same group is to span Close to from, thus the similarity between its corresponding sequence is higher.Use this character, when clustering in each single group, Can avoid the distance operation between a large amount of dissimilar sequence, this is time-consuming the best part in whole program, effectively reduces this The computing of part time-consumingly will be greatly improved overall operation efficiency.
Wherein, k-mer count vector distance approximation under two norms is considered as the similarity degree of sequence, by k- Mer count vector is LSH and is mapped, and original gene sequence is divided into some groups, it is to avoid the distance between a large amount of dissimilar sequences Calculate.
S3. utilize DP-means algorithm that each packet is clustered, and utilize message passing interface (Message Passing Interface, MPI) and multithreading (Multithreading) technology realize parallel processing
DP-means algorithm is a kind of imparametrization bayes method, and its cluster process is similar to K-means very much, but logical Crossing cluster threshold value λ, DP-means can learn cluster number automatically, it is not necessary to specifies fixing K.Calculated by DP-means Method, we can obtain than greedy algorithm cluster result more accurately.After LSH is to original data packet, we are to each point Group uses DP-means to cluster, and utilizes MPI and multithreading to be accelerated it.
Wherein said DP-means algorithm uses the editing distance between sequence two-by-two as the tolerance of sequence similarity.? DP-means adjusts in the step of cluster centre, and the method employing sampling, to be proportional to sequence weights in each cluster Probability randomly chooses a subset, selects strict cluster centre as the center of whole cluster in this subset.
S4. the cluster result that packet each in step S3 produces is merged, and iteration operating procedure S2 and S3 makes entirety Cluster result is restrained
The center of each cluster is picked out by we, as the representative sequence of whole cluster, and constitutes a new collection Close.It is mapped with certain probability sequence close together can be made to be assigned to due to what LSH algorithm did under a hash function In two groups, so we close at new this cluster centre collection constituted, iteration runs LSH packet and DP-means cluster Process, to ensure that similar sequence has the highest probability to be put in a packet, thus is clustered by DP-means algorithm Together so that whole cluster result is restrained.
This iterative process is equivalent to do, by multiple Hash tables, the step repeatedly mapped in LSH algorithm.If it is assumed that it is each Cluster result DP-means algorithm in group always can provide correct result, can be estimated by LSH theory and altogether need Iterations.
S5. finally according to the k-mer that length is bigger, all remaining gene orders are Hash (Big-Kmer Mapping), try again cluster by the sequence having identical k-mer, draws final cluster result.
Owing to the iterative process of LSH is when close to convergence, in LSH packet, major part sequence is all inequality, each iteration Only a small amount of sequence is clustered, and causes efficiency to decline.Thus we use another kind of strategy Hash (Big-in this case Kmer Mapping), according to the k-mer bigger with length, all remaining sequences are done Hash mapping, identical k-mer will be had Sequence finally do and once cluster.Owing to cluster result is already close to convergence, so cluster size now is the least, thus The step for of can being quickly completed.Outside Chu Ci, it is long that this step can also solve sequence in another kind of extreme case, i.e. data set Degree difference is excessive, and this can cause in the packet of LSH the two sequence by piecemeal, and can be solved by Big-KmerMapping Certainly this situation.
In application, k-mer length typically can choose in data set the 25% of gene order length average, and this can ensure that In true cluster, the public substring of major part sequence exceedes this length.Can effectively solve sequence length by step S5 to differ In the case of excessive, the problem that LSH lost efficacy;And during close to convergence, the problem that LSH efficiency of algorithm reduces.
Gene order described in the efficient clustering method of the present invention includes 16s rRNA, 18s rRNA etc..
The inventive method can process magnanimity sequence data effectively, including 16s rRNA and 18s rRNA data.Due to Employ efficient block-iterative solutions, it is to avoid the comparison of a large amount of dissimilar sequences, the cluster for large-scale dataset asks Topic, the inventive method can quickly provide cluster result, is that the current bio information field extensive clustering problem of process is the most efficient Method.Simultaneously as more accurate to the estimation of cluster centre in DP-means algorithm, the cluster result that this method draws can To ensure the highest accuracy.
Accompanying drawing explanation
Fig. 1 is the core algorithm false code of the present invention.
Fig. 2 is the clustering algorithm flow process based on local sensitivity Hash and imparametrization bayes method that the present invention proposes Figure.
Fig. 3-A is the visualization result figure of the embodiment 2 cluster result on emulation data set Sim5.
Fig. 3-B represents in comparative example 15 kinds of methods cluster accuracy on each data set.
Fig. 3-C represents number of clusters and the deviation of legitimate reading of 5 kinds of method predictions in comparative example 1.
Fig. 4-A represents that embodiment 3 processes the time needed for various data scale under different CPU core numbers.
Fig. 4-B represents in comparative example 23 kinds of softwares operation time under different data scales and CPU configure.
Fig. 4-C represents the number of clusters that in comparative example 2, distinct methods estimates.
Detailed description of the invention
Detailed description below is used for illustrating the present invention, but is not limited to the scope of the present invention.
Embodiment 1
Be described in detail below involved by present embodiment based on local sensitivity Hash (LSH) and nonparametric Bayes method (DP-means) clustering algorithm.The core algorithm false code of the present invention is shown in Fig. 1;The present invention propose based on local sensitivity Hash Fig. 2 is seen with the clustering algorithm flow chart of imparametrization bayes method.
S1. remove all repetitive sequences in data set, and gene order is converted to dimension is 4KK-mer count vector
Due in general grand genome 16s rRNA sample, it is understood that there may be the most identical gene order, by these The quantity of sequence is appreciated that the abundance of a kind of microorganism, and in clustering at one, the sequence that abundance is the highest is more probably poly-simultaneously Class center.Therefore, we are that every sequence sets a weight, and the initialization of weight is exactly that this sequence repeats sequence in data set The quantity of row.Meanwhile, each gene order is all converted to dimension by us is 4KK-mer count vector.Implement Time the gene order comprising ATCG can be regarded as the integer of 4 systems, the sequence of the most a length of K corresponds to [0,4 the most naturallyK) An integer, use the k-mer of all of a length of K in sliding window statistical series to obtain count vector, for follow-up LSH Do pretreatment.
S2. utilize LSH algorithm for two norm distances (Gionis et al., 1999;Datar et al.,2004) K-mer vector is grouped so that similar sequence is assigned to same group
LSH algorithm is the highly effective algorithm for solving nearest neighbor search problem, has pseudo-linear time complexity, actual In operation, efficiency is the highest.The core concept of LSH is, uses one group of specific hash function to calculate the cryptographic Hash of sample point, makes Must under two norm spaces distance closer to two points have higher probability to obtain identical cryptographic Hash.Specifically, for Point in space, the hash function race that can use under the distance metric of two norms is:
WhereinBeing a random vector, its most one-dimensional sampling from standard normal distribution the most independently obtains;It it is original data point;W is that a positive integer is for discretization hash function value;B ∈ [0, w) it is a side-play amount. Under this hash function, data point v1And v2The probability having an identical hash function value is:
P c = P ( h a , b ( v 1 ) = h a , b ( v 2 ) ) = ∫ 0 w 1 c g ( t c ) ( 1 - t w ) d t
Wherein c=| | v1-v2||2It it is the distance of two points.It can be seen that this probability along with data point distance increase and Dull reduction.
By k-mer count vector is used LSH algorithm, similar sequence is hashing onto in a packet, thus avoids A large amount of unnecessary distances calculate.In practical programs, the packet size set according to user, program will automatically set w's Numerical value.
S3. utilize DP-means algorithm (Kullis and Jordan, 2012) that each packet is clustered, and utilization disappears Breath passing interface (Message Passing Interface, MPI) and multithreading (Multithreading) technology realize parallel Process
DP-means algorithm be Di Li Cray process mixed model (Dirichlet Process Mixture Model, with It is referred to as down DPMM model) in little variance meaning limit inferior situation.DPMM is the imparametrization bayes method for cluster, when When using the method estimated result of gibbs sampler, the formula of employing is as follows:
p ( z i = k ) = n - i , k ψ ( ( 2 πσ 2 ) - d 2 exp ( - 1 2 σ 2 D 2 ( x i , μ k ) ) k = 1 ... K α ψ ( ( 2 π ( ρ + σ 2 ) ) - d 2 · exp ( - 1 2 ( ρ + σ 2 ) D 2 ( x i , x i ) ) k = K + 1
Wherein xiIt it is i-th data point;ziBe this data point corresponding cluster numbering;μkIt it is the cluster centre of kth cluster (sequence);Function D weighs the similarity degree of two sequences, i.e. editing distance;α is the parameter of DPMM model, controls number of clusters; ψ is normaliztion constant.This formula assume that to come each single cluster modeling with Gauss model, and describes i-th sequence Cluster is to the probability of kth (k=1 ... K) individual cluster, or the probability of oneself newly-generated class (k=K+1).
Use little Variance Method at us, make the variances sigma of Gauss model tend to 0, then above formula will become:
p ( z i = k ) = n - i , k ψ ′ exp ( - 1 2 σ 2 D 2 ( x i , μ k ) ) k = 1 ... K 1 ψ ′ exp ( - λ 2 σ 2 - D 2 ( x i , x i ) 2 ( ρ + σ 2 ) ) k = K + 1
It may be seen that in above probability, only { D2(xi, μ1) ..., D2(xi, μK), λ } in these maximum one Individual will obtain a non-zero probability (equal to 1), other are every is equal to 0.Thus we should be directly by xiCluster is to correspondence That.It is exactly more than the main contents of DP-means algorithm.Algorithm false code can be seen in fig. 1.
After we utilize LSH that initial data is done piecemeal, each piecemeal can be used DP-means algorithm cluster. In addition, owing to the cluster process of piecemeal each during this is completely self-contained, thus we can utilize easily Concurrent technique accelerates program.In concrete implementation, we have employed MPI technology simultaneously and multithreading achieves mixing also Row structure.Utilizing MPI technology, our program can be run on the multiple stage machine of a cluster simultaneously;In single service On device, owing to MPI needs to use interprocess communication, and multithreading can directly use shared drive, and therefore we use The latter realizes at the multi-core parallel concurrent on a station server.This structure improves parallel efficiency to greatest extent, makes calculation Method is run faster.
S4. the cluster result that packet each in step S3 produces is merged, and iteration operating procedure S2 and S3 makes entirety Cluster result is restrained
In this step, the cluster produced in each piecemeal in S3 is all put together, by each cluster by we The representative that center sequence clusters as this, in upper iterative process once (LSH piecemeal and DP-means cluster), we are only Consider those center sequence.In implementing, we can be by the weight sets centre to centre heart sequence of all of for each cluster sequence On so that it is in the iteration of next layer, represent whole cluster.If certain two center sequence is clustered together in next layer, The Cluster merging the most just they represented is to together, and by that analogy, iteration, until result restrains, i.e. certain LSH owns The cluster number of times occurred in piecemeal is less than a threshold value.This iterative process is equivalent to by multiple Hash tables in original LSH algorithm, Doing data and repeatedly map, thus avoid under specific mapping, some closely located point is different to hash function value.
S5. finally according to the k-mer that length is bigger, all remaining gene orders are Hash (Big-Kmer Mapping), try again cluster by the sequence having identical k-mer, draws final cluster result
When LSH iteration is close to convergence, the cluster occurred in each iterative process will become considerably less.Now, algorithm effect Rate have received the biggest impact.Therefore, in the case of the close convergence of LSH iterative process, we use another strategy Big- All remaining sequences are done Hash mapping according to the k-mer that length is bigger, will be comprised identical k-mer's by Kmer Mapping Sequence tries again cluster, draws final cluster result with this.The scheme mapped by this big k-mer, is possible not only to solve Inefficient problem during reception convergence, it is also possible to solve sequence length difference and cross the problem beaten: if i.e. two sequence lengths Widely different, then even if the two sequence comes from species, its k-mer count vector also can differ greatly, thus not Can be assigned in a group by LSH algorithm.
The embodiment 2 cluster analysis to emulation 16s rna gene data set
In order to the legitimate reading (Ground truth) of algorithm cluster result and data set being contrasted, Wo Men Contrast experiment has been carried out on emulation data set.Emulation data set is generated by software Grinder, and 5 groups altogether, the parameter of each group is such as Shown in lower table 1 below:
Table 1 emulates data set and generates parameter
As the visualization result of embodiment 1 method cluster result on Sim5 as shown in Fig. 3-A.Each circle in Fig. 3-A The area of circle is proportional to cluster size (number of sequence in cluster), the representative legitimate reading on the left side in every a pair circle (Ground truth), representative this algorithm cluster result on the right, overlapping area represents both common factors.This legend illustrates The comparative result of 60 clusters (accounting for the 45% of data set sequence sum) maximum in Sim5, it can be seen that this algorithm is the most permissible Find out to entirely accurate all clusters.
The embodiment 3 cluster analysis to Taihu Lake microorganism 16s rRNA Hong Jiyinzushuojuji
This data set is the 16s rRNA Hong Jiyinzushuojuji gathered to study TAIHU LAKE body pollution, comprises 81 Water surface micro-biological samples, is collected in 9 different months of 2012.Whole data set comprise 316,153,464 original Sequence, sequence length is 80bp, and file size is 30GB.
Fig. 4-A illustrate embodiment 1 method (DACE) process under different CPU core numbers needed for various data scale time Between.It will be seen that utilize message passing interface and multithreading, embodiment 1 algorithm (DACE) can effectively utilize calculating money Source, in the case of CPU core number increases, is greatly decreased the operation time.Especially, when data scale is bigger, the time of operation can With along with the increase of CPU core number, equal proportion ground reduces.The extensibility (Scalability) of visible embodiment 1 algorithm (DACE) The highest, it is well suited for processing ultra-large data set.
Comparative example 1
By 16s rRNA gene data collection (Sim1~Sim5) same as in Example 2 by embodiment 1 method (hereinafter referred to as DACE) tri-kinds of softwares of widely used CD-HIT, UCLUST, CROP carry out the contrast of cluster result, experimental result and in field In CD-HIT-ACC be the accurate pattern of CD-HIT software.Comparing result is as shown in Fig. 3-B and Fig. 3-C.Fig. 3-B shows 5 Planting method cluster accuracy (Normalized Mutual Information, NMI) on each data set, this score value is more Mean that greatly cluster result is the most accurate.It can be seen that embodiment 1 method (DACE) is concentrated 5 data, 4 are had to reach the highest Divide (Sim1, Sim3, Sim4, Sim5).Fig. 3-C shows the number of clusters (OTU numbers) of 5 kinds of method predictions and true knot The deviation of fruit.It can be seen that the result that embodiment 1 method (DACE) is on Sim1, Sim3, Sim4 is all closest to legitimate reading.Logical Cross emulation experiment it will be seen that the accuracy of embodiment 1 method is better than all classic algorithm generally.
Comparative example 2
By microorganism 16s rRNA Hong Jiyinzushuojuji in Taihu Lake same as in Example 2 by the (letter below of embodiment 1 method Claim DACE) and two kinds of softwares of CD-HIT, UCLUST carry out cluster efficiency contrast.The CROP mentioned in comparative example 1 is due to speed Relatively slow, therefore not in the comparison range of this comparative example.
Comparing result is as shown in Fig. 4-B and Fig. 4-C.Fig. 4-B shows that 3 kinds of softwares are joined at different data scales and CPU The operation time under putting.It is pointed out that CD-HIT only supports multi-threaded parallel, so in our experimental situation, 20 CPU core of many supports;UCLUST does not support parallel computation, therefore uses monokaryon to calculate;Embodiment 1 method (DACE) is supported arbitrarily The CPU core number of quantity, in our experimental situation, at most uses 80 cores.The vertical coordinate of Fig. 4-B have employed logarithmic coordinates, It can be seen that on complete data set (sequence quantity is 316M), DACE is (20 core) under identical check figure, than CD-HIT fast 6 Times, fast 80 times than CD-HIT-ACC;If using 80 cores, DACE only needs 25 minutes to complete cluster, and CD-HIT needs 5.25 little Time, CD-HIT-ACC and UCLUST needs about 68 hours.Visible, the operation efficiency of algorithm is significantly better than all classical calculations herein Method.Fig. 4-C shows the number of clusters (OTU numbers) that distinct methods estimates, due to gathering of Taihu Lake microbiological data collection Class does not has legitimate reading, thus correctness can not be strictly described, but clustering method all can over-evaluate number of clusters under normal circumstances, And the number of clusters that DACE estimates is less than additive method.
In Fig. 3-B, abscissa Dataset represents that emulation data set, vertical coordinate NMI Score represent cluster accuracy, Program represents 5 kinds of distinct methods of contrast.In Fig. 3-C, vertical coordinate Bias of predicted OTU numbers represents every The number of clusters of the method for kind prediction and the deviation of legitimate reading, Program represents 5 kinds of distinct methods of contrast.Horizontal seat in Fig. 4-A Mark #Cores represents CPU core number, and vertical coordinate Running time (minutes) represents the running software time (minute), DataSize represents the data set scale (1M=10 of use6Bar sequence).In Fig. 4-B, abscissa #Sequences represents data set Sequence bar number, Running time (minutes) represents the running software time (minute), and Program represents the software of contrast.Figure In 4-C, abscissa #Sequences represents data set sequence bar number, and Predicted OTU numbers (K) represents the poly-of prediction Class quantity (unit thousand), App represents the method for contrast.
Although, the present invention is described in detail the most with a general description of the specific embodiments, but On the basis of the present invention, can make some modifications or improvements it, this will be apparent to those skilled in the art.Cause This, these modifications or improvements without departing from theon the basis of the spirit of the present invention, belong to the scope of protection of present invention.

Claims (8)

1. one kind based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method, it is characterised in that include Following steps:
S1. remove all repetitive sequences in data set, and gene order is converted to dimension is 4KK-mer count vector;
S2. utilize the LSH algorithm for two norm distances that described k-mer count vector is grouped so that similar sequence It is assigned to same group;
S3. utilize DP-means algorithm that each packet is clustered, and utilize MPI and multithreading to realize parallel processing;
S4. the cluster result that packet each in step S3 produces is merged, and iteration operating procedure S2 and S3 makes overall cluster Result restrains;
S5. finally according to the k-mer that length is bigger, all remaining gene orders are done Hash, the sequence of identical k-mer will be had Arrange the cluster that tries again, draw final cluster result.
Efficient clustering method the most according to claim 1, it is characterised in that step S2 includes existing k-mer count vector Distance approximation under two norms is considered as the similarity degree of sequence, maps, by original gene by k-mer count vector is LSH Sequence is divided into some groups.
Efficient clustering method the most according to claim 1, it is characterised in that DP-means algorithm described in step S3 uses two Editing distance between two sequences is as the tolerance of sequence similarity.
Efficient clustering method the most according to claim 1, it is characterised in that utilize DP-means algorithm pair described in step S3 During each packet clusters, the method for sampling is used to adjust cluster centre, to be proportional to sequence power in each cluster The probability of weight randomly chooses a subset, selects strict cluster centre as the center of whole cluster in this subset.
Efficient clustering method the most according to claim 1, it is characterised in that described in step S4, iterative process is equivalent to LSH Algorithm is done the step repeatedly mapped by multiple Hash tables.
Efficient clustering method the most according to claim 1, it is characterised in that step S4 assumes that DP-means algorithm always may be used To provide the correct cluster result in each group, then estimated the iterations altogether needed by LSH theory.
Efficient clustering method the most according to claim 1, it is characterised in that k-mer described in step S5 is a length of chooses number According to concentrating the 25% of gene order length average.
8. according to the efficient clustering method described in any one of claim 1-7, it is characterised in that described gene order includes 16s rRNA、18s rRNA。
CN201610534138.9A 2016-07-07 2016-07-07 Efficient clustering method based on local sensitivity Hash and imparametrization bayes method Active CN106228035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610534138.9A CN106228035B (en) 2016-07-07 2016-07-07 Efficient clustering method based on local sensitivity Hash and imparametrization bayes method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610534138.9A CN106228035B (en) 2016-07-07 2016-07-07 Efficient clustering method based on local sensitivity Hash and imparametrization bayes method

Publications (2)

Publication Number Publication Date
CN106228035A true CN106228035A (en) 2016-12-14
CN106228035B CN106228035B (en) 2019-03-01

Family

ID=57519340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610534138.9A Active CN106228035B (en) 2016-07-07 2016-07-07 Efficient clustering method based on local sensitivity Hash and imparametrization bayes method

Country Status (1)

Country Link
CN (1) CN106228035B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109378039A (en) * 2018-08-20 2019-02-22 中国矿业大学 Oncogene based on discrete constraint and the norm that binds expresses spectral-data clustering method
CN113114454A (en) * 2021-03-01 2021-07-13 暨南大学 Efficient privacy outsourcing k-means clustering method
WO2021143016A1 (en) * 2020-01-15 2021-07-22 平安科技(深圳)有限公司 Approximate data processing method and apparatus, medium and electronic device
CN114023389A (en) * 2022-01-05 2022-02-08 成都齐碳科技有限公司 Analysis method of metagenome data
CN114420215A (en) * 2022-03-28 2022-04-29 山东大学 Large-scale biological data clustering method and system based on spanning tree
CN117573875A (en) * 2023-12-05 2024-02-20 安芯网盾(北京)科技有限公司 Method and device for optimizing homonymy file clustering algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1710558A (en) * 2005-07-07 2005-12-21 复旦大学 Gene chip expression spectral-data clustering method based on main cluster cutting
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN103793438A (en) * 2012-11-05 2014-05-14 山东省计算中心 MapReduce based parallel clustering method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1710558A (en) * 2005-07-07 2005-12-21 复旦大学 Gene chip expression spectral-data clustering method based on main cluster cutting
CN103793438A (en) * 2012-11-05 2014-05-14 山东省计算中心 MapReduce based parallel clustering method
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王忠伟 等: "基于LSH的高维大数据k近邻搜索算法", 《电子学报》 *
赵跃华 等: "面向海量病毒样本家族聚类方法的研究", 《计算机工程与应用》 *
郑奇斌 等: "结合局部敏感哈希的k近邻数据填补算法", 《计算机应用》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109378039A (en) * 2018-08-20 2019-02-22 中国矿业大学 Oncogene based on discrete constraint and the norm that binds expresses spectral-data clustering method
CN109378039B (en) * 2018-08-20 2022-02-25 中国矿业大学 Tumor gene expression profile data clustering method based on discrete constraint and capping norm
WO2021143016A1 (en) * 2020-01-15 2021-07-22 平安科技(深圳)有限公司 Approximate data processing method and apparatus, medium and electronic device
CN113114454A (en) * 2021-03-01 2021-07-13 暨南大学 Efficient privacy outsourcing k-means clustering method
CN113114454B (en) * 2021-03-01 2022-11-29 暨南大学 Efficient privacy outsourcing k-means clustering method
CN114023389A (en) * 2022-01-05 2022-02-08 成都齐碳科技有限公司 Analysis method of metagenome data
CN114420215A (en) * 2022-03-28 2022-04-29 山东大学 Large-scale biological data clustering method and system based on spanning tree
CN114420215B (en) * 2022-03-28 2022-09-16 山东大学 Large-scale biological data clustering method and system based on spanning tree
CN117573875A (en) * 2023-12-05 2024-02-20 安芯网盾(北京)科技有限公司 Method and device for optimizing homonymy file clustering algorithm

Also Published As

Publication number Publication date
CN106228035B (en) 2019-03-01

Similar Documents

Publication Publication Date Title
Rannala et al. Species delimitation
CN106228035B (en) Efficient clustering method based on local sensitivity Hash and imparametrization bayes method
Do et al. ProbCons: Probabilistic consistency-based multiple sequence alignment
Wang et al. On the complexity of multiple sequence alignment
Cai et al. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time
Bader et al. Industrial applications of high-performance computing for phylogeny reconstruction
Moret et al. High-performance algorithm engineering for computational phylogenetics
Liao et al. A new unsupervised binning approach for metagenomic sequences based on n-grams and automatic feature weighting
Rasheed et al. A map-reduce framework for clustering metagenomes
Warnow Large-scale multiple sequence alignment and phylogeny estimation
US12062417B2 (en) System, method and computer accessible-medium for multiplexing base calling and/or alignment
Heaps et al. Bayesian modelling of compositional heterogeneity in molecular phylogenetics
Nawaz et al. Using alignment-free and pattern mining methods for SARS-CoV-2 genome analysis
US20220208540A1 (en) System for Identifying Structures of Molecular Compounds from Mass Spectrometry Data
Divakar et al. Molecular phylogenetic and phylogenomic approaches in studies of lichen systematics and evolution
Saeed et al. A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes
Wu On biological validity indices for soft clustering algorithms for gene expression data
Pissis et al. MoTeX: A word-based HPC tool for MoTif eXtraction
Elkhani et al. Membrane computing to model feature selection of microarray cancer data
Pipes et al. A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets
Martin et al. Machine learning substantiates biologically meaningful species delimitations in the phylogenetically complex North American box turtle genus Terrapene
Wiegert Use Cases of Predictive Modeling for Phylogenetic Inference and Placements
Dey et al. Biochemical property based positional matrix: A new approach towards genome sequence comparison
Ebrahimpour Boroojeny et al. Theory of graph traversal edit distance, extensions, and applications
Zheng Real-Time DNA Streams Processing on Mobile Devices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant