CN106228035A - Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method - Google Patents
Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method Download PDFInfo
- Publication number
- CN106228035A CN106228035A CN201610534138.9A CN201610534138A CN106228035A CN 106228035 A CN106228035 A CN 106228035A CN 201610534138 A CN201610534138 A CN 201610534138A CN 106228035 A CN106228035 A CN 106228035A
- Authority
- CN
- China
- Prior art keywords
- cluster
- sequence
- mer
- clustering method
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of based on local sensitivity Hash with the efficient clustering method of imparametrization bayes method.The inventive method can process magnanimity sequence data effectively, including 16s rRNA and 18s rRNA data.Owing to employing efficient block-iterative solutions, it is to avoid the comparison of a large amount of dissimilar sequences, for the clustering problem of large-scale dataset, this method can quickly provide cluster result, is that current bio information field processes the most efficient method of extensive clustering problem.Simultaneously as more accurate to the estimation of cluster centre in DP means algorithm, the cluster result that the inventive method draws can ensure that the highest accuracy.
Description
Technical field
The invention belongs to computer utility (bioinformatics) field, be specifically related to a kind of based on local sensitivity Hash with non-
The efficient clustering method of parametrization bayes method.
Background technology
Recently as developing rapidly of second filial generation sequencing technologies, people can be quick, cheaply from environmental samples
The a large amount of high-quality DNA/RNA of middle acquisition checks order fragment.The whole world microorganism project (Earth Microbiome Project) from
15,000 environmental samples obtain 1,300,000,000 biological 16s rRNA gene orders all over the world;Human intestinal microorganisms's project
Group obtains 1,100,000,000 16s rRNA gene orders from 531 tested samples.By this large scale sequencing, people are permissible
Obtain the most useful information for Environment control, study of disease, exploitation medicine etc..When analyzing these data, most basic
Sequence is clustered by one work exactly, and according to similarity degree, wall scroll gene order is clustered into microbiologic population (OTU), right
Entirety calculates and analyzes.
In practical operation, we generally define its similarity degree according to the editing distance between sequence.The most biological
Informatics has had many software to may be used for the cluster of gene order, mainly has a following kind: hierarchical clustering
Method, greed clustering procedure, probabilistic model method.The software using hierarchical clustering method mainly has DOTUR (Schloss and
Handelsman, 2005) and Mothur (Schloss et al., 2009), this clustering method is generally entered by structure one
Change tree, by defining the threshold value of a microbiologic population, cluster result can be immediately arrived on cladogram.This method is usual
Need to calculate the similarity degree of sequence two-by-two, and be saved in internal memory, then start to construct cladogram.The time of this method
Complexity and space complexity are the highest, the most helpless when in the face of the data of magnanimity.Use the software master of greedy algorithm
CD-HIT to be had (Fu et al., 2012) and UCLUST (Edgar, 2010), these softwares can be using Article 1 sequence as first
The center (representing whole cluster) of individual cluster, later each sequence all compares with existing cluster centre, if threshold value it
In, then this sequence being belonged in this cluster, the distance if all of cluster centre Yu this sequence is above threshold value, then
This sequence constitutes a class by itself.This algorithm is the most efficient, to such an extent as to numerous studies all use based on greedy algorithm at present
Clustering tool.The shortcoming of this algorithm is, cluster result will depend critically upon the order of list entries, and result accuracy is relatively
Difference.The software using probabilistic model mainly has CROP (Hao et al., 2011), uses non-supervisory Bayes side in this software
Method, is modeled cluster with gauss hybrid models, uses MCMC methodology to estimate model parameter, clusters with this.In practice
In, this algorithm is owing to employing soft-threshold, so accuracy is higher, but owing to needs use MCMC methodology successive ignition,
Therefore speed is the slowest.
Local sensitivity Hash (Locality Sensitive Hashing) is abbreviated as LSH, is a kind of satisfied by design
The special nature i.e. hash function of local sensitivity, the method improving similar search efficiency.
Summary of the invention
The technical issues that need to address of the present invention be to provide one efficiently, Gene Sequences Clustering method accurately, overcome greedy
Center algorithm result is inaccurate, and result depends on the problem of sequence inputting order, the most also to ensure that calculating speed can reach
To more original instruments, enabling real process magnanimity gene data quickly and efficiently.
It is an object of the invention to provide a kind of based on local sensitivity Hash (LSH) and imparametrization bayes method (DP-
Means) efficient clustering method, makes cluster result more accurate than traditional greedy algorithm by means of imparametrization bayes method
Really, LSH partition strategy accelerates whole cluster process simultaneously.Specifically, the step that realizes of the method includes:
S1. remove all repetitive sequences in data set, and gene order is converted to dimension is 4KK-mer count vector
Due in general grand genome 16s rRNA, 18s rRNA sample, it is understood that there may be the most identical gene order,
Be appreciated that the abundance of a kind of microorganism by the quantity of these sequences, in clustering at one, the highest sequence of abundance is more simultaneously
It is probably cluster centre.Therefore, we are that every sequence sets a weight, and the initialization of weight is exactly that this sequence is at data set
The quantity of middle repetitive sequence.Meanwhile, each gene order is all converted to dimension by us is 4KK-mer count vector, for
Successive iterations uses LSH to carry out packet and does pretreatment.
Described dimension is 4KK-mer count vector refer to that vector dimension is 4K, each dimension correspondence one k-mer, record
The number of times that this k-mer occurs in the sequence.
S2. utilize the LSH algorithm for two norm distances that described k-mer count vector is grouped so that similar
Sequence is assigned to same group
We assume that higher its k-mer count vector of sequence of similarity is close together under two norm measure, thus sharp
With the LSH algorithm of two norm distance measure, k-mer count vector is grouped so that the k-mer in same group is to span
Close to from, thus the similarity between its corresponding sequence is higher.Use this character, when clustering in each single group,
Can avoid the distance operation between a large amount of dissimilar sequence, this is time-consuming the best part in whole program, effectively reduces this
The computing of part time-consumingly will be greatly improved overall operation efficiency.
Wherein, k-mer count vector distance approximation under two norms is considered as the similarity degree of sequence, by k-
Mer count vector is LSH and is mapped, and original gene sequence is divided into some groups, it is to avoid the distance between a large amount of dissimilar sequences
Calculate.
S3. utilize DP-means algorithm that each packet is clustered, and utilize message passing interface (Message
Passing Interface, MPI) and multithreading (Multithreading) technology realize parallel processing
DP-means algorithm is a kind of imparametrization bayes method, and its cluster process is similar to K-means very much, but logical
Crossing cluster threshold value λ, DP-means can learn cluster number automatically, it is not necessary to specifies fixing K.Calculated by DP-means
Method, we can obtain than greedy algorithm cluster result more accurately.After LSH is to original data packet, we are to each point
Group uses DP-means to cluster, and utilizes MPI and multithreading to be accelerated it.
Wherein said DP-means algorithm uses the editing distance between sequence two-by-two as the tolerance of sequence similarity.?
DP-means adjusts in the step of cluster centre, and the method employing sampling, to be proportional to sequence weights in each cluster
Probability randomly chooses a subset, selects strict cluster centre as the center of whole cluster in this subset.
S4. the cluster result that packet each in step S3 produces is merged, and iteration operating procedure S2 and S3 makes entirety
Cluster result is restrained
The center of each cluster is picked out by we, as the representative sequence of whole cluster, and constitutes a new collection
Close.It is mapped with certain probability sequence close together can be made to be assigned to due to what LSH algorithm did under a hash function
In two groups, so we close at new this cluster centre collection constituted, iteration runs LSH packet and DP-means cluster
Process, to ensure that similar sequence has the highest probability to be put in a packet, thus is clustered by DP-means algorithm
Together so that whole cluster result is restrained.
This iterative process is equivalent to do, by multiple Hash tables, the step repeatedly mapped in LSH algorithm.If it is assumed that it is each
Cluster result DP-means algorithm in group always can provide correct result, can be estimated by LSH theory and altogether need
Iterations.
S5. finally according to the k-mer that length is bigger, all remaining gene orders are Hash (Big-Kmer
Mapping), try again cluster by the sequence having identical k-mer, draws final cluster result.
Owing to the iterative process of LSH is when close to convergence, in LSH packet, major part sequence is all inequality, each iteration
Only a small amount of sequence is clustered, and causes efficiency to decline.Thus we use another kind of strategy Hash (Big-in this case
Kmer Mapping), according to the k-mer bigger with length, all remaining sequences are done Hash mapping, identical k-mer will be had
Sequence finally do and once cluster.Owing to cluster result is already close to convergence, so cluster size now is the least, thus
The step for of can being quickly completed.Outside Chu Ci, it is long that this step can also solve sequence in another kind of extreme case, i.e. data set
Degree difference is excessive, and this can cause in the packet of LSH the two sequence by piecemeal, and can be solved by Big-KmerMapping
Certainly this situation.
In application, k-mer length typically can choose in data set the 25% of gene order length average, and this can ensure that
In true cluster, the public substring of major part sequence exceedes this length.Can effectively solve sequence length by step S5 to differ
In the case of excessive, the problem that LSH lost efficacy;And during close to convergence, the problem that LSH efficiency of algorithm reduces.
Gene order described in the efficient clustering method of the present invention includes 16s rRNA, 18s rRNA etc..
The inventive method can process magnanimity sequence data effectively, including 16s rRNA and 18s rRNA data.Due to
Employ efficient block-iterative solutions, it is to avoid the comparison of a large amount of dissimilar sequences, the cluster for large-scale dataset asks
Topic, the inventive method can quickly provide cluster result, is that the current bio information field extensive clustering problem of process is the most efficient
Method.Simultaneously as more accurate to the estimation of cluster centre in DP-means algorithm, the cluster result that this method draws can
To ensure the highest accuracy.
Accompanying drawing explanation
Fig. 1 is the core algorithm false code of the present invention.
Fig. 2 is the clustering algorithm flow process based on local sensitivity Hash and imparametrization bayes method that the present invention proposes
Figure.
Fig. 3-A is the visualization result figure of the embodiment 2 cluster result on emulation data set Sim5.
Fig. 3-B represents in comparative example 15 kinds of methods cluster accuracy on each data set.
Fig. 3-C represents number of clusters and the deviation of legitimate reading of 5 kinds of method predictions in comparative example 1.
Fig. 4-A represents that embodiment 3 processes the time needed for various data scale under different CPU core numbers.
Fig. 4-B represents in comparative example 23 kinds of softwares operation time under different data scales and CPU configure.
Fig. 4-C represents the number of clusters that in comparative example 2, distinct methods estimates.
Detailed description of the invention
Detailed description below is used for illustrating the present invention, but is not limited to the scope of the present invention.
Embodiment 1
Be described in detail below involved by present embodiment based on local sensitivity Hash (LSH) and nonparametric Bayes method
(DP-means) clustering algorithm.The core algorithm false code of the present invention is shown in Fig. 1;The present invention propose based on local sensitivity Hash
Fig. 2 is seen with the clustering algorithm flow chart of imparametrization bayes method.
S1. remove all repetitive sequences in data set, and gene order is converted to dimension is 4KK-mer count vector
Due in general grand genome 16s rRNA sample, it is understood that there may be the most identical gene order, by these
The quantity of sequence is appreciated that the abundance of a kind of microorganism, and in clustering at one, the sequence that abundance is the highest is more probably poly-simultaneously
Class center.Therefore, we are that every sequence sets a weight, and the initialization of weight is exactly that this sequence repeats sequence in data set
The quantity of row.Meanwhile, each gene order is all converted to dimension by us is 4KK-mer count vector.Implement
Time the gene order comprising ATCG can be regarded as the integer of 4 systems, the sequence of the most a length of K corresponds to [0,4 the most naturallyK)
An integer, use the k-mer of all of a length of K in sliding window statistical series to obtain count vector, for follow-up LSH
Do pretreatment.
S2. utilize LSH algorithm for two norm distances (Gionis et al., 1999;Datar et al.,2004)
K-mer vector is grouped so that similar sequence is assigned to same group
LSH algorithm is the highly effective algorithm for solving nearest neighbor search problem, has pseudo-linear time complexity, actual
In operation, efficiency is the highest.The core concept of LSH is, uses one group of specific hash function to calculate the cryptographic Hash of sample point, makes
Must under two norm spaces distance closer to two points have higher probability to obtain identical cryptographic Hash.Specifically, for
Point in space, the hash function race that can use under the distance metric of two norms is:
WhereinBeing a random vector, its most one-dimensional sampling from standard normal distribution the most independently obtains;It it is original data point;W is that a positive integer is for discretization hash function value;B ∈ [0, w) it is a side-play amount.
Under this hash function, data point v1And v2The probability having an identical hash function value is:
Wherein c=| | v1-v2||2It it is the distance of two points.It can be seen that this probability along with data point distance increase and
Dull reduction.
By k-mer count vector is used LSH algorithm, similar sequence is hashing onto in a packet, thus avoids
A large amount of unnecessary distances calculate.In practical programs, the packet size set according to user, program will automatically set w's
Numerical value.
S3. utilize DP-means algorithm (Kullis and Jordan, 2012) that each packet is clustered, and utilization disappears
Breath passing interface (Message Passing Interface, MPI) and multithreading (Multithreading) technology realize parallel
Process
DP-means algorithm be Di Li Cray process mixed model (Dirichlet Process Mixture Model, with
It is referred to as down DPMM model) in little variance meaning limit inferior situation.DPMM is the imparametrization bayes method for cluster, when
When using the method estimated result of gibbs sampler, the formula of employing is as follows:
Wherein xiIt it is i-th data point;ziBe this data point corresponding cluster numbering;μkIt it is the cluster centre of kth cluster
(sequence);Function D weighs the similarity degree of two sequences, i.e. editing distance;α is the parameter of DPMM model, controls number of clusters;
ψ is normaliztion constant.This formula assume that to come each single cluster modeling with Gauss model, and describes i-th sequence
Cluster is to the probability of kth (k=1 ... K) individual cluster, or the probability of oneself newly-generated class (k=K+1).
Use little Variance Method at us, make the variances sigma of Gauss model tend to 0, then above formula will become:
It may be seen that in above probability, only { D2(xi, μ1) ..., D2(xi, μK), λ } in these maximum one
Individual will obtain a non-zero probability (equal to 1), other are every is equal to 0.Thus we should be directly by xiCluster is to correspondence
That.It is exactly more than the main contents of DP-means algorithm.Algorithm false code can be seen in fig. 1.
After we utilize LSH that initial data is done piecemeal, each piecemeal can be used DP-means algorithm cluster.
In addition, owing to the cluster process of piecemeal each during this is completely self-contained, thus we can utilize easily
Concurrent technique accelerates program.In concrete implementation, we have employed MPI technology simultaneously and multithreading achieves mixing also
Row structure.Utilizing MPI technology, our program can be run on the multiple stage machine of a cluster simultaneously;In single service
On device, owing to MPI needs to use interprocess communication, and multithreading can directly use shared drive, and therefore we use
The latter realizes at the multi-core parallel concurrent on a station server.This structure improves parallel efficiency to greatest extent, makes calculation
Method is run faster.
S4. the cluster result that packet each in step S3 produces is merged, and iteration operating procedure S2 and S3 makes entirety
Cluster result is restrained
In this step, the cluster produced in each piecemeal in S3 is all put together, by each cluster by we
The representative that center sequence clusters as this, in upper iterative process once (LSH piecemeal and DP-means cluster), we are only
Consider those center sequence.In implementing, we can be by the weight sets centre to centre heart sequence of all of for each cluster sequence
On so that it is in the iteration of next layer, represent whole cluster.If certain two center sequence is clustered together in next layer,
The Cluster merging the most just they represented is to together, and by that analogy, iteration, until result restrains, i.e. certain LSH owns
The cluster number of times occurred in piecemeal is less than a threshold value.This iterative process is equivalent to by multiple Hash tables in original LSH algorithm,
Doing data and repeatedly map, thus avoid under specific mapping, some closely located point is different to hash function value.
S5. finally according to the k-mer that length is bigger, all remaining gene orders are Hash (Big-Kmer
Mapping), try again cluster by the sequence having identical k-mer, draws final cluster result
When LSH iteration is close to convergence, the cluster occurred in each iterative process will become considerably less.Now, algorithm effect
Rate have received the biggest impact.Therefore, in the case of the close convergence of LSH iterative process, we use another strategy Big-
All remaining sequences are done Hash mapping according to the k-mer that length is bigger, will be comprised identical k-mer's by Kmer Mapping
Sequence tries again cluster, draws final cluster result with this.The scheme mapped by this big k-mer, is possible not only to solve
Inefficient problem during reception convergence, it is also possible to solve sequence length difference and cross the problem beaten: if i.e. two sequence lengths
Widely different, then even if the two sequence comes from species, its k-mer count vector also can differ greatly, thus not
Can be assigned in a group by LSH algorithm.
The embodiment 2 cluster analysis to emulation 16s rna gene data set
In order to the legitimate reading (Ground truth) of algorithm cluster result and data set being contrasted, Wo Men
Contrast experiment has been carried out on emulation data set.Emulation data set is generated by software Grinder, and 5 groups altogether, the parameter of each group is such as
Shown in lower table 1 below:
Table 1 emulates data set and generates parameter
As the visualization result of embodiment 1 method cluster result on Sim5 as shown in Fig. 3-A.Each circle in Fig. 3-A
The area of circle is proportional to cluster size (number of sequence in cluster), the representative legitimate reading on the left side in every a pair circle
(Ground truth), representative this algorithm cluster result on the right, overlapping area represents both common factors.This legend illustrates
The comparative result of 60 clusters (accounting for the 45% of data set sequence sum) maximum in Sim5, it can be seen that this algorithm is the most permissible
Find out to entirely accurate all clusters.
The embodiment 3 cluster analysis to Taihu Lake microorganism 16s rRNA Hong Jiyinzushuojuji
This data set is the 16s rRNA Hong Jiyinzushuojuji gathered to study TAIHU LAKE body pollution, comprises 81
Water surface micro-biological samples, is collected in 9 different months of 2012.Whole data set comprise 316,153,464 original
Sequence, sequence length is 80bp, and file size is 30GB.
Fig. 4-A illustrate embodiment 1 method (DACE) process under different CPU core numbers needed for various data scale time
Between.It will be seen that utilize message passing interface and multithreading, embodiment 1 algorithm (DACE) can effectively utilize calculating money
Source, in the case of CPU core number increases, is greatly decreased the operation time.Especially, when data scale is bigger, the time of operation can
With along with the increase of CPU core number, equal proportion ground reduces.The extensibility (Scalability) of visible embodiment 1 algorithm (DACE)
The highest, it is well suited for processing ultra-large data set.
Comparative example 1
By 16s rRNA gene data collection (Sim1~Sim5) same as in Example 2 by embodiment 1 method (hereinafter referred to as
DACE) tri-kinds of softwares of widely used CD-HIT, UCLUST, CROP carry out the contrast of cluster result, experimental result and in field
In CD-HIT-ACC be the accurate pattern of CD-HIT software.Comparing result is as shown in Fig. 3-B and Fig. 3-C.Fig. 3-B shows 5
Planting method cluster accuracy (Normalized Mutual Information, NMI) on each data set, this score value is more
Mean that greatly cluster result is the most accurate.It can be seen that embodiment 1 method (DACE) is concentrated 5 data, 4 are had to reach the highest
Divide (Sim1, Sim3, Sim4, Sim5).Fig. 3-C shows the number of clusters (OTU numbers) of 5 kinds of method predictions and true knot
The deviation of fruit.It can be seen that the result that embodiment 1 method (DACE) is on Sim1, Sim3, Sim4 is all closest to legitimate reading.Logical
Cross emulation experiment it will be seen that the accuracy of embodiment 1 method is better than all classic algorithm generally.
Comparative example 2
By microorganism 16s rRNA Hong Jiyinzushuojuji in Taihu Lake same as in Example 2 by the (letter below of embodiment 1 method
Claim DACE) and two kinds of softwares of CD-HIT, UCLUST carry out cluster efficiency contrast.The CROP mentioned in comparative example 1 is due to speed
Relatively slow, therefore not in the comparison range of this comparative example.
Comparing result is as shown in Fig. 4-B and Fig. 4-C.Fig. 4-B shows that 3 kinds of softwares are joined at different data scales and CPU
The operation time under putting.It is pointed out that CD-HIT only supports multi-threaded parallel, so in our experimental situation,
20 CPU core of many supports;UCLUST does not support parallel computation, therefore uses monokaryon to calculate;Embodiment 1 method (DACE) is supported arbitrarily
The CPU core number of quantity, in our experimental situation, at most uses 80 cores.The vertical coordinate of Fig. 4-B have employed logarithmic coordinates,
It can be seen that on complete data set (sequence quantity is 316M), DACE is (20 core) under identical check figure, than CD-HIT fast 6
Times, fast 80 times than CD-HIT-ACC;If using 80 cores, DACE only needs 25 minutes to complete cluster, and CD-HIT needs 5.25 little
Time, CD-HIT-ACC and UCLUST needs about 68 hours.Visible, the operation efficiency of algorithm is significantly better than all classical calculations herein
Method.Fig. 4-C shows the number of clusters (OTU numbers) that distinct methods estimates, due to gathering of Taihu Lake microbiological data collection
Class does not has legitimate reading, thus correctness can not be strictly described, but clustering method all can over-evaluate number of clusters under normal circumstances,
And the number of clusters that DACE estimates is less than additive method.
In Fig. 3-B, abscissa Dataset represents that emulation data set, vertical coordinate NMI Score represent cluster accuracy,
Program represents 5 kinds of distinct methods of contrast.In Fig. 3-C, vertical coordinate Bias of predicted OTU numbers represents every
The number of clusters of the method for kind prediction and the deviation of legitimate reading, Program represents 5 kinds of distinct methods of contrast.Horizontal seat in Fig. 4-A
Mark #Cores represents CPU core number, and vertical coordinate Running time (minutes) represents the running software time (minute),
DataSize represents the data set scale (1M=10 of use6Bar sequence).In Fig. 4-B, abscissa #Sequences represents data set
Sequence bar number, Running time (minutes) represents the running software time (minute), and Program represents the software of contrast.Figure
In 4-C, abscissa #Sequences represents data set sequence bar number, and Predicted OTU numbers (K) represents the poly-of prediction
Class quantity (unit thousand), App represents the method for contrast.
Although, the present invention is described in detail the most with a general description of the specific embodiments, but
On the basis of the present invention, can make some modifications or improvements it, this will be apparent to those skilled in the art.Cause
This, these modifications or improvements without departing from theon the basis of the spirit of the present invention, belong to the scope of protection of present invention.
Claims (8)
1. one kind based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method, it is characterised in that include
Following steps:
S1. remove all repetitive sequences in data set, and gene order is converted to dimension is 4KK-mer count vector;
S2. utilize the LSH algorithm for two norm distances that described k-mer count vector is grouped so that similar sequence
It is assigned to same group;
S3. utilize DP-means algorithm that each packet is clustered, and utilize MPI and multithreading to realize parallel processing;
S4. the cluster result that packet each in step S3 produces is merged, and iteration operating procedure S2 and S3 makes overall cluster
Result restrains;
S5. finally according to the k-mer that length is bigger, all remaining gene orders are done Hash, the sequence of identical k-mer will be had
Arrange the cluster that tries again, draw final cluster result.
Efficient clustering method the most according to claim 1, it is characterised in that step S2 includes existing k-mer count vector
Distance approximation under two norms is considered as the similarity degree of sequence, maps, by original gene by k-mer count vector is LSH
Sequence is divided into some groups.
Efficient clustering method the most according to claim 1, it is characterised in that DP-means algorithm described in step S3 uses two
Editing distance between two sequences is as the tolerance of sequence similarity.
Efficient clustering method the most according to claim 1, it is characterised in that utilize DP-means algorithm pair described in step S3
During each packet clusters, the method for sampling is used to adjust cluster centre, to be proportional to sequence power in each cluster
The probability of weight randomly chooses a subset, selects strict cluster centre as the center of whole cluster in this subset.
Efficient clustering method the most according to claim 1, it is characterised in that described in step S4, iterative process is equivalent to LSH
Algorithm is done the step repeatedly mapped by multiple Hash tables.
Efficient clustering method the most according to claim 1, it is characterised in that step S4 assumes that DP-means algorithm always may be used
To provide the correct cluster result in each group, then estimated the iterations altogether needed by LSH theory.
Efficient clustering method the most according to claim 1, it is characterised in that k-mer described in step S5 is a length of chooses number
According to concentrating the 25% of gene order length average.
8. according to the efficient clustering method described in any one of claim 1-7, it is characterised in that described gene order includes 16s
rRNA、18s rRNA。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610534138.9A CN106228035B (en) | 2016-07-07 | 2016-07-07 | Efficient clustering method based on local sensitivity Hash and imparametrization bayes method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610534138.9A CN106228035B (en) | 2016-07-07 | 2016-07-07 | Efficient clustering method based on local sensitivity Hash and imparametrization bayes method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106228035A true CN106228035A (en) | 2016-12-14 |
CN106228035B CN106228035B (en) | 2019-03-01 |
Family
ID=57519340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610534138.9A Active CN106228035B (en) | 2016-07-07 | 2016-07-07 | Efficient clustering method based on local sensitivity Hash and imparametrization bayes method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106228035B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109378039A (en) * | 2018-08-20 | 2019-02-22 | 中国矿业大学 | Oncogene based on discrete constraint and the norm that binds expresses spectral-data clustering method |
CN113114454A (en) * | 2021-03-01 | 2021-07-13 | 暨南大学 | Efficient privacy outsourcing k-means clustering method |
WO2021143016A1 (en) * | 2020-01-15 | 2021-07-22 | 平安科技(深圳)有限公司 | Approximate data processing method and apparatus, medium and electronic device |
CN114023389A (en) * | 2022-01-05 | 2022-02-08 | 成都齐碳科技有限公司 | Analysis method of metagenome data |
CN114420215A (en) * | 2022-03-28 | 2022-04-29 | 山东大学 | Large-scale biological data clustering method and system based on spanning tree |
CN117573875A (en) * | 2023-12-05 | 2024-02-20 | 安芯网盾(北京)科技有限公司 | Method and device for optimizing homonymy file clustering algorithm |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1710558A (en) * | 2005-07-07 | 2005-12-21 | 复旦大学 | Gene chip expression spectral-data clustering method based on main cluster cutting |
CN103631928A (en) * | 2013-12-05 | 2014-03-12 | 中国科学院信息工程研究所 | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system |
CN103793438A (en) * | 2012-11-05 | 2014-05-14 | 山东省计算中心 | MapReduce based parallel clustering method |
-
2016
- 2016-07-07 CN CN201610534138.9A patent/CN106228035B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1710558A (en) * | 2005-07-07 | 2005-12-21 | 复旦大学 | Gene chip expression spectral-data clustering method based on main cluster cutting |
CN103793438A (en) * | 2012-11-05 | 2014-05-14 | 山东省计算中心 | MapReduce based parallel clustering method |
CN103631928A (en) * | 2013-12-05 | 2014-03-12 | 中国科学院信息工程研究所 | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system |
Non-Patent Citations (3)
Title |
---|
王忠伟 等: "基于LSH的高维大数据k近邻搜索算法", 《电子学报》 * |
赵跃华 等: "面向海量病毒样本家族聚类方法的研究", 《计算机工程与应用》 * |
郑奇斌 等: "结合局部敏感哈希的k近邻数据填补算法", 《计算机应用》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109378039A (en) * | 2018-08-20 | 2019-02-22 | 中国矿业大学 | Oncogene based on discrete constraint and the norm that binds expresses spectral-data clustering method |
CN109378039B (en) * | 2018-08-20 | 2022-02-25 | 中国矿业大学 | Tumor gene expression profile data clustering method based on discrete constraint and capping norm |
WO2021143016A1 (en) * | 2020-01-15 | 2021-07-22 | 平安科技(深圳)有限公司 | Approximate data processing method and apparatus, medium and electronic device |
CN113114454A (en) * | 2021-03-01 | 2021-07-13 | 暨南大学 | Efficient privacy outsourcing k-means clustering method |
CN113114454B (en) * | 2021-03-01 | 2022-11-29 | 暨南大学 | Efficient privacy outsourcing k-means clustering method |
CN114023389A (en) * | 2022-01-05 | 2022-02-08 | 成都齐碳科技有限公司 | Analysis method of metagenome data |
CN114420215A (en) * | 2022-03-28 | 2022-04-29 | 山东大学 | Large-scale biological data clustering method and system based on spanning tree |
CN114420215B (en) * | 2022-03-28 | 2022-09-16 | 山东大学 | Large-scale biological data clustering method and system based on spanning tree |
CN117573875A (en) * | 2023-12-05 | 2024-02-20 | 安芯网盾(北京)科技有限公司 | Method and device for optimizing homonymy file clustering algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN106228035B (en) | 2019-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rannala et al. | Species delimitation | |
CN106228035B (en) | Efficient clustering method based on local sensitivity Hash and imparametrization bayes method | |
Do et al. | ProbCons: Probabilistic consistency-based multiple sequence alignment | |
Wang et al. | On the complexity of multiple sequence alignment | |
Cai et al. | ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time | |
Bader et al. | Industrial applications of high-performance computing for phylogeny reconstruction | |
Moret et al. | High-performance algorithm engineering for computational phylogenetics | |
Liao et al. | A new unsupervised binning approach for metagenomic sequences based on n-grams and automatic feature weighting | |
Rasheed et al. | A map-reduce framework for clustering metagenomes | |
Warnow | Large-scale multiple sequence alignment and phylogeny estimation | |
US12062417B2 (en) | System, method and computer accessible-medium for multiplexing base calling and/or alignment | |
Heaps et al. | Bayesian modelling of compositional heterogeneity in molecular phylogenetics | |
Nawaz et al. | Using alignment-free and pattern mining methods for SARS-CoV-2 genome analysis | |
US20220208540A1 (en) | System for Identifying Structures of Molecular Compounds from Mass Spectrometry Data | |
Divakar et al. | Molecular phylogenetic and phylogenomic approaches in studies of lichen systematics and evolution | |
Saeed et al. | A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes | |
Wu | On biological validity indices for soft clustering algorithms for gene expression data | |
Pissis et al. | MoTeX: A word-based HPC tool for MoTif eXtraction | |
Elkhani et al. | Membrane computing to model feature selection of microarray cancer data | |
Pipes et al. | A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets | |
Martin et al. | Machine learning substantiates biologically meaningful species delimitations in the phylogenetically complex North American box turtle genus Terrapene | |
Wiegert | Use Cases of Predictive Modeling for Phylogenetic Inference and Placements | |
Dey et al. | Biochemical property based positional matrix: A new approach towards genome sequence comparison | |
Ebrahimpour Boroojeny et al. | Theory of graph traversal edit distance, extensions, and applications | |
Zheng | Real-Time DNA Streams Processing on Mobile Devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |