CN110298371A - The method and apparatus of data clusters - Google Patents
The method and apparatus of data clusters Download PDFInfo
- Publication number
- CN110298371A CN110298371A CN201810239450.4A CN201810239450A CN110298371A CN 110298371 A CN110298371 A CN 110298371A CN 201810239450 A CN201810239450 A CN 201810239450A CN 110298371 A CN110298371 A CN 110298371A
- Authority
- CN
- China
- Prior art keywords
- data
- cluster
- cluster radius
- clustered
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000004422 calculation algorithm Methods 0.000 claims description 28
- 238000013480 data collection Methods 0.000 claims description 19
- 241001269238 Data Species 0.000 claims description 11
- 230000001174 ascending effect Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 7
- 241000208340 Araliaceae Species 0.000 claims description 6
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 6
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 6
- 235000008434 ginseng Nutrition 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 8
- 230000006854 communication Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 102000020897 Formins Human genes 0.000 description 1
- 108091022623 Formins Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the method and apparatus of data clusters, are related to field of computer technology.One specific embodiment of this method comprises determining that the min cluster radius set of data set to be clustered;At least one cluster radius is determined from min cluster radius set;Data set to be clustered is clustered according at least one cluster radius.The embodiment can obtain multiple cluster radius, and data set can be divided into the cluster of different densities based on multiple cluster radius, the accuracy rate of cluster is improved, reduce the complexity of calculating.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of method and apparatus of data clusters.
Background technique
In recent years, constantly bringing forth new ideas and improving with information technology, when interacting formula data mining, user is often sharp
With different analytical technologies, various tasks are executed such as cluster, matching, filtering and visualization technology, so that user can do
Wise decision out.Wherein, clustering technique can be such that the data of same cluster (or class) are brought together as far as possible, different clusters (or
Class) data separate as far as possible, and cluster is widely used in market analysis, information security, finance and amusement etc., therefore such as
What accurately clusters data becomes more and more important in the interactive data digging of big data era.
The prior art generally uses K-MEANS algorithm or the DBSCAN algorithm based on single density to cluster data.
Wherein, K-MEANS algorithm is input cluster number k, and the database comprising n data, output meet variance minimum sandards k
A kind of algorithm of a cluster.DBSCAN algorithm (Density-Based Spatial Clustering of based on single density
Applications with Noise, a more representational density-based algorithms), cluster is defined as density
The maximum set of connected point can be cluster having region division highdensity enough, and can be in the space data sets of noise
The cluster of arbitrary shape is found in conjunction.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery: one, K-MEANS is calculated
The cluster number of method needs to give in advance, and needs artificially to determine initial cluster center, and the algorithm can not detect automatically
Outlier out, wherein outlier classification refers to the classification that contained data point number is seldom after the completion of cluster, in outlier classification
Data point is known as outlier;Two, the DBSCAN algorithm based on single density be data are clustered based on the injectivity radius, if
Data set has a variety of density, then the algorithm cannot accurately distinguish the density of variation, and the algorithm needs to calculate Euclidean distance,
Complexity is high.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus of data clusters, the accurate of cluster can be improved
Rate reduces the complexity of calculating.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of method of data clusters is provided.
The method of a kind of data-pushing of the embodiment of the present invention, comprising: determine the min cluster half of data set to be clustered
Diameter set;At least one cluster radius is determined from the min cluster radius set;According at least one described cluster radius
The data set to be clustered is clustered.
Optionally, the min cluster radius set for obtaining data set to be clustered includes: for the data to be clustered
The each data concentrated, choose multiple reference datas from the data set to be clustered based on data decimation rule, then count
Multiple manhatton distances between each data and the multiple reference data are calculated, and calculate the multiple manhatton distance
Mean value;According to min cluster radius set described in the average generation of the corresponding manhatton distance of each data.
Optionally, the data decimation rule includes: and chooses reference data rule according to distance and chosen according to quantity to join
Examine at least one of data rule, wherein it is described according to distance choose reference data rule include: for the number in data set A
According to a, the data of pre-determined distance value are less than from the manhatton distance in the data set A between selection and the data a;And institute
State according to quantity choose reference data rule include: for the data a in data set A, by other data in the data set A with
Then manhatton distance between the data a is selected by arranging from small to large since apart from other corresponding data of minimum value
Select the data of predetermined number.
Optionally, determine that at least one cluster radius includes: to gather the minimum from the min cluster radius set
Element in class radius set is arranged according to the sequence of ascending or descending order;It is described minimum poly- after calculating sequence using inflection point algorithm
The inflection point of class radius set;At least one cluster radius is determined according to the corresponding several elements of the inflection point.
Optionally, carrying out cluster to the data set to be clustered according at least one described cluster radius includes: with institute
The sequence of at least one cluster radius from small to large is stated, target data set is clustered to obtain based on each cluster radius
The corresponding cluster of the cluster radius and noise data collection, wherein the target data set is the noise number obtained after preceding primary cluster
According to collection, the target data set clustered for the first time is the data set to be clustered.
To achieve the above object, according to another aspect of an embodiment of the present invention, a kind of device of data clusters is provided.
A kind of device of data clusters of the embodiment of the present invention, comprising: the first determining module, for determining number to be clustered
According to the min cluster radius set of collection;Second determining module, for determining at least one from the min cluster radius set
Cluster radius;Cluster module, for being clustered according at least one described cluster radius to the data set to be clustered.
Optionally, first determining module is also used to: for each data in the data set to be clustered, being based on
Data decimation rule chooses multiple reference datas from the data set to be clustered, then calculate each data with it is described
Multiple manhatton distances between multiple reference datas, and calculate the mean value of the multiple manhatton distance;According to each data
Min cluster radius set described in the average generation of corresponding manhatton distance.
Optionally, the data decimation rule includes: and chooses reference data rule according to distance and chosen according to quantity to join
Examine at least one of data rule, wherein it is described according to distance choose reference data rule include: for the number in data set A
According to a, the data of pre-determined distance value are less than from the manhatton distance in the data set A between selection and the data a;And institute
State according to quantity choose reference data rule include: for the data a in data set A, by other data in the data set A with
Then manhatton distance between the data a is selected by arranging from small to large since apart from other corresponding data of minimum value
Select the data of predetermined number.
Optionally, second determining module is also used to: by the element in the min cluster radius set according to ascending order
Or the sequence arrangement of descending;The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;According to described
The corresponding several elements of inflection point determine at least one cluster radius.
Optionally, the cluster module is also used to: with the sequence of at least one described cluster radius from small to large, based on every
One cluster radius clusters target data set to obtain the corresponding cluster of the cluster radius and noise data collection, wherein institute
Stating target data set is obtained noise data collection after preceding primary cluster, and the target data set clustered for the first time is described to be clustered
Data set.
To achieve the above object, according to an embodiment of the present invention in another aspect, providing a kind of electronic equipment.
The a kind of electronic equipment of the embodiment of the present invention includes: one or more processors;Storage device, for storing one
Or multiple programs, when one or more programs are executed by one or more processors, so that one or more processors realize this
The method of the data clusters of inventive embodiments.
To achieve the above object, another aspect according to an embodiment of the present invention, provides a kind of computer-readable medium.
A kind of computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, and program is held by processor
The method of the data clusters of the embodiment of the present invention is realized when row.
One embodiment in foregoing invention has the following advantages that or the utility model has the advantages that multiple cluster radius can be obtained, from
And data set can be divided into the cluster of different densities based on multiple cluster radius, the accuracy rate of cluster is improved, calculating is reduced
Complexity;Its corresponding reference data is selected for each data in the embodiment of the present invention, then by calculating data and ginseng
The min cluster radius set that the manhatton distance between data determines data set is examined, so as to guarantee that it is accurate that distance calculates
Property under the premise of, substantially reduce distance calculate complexity;From distance and the multiple angle Selection ginsengs of quantity in the embodiment of the present invention
Data are examined, so as to be adaptively adjusted according to the actual situation, further increase the practicability and accuracy of scheme;The present invention
Cluster radius is chosen from min cluster radius set using inflection point algorithm in embodiment, may thereby determine that out different clusters
Radius, and then data set can be divided into the cluster of different densities;According to cluster radius from small to large suitable in the embodiment of the present invention
Ordered pair data set is clustered, and density is very high, density is placed in the middle and the lesser cluster of density so as to successively choosing from data set,
Further increase the accuracy rate of cluster.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment
With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the schematic diagram with the data set of different densities;
Fig. 2 is the schematic diagram of the key step of the method for data clusters according to an embodiment of the present invention;
Fig. 3 is that the method for data clusters according to an embodiment of the present invention determines the schematic diagram of cluster radius;
Fig. 4 is the schematic diagram of the main flow of the method for data clusters according to an embodiment of the present invention;
Fig. 5 is the schematic diagram of the main modular of the device of data clusters according to an embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 7 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present invention
Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Cluster refer to according to some specific criteria (such as distance criterion or similarity criteria, i.e., the distance between data point or
Similarity) data set is divided into different clusters (or class), so that the user characteristics similitude in the same cluster (or class) is as far as possible
It is big or apart from as small as possible, while the user characteristics otherness not in the same cluster (or class) is also as large as possible.Existing skill
The basic step that the K-MEANS algorithm of data clusters is realized in art includes: that (1) arbitrarily selects k object to make from n data object
For initial cluster center;(2) according to the mean value (center object) of each clustering object, each object and these center objects are calculated
Distance, and corresponding object is divided again according to minimum range;(3) each mean value for changing cluster is recalculated
(center object);(4) canonical measure function is calculated, when meeting certain condition, when such as function convergence, then algorithm is terminated;Such as really bar
Part is unsatisfactory for, and returns to step (2).
Although data clusters may be implemented in K-MEANS algorithm, but need to be determined in advance k, and cannot recognize that noise
Point (i.e. outlier), therefore the DBSCAN algorithm based on single density comes into being.DBSCAN algorithm needs two input parameters:
Sweep radius (eps) and minimum comprising counting (minPts), specific steps are as follows: optional one not visited (unvisited's)
Point starts, and finds out all points nearby with its distance within eps (including eps);If the quantity nearby put is not less than
MinPts, then nearby point forms a cluster to current point with it, and starting point is marked as having accessed (visited);Then it passs
Return, all points for being not labeled as having accessed (visited) in the cluster is handled in the same way, to be extended to cluster;
If the quantity nearby put is less than minPts, which is temporarily labeled and is used as noise spot;If cluster is fully extended, i.e. cluster
Interior all the points are marked as having accessed, and are then gone to handle not visited point with same algorithm.
Although the DBSCAN algorithm based on single density does not have to preset cluster number, noise spot also can detecte out,
But if data set has a variety of density, the density of variation cannot be accurately distinguished, such as Fig. 1 is the data with different densities
The schematic diagram of collection, by Fig. 1 observable, there are three classifications out, and these three classifications have different density, the class of the leftmost side
Other density is smaller, and middle category density is placed in the middle, and rightmost side classification density is very high.The method of the data clusters of the prior art cannot be quasi-
These three types are really distinguished, i.e., if selection radius is small, may result in and the data in left side are all determined as outlier, if chosen
Radius is larger, then causes two different classes of data for originally belonging to higher density to be gathered for one kind, and then causes to cluster accuracy rate
It is low.And data collection is become increasingly easy, lead to that database size is increasing, complexity is got over
Come higher, such as various types of trade transaction datas, document, gene expression data, the dimension (attribute) of data usually can be with
Reach hundreds and thousands of dimensions, it is even higher.Therefore, the present invention provides a kind of method of data clusters, can be based on multiple clusters half
Data set is divided into the cluster of different densities by diameter (that is, sweep radius).
Fig. 2 is the schematic diagram of the key step of the method for data clusters according to an embodiment of the present invention, as shown in Fig. 2, this
The key step of the method for the data clusters of inventive embodiments may include:
Step S201: the min cluster radius set of data set to be clustered is determined.In the present invention, first have to obtain to poly-
Then the data set of class determines its min cluster radius set according to data set.
Step S202: at least one cluster radius is determined from min cluster radius set.Minimum is obtained in step S201
After cluster radius set, at least one cluster radius is selected from set.
Step S203: data set to be clustered is clustered according at least one cluster radius.It is obtained according to step 202
Cluster radius, data set is clustered.It is that DBSCAN cluster is carried out to data set based on multiple cluster radius in the present invention.
In the embodiment of the present invention, the min cluster radius set for obtaining data set to be clustered may include: for poly-
Each data in the data set of class, based on data decimation rule from chosen in data set to be clustered it is multiple (present invention in it is more
A is predetermined number, can be one, specific value is determined according to the size of quantity collection) reference data, then calculate
Multiple manhatton distances between data and multiple reference datas, and calculate the mean value of multiple manhatton distances;According to every number
According to the average generation min cluster radius set of corresponding manhatton distance.In the present invention, it is assumed that have in data set A to be clustered
N data are then directed to each data a, it is first determined then the m reference data of data a calculates separately data a and m reference
The manhatton distance of data then seeks manhatton distance mean value d, then A has n Manhattan mean value for data sets, constitutes most
Small cluster radius set D.In view of in the present invention, the m reference data of data a be manhatton distance between data a most
Small m data, therefore the collection that the mean value of the corresponding manhatton distance of each data is constituted is collectively referred to as into minimum in the present invention
Cluster radius set.Wherein, for j dimension data, the manhatton distance calculation method of two data x and y are as follows:
Wherein r=1.
If the r in above-mentioned formula is set to 2, for the calculation method of Euclidean distance, put down so Euclidean distance needs to calculate
Side and and square root, be that speed is relatively slow and the calculation method of calculation amount complexity.And manhatton distance need to only carry out simple numerical value
Plus-minus operation, computation complexity will be significantly less than the computation complexity of Euclidean distance, to greatly reduce computing cost, improve
Clustering performance and speed.
In the embodiment of the present invention, data decimation rule may include: according to distance selection reference data rule and according to number
Amount chooses at least one of reference data rule.In the present invention, for data set A, calculate first Manhattan between any two away from
From.Reference data is chosen according to distance to refer to for a data a, is selected from data set A small with the manhatton distance of data a
In the data of pre-determined distance value.It chooses reference data according to quantity to refer to for a data a, by other data in data set A
Manhatton distance with data a selects since apart from other corresponding data of minimum value default by arranging from small to large, then
The reference data of number.From distance and the multiple angle Selection reference datas of quantity in the embodiment of the present invention, so as to according to reality
Border situation is adaptively adjusted, and further increases the practicability and accuracy of scheme.
In the embodiment of the present invention, from min cluster radius set determine at least one cluster radius may include: will most
Element in small cluster radius set is arranged according to the sequence of ascending or descending order;It is poly- that the minimum after sequence is calculated using inflection point algorithm
The inflection point of class radius set;At least one cluster radius is determined according to the corresponding element of inflection point.In the present invention, by min cluster collection
The sequence arrangement of element according to sequence from big to small or from small to large in D is closed, by inflection point calculation method, is calculated most
The change point of small distance, and then select multiple cluster radius and clustered for DBSCAN.Wherein, inflection point, the also known as point of inflexion, in mathematics
Upper to refer to the point for changing curve direction upward or downward, intuitively inflection point is to make point (the i.e. bumps of curve of tangent line drilling-curve
Separation).If the function of the curvilinear figure has a second dervative in inflection point, second dervative at inflection point contrary sign (by just become it is negative or
Become just by negative) or be not present.Inflection point algorithm is to be used as " minimum range change point " by calculating the inflection point that second dervative obtains.
Fig. 3 is that the method for data clusters according to an embodiment of the present invention determines the schematic diagram of cluster radius.As shown in figure 3,
By taking 2-D data as an example, the distribution situation of 1000 two-dimentional data sets is represented on the left of Fig. 3, in the manner described above to the data set
Min cluster radius collection D is generated, then element therein is arranged according to inverted order, produces min cluster half shown in the right side Fig. 3
Diameter distribution situation, horizontal axis are 1000 data sequence numbers, and the longitudinal axis is corresponding data to its nearest 4 reference data (reference data
Value number is manhatton distance mean value 4), can be seen that, wherein comprising three sections of flatter distributions, every section flat from the curve
It will appear an inflection point before smooth curve, it can be using first ordinate taking as cluster radius of every section of flat distribution
Value, can show that the cluster radius value of selection has the corresponding value 1834 of the 3: 65th data using inflection point algorithm for the following figure,
The corresponding value 586 of 443rd data point, the corresponding value 361 of the 590th data point.
In the embodiment of the present invention, carrying out cluster to data set to be clustered according at least one cluster radius may include:
With the sequence of at least one cluster radius from small to large, target data set is clustered to obtain based on each cluster radius
The corresponding cluster of cluster radius and noise data collection.Wherein, target data set is obtained noise data collection after preceding primary cluster, just
The target data set of secondary cluster is data set to be clustered.In the present invention, it is assumed that data set to be clustered is A, gets d1, d2
With tri- cluster radius of d3, and d1 < d2 < d3.Firstly, being carried out as sweep radius to data set A according to cluster radius d1
DBSCAN cluster, gets cluster and noise data collection c1 based on cluster radius d1, then to noise data collection c1 according to cluster
Radius d2 carries out DBSCAN cluster, cluster and noise data collection c2 based on cluster radius d2 is got, finally to noise data collection
C2 carries out DBSCAN cluster according to cluster radius d3, gets cluster and noise data collection c3 based on cluster radius d3, thus
Cluster based on three density is completed to entire data set A.
Fig. 4 is the schematic diagram of the main flow of the method for data clusters according to an embodiment of the present invention, as shown in figure 4, this
The main flow of the method for the data clusters of inventive embodiments may include: step S401, obtain data set A to be clustered;Step
Rapid S402 chooses the corresponding m reference data of data a for each data a in data set to be clustered, and calculates data a
M manhatton distance between m corresponding reference data is to get the mean value d of m manhatton distance;Step
S403 generates min cluster radius set D according to the mean value d of the corresponding manhatton distance of each data;Step S404, will be minimum
Element in cluster radius set D is arranged according to the sequence of ascending or descending order, to get turning for min cluster radius set D
Point;Step S405 obtains cluster radius according to the corresponding element of inflection point;Step S406, with the sequence of cluster radius from small to large,
Target data set is clustered based on each cluster radius to obtain the corresponding cluster of cluster radius and noise data collection.Its
In, target data set in step S406 is the noise data collection obtained after preceding primary cluster, the target data set clustered for the first time
For data set A to be clustered.
The technical solution of data clusters according to an embodiment of the present invention, which can be seen that, can obtain multiple cluster radius, from
And data set can be divided into the cluster of different densities based on multiple cluster radius, the accuracy rate of cluster is improved, calculating is reduced
Complexity;Its corresponding reference data is selected for each data in the embodiment of the present invention, then by calculating data and ginseng
The min cluster radius set that the manhatton distance between data determines data set is examined, so as to guarantee that it is accurate that distance calculates
Property under the premise of, substantially reduce distance calculate complexity;From distance and the multiple angle Selection ginsengs of quantity in the embodiment of the present invention
Data are examined, so as to be adaptively adjusted according to the actual situation, further increase the practicability and accuracy of scheme;The present invention
Cluster radius is chosen from min cluster radius set using inflection point algorithm in embodiment, may thereby determine that out different clusters
Radius, and then data set can be divided into the cluster of different densities;According to cluster radius from small to large suitable in the embodiment of the present invention
Ordered pair data set is clustered, and density is very high, density is placed in the middle and the lesser cluster of density so as to successively choosing from data set,
Further increase the accuracy rate of cluster.
Fig. 5 is the schematic diagram of the main modular of the device of data clusters according to an embodiment of the present invention.As shown in figure 5, this
The device 500 of the data clusters of inventive embodiments mainly comprises the following modules: the first determining module 501, the second determining module 502
With cluster module 503.
Wherein, the first determining module 501 can be used for obtaining the min cluster radius set of data set to be clustered.Second really
Cover half block 502 can be used for determining at least one cluster radius from min cluster radius set.Cluster module 503 can be used for basis
At least one cluster radius clusters data set to be clustered.
In the embodiment of the present invention, the first determining module 501 can also be used in: for every number in data set to be clustered
According to choosing multiple reference datas from data set to be clustered based on data decimation rule, then calculate data and multiple references
Multiple manhatton distances between data, and calculate the mean value of multiple manhatton distances;According to the corresponding Manhattan of each data
The average generation min cluster radius set of distance.
In the embodiment of the present invention, data decimation rule may include: according to distance selection reference data rule and according to number
Amount chooses at least one of reference data rule.
In the embodiment of the present invention, the second determining module 502 can also be used in: by the element in min cluster radius set according to
The sequence of ascending or descending order arranges;The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;According to inflection point
Corresponding element determines at least one cluster radius.
In the embodiment of the present invention, cluster module 503 can also be used in: with the sequence of at least one cluster radius from small to large,
Target data set is clustered based on each cluster radius to obtain the corresponding cluster of cluster radius and noise data collection, wherein
Target data set is the noise data collection obtained after preceding primary cluster, and the target data set clustered for the first time is data to be clustered
Collection.
From the above, it can be seen that multiple cluster radius can be obtained, so as to count based on multiple cluster radius
It is divided into the cluster of different densities according to collection, improves the accuracy rate of cluster, reduce the complexity of calculating;It is directed in the embodiment of the present invention
Each data select its corresponding reference data, then determine number by calculating the manhatton distance between data and reference data
According to the min cluster radius set of collection, calculated so as under the premise of guaranteeing that distance calculates accuracy, substantially reduce distance
Complexity;From distance and the multiple angle Selection reference datas of quantity in the embodiment of the present invention, so as to according to the actual situation
It is adaptively adjusted, further increases the practicability and accuracy of scheme;Utilize inflection point algorithm from minimum in the embodiment of the present invention
Cluster radius is chosen in cluster radius set, may thereby determine that out different cluster radius, and then data set can be divided into
The cluster of different densities;Data set is clustered according to the sequence of cluster radius from small to large in the embodiment of the present invention, so as to
Density is very high, density is placed in the middle and the lesser cluster of density successively to choose from data set, further increases the accuracy rate of cluster.
Fig. 6 is shown can be using the exemplary of the device of the method or data clusters of the data clusters of the embodiment of the present invention
System architecture 600.
As shown in fig. 6, system architecture 600 may include terminal device 601,602,603, network 604 and server 605.
Network 604 between terminal device 601,602,603 and server 605 to provide the medium of communication link.Network 604 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 601,602,603 and be interacted by network 604 with server 605, to receive or send out
Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603
(merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 601,602,603 can be the various electronic equipments with display screen and supported web page browsing, packet
Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user
The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception
To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter
Breath -- merely illustrative) feed back to terminal device.
It should be noted that the method for data clusters provided by the embodiment of the present invention is generally executed by server 605, phase
Ying Di, the device of data clusters are generally positioned in server 605.
It should be understood that the number of terminal device, network and server in Fig. 6 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
Below with reference to Fig. 7, it illustrates the computer systems 700 for the terminal device for being suitable for being used to realize the embodiment of the present invention
Structural schematic diagram.Terminal device shown in Fig. 7 is only an example, function to the embodiment of the present invention and should not use model
Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in
Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and
Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data.
CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always
Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.;
And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because
The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon
Computer program be mounted into storage section 708 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention
Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer
Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.?
In such embodiment, which can be downloaded and installed from network by communications portion 709, and/or from can
Medium 711 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 701, system of the invention is executed
The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard
The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet
Include the first determining module, the second determining module and cluster module.Wherein, the title of these modules is not constituted under certain conditions
Restriction to the module itself, for example, the first determining module is also described as " determining that the minimum of data set to be clustered is poly-
The module of class radius set ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be
Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes
Obtain the min cluster radius set that the equipment comprises determining that data set to be clustered;From min cluster radius set determine to
A few cluster radius;Data set to be clustered is clustered according at least one cluster radius.
Technical solution according to an embodiment of the present invention can obtain multiple cluster radius, so as to be based on multiple clusters
Data set is divided into the cluster of different densities by radius, improves the accuracy rate of cluster, reduces the complexity of calculating;The present invention is implemented
Select its corresponding reference data for each data in example, then by calculate the Manhattan between data and reference data away from
From the min cluster radius set for determining data set, so as to substantially reduce under the premise of guaranteeing that distance calculates accuracy
The complexity that distance calculates;From distance and the multiple angle Selection reference datas of quantity in the embodiment of the present invention, so as to basis
Actual conditions are adaptively adjusted, and further increase the practicability and accuracy of scheme;It is calculated in the embodiment of the present invention using inflection point
Method chooses cluster radius from min cluster radius set, may thereby determine that out different cluster radius, and then can be by number
It is divided into the cluster of different densities according to collection;Data set is gathered according to the sequence of cluster radius from small to large in the embodiment of the present invention
Class, density is very high, density is placed in the middle and the lesser cluster of density so as to successively choosing from data set, further increases cluster
Accuracy rate.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright
It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any
Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention
Within.
Claims (12)
1. a kind of method of data clusters characterized by comprising
Determine the min cluster radius set of data set to be clustered;
At least one cluster radius is determined from the min cluster radius set;
The data set to be clustered is clustered according at least one described cluster radius.
2. the method according to claim 1, wherein obtaining the min cluster radius set of data set to be clustered
Include:
For each data in the data set to be clustered, based on data decimation rule from the data set to be clustered
Multiple reference datas are chosen, multiple manhatton distances between each data and the multiple reference data are then calculated,
And calculate the mean value of the multiple manhatton distance;
According to min cluster radius set described in the average generation of the corresponding manhatton distance of each data.
3. according to the method described in claim 2, it is characterized in that, the data decimation rule includes: to choose ginseng according to distance
It examines data rule and at least one of reference data rule is chosen according to quantity, wherein is described that reference data is chosen according to distance
Rule includes: for the data a in data set A, from the manhatton distance in the data set A between selection and the data a
Less than the data of pre-determined distance value;And
It is described according to quantity choose reference data rule include: for the data a in data set A, by other in the data set A
Manhatton distance between data and the data a by arranging from small to large, then from apart from other corresponding data of minimum value
Start the data of selection predetermined number.
4. the method according to claim 1, wherein determining at least one from the min cluster radius set
Cluster radius includes:
Element in the min cluster radius set is arranged according to the sequence of ascending or descending order;
The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;
At least one cluster radius is determined according to the corresponding element of the inflection point.
5. the method according to claim 1, wherein according at least one described cluster radius to described to be clustered
Data set carry out cluster include:
With the sequence of at least one described cluster radius from small to large, target data set is gathered based on each cluster radius
Class to obtain the corresponding cluster of the cluster radius and noise data collection, wherein the target data set be preceding primary cluster after obtain
Noise data collection, the target data set clustered for the first time be the data set to be clustered.
6. a kind of device of data clusters characterized by comprising
First determining module, for determining the min cluster radius set of data set to be clustered;
Second determining module, for determining at least one cluster radius from the min cluster radius set;
Cluster module, for being clustered according at least one described cluster radius to the data set to be clustered.
7. device according to claim 6, which is characterized in that first determining module is also used to:
For each data in the data set to be clustered, based on data decimation rule from the data set to be clustered
Multiple reference datas are chosen, multiple manhatton distances between each data and the multiple reference data are then calculated,
And calculate the mean value of the multiple manhatton distance;
According to min cluster radius set described in the average generation of the corresponding manhatton distance of each data.
8. device according to claim 7, which is characterized in that the data decimation rule includes: to choose ginseng according to distance
It examines data rule and at least one of reference data rule is chosen according to quantity, wherein is described that reference data is chosen according to distance
Rule includes: for the data a in data set A, from the manhatton distance in the data set A between selection and the data a
Less than the data of pre-determined distance value;And
It is described according to quantity choose reference data rule include: for the data a in data set A, by other in the data set A
Manhatton distance between data and the data a by arranging from small to large, then from apart from other corresponding data of minimum value
Start the data of selection predetermined number.
9. device according to claim 6, which is characterized in that second determining module is also used to:
Element in the min cluster radius set is arranged according to the sequence of ascending or descending order;
The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;
At least one cluster radius is determined according to the corresponding element of the inflection point.
10. device according to claim 6, which is characterized in that the cluster module is also used to:
With the sequence of at least one described cluster radius from small to large, target data set is gathered based on each cluster radius
Class to obtain the corresponding cluster of the cluster radius and noise data collection, wherein the target data set be preceding primary cluster after obtain
Noise data collection, the target data set clustered for the first time be the data set to be clustered.
11. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
Now such as method as claimed in any one of claims 1 to 5.
12. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor
Such as method as claimed in any one of claims 1 to 5 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810239450.4A CN110298371A (en) | 2018-03-22 | 2018-03-22 | The method and apparatus of data clusters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810239450.4A CN110298371A (en) | 2018-03-22 | 2018-03-22 | The method and apparatus of data clusters |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110298371A true CN110298371A (en) | 2019-10-01 |
Family
ID=68025586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810239450.4A Pending CN110298371A (en) | 2018-03-22 | 2018-03-22 | The method and apparatus of data clusters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110298371A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112215287A (en) * | 2020-10-13 | 2021-01-12 | 中国光大银行股份有限公司 | Distance-based multi-section clustering method and device, storage medium and electronic device |
CN113033584A (en) * | 2019-12-09 | 2021-06-25 | Oppo广东移动通信有限公司 | Data processing method and related equipment |
CN116204800A (en) * | 2022-11-30 | 2023-06-02 | 北京码牛科技股份有限公司 | Controllable clustering method, system, terminal and storage medium for position point division |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2003136467A (en) * | 2003-12-16 | 2005-05-27 | Открытое акционерное общество "Научно-производственное предпри тие "Радар ммс" (RU) | METHOD FOR AUTOMATIC CLUSTERING OBJECTS |
TW201109949A (en) * | 2009-09-01 | 2011-03-16 | Univ Nat Pingtung Sci & Tech | Density-based data clustering method |
US20140334739A1 (en) * | 2013-05-08 | 2014-11-13 | Xyratex Technology Limited | Methods of clustering computational event logs |
CN105913077A (en) * | 2016-04-07 | 2016-08-31 | 华北电力大学(保定) | Data clustering method based on dimensionality reduction and sampling |
CN106503086A (en) * | 2016-10-11 | 2017-03-15 | 成都云麒麟软件有限公司 | The detection method of distributed local outlier |
CN106709503A (en) * | 2016-11-23 | 2017-05-24 | 广西中烟工业有限责任公司 | Large spatial data clustering algorithm K-DBSCAN based on density |
CN107688955A (en) * | 2016-08-03 | 2018-02-13 | 浙江工业大学 | A kind of city commercial circle group variety division methods based on adaptive DBSCAN Density Clusterings |
-
2018
- 2018-03-22 CN CN201810239450.4A patent/CN110298371A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2003136467A (en) * | 2003-12-16 | 2005-05-27 | Открытое акционерное общество "Научно-производственное предпри тие "Радар ммс" (RU) | METHOD FOR AUTOMATIC CLUSTERING OBJECTS |
TW201109949A (en) * | 2009-09-01 | 2011-03-16 | Univ Nat Pingtung Sci & Tech | Density-based data clustering method |
US20140334739A1 (en) * | 2013-05-08 | 2014-11-13 | Xyratex Technology Limited | Methods of clustering computational event logs |
CN105913077A (en) * | 2016-04-07 | 2016-08-31 | 华北电力大学(保定) | Data clustering method based on dimensionality reduction and sampling |
CN107688955A (en) * | 2016-08-03 | 2018-02-13 | 浙江工业大学 | A kind of city commercial circle group variety division methods based on adaptive DBSCAN Density Clusterings |
CN106503086A (en) * | 2016-10-11 | 2017-03-15 | 成都云麒麟软件有限公司 | The detection method of distributed local outlier |
CN106709503A (en) * | 2016-11-23 | 2017-05-24 | 广西中烟工业有限责任公司 | Large spatial data clustering algorithm K-DBSCAN based on density |
Non-Patent Citations (3)
Title |
---|
储岳中 等: "动态最近邻聚类算法的优化研究", 计算机工程与设计, vol. 32, no. 05, pages 1687 - 1690 * |
温海波: "改进聚类算法在 DRDoS 攻击检测中的应用研究", 安徽建筑大学学报, vol. 25, no. 01, 15 February 2017 (2017-02-15), pages 70 - 75 * |
罗维佳 等: "面向LBSN的k-medoids聚类算法", 中国科学技术大学学报, vol. 47, no. 01, pages 70 - 79 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113033584A (en) * | 2019-12-09 | 2021-06-25 | Oppo广东移动通信有限公司 | Data processing method and related equipment |
CN113033584B (en) * | 2019-12-09 | 2023-07-07 | Oppo广东移动通信有限公司 | Data processing method and related equipment |
CN112215287A (en) * | 2020-10-13 | 2021-01-12 | 中国光大银行股份有限公司 | Distance-based multi-section clustering method and device, storage medium and electronic device |
CN112215287B (en) * | 2020-10-13 | 2024-04-12 | 中国光大银行股份有限公司 | Multi-section clustering method and device based on distance, storage medium and electronic device |
CN116204800A (en) * | 2022-11-30 | 2023-06-02 | 北京码牛科技股份有限公司 | Controllable clustering method, system, terminal and storage medium for position point division |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110019211A (en) | The methods, devices and systems of association index | |
CN107305637B (en) | Data clustering method and device based on K-Means algorithm | |
CN109697641A (en) | The method and apparatus for calculating commodity similarity | |
CN108764319A (en) | A kind of sample classification method and apparatus | |
CN107908616B (en) | Method and device for predicting trend words | |
CN109614402A (en) | Multidimensional data query method and device | |
CN110298371A (en) | The method and apparatus of data clusters | |
CN109002925A (en) | Traffic prediction method and apparatus | |
CN107480205A (en) | A kind of method and apparatus for carrying out data partition | |
CN110362815A (en) | Text vector generation method and device | |
CN110020312A (en) | The method and apparatus for extracting Web page text | |
CN110389873A (en) | A kind of method and apparatus of determining server resource service condition | |
CN109903105A (en) | A kind of method and apparatus for improving end article attribute | |
CN107392259A (en) | The method and apparatus for building unbalanced sample classification model | |
CN110019367A (en) | A kind of method and apparatus of statistical data feature | |
US20130151519A1 (en) | Ranking Programs in a Marketplace System | |
CN109785072A (en) | Method and apparatus for generating information | |
CN110443264A (en) | A kind of method and apparatus of cluster | |
CN110263791A (en) | A kind of method and apparatus in identification function area | |
CN110309142A (en) | The method and apparatus of regulation management | |
CN110503117A (en) | The method and apparatus of data clusters | |
US20210209122A1 (en) | Information push method and apparatus, device, and storage medium | |
CN112418258A (en) | Feature discretization method and device | |
CN109754273A (en) | The method and apparatus for promoting any active ues quantity | |
CN110019802A (en) | A kind of method and apparatus of text cluster |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |