CN110298371A - The method and apparatus of data clusters - Google Patents

The method and apparatus of data clusters Download PDF

Info

Publication number
CN110298371A
CN110298371A CN201810239450.4A CN201810239450A CN110298371A CN 110298371 A CN110298371 A CN 110298371A CN 201810239450 A CN201810239450 A CN 201810239450A CN 110298371 A CN110298371 A CN 110298371A
Authority
CN
China
Prior art keywords
data
cluster
cluster radius
clustered
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810239450.4A
Other languages
Chinese (zh)
Inventor
李树海
王硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810239450.4A priority Critical patent/CN110298371A/en
Publication of CN110298371A publication Critical patent/CN110298371A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the method and apparatus of data clusters, are related to field of computer technology.One specific embodiment of this method comprises determining that the min cluster radius set of data set to be clustered;At least one cluster radius is determined from min cluster radius set;Data set to be clustered is clustered according at least one cluster radius.The embodiment can obtain multiple cluster radius, and data set can be divided into the cluster of different densities based on multiple cluster radius, the accuracy rate of cluster is improved, reduce the complexity of calculating.

Description

The method and apparatus of data clusters
Technical field
The present invention relates to field of computer technology more particularly to a kind of method and apparatus of data clusters.
Background technique
In recent years, constantly bringing forth new ideas and improving with information technology, when interacting formula data mining, user is often sharp With different analytical technologies, various tasks are executed such as cluster, matching, filtering and visualization technology, so that user can do Wise decision out.Wherein, clustering technique can be such that the data of same cluster (or class) are brought together as far as possible, different clusters (or Class) data separate as far as possible, and cluster is widely used in market analysis, information security, finance and amusement etc., therefore such as What accurately clusters data becomes more and more important in the interactive data digging of big data era.
The prior art generally uses K-MEANS algorithm or the DBSCAN algorithm based on single density to cluster data. Wherein, K-MEANS algorithm is input cluster number k, and the database comprising n data, output meet variance minimum sandards k A kind of algorithm of a cluster.DBSCAN algorithm (Density-Based Spatial Clustering of based on single density Applications with Noise, a more representational density-based algorithms), cluster is defined as density The maximum set of connected point can be cluster having region division highdensity enough, and can be in the space data sets of noise The cluster of arbitrary shape is found in conjunction.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery: one, K-MEANS is calculated The cluster number of method needs to give in advance, and needs artificially to determine initial cluster center, and the algorithm can not detect automatically Outlier out, wherein outlier classification refers to the classification that contained data point number is seldom after the completion of cluster, in outlier classification Data point is known as outlier;Two, the DBSCAN algorithm based on single density be data are clustered based on the injectivity radius, if Data set has a variety of density, then the algorithm cannot accurately distinguish the density of variation, and the algorithm needs to calculate Euclidean distance, Complexity is high.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus of data clusters, the accurate of cluster can be improved Rate reduces the complexity of calculating.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of method of data clusters is provided.
The method of a kind of data-pushing of the embodiment of the present invention, comprising: determine the min cluster half of data set to be clustered Diameter set;At least one cluster radius is determined from the min cluster radius set;According at least one described cluster radius The data set to be clustered is clustered.
Optionally, the min cluster radius set for obtaining data set to be clustered includes: for the data to be clustered The each data concentrated, choose multiple reference datas from the data set to be clustered based on data decimation rule, then count Multiple manhatton distances between each data and the multiple reference data are calculated, and calculate the multiple manhatton distance Mean value;According to min cluster radius set described in the average generation of the corresponding manhatton distance of each data.
Optionally, the data decimation rule includes: and chooses reference data rule according to distance and chosen according to quantity to join Examine at least one of data rule, wherein it is described according to distance choose reference data rule include: for the number in data set A According to a, the data of pre-determined distance value are less than from the manhatton distance in the data set A between selection and the data a;And institute State according to quantity choose reference data rule include: for the data a in data set A, by other data in the data set A with Then manhatton distance between the data a is selected by arranging from small to large since apart from other corresponding data of minimum value Select the data of predetermined number.
Optionally, determine that at least one cluster radius includes: to gather the minimum from the min cluster radius set Element in class radius set is arranged according to the sequence of ascending or descending order;It is described minimum poly- after calculating sequence using inflection point algorithm The inflection point of class radius set;At least one cluster radius is determined according to the corresponding several elements of the inflection point.
Optionally, carrying out cluster to the data set to be clustered according at least one described cluster radius includes: with institute The sequence of at least one cluster radius from small to large is stated, target data set is clustered to obtain based on each cluster radius The corresponding cluster of the cluster radius and noise data collection, wherein the target data set is the noise number obtained after preceding primary cluster According to collection, the target data set clustered for the first time is the data set to be clustered.
To achieve the above object, according to another aspect of an embodiment of the present invention, a kind of device of data clusters is provided.
A kind of device of data clusters of the embodiment of the present invention, comprising: the first determining module, for determining number to be clustered According to the min cluster radius set of collection;Second determining module, for determining at least one from the min cluster radius set Cluster radius;Cluster module, for being clustered according at least one described cluster radius to the data set to be clustered.
Optionally, first determining module is also used to: for each data in the data set to be clustered, being based on Data decimation rule chooses multiple reference datas from the data set to be clustered, then calculate each data with it is described Multiple manhatton distances between multiple reference datas, and calculate the mean value of the multiple manhatton distance;According to each data Min cluster radius set described in the average generation of corresponding manhatton distance.
Optionally, the data decimation rule includes: and chooses reference data rule according to distance and chosen according to quantity to join Examine at least one of data rule, wherein it is described according to distance choose reference data rule include: for the number in data set A According to a, the data of pre-determined distance value are less than from the manhatton distance in the data set A between selection and the data a;And institute State according to quantity choose reference data rule include: for the data a in data set A, by other data in the data set A with Then manhatton distance between the data a is selected by arranging from small to large since apart from other corresponding data of minimum value Select the data of predetermined number.
Optionally, second determining module is also used to: by the element in the min cluster radius set according to ascending order Or the sequence arrangement of descending;The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;According to described The corresponding several elements of inflection point determine at least one cluster radius.
Optionally, the cluster module is also used to: with the sequence of at least one described cluster radius from small to large, based on every One cluster radius clusters target data set to obtain the corresponding cluster of the cluster radius and noise data collection, wherein institute Stating target data set is obtained noise data collection after preceding primary cluster, and the target data set clustered for the first time is described to be clustered Data set.
To achieve the above object, according to an embodiment of the present invention in another aspect, providing a kind of electronic equipment.
The a kind of electronic equipment of the embodiment of the present invention includes: one or more processors;Storage device, for storing one Or multiple programs, when one or more programs are executed by one or more processors, so that one or more processors realize this The method of the data clusters of inventive embodiments.
To achieve the above object, another aspect according to an embodiment of the present invention, provides a kind of computer-readable medium.
A kind of computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, and program is held by processor The method of the data clusters of the embodiment of the present invention is realized when row.
One embodiment in foregoing invention has the following advantages that or the utility model has the advantages that multiple cluster radius can be obtained, from And data set can be divided into the cluster of different densities based on multiple cluster radius, the accuracy rate of cluster is improved, calculating is reduced Complexity;Its corresponding reference data is selected for each data in the embodiment of the present invention, then by calculating data and ginseng The min cluster radius set that the manhatton distance between data determines data set is examined, so as to guarantee that it is accurate that distance calculates Property under the premise of, substantially reduce distance calculate complexity;From distance and the multiple angle Selection ginsengs of quantity in the embodiment of the present invention Data are examined, so as to be adaptively adjusted according to the actual situation, further increase the practicability and accuracy of scheme;The present invention Cluster radius is chosen from min cluster radius set using inflection point algorithm in embodiment, may thereby determine that out different clusters Radius, and then data set can be divided into the cluster of different densities;According to cluster radius from small to large suitable in the embodiment of the present invention Ordered pair data set is clustered, and density is very high, density is placed in the middle and the lesser cluster of density so as to successively choosing from data set, Further increase the accuracy rate of cluster.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the schematic diagram with the data set of different densities;
Fig. 2 is the schematic diagram of the key step of the method for data clusters according to an embodiment of the present invention;
Fig. 3 is that the method for data clusters according to an embodiment of the present invention determines the schematic diagram of cluster radius;
Fig. 4 is the schematic diagram of the main flow of the method for data clusters according to an embodiment of the present invention;
Fig. 5 is the schematic diagram of the main modular of the device of data clusters according to an embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 7 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present invention Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Cluster refer to according to some specific criteria (such as distance criterion or similarity criteria, i.e., the distance between data point or Similarity) data set is divided into different clusters (or class), so that the user characteristics similitude in the same cluster (or class) is as far as possible It is big or apart from as small as possible, while the user characteristics otherness not in the same cluster (or class) is also as large as possible.Existing skill The basic step that the K-MEANS algorithm of data clusters is realized in art includes: that (1) arbitrarily selects k object to make from n data object For initial cluster center;(2) according to the mean value (center object) of each clustering object, each object and these center objects are calculated Distance, and corresponding object is divided again according to minimum range;(3) each mean value for changing cluster is recalculated (center object);(4) canonical measure function is calculated, when meeting certain condition, when such as function convergence, then algorithm is terminated;Such as really bar Part is unsatisfactory for, and returns to step (2).
Although data clusters may be implemented in K-MEANS algorithm, but need to be determined in advance k, and cannot recognize that noise Point (i.e. outlier), therefore the DBSCAN algorithm based on single density comes into being.DBSCAN algorithm needs two input parameters: Sweep radius (eps) and minimum comprising counting (minPts), specific steps are as follows: optional one not visited (unvisited's) Point starts, and finds out all points nearby with its distance within eps (including eps);If the quantity nearby put is not less than MinPts, then nearby point forms a cluster to current point with it, and starting point is marked as having accessed (visited);Then it passs Return, all points for being not labeled as having accessed (visited) in the cluster is handled in the same way, to be extended to cluster; If the quantity nearby put is less than minPts, which is temporarily labeled and is used as noise spot;If cluster is fully extended, i.e. cluster Interior all the points are marked as having accessed, and are then gone to handle not visited point with same algorithm.
Although the DBSCAN algorithm based on single density does not have to preset cluster number, noise spot also can detecte out, But if data set has a variety of density, the density of variation cannot be accurately distinguished, such as Fig. 1 is the data with different densities The schematic diagram of collection, by Fig. 1 observable, there are three classifications out, and these three classifications have different density, the class of the leftmost side Other density is smaller, and middle category density is placed in the middle, and rightmost side classification density is very high.The method of the data clusters of the prior art cannot be quasi- These three types are really distinguished, i.e., if selection radius is small, may result in and the data in left side are all determined as outlier, if chosen Radius is larger, then causes two different classes of data for originally belonging to higher density to be gathered for one kind, and then causes to cluster accuracy rate It is low.And data collection is become increasingly easy, lead to that database size is increasing, complexity is got over Come higher, such as various types of trade transaction datas, document, gene expression data, the dimension (attribute) of data usually can be with Reach hundreds and thousands of dimensions, it is even higher.Therefore, the present invention provides a kind of method of data clusters, can be based on multiple clusters half Data set is divided into the cluster of different densities by diameter (that is, sweep radius).
Fig. 2 is the schematic diagram of the key step of the method for data clusters according to an embodiment of the present invention, as shown in Fig. 2, this The key step of the method for the data clusters of inventive embodiments may include:
Step S201: the min cluster radius set of data set to be clustered is determined.In the present invention, first have to obtain to poly- Then the data set of class determines its min cluster radius set according to data set.
Step S202: at least one cluster radius is determined from min cluster radius set.Minimum is obtained in step S201 After cluster radius set, at least one cluster radius is selected from set.
Step S203: data set to be clustered is clustered according at least one cluster radius.It is obtained according to step 202 Cluster radius, data set is clustered.It is that DBSCAN cluster is carried out to data set based on multiple cluster radius in the present invention.
In the embodiment of the present invention, the min cluster radius set for obtaining data set to be clustered may include: for poly- Each data in the data set of class, based on data decimation rule from chosen in data set to be clustered it is multiple (present invention in it is more A is predetermined number, can be one, specific value is determined according to the size of quantity collection) reference data, then calculate Multiple manhatton distances between data and multiple reference datas, and calculate the mean value of multiple manhatton distances;According to every number According to the average generation min cluster radius set of corresponding manhatton distance.In the present invention, it is assumed that have in data set A to be clustered N data are then directed to each data a, it is first determined then the m reference data of data a calculates separately data a and m reference The manhatton distance of data then seeks manhatton distance mean value d, then A has n Manhattan mean value for data sets, constitutes most Small cluster radius set D.In view of in the present invention, the m reference data of data a be manhatton distance between data a most Small m data, therefore the collection that the mean value of the corresponding manhatton distance of each data is constituted is collectively referred to as into minimum in the present invention Cluster radius set.Wherein, for j dimension data, the manhatton distance calculation method of two data x and y are as follows:
Wherein r=1.
If the r in above-mentioned formula is set to 2, for the calculation method of Euclidean distance, put down so Euclidean distance needs to calculate Side and and square root, be that speed is relatively slow and the calculation method of calculation amount complexity.And manhatton distance need to only carry out simple numerical value Plus-minus operation, computation complexity will be significantly less than the computation complexity of Euclidean distance, to greatly reduce computing cost, improve Clustering performance and speed.
In the embodiment of the present invention, data decimation rule may include: according to distance selection reference data rule and according to number Amount chooses at least one of reference data rule.In the present invention, for data set A, calculate first Manhattan between any two away from From.Reference data is chosen according to distance to refer to for a data a, is selected from data set A small with the manhatton distance of data a In the data of pre-determined distance value.It chooses reference data according to quantity to refer to for a data a, by other data in data set A Manhatton distance with data a selects since apart from other corresponding data of minimum value default by arranging from small to large, then The reference data of number.From distance and the multiple angle Selection reference datas of quantity in the embodiment of the present invention, so as to according to reality Border situation is adaptively adjusted, and further increases the practicability and accuracy of scheme.
In the embodiment of the present invention, from min cluster radius set determine at least one cluster radius may include: will most Element in small cluster radius set is arranged according to the sequence of ascending or descending order;It is poly- that the minimum after sequence is calculated using inflection point algorithm The inflection point of class radius set;At least one cluster radius is determined according to the corresponding element of inflection point.In the present invention, by min cluster collection The sequence arrangement of element according to sequence from big to small or from small to large in D is closed, by inflection point calculation method, is calculated most The change point of small distance, and then select multiple cluster radius and clustered for DBSCAN.Wherein, inflection point, the also known as point of inflexion, in mathematics Upper to refer to the point for changing curve direction upward or downward, intuitively inflection point is to make point (the i.e. bumps of curve of tangent line drilling-curve Separation).If the function of the curvilinear figure has a second dervative in inflection point, second dervative at inflection point contrary sign (by just become it is negative or Become just by negative) or be not present.Inflection point algorithm is to be used as " minimum range change point " by calculating the inflection point that second dervative obtains.
Fig. 3 is that the method for data clusters according to an embodiment of the present invention determines the schematic diagram of cluster radius.As shown in figure 3, By taking 2-D data as an example, the distribution situation of 1000 two-dimentional data sets is represented on the left of Fig. 3, in the manner described above to the data set Min cluster radius collection D is generated, then element therein is arranged according to inverted order, produces min cluster half shown in the right side Fig. 3 Diameter distribution situation, horizontal axis are 1000 data sequence numbers, and the longitudinal axis is corresponding data to its nearest 4 reference data (reference data Value number is manhatton distance mean value 4), can be seen that, wherein comprising three sections of flatter distributions, every section flat from the curve It will appear an inflection point before smooth curve, it can be using first ordinate taking as cluster radius of every section of flat distribution Value, can show that the cluster radius value of selection has the corresponding value 1834 of the 3: 65th data using inflection point algorithm for the following figure, The corresponding value 586 of 443rd data point, the corresponding value 361 of the 590th data point.
In the embodiment of the present invention, carrying out cluster to data set to be clustered according at least one cluster radius may include: With the sequence of at least one cluster radius from small to large, target data set is clustered to obtain based on each cluster radius The corresponding cluster of cluster radius and noise data collection.Wherein, target data set is obtained noise data collection after preceding primary cluster, just The target data set of secondary cluster is data set to be clustered.In the present invention, it is assumed that data set to be clustered is A, gets d1, d2 With tri- cluster radius of d3, and d1 < d2 < d3.Firstly, being carried out as sweep radius to data set A according to cluster radius d1 DBSCAN cluster, gets cluster and noise data collection c1 based on cluster radius d1, then to noise data collection c1 according to cluster Radius d2 carries out DBSCAN cluster, cluster and noise data collection c2 based on cluster radius d2 is got, finally to noise data collection C2 carries out DBSCAN cluster according to cluster radius d3, gets cluster and noise data collection c3 based on cluster radius d3, thus Cluster based on three density is completed to entire data set A.
Fig. 4 is the schematic diagram of the main flow of the method for data clusters according to an embodiment of the present invention, as shown in figure 4, this The main flow of the method for the data clusters of inventive embodiments may include: step S401, obtain data set A to be clustered;Step Rapid S402 chooses the corresponding m reference data of data a for each data a in data set to be clustered, and calculates data a M manhatton distance between m corresponding reference data is to get the mean value d of m manhatton distance;Step S403 generates min cluster radius set D according to the mean value d of the corresponding manhatton distance of each data;Step S404, will be minimum Element in cluster radius set D is arranged according to the sequence of ascending or descending order, to get turning for min cluster radius set D Point;Step S405 obtains cluster radius according to the corresponding element of inflection point;Step S406, with the sequence of cluster radius from small to large, Target data set is clustered based on each cluster radius to obtain the corresponding cluster of cluster radius and noise data collection.Its In, target data set in step S406 is the noise data collection obtained after preceding primary cluster, the target data set clustered for the first time For data set A to be clustered.
The technical solution of data clusters according to an embodiment of the present invention, which can be seen that, can obtain multiple cluster radius, from And data set can be divided into the cluster of different densities based on multiple cluster radius, the accuracy rate of cluster is improved, calculating is reduced Complexity;Its corresponding reference data is selected for each data in the embodiment of the present invention, then by calculating data and ginseng The min cluster radius set that the manhatton distance between data determines data set is examined, so as to guarantee that it is accurate that distance calculates Property under the premise of, substantially reduce distance calculate complexity;From distance and the multiple angle Selection ginsengs of quantity in the embodiment of the present invention Data are examined, so as to be adaptively adjusted according to the actual situation, further increase the practicability and accuracy of scheme;The present invention Cluster radius is chosen from min cluster radius set using inflection point algorithm in embodiment, may thereby determine that out different clusters Radius, and then data set can be divided into the cluster of different densities;According to cluster radius from small to large suitable in the embodiment of the present invention Ordered pair data set is clustered, and density is very high, density is placed in the middle and the lesser cluster of density so as to successively choosing from data set, Further increase the accuracy rate of cluster.
Fig. 5 is the schematic diagram of the main modular of the device of data clusters according to an embodiment of the present invention.As shown in figure 5, this The device 500 of the data clusters of inventive embodiments mainly comprises the following modules: the first determining module 501, the second determining module 502 With cluster module 503.
Wherein, the first determining module 501 can be used for obtaining the min cluster radius set of data set to be clustered.Second really Cover half block 502 can be used for determining at least one cluster radius from min cluster radius set.Cluster module 503 can be used for basis At least one cluster radius clusters data set to be clustered.
In the embodiment of the present invention, the first determining module 501 can also be used in: for every number in data set to be clustered According to choosing multiple reference datas from data set to be clustered based on data decimation rule, then calculate data and multiple references Multiple manhatton distances between data, and calculate the mean value of multiple manhatton distances;According to the corresponding Manhattan of each data The average generation min cluster radius set of distance.
In the embodiment of the present invention, data decimation rule may include: according to distance selection reference data rule and according to number Amount chooses at least one of reference data rule.
In the embodiment of the present invention, the second determining module 502 can also be used in: by the element in min cluster radius set according to The sequence of ascending or descending order arranges;The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;According to inflection point Corresponding element determines at least one cluster radius.
In the embodiment of the present invention, cluster module 503 can also be used in: with the sequence of at least one cluster radius from small to large, Target data set is clustered based on each cluster radius to obtain the corresponding cluster of cluster radius and noise data collection, wherein Target data set is the noise data collection obtained after preceding primary cluster, and the target data set clustered for the first time is data to be clustered Collection.
From the above, it can be seen that multiple cluster radius can be obtained, so as to count based on multiple cluster radius It is divided into the cluster of different densities according to collection, improves the accuracy rate of cluster, reduce the complexity of calculating;It is directed in the embodiment of the present invention Each data select its corresponding reference data, then determine number by calculating the manhatton distance between data and reference data According to the min cluster radius set of collection, calculated so as under the premise of guaranteeing that distance calculates accuracy, substantially reduce distance Complexity;From distance and the multiple angle Selection reference datas of quantity in the embodiment of the present invention, so as to according to the actual situation It is adaptively adjusted, further increases the practicability and accuracy of scheme;Utilize inflection point algorithm from minimum in the embodiment of the present invention Cluster radius is chosen in cluster radius set, may thereby determine that out different cluster radius, and then data set can be divided into The cluster of different densities;Data set is clustered according to the sequence of cluster radius from small to large in the embodiment of the present invention, so as to Density is very high, density is placed in the middle and the lesser cluster of density successively to choose from data set, further increases the accuracy rate of cluster.
Fig. 6 is shown can be using the exemplary of the device of the method or data clusters of the data clusters of the embodiment of the present invention System architecture 600.
As shown in fig. 6, system architecture 600 may include terminal device 601,602,603, network 604 and server 605. Network 604 between terminal device 601,602,603 and server 605 to provide the medium of communication link.Network 604 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 601,602,603 and be interacted by network 604 with server 605, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603 (merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 601,602,603 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.
It should be noted that the method for data clusters provided by the embodiment of the present invention is generally executed by server 605, phase Ying Di, the device of data clusters are generally positioned in server 605.
It should be understood that the number of terminal device, network and server in Fig. 6 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
Below with reference to Fig. 7, it illustrates the computer systems 700 for the terminal device for being suitable for being used to realize the embodiment of the present invention Structural schematic diagram.Terminal device shown in Fig. 7 is only an example, function to the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and Execute various movements appropriate and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data. CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always Line 704.
I/O interface 705 is connected to lower component: the importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 708 including hard disk etc.; And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because The network of spy's net executes communication process.Driver 710 is also connected to I/O interface 705 as needed.Detachable media 711, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 710, in order to read from thereon Computer program be mounted into storage section 708 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 709, and/or from can Medium 711 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 701, system of the invention is executed The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet Include the first determining module, the second determining module and cluster module.Wherein, the title of these modules is not constituted under certain conditions Restriction to the module itself, for example, the first determining module is also described as " determining that the minimum of data set to be clustered is poly- The module of class radius set ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtain the min cluster radius set that the equipment comprises determining that data set to be clustered;From min cluster radius set determine to A few cluster radius;Data set to be clustered is clustered according at least one cluster radius.
Technical solution according to an embodiment of the present invention can obtain multiple cluster radius, so as to be based on multiple clusters Data set is divided into the cluster of different densities by radius, improves the accuracy rate of cluster, reduces the complexity of calculating;The present invention is implemented Select its corresponding reference data for each data in example, then by calculate the Manhattan between data and reference data away from From the min cluster radius set for determining data set, so as to substantially reduce under the premise of guaranteeing that distance calculates accuracy The complexity that distance calculates;From distance and the multiple angle Selection reference datas of quantity in the embodiment of the present invention, so as to basis Actual conditions are adaptively adjusted, and further increase the practicability and accuracy of scheme;It is calculated in the embodiment of the present invention using inflection point Method chooses cluster radius from min cluster radius set, may thereby determine that out different cluster radius, and then can be by number It is divided into the cluster of different densities according to collection;Data set is gathered according to the sequence of cluster radius from small to large in the embodiment of the present invention Class, density is very high, density is placed in the middle and the lesser cluster of density so as to successively choosing from data set, further increases cluster Accuracy rate.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (12)

1. a kind of method of data clusters characterized by comprising
Determine the min cluster radius set of data set to be clustered;
At least one cluster radius is determined from the min cluster radius set;
The data set to be clustered is clustered according at least one described cluster radius.
2. the method according to claim 1, wherein obtaining the min cluster radius set of data set to be clustered Include:
For each data in the data set to be clustered, based on data decimation rule from the data set to be clustered Multiple reference datas are chosen, multiple manhatton distances between each data and the multiple reference data are then calculated, And calculate the mean value of the multiple manhatton distance;
According to min cluster radius set described in the average generation of the corresponding manhatton distance of each data.
3. according to the method described in claim 2, it is characterized in that, the data decimation rule includes: to choose ginseng according to distance It examines data rule and at least one of reference data rule is chosen according to quantity, wherein is described that reference data is chosen according to distance Rule includes: for the data a in data set A, from the manhatton distance in the data set A between selection and the data a Less than the data of pre-determined distance value;And
It is described according to quantity choose reference data rule include: for the data a in data set A, by other in the data set A Manhatton distance between data and the data a by arranging from small to large, then from apart from other corresponding data of minimum value Start the data of selection predetermined number.
4. the method according to claim 1, wherein determining at least one from the min cluster radius set Cluster radius includes:
Element in the min cluster radius set is arranged according to the sequence of ascending or descending order;
The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;
At least one cluster radius is determined according to the corresponding element of the inflection point.
5. the method according to claim 1, wherein according at least one described cluster radius to described to be clustered Data set carry out cluster include:
With the sequence of at least one described cluster radius from small to large, target data set is gathered based on each cluster radius Class to obtain the corresponding cluster of the cluster radius and noise data collection, wherein the target data set be preceding primary cluster after obtain Noise data collection, the target data set clustered for the first time be the data set to be clustered.
6. a kind of device of data clusters characterized by comprising
First determining module, for determining the min cluster radius set of data set to be clustered;
Second determining module, for determining at least one cluster radius from the min cluster radius set;
Cluster module, for being clustered according at least one described cluster radius to the data set to be clustered.
7. device according to claim 6, which is characterized in that first determining module is also used to:
For each data in the data set to be clustered, based on data decimation rule from the data set to be clustered Multiple reference datas are chosen, multiple manhatton distances between each data and the multiple reference data are then calculated, And calculate the mean value of the multiple manhatton distance;
According to min cluster radius set described in the average generation of the corresponding manhatton distance of each data.
8. device according to claim 7, which is characterized in that the data decimation rule includes: to choose ginseng according to distance It examines data rule and at least one of reference data rule is chosen according to quantity, wherein is described that reference data is chosen according to distance Rule includes: for the data a in data set A, from the manhatton distance in the data set A between selection and the data a Less than the data of pre-determined distance value;And
It is described according to quantity choose reference data rule include: for the data a in data set A, by other in the data set A Manhatton distance between data and the data a by arranging from small to large, then from apart from other corresponding data of minimum value Start the data of selection predetermined number.
9. device according to claim 6, which is characterized in that second determining module is also used to:
Element in the min cluster radius set is arranged according to the sequence of ascending or descending order;
The inflection point of the min cluster radius set after sequence is calculated using inflection point algorithm;
At least one cluster radius is determined according to the corresponding element of the inflection point.
10. device according to claim 6, which is characterized in that the cluster module is also used to:
With the sequence of at least one described cluster radius from small to large, target data set is gathered based on each cluster radius Class to obtain the corresponding cluster of the cluster radius and noise data collection, wherein the target data set be preceding primary cluster after obtain Noise data collection, the target data set clustered for the first time be the data set to be clustered.
11. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 5.
12. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor Such as method as claimed in any one of claims 1 to 5 is realized when row.
CN201810239450.4A 2018-03-22 2018-03-22 The method and apparatus of data clusters Pending CN110298371A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810239450.4A CN110298371A (en) 2018-03-22 2018-03-22 The method and apparatus of data clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810239450.4A CN110298371A (en) 2018-03-22 2018-03-22 The method and apparatus of data clusters

Publications (1)

Publication Number Publication Date
CN110298371A true CN110298371A (en) 2019-10-01

Family

ID=68025586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810239450.4A Pending CN110298371A (en) 2018-03-22 2018-03-22 The method and apparatus of data clusters

Country Status (1)

Country Link
CN (1) CN110298371A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215287A (en) * 2020-10-13 2021-01-12 中国光大银行股份有限公司 Distance-based multi-section clustering method and device, storage medium and electronic device
CN113033584A (en) * 2019-12-09 2021-06-25 Oppo广东移动通信有限公司 Data processing method and related equipment
CN116204800A (en) * 2022-11-30 2023-06-02 北京码牛科技股份有限公司 Controllable clustering method, system, terminal and storage medium for position point division

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2003136467A (en) * 2003-12-16 2005-05-27 Открытое акционерное общество "Научно-производственное предпри тие "Радар ммс" (RU) METHOD FOR AUTOMATIC CLUSTERING OBJECTS
TW201109949A (en) * 2009-09-01 2011-03-16 Univ Nat Pingtung Sci & Tech Density-based data clustering method
US20140334739A1 (en) * 2013-05-08 2014-11-13 Xyratex Technology Limited Methods of clustering computational event logs
CN105913077A (en) * 2016-04-07 2016-08-31 华北电力大学(保定) Data clustering method based on dimensionality reduction and sampling
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN106709503A (en) * 2016-11-23 2017-05-24 广西中烟工业有限责任公司 Large spatial data clustering algorithm K-DBSCAN based on density
CN107688955A (en) * 2016-08-03 2018-02-13 浙江工业大学 A kind of city commercial circle group variety division methods based on adaptive DBSCAN Density Clusterings

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2003136467A (en) * 2003-12-16 2005-05-27 Открытое акционерное общество "Научно-производственное предпри тие "Радар ммс" (RU) METHOD FOR AUTOMATIC CLUSTERING OBJECTS
TW201109949A (en) * 2009-09-01 2011-03-16 Univ Nat Pingtung Sci & Tech Density-based data clustering method
US20140334739A1 (en) * 2013-05-08 2014-11-13 Xyratex Technology Limited Methods of clustering computational event logs
CN105913077A (en) * 2016-04-07 2016-08-31 华北电力大学(保定) Data clustering method based on dimensionality reduction and sampling
CN107688955A (en) * 2016-08-03 2018-02-13 浙江工业大学 A kind of city commercial circle group variety division methods based on adaptive DBSCAN Density Clusterings
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN106709503A (en) * 2016-11-23 2017-05-24 广西中烟工业有限责任公司 Large spatial data clustering algorithm K-DBSCAN based on density

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
储岳中 等: "动态最近邻聚类算法的优化研究", 计算机工程与设计, vol. 32, no. 05, pages 1687 - 1690 *
温海波: "改进聚类算法在 DRDoS 攻击检测中的应用研究", 安徽建筑大学学报, vol. 25, no. 01, 15 February 2017 (2017-02-15), pages 70 - 75 *
罗维佳 等: "面向LBSN的k-medoids聚类算法", 中国科学技术大学学报, vol. 47, no. 01, pages 70 - 79 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033584A (en) * 2019-12-09 2021-06-25 Oppo广东移动通信有限公司 Data processing method and related equipment
CN113033584B (en) * 2019-12-09 2023-07-07 Oppo广东移动通信有限公司 Data processing method and related equipment
CN112215287A (en) * 2020-10-13 2021-01-12 中国光大银行股份有限公司 Distance-based multi-section clustering method and device, storage medium and electronic device
CN112215287B (en) * 2020-10-13 2024-04-12 中国光大银行股份有限公司 Multi-section clustering method and device based on distance, storage medium and electronic device
CN116204800A (en) * 2022-11-30 2023-06-02 北京码牛科技股份有限公司 Controllable clustering method, system, terminal and storage medium for position point division

Similar Documents

Publication Publication Date Title
CN110019211A (en) The methods, devices and systems of association index
CN107305637B (en) Data clustering method and device based on K-Means algorithm
CN109697641A (en) The method and apparatus for calculating commodity similarity
CN108764319A (en) A kind of sample classification method and apparatus
CN107908616B (en) Method and device for predicting trend words
CN109614402A (en) Multidimensional data query method and device
CN110298371A (en) The method and apparatus of data clusters
CN109002925A (en) Traffic prediction method and apparatus
CN107480205A (en) A kind of method and apparatus for carrying out data partition
CN110362815A (en) Text vector generation method and device
CN110020312A (en) The method and apparatus for extracting Web page text
CN110389873A (en) A kind of method and apparatus of determining server resource service condition
CN109903105A (en) A kind of method and apparatus for improving end article attribute
CN107392259A (en) The method and apparatus for building unbalanced sample classification model
CN110019367A (en) A kind of method and apparatus of statistical data feature
US20130151519A1 (en) Ranking Programs in a Marketplace System
CN109785072A (en) Method and apparatus for generating information
CN110443264A (en) A kind of method and apparatus of cluster
CN110263791A (en) A kind of method and apparatus in identification function area
CN110309142A (en) The method and apparatus of regulation management
CN110503117A (en) The method and apparatus of data clusters
US20210209122A1 (en) Information push method and apparatus, device, and storage medium
CN112418258A (en) Feature discretization method and device
CN109754273A (en) The method and apparatus for promoting any active ues quantity
CN110019802A (en) A kind of method and apparatus of text cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination