CN109002513A - A kind of data clustering method and device - Google Patents

A kind of data clustering method and device Download PDF

Info

Publication number
CN109002513A
CN109002513A CN201810723419.8A CN201810723419A CN109002513A CN 109002513 A CN109002513 A CN 109002513A CN 201810723419 A CN201810723419 A CN 201810723419A CN 109002513 A CN109002513 A CN 109002513A
Authority
CN
China
Prior art keywords
data
uncertain
data set
mass center
initial mass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810723419.8A
Other languages
Chinese (zh)
Other versions
CN109002513B (en
Inventor
陈力铭
叶朱荪
张峰
马新杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Information Technology Co Ltd
Original Assignee
Shenzhen Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Information Technology Co Ltd filed Critical Shenzhen Information Technology Co Ltd
Priority to CN201810723419.8A priority Critical patent/CN109002513B/en
Publication of CN109002513A publication Critical patent/CN109002513A/en
Application granted granted Critical
Publication of CN109002513B publication Critical patent/CN109002513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of data clustering method and device, in the case where obtaining uncertain data to be clustered, uncertain probability density function based on uncertain data, it calculates and cluster information needed is carried out to uncertain data, such as uncertain probability density function based on uncertain data, recalculate the default initial mass center of the data set, be considered as the default initial mass center that uncertain data is recalculated to the data set relative to the uncertain data of the expectation square error summation of the data set expectation square error and the uncertain data to other data sets default initial mass center the sum of expectation square error, and then the desired the smallest data set of square error summation value is determined as target data set, uncertain data is divided to the target data to concentrate, it realizes based on uncertain data Uncertain probability density function improves the accuracy of uncertain data cluster to the cluster of uncertain data.

Description

A kind of data clustering method and device
Technical field
The invention belongs to technical field of data processing more particularly to a kind of data clustering methods and device.
Background technique
Measure inaccurate, sampling error, outdated data source or other etc. due to, data often have uncertainty The property of (abbreviation uncertain data), especially in the application for needing to interact with true environment, such as Location based service and biography In the applications such as sensor monitoring, for tracking mobile target (such as vehicle or people) in Location based service, in Location based service In can not track the accurate instantaneous positions of all mobile targets, therefore the change in location process of each mobile target completely With uncertainty, this uncertainty can have an impact the management of data, such as data query and data clusters.
The uncertainty of data includes two types at present: already present uncertain and numerical value is uncertain.First In seed type, regardless of target or data tuple exist whether, data itself have existed uncertainty.Such as in relational database Data tuple may there are a probability value of degree of belief is associated with that can show it.In second of type, a data As a closed area, the probability density function (PDF) of the data limits the value of the data.Both types are come It says, available data clusters have following two:
By being suitble to hybrid density not with the solution of EM (Expectation Maximization, greatest hope) algorithm The problem of deterministic data clusters and Fuzzy C-Means Cluster Algorithm, but both data clustering methods do not account for not really The qualitative influence to cluster causes to cluster accurate reduction.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of data clustering method and devices, for improving uncertainty The accuracy of data clusters.Technical solution is as follows:
The present invention provides a kind of data clustering method, which comprises
In the case where getting uncertain data to be clustered, to any data set: by the uncertain data It is divided in the data set, the uncertain probability density function based on the uncertain data recalculates the data set Default initial mass center;
To any data set: calculating the default initial mass center that the uncertain data is recalculated to the data set It is expected that square error and the uncertain data to other data sets default initial mass center the sum of expectation square error, The sum of described expectation square error is determined as expectation square error summation of the uncertain data relative to the data set;
The desired the smallest data set of square error summation value is determined as target data set;
The uncertain data is divided to the target data to concentrate.
Preferably, described to any data set: the uncertain data to be divided in the data set, based on described in not The uncertain probability density function of deterministic data, the default initial mass center for recalculating the data set include:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true Qualitative probabilistic density function.
Preferably, described to any data set: calculate the uncertain data recalculated to the data set it is pre- If the expectation of the default initial mass center of the expectation square error and the uncertain data of initial mass center to other data sets The sum of square error includes:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be Uncertain probability density function, K are data set sum.
The present invention also provides a kind of data clustering methods, which comprises
In the case where getting uncertain data to be clustered, the uncertainty based on the uncertain data is general Rate density function, determine the uncertain data to each data set default initial mass center desired distance;
The smallest data set of desired distance is determined as to the target data set of the uncertain data, and will be described not true Qualitative data is divided to the target data and concentrates;
Uncertain probability density function based on the uncertain data, recalculates the pre- of the target data set If initial mass center, and iteration executes the uncertain probability density function based on the uncertain data, determines described not true Qualitative data to each data set default initial mass center desired distance and the smallest data set of desired distance is determined as institute The step of stating the target data set of uncertain data, until meeting preset condition.
Preferably, the uncertain probability density function based on the uncertain data determines described uncertain The desired distance of default initial mass center of property data to each data set includes:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is not Certainty probability density function.
Preferably, the uncertain probability density function based on the uncertain data, recalculates the mesh Mark data set default initial mass center include:
Based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true Qualitative probabilistic density function.
The present invention also provides a kind of data clusters device, described device includes:
First computing unit, in the case where getting uncertain data to be clustered, to any data set: will The uncertain data is divided in the data set, the uncertain probability density function based on the uncertain data, Recalculate the default initial mass center of the data set;
Second computing unit, for any data set: calculating the uncertain data and recalculated to the data set The default initial mass center of the expectation square error of default initial mass center out and the uncertain data to other data sets The sum of expectation square error, the sum of described expectation square error is determined as the uncertain data relative to the data set Expectation square error summation;
Determination unit, for that will it is expected that the smallest data set of square error summation value is determined as target data set;
Division unit is concentrated for the uncertain data to be divided to the target data.
Preferably, first computing unit, for being based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true Qualitative probabilistic density function;
Or
Second computing unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be Uncertain probability density function, K are data set sum.
The present invention also provides a kind of data clusters device, described device includes:
Determination unit, in the case where getting uncertain data to be clustered, being based on the uncertain number According to uncertain probability density function, determine the uncertain data to each data set default initial mass center expectation Distance;
Division unit, for the smallest data set of desired distance to be determined as to the target data of the uncertain data Collection, and the uncertain data is divided to the target data and is concentrated;
Computing unit recalculates described for the uncertain probability density function based on the uncertain data The default initial mass center of target data set, and the determination unit and division unit iteration execution are triggered based on described not true The uncertain probability density function of qualitative data, determine the uncertain data to each data set default initial mass center Desired distance and the step of the smallest data set of desired distance is determined as the target data set of the uncertain data, directly To meeting preset condition.
Preferably, the determination unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is not Certainty probability density function;
Or
The computing unit, for being based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true Qualitative probabilistic density function.
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, based on uncertain number According to uncertain probability density function, calculate and cluster information needed carried out to uncertain data, such as based on uncertainty The uncertain probability density function of data, recalculates that the default initial mass center of the data set, to be considered as uncertain data opposite In the default initial mass center that the uncertain data of the expectation square error summation of the data set is recalculated to the data set Expectation square error and the uncertain data to other data sets default initial mass center expectation square error it With, and then the desired the smallest data set of square error summation value is determined as target data set, uncertain data is divided It is concentrated to the target data, realizes the uncertain probability density function based on uncertain data to uncertain data Cluster improves the accuracy of uncertain data cluster.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow chart of data clustering method provided in an embodiment of the present invention;
Fig. 2 is another flow chart of data clustering method provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of data clusters device provided in an embodiment of the present invention;
Fig. 4 is another structural schematic diagram of data clusters device provided in an embodiment of the present invention.
Specific embodiment
Currently, data clusters problem is in data set Cj(j is from 1 to K) finds a data set C, wherein data set CjBy base In the average value c of similitudej(it is considered as data set CjDefault initial mass center) constitute, and different Data Clustering Algorithms can With the different objective function of correspondence, but its juche idea is the distance and maximization minimized between same data intensive data Distance between different data intensive data, wherein the distance minimized between same data intensive data can also be considered as minimizing The distance between same every data of data set Sino-U.S. and minimizing preset in every data and the data set initial mass center it Between distance.
Applicant is from hard clustering algorithm --- the research of mean cluster (K-means) algorithm is suitable for uncertain data Clustering algorithm, wherein the purpose of K-means algorithm be to be focused to find out a data set C from K data it is flat to minimize Square sum of the deviations (SSE).The calculation formula of square error summation is as follows:
| | | | indicate a data xiWith the default initial mass center c of data setjDistance.For example, Euclidean distance is defined as:One data set CiDefault initial mass center defined by following vector form:
Corresponding, the process of K-means algorithm is as follows:
1.Assign initial values for cluster means c1 to cK
2.repeat
3.for i=1 to n do
4.Assign each data point xi to cluster Cj where||cj-xi||is the minimum.
5.end for
6.for j=1 to K do
7.Recalculate cluster mean cj of cluster Cj
8.end for
9.until convergence
10.return C
Its process, which is briefly described, is: 1) presetting initial mass center to the setting of each data set;2) every data to be clustered are calculated The distance between default initial mass center to each data set (| | cj-xi| |), and data to be clustered are divided to apart from value most In small data set;3) the default initial mass center apart from the smallest data set of value is recalculated;4) iterative step 2 to 3 until Meet preset condition.
From above-mentioned K-means algorithm it is found that not accounting for probabilistic shadow when being clustered using K-means algorithm It rings, therefore applicant sums up when clustering to uncertain data, needs the uncertainty based on uncertain data Probability density function carries out cluster information needed to uncertain data to calculate, and every data to be clustered arrive when will such as cluster The distance between default initial mass center of each data set (| | cj-xi| |) be changed to desired distance E (| | cj-xi| |), and be based on The uncertain probability density function of uncertain data presets initial mass center to calculate, or the purpose of cluster is considered as most Smallization it is expected square error summation, so as to improve the accuracy of uncertain data cluster.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Referring to Fig. 1, the data are poly- it illustrates a kind of flow chart of data clustering method provided in an embodiment of the present invention Class method is directed to uncertain data, clusters accuracy for improving uncertain data, can specifically include following steps:
101: in the case where getting uncertain data to be clustered, to any data set: by uncertain data It is divided in the data set, the uncertain probability density function based on uncertain data recalculates the pre- of the data set If initial mass center.
It is to be understood that for each data set, preset for each data set and preset initial mass center, when obtaining When getting uncertain data to be clustered, which is divided to any data and is concentrated, then based on uncertainty The uncertain probability density function of data recalculates the default initial mass center of any data set, that is, passes through uncertain data Uncertain probability density function change data set default initial mass center.
Uncertain probability density function based on uncertain data in the present embodiment, recalculates the data set A kind of mode for presetting initial mass center is:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty Probability density function.
By taking three data sets as an example, these three data sets are respectively: data set 1, data set 2 and data set 3, will not really Qualitative data xiIn the case where being divided to data set 1, the default just prothyl of data set 1 can be recalculated based on above-mentioned formula The heart, to change the default initial mass center of data set by the uncertain probability density function of uncertain data.
The wherein attribute value o of uncertain probability density function and uncertain dataiCorrelation, such as uncertain probability Density function is attribute value oiIn the probability density function of time t, i=1 to n.The performance of the uncertainty probability density function Form may is that averag density function or gauss of distribution function.
102: to any data set: the default initial mass center that calculating uncertain data to the data set recalculates It is expected that square error and uncertain data to other data sets default initial mass center the sum of expectation square error, by the phase The sum of square error is hoped to be determined as expectation square error summation of the uncertain data relative to the data set.
That is expectation square error of the uncertain data relative to j-th of data set in determining all data sets When summation, the default initial mass center of j-th of data set is the default initial mass center recalculated, other data sets it is default just The prothyl heart is pre-set to preset initial mass center.
Still by above-mentioned three data sets: for data set 1, data set 2 and data set 3, determining uncertain data phase For data set 1 expectation square error summation when, the default initial mass center of data set 1 be recalculate it is default just prothyl The default initial mass center of the heart, data set 2 and data set 3 is pre-set to preset initial mass center.
In the present embodiment, to any data set: calculating uncertain data and preset to what the data set recalculated The expectation square mistake of default initial mass center of the expectation square error and uncertain data of initial mass center to other data sets A kind of mode of the sum of difference is:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be Uncertain probability density function, K are data set sum.
103: the desired the smallest data set of square error summation value being determined as target data set, and by uncertain number It is concentrated according to target data is divided to, thus total by a minimum expectation square error is considered as to the cluster of uncertain data The problem of with E (SSE), can thus determine the mesh of uncertain data by expectation square error summation value minimum Data set is marked, and then uncertain data can be divided to target data concentration.
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, to any data set: will Uncertain data is divided in the data set, and the uncertain probability density function based on uncertain data recalculates The default initial mass center of the data set, and to any data set: calculate what uncertain data was recalculated to the data set Preset initial mass center expectation square error and uncertain data to other data sets default initial mass center expectation it is flat The sum of square error, it would be desirable to which it is total relative to the expectation square error of the data set that the sum of square error is determined as uncertain data With, it would be desirable to the smallest data set of square error summation value is determined as target data set, and uncertain data is divided to institute Target data concentration is stated, to realize the uncertain probability density function based on uncertain data to uncertain data Thus cluster improves the accuracy of uncertain data cluster.
Referring to Fig. 2, it illustrates another flow chart of data clustering method provided in an embodiment of the present invention, the data Clustering method is equally directed to uncertain data, for improving uncertain data cluster accuracy, can specifically include following Step:
201: in the case where getting uncertain data to be clustered, the uncertainty based on uncertain data is general Rate density function, determine uncertain data to each data set default initial mass center desired distance.
In the present embodiment, the desired distance of default initial mass center of uncertain data to each data set can be denoted as E(||cj-xi| |), particularly, the various geometric figure range of indeterminacy (e.g., line, circle) and different uncertainties are general Rate density function is intended to using numerical integrating, Given this can using E (| cj-xi|2) substitution E (| | cj-xi||)。
Thus uncertain data to each data set default initial mass center desired distance calculation formula are as follows:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertain Property probability density function.
202: the smallest data set of desired distance being determined as to the target data set of uncertain data, and will be uncertain Data are divided to target data concentration.
203: the uncertain probability density function based on uncertain data recalculates the default first of target data set The prothyl heart, and iteration executes step 201 and step 202, until meeting preset condition.
In the present embodiment, a kind of mode for recalculating the default initial mass center of target data set is: it is based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty Probability density function.
Furthermore herein it should be noted is that: execute step 201 in iteration and determine uncertain data to every number According to the default initial mass center of collection desired distance when, if the default initial mass center of some data set is recalculated, for weight New calculate is preset for the data set of initial mass center, when executing step 201 it is confirmed that uncertain data is to the data set The desired distance of the default initial mass center recalculated, i.e. above-mentioned formula E (| | cj-xi| |) in cjIt recalculates Preset initial mass center.
Wherein preset condition can be depending on practical application, such as preset condition may is that (1) when desired distance is less than Pre-determined distance (depending on practical application) (2) is before uncertain data to be clustered in an iteration is reassigned to Target data set (3) reach default the number of iterations when the number of iterations (depending on practical application).
Process shown in above-mentioned Fig. 2 is expressed as follows with endless form:
1.Assign initial values for cluster means c1 to cK(c1 to cKIt is each data The default initial mass center of collection)
2.repeat
3.for i=1 to n do
4.Assign each data point xi(uncertain data) to cluster Cj(j-th of data set) where E(||cj-xi| |) (desired distance) is the minimum.
5.end for
6.for j=1 to K do
7.Recalculate cluster mean cj of cluster Cj
8.end for
9.until convergence
10.return C (target data set)
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, based on uncertain number According to uncertain probability density function, determine uncertain data to each data set default initial mass center expectation away from From, it would be desirable to it is determined as the target data set of uncertain data apart from the smallest data set, and uncertain data is divided It is concentrated to target data, the uncertain probability density function based on uncertain data recalculates the pre- of target data set If initial mass center, and iteration above-mentioned steps can thus not known up to meeting preset condition based on uncertain data Property probability density function to the cluster of uncertain data, thus improve the accuracy of uncertain data cluster.
In order to prove the feasibility of above-mentioned data clustering method, data clusters mode provided in this embodiment is applied to In the scene of the corresponding uncertain data of the target moved in plane space, each uncertain data is allowed to exist in this scenario The mobile position of a direction is evenly distributed on one section of straight line.
Assuming that presetting initial mass center c=(p, q) and a uncertain data x is specified in a uncertain line segment On, the destination node of the uncertainty line segment is (a, b) and (c, d), and the linear equation of this uncertain line segment in this way is available Parameter is expressed as (a+t (c-a), b+t (d-b)), and wherein t belongs to [0,1].Uncertain probability density letter is indicated using f (t) Number.The distance of uncertain line segment is expressed as simultaneously
It is further:
Wherein B=2 [(c-a) (a-p)+(d-b) (b-q)], C=(p-a)2+(q-b)2
If uncertainty probability density function f (t) be it is equally distributed, as f (t)=1, above formula is just Become:
So as to go out desired distance for equally distributed indeterminacy of calculation, To realize the cluster to uncertain data.It may be noted that a bit: being uniformly distributed is a special case, when not being to be uniformly distributed then Uncertain probability density function can be indicated using Gaussian function etc..
It is to verify data clustering method provided in this embodiment (being referred to as FK-means method for ease of description) The no accuracy for improving cluster, simulate following scene in the present embodiment: the system of traceable one group of moving target position is Through having clapped the snapshot of one group of these target position of reaction, these position datas are there are in record set, wherein each position data There is certain uncertainty, captures unascertained information using uncertain factor thus.Next relatively FK-means Except the difference of method and K-means method: (1) K-means method being applied to record and neutralized FK-means method application Data uncertainty is neutralized in record.More specifically, the two-dimensional space first at one 100 × 100 generates one group of random number According to as record.For every data, the uncertainty of a data includes that probabilistic type, data can move most The direction that small distance D and data can move.
Next, the actual position of these data is simulated according to record and uncertainty from the tired raw bits deposited in record The offset set generates.It is in particular for every data, collected position data record is on record, then it is randomly generated one Data determine its possible moving distance.(multidirectional) or two-way uncertainty are moved freely if belonged to, will be generated in addition One data determines its possible moving direction.Such as position data is indicated using actual value.
In the present embodiment, the data set that FK-means method and K-means method are directed to is as follows:
(1) record (using tradition K-means)
(2) record and uncertainty (using FK-means)
(3) true value (using tradition K-means)
In order to verify FK-means method in the close work from the data set generated in truthful data of data set of generation With using the widely used adjustment orchid moral index (ARI) for being used to calculate similarity between cluster result.ARI value is higher, then and two Cluster result similarity is higher.Applicant will between the data set generated by (2) and (3) ARI index and (1) and (3) generate Data set between ARI index be compared.
The number (n) of uncertain data to be clustered, the number of data set (K) and can mobile minimum range (D) value of these three parameters will change in an experiment.Table 1, which is presented, to be changed obtained by the value of D when keeping n=1000 and K=20 The different experiments result arrived.In different parameter combinations, 500 experiments have been done.Note is generated in advance in experiment each time Record, the combination of uncertainty degree, actual value.The combination of these data is to be used in three kinds of cluster process simultaneously.It is identical pre- If initial mass center set is also used simultaneously into three kinds of cluster process, in this way can be to avoid by K-means method and FK- Deviation caused by initial mass center is preset in means method.Test each time, allow K-means method ((1) neutralize (3) in) and FK-means method (in (2)) run to always ought all uncertain datas to be clustered in the cluster it is continuous twice Just terminate when not changing in iteration or when the number of iterations reaches 10000 times.Blue moral index and time interval are adjusted by difference FK-means method and K-means method 500 times experiments are averaged to obtain.
As can be seen from Table 1, in being applied to record data, the adjustment orchid moral index of FK-means method is always than tradition K-means method is high.Pairwise testing the result shows that, two kinds of (p < 0.000001 in each use-case) under the conditions of all settings The adjustment orchid moral index value of method is a difference in that significantly.This result shows that, the data set obtained by FK-means method Closer to the data set obtained from real world.Change speech, FK-means method can obtain a data set, and this data set It is the preferable prediction that data set is obtained from real world availability data.
1. experimental result of table
D 2.5 5 7.5 10 20 50
ARI(FK-means) 0.733 0.689 0.652 0.632 0.506 0.311
ARI(K-means) 0.700 0.626 0.573 0.523 0.351 0.121
It improves 0.033 0.063 0.079 0.109 0.155 0.189
Improve percentage 4.77% 10.03% 13.84% 20.82% 44.34% 155.75%
Applicant is in depth tested by assigning different values to n, K and D and keeping other variables constants.Institute Under there is something special, it is found by the applicant that FK-means method is improved than traditional K-means method, and result of study shows when uncertain Property degree increase when, the improvement of FK-means method is also higher.On the other hand, except when the number of data set is very small When, the number of uncertain data to be clustered and the number of data set are not have greatly to the effect of FK-means method Influence.In terms of efficiency, it is found by the applicant that FK-means method ratio K-means method needs more calculating times, still It is required it is more calculate the time and usually only need extra time of fair amount, this be it is reasonable, because of the side FK-means Method considers uncertainty and makes clustering result quality more preferable, that is, the accuracy clustered improves.
For the various method embodiments described above, for simple description, therefore, it is stated as a series of action combinations, but Be those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because according to the present invention, certain A little steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know that, it is retouched in specification The embodiment stated belongs to preferred embodiment, and related actions and modules are not necessarily necessary for the present invention.
Corresponding with above method embodiment, the embodiment of the present invention also provides a kind of data clusters device, structure such as Fig. 3 It is shown, it may include: the first computing unit 11, the second computing unit 12, determination unit 13 and division unit 14.
First computing unit 11, in the case where getting uncertain data to be clustered, to any data set: Uncertain data is divided in the data set, the uncertain probability density function based on uncertain data is counted again Calculate the default initial mass center of the data set.
It is to be understood that for each data set, preset for each data set and preset initial mass center, when obtaining When getting uncertain data to be clustered, which is divided to any data and is concentrated, then based on uncertainty The uncertain probability density function of data recalculates the default initial mass center of any data set, that is, passes through uncertain data Uncertain probability density function change data set default initial mass center.
The first computing unit 11, which recalculates a kind of mode of the default initial mass center of the data set, in the present embodiment is:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty Probability density function.
Second computing unit 12, for any data set: calculating uncertain data and recalculated to the data set Default initial mass center expectation square error and uncertain data to other data sets default initial mass center expectation The sum of square error, it would be desirable to which the sum of square error is determined as expectation square error of the uncertain data relative to the data set Summation.
That is expectation square error of the uncertain data relative to j-th of data set in determining all data sets When summation, the default initial mass center of j-th of data set is the default initial mass center recalculated, other data sets it is default just The prothyl heart is pre-set to preset initial mass center.
The second computing unit 12, which obtains a kind of mode of expectation square error summation, in the present embodiment is: based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be Uncertain probability density function, K are data set sum.
Determination unit 13, for that will it is expected that the smallest data set of square error summation value is determined as target data set, from And the problem of by minimum expectation square error summation E (SSE) is considered as to the cluster of uncertain data, thus may be used To determine the target data set of uncertain data by expectation square error summation value minimum.
Division unit 14 is concentrated for uncertain data to be divided to target data.
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, to any data set: will Uncertain data is divided in the data set, and the uncertain probability density function based on uncertain data recalculates The default initial mass center of the data set, and to any data set: calculate what uncertain data was recalculated to the data set Preset initial mass center expectation square error and uncertain data to other data sets default initial mass center expectation it is flat The sum of square error, it would be desirable to which it is total relative to the expectation square error of the data set that the sum of square error is determined as uncertain data With, it would be desirable to the smallest data set of square error summation value is determined as target data set, and uncertain data is divided to institute Target data concentration is stated, to realize the uncertain probability density function based on uncertain data to uncertain data Thus cluster improves the accuracy of uncertain data cluster.
Referring to Fig. 4, can wrap it illustrates another structure of data clusters device provided in an embodiment of the present invention It includes: determination unit 21, division unit 22 and computing unit 23.
Determination unit 21, for being based on uncertain data in the case where getting uncertain data to be clustered Uncertain probability density function, determine uncertain data to each data set default initial mass center desired distance.
In the present embodiment, the desired distance of default initial mass center of uncertain data to each data set can be denoted as E(||cj-xi| |), particularly, the various geometric figure range of indeterminacy (e.g., line, circle) and different uncertainties are general Rate density function is intended to using numerical integrating, Given this can using E (| cj-xi|2) substitution E (| | cj-xi||)。
Thus uncertain data to each data set default initial mass center desired distance calculation formula are as follows:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertain Property probability density function.
Division unit 22, for the smallest data set of desired distance to be determined as to the target data set of uncertain data, And uncertain data is divided to target data and is concentrated.
Computing unit 23 recalculates number of targets for the uncertain probability density function based on uncertain data According to the default initial mass center of collection, and trigger determination unit 21 and 22 iteration of division unit execute based on uncertain data not really Qualitative probabilistic density function, determine uncertain data to each data set default initial mass center desired distance and will expectation The step of being determined as the target data set of uncertain data apart from the smallest data set, until meeting preset condition.
In the present embodiment, computing unit 23, which recalculates a kind of mode of the default initial mass center of target data set, is: Based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty Probability density function.
Furthermore herein it should be noted is that: triggering determination unit iteration execute determine uncertain data to often It is right if the default initial mass center of some data set is recalculated when the desired distance of the default initial mass center of a data set In recalculating for the data set for presetting initial mass center, determination unit 21 it is confirmed that the data set recalculate it is default The desired distance of initial mass center, i.e. above-mentioned formula E (| | cj-xi| |) in cjIt is the default initial mass center recalculated.
Wherein preset condition can be depending on practical application, such as preset condition may is that (1) when desired distance is less than Pre-determined distance (depending on practical application) (2) is before uncertain data to be clustered in an iteration is reassigned to Target data set (3) reach default the number of iterations when the number of iterations (depending on practical application).
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, based on uncertain number According to uncertain probability density function, determine uncertain data to each data set default initial mass center expectation away from From, it would be desirable to it is determined as the target data set of uncertain data apart from the smallest data set, and uncertain data is divided It is concentrated to target data, the uncertain probability density function based on uncertain data recalculates the pre- of target data set If initial mass center, and iteration above-mentioned steps can thus not known up to meeting preset condition based on uncertain data Property probability density function to the cluster of uncertain data, thus improve the accuracy of uncertain data cluster.
In addition, the present embodiment also provides a kind of storage medium, storage is by computer program, the calculating on the storage medium Machine program is for realizing above-mentioned data clustering method.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.
The foregoing description of the disclosed embodiments can be realized those skilled in the art or using the present invention.To this A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited It is formed on the embodiments shown herein, and is to fit to consistent with the principles and novel features disclosed in this article widest Range.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (10)

1. a kind of data clustering method, which is characterized in that the described method includes:
In the case where getting uncertain data to be clustered, to any data set: the uncertain data is divided Into the data set, the uncertain probability density function based on the uncertain data recalculates the pre- of the data set If initial mass center;
To any data set: calculating the expectation for the default initial mass center that the uncertain data is recalculated to the data set Square error and the uncertain data to other data sets default initial mass center the sum of expectation square error, by institute State the expectation square error summation that the sum of desired square error is determined as the uncertain data relative to the data set;
The desired the smallest data set of square error summation value is determined as target data set;
The uncertain data is divided to the target data to concentrate.
2. the method according to claim 1, wherein described to any data set: by the uncertain data It is divided in the data set, the uncertain probability density function based on the uncertain data recalculates the data set Default initial mass center include:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertainty Probability density function.
3. the method according to claim 1, wherein described to any data set: calculating the uncertain number Other are arrived according to the expectation square error and the uncertain data of the default initial mass center recalculated to the data set The sum of the expectation square error of the default initial mass center of data set includes:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square error And the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) it is not true Qualitative probabilistic density function, K are data set sum.
4. a kind of data clustering method, which is characterized in that the described method includes:
In the case where getting uncertain data to be clustered, the uncertain probability based on the uncertain data is close Spend function, determine the uncertain data to each data set default initial mass center desired distance;
The smallest data set of desired distance is determined as to the target data set of the uncertain data, and by the uncertainty Data are divided to the target data and concentrate;
Uncertain probability density function based on the uncertain data recalculates the default first of the target data set The prothyl heart, and iteration executes the uncertain probability density function based on the uncertain data, determines the uncertainty Data to each data set default initial mass center desired distance and by the smallest data set of desired distance be determined as it is described not The step of target data set of deterministic data, until meeting preset condition.
5. according to the method described in claim 4, it is characterized in that, the uncertainty based on the uncertain data is general Rate density function, the desired distance of default initial mass center for determining the uncertain data to each data set include:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertainty Probability density function.
6. according to the method described in claim 4, it is characterized in that, the uncertainty based on the uncertain data is general Rate density function, the default initial mass center for recalculating the target data set include:
Based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertain general Rate density function.
7. a kind of data clusters device, which is characterized in that described device includes:
First computing unit, in the case where getting uncertain data to be clustered, to any data set: will be described Uncertain data is divided in the data set, the uncertain probability density function based on the uncertain data, again Calculate the default initial mass center of the data set;
Second computing unit, for any data set: calculating what the uncertain data was recalculated to the data set Preset initial mass center expectation square error and the uncertain data to other data sets default initial mass center phase It hopes the sum of square error, the sum of described expectation square error is determined as phase of the uncertain data relative to the data set Hope square error summation;
Determination unit, for that will it is expected that the smallest data set of square error summation value is determined as target data set;
Division unit is concentrated for the uncertain data to be divided to the target data.
8. device according to claim 7, which is characterized in that first computing unit, for being based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertainty Probability density function;
Or
Second computing unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square error And the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) it is not true Qualitative probabilistic density function, K are data set sum.
9. a kind of data clusters device, which is characterized in that described device includes:
Determination unit, in the case where getting uncertain data to be clustered, based on the uncertain data Uncertain probability density function, determine the uncertain data to each data set default initial mass center expectation away from From;
Division unit, for the smallest data set of desired distance to be determined as to the target data set of the uncertain data, and The uncertain data is divided to the target data to concentrate;
Computing unit recalculates the target for the uncertain probability density function based on the uncertain data The default initial mass center of data set, and the determination unit and division unit iteration execution are triggered based on the uncertainty The uncertain probability density function of data, determine the uncertain data to each data set default initial mass center phase The step of hoping distance and the smallest data set of desired distance be determined as the target data set of the uncertain data, until full Sufficient preset condition.
10. device according to claim 9, which is characterized in that the determination unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertainty Probability density function;
Or
The computing unit, for being based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertain general Rate density function.
CN201810723419.8A 2018-07-04 2018-07-04 Data clustering method and device Active CN109002513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810723419.8A CN109002513B (en) 2018-07-04 2018-07-04 Data clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810723419.8A CN109002513B (en) 2018-07-04 2018-07-04 Data clustering method and device

Publications (2)

Publication Number Publication Date
CN109002513A true CN109002513A (en) 2018-12-14
CN109002513B CN109002513B (en) 2022-07-19

Family

ID=64598536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810723419.8A Active CN109002513B (en) 2018-07-04 2018-07-04 Data clustering method and device

Country Status (1)

Country Link
CN (1) CN109002513B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689180A (en) * 2019-09-18 2020-01-14 科大国创软件股份有限公司 Intelligent route planning method and system based on geographic position
CN112989221A (en) * 2021-02-18 2021-06-18 支付宝(杭州)信息技术有限公司 Position-based family relation analysis method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030046253A1 (en) * 2001-05-17 2003-03-06 Honeywell International Inc. Neuro/fuzzy hybrid approach to clustering data
US20090222472A1 (en) * 2008-02-28 2009-09-03 Aggarwal Charu C Method and Apparatus for Aggregation in Uncertain Data
CN103177088A (en) * 2013-03-08 2013-06-26 北京理工大学 Biomedicine missing data compensation method
CN104731916A (en) * 2015-03-24 2015-06-24 无锡中科泛在信息技术研发中心有限公司 Optimizing initial center K-means clustering method based on density in data mining
CN105260748A (en) * 2015-10-16 2016-01-20 吉林大学 Method for clustering uncertain data
CN106684905A (en) * 2016-11-21 2017-05-17 国网四川省电力公司经济技术研究院 Wind power plant dynamic equivalence method with wind power forecast uncertainty considered
CN107316081A (en) * 2017-06-12 2017-11-03 大连理工大学 A kind of uncertain data sorting technique based on extreme learning machine

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030046253A1 (en) * 2001-05-17 2003-03-06 Honeywell International Inc. Neuro/fuzzy hybrid approach to clustering data
US20090222472A1 (en) * 2008-02-28 2009-09-03 Aggarwal Charu C Method and Apparatus for Aggregation in Uncertain Data
CN103177088A (en) * 2013-03-08 2013-06-26 北京理工大学 Biomedicine missing data compensation method
CN104731916A (en) * 2015-03-24 2015-06-24 无锡中科泛在信息技术研发中心有限公司 Optimizing initial center K-means clustering method based on density in data mining
CN105260748A (en) * 2015-10-16 2016-01-20 吉林大学 Method for clustering uncertain data
CN106684905A (en) * 2016-11-21 2017-05-17 国网四川省电力公司经济技术研究院 Wind power plant dynamic equivalence method with wind power forecast uncertainty considered
CN107316081A (en) * 2017-06-12 2017-11-03 大连理工大学 A kind of uncertain data sorting technique based on extreme learning machine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖宇鹏 等: ""基于模糊c-均值的空间不确定数据聚类"", 《计算机工程》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689180A (en) * 2019-09-18 2020-01-14 科大国创软件股份有限公司 Intelligent route planning method and system based on geographic position
CN112989221A (en) * 2021-02-18 2021-06-18 支付宝(杭州)信息技术有限公司 Position-based family relation analysis method and device

Also Published As

Publication number Publication date
CN109002513B (en) 2022-07-19

Similar Documents

Publication Publication Date Title
He et al. Indoor localization and automatic fingerprint update with altered AP signals
Oh et al. Markov chain Monte Carlo data association for multi-target tracking
CN108536851B (en) User identity recognition method based on moving track similarity comparison
CN109951807A (en) Fusion RSS and CSI indoor orientation method based on WiFi signal
Sharifzadeh et al. Supporting spatial aggregation in sensor network databases
He et al. Tilejunction: Mitigating signal noise for fingerprint-based indoor localization
CN108650626A (en) A kind of fingerprinting localization algorithm based on Thiessen polygon
CN109374986B (en) Thunder and lightning positioning method and system based on cluster analysis and grid search
CN105554873B (en) A kind of Wireless Sensor Network Located Algorithm based on PSO-GA-RBF-HOP
CN110049549B (en) WiFi fingerprint-based multi-fusion indoor positioning method and system
Zhang et al. Hybrid fuzzy clustering method based on FCM and enhanced logarithmical PSO (ELPSO)
CN111460508B (en) Track data protection method based on differential privacy technology
CN104066178B (en) A kind of indoor wireless location fingerprint generation method based on artificial neural network
CN109002513A (en) A kind of data clustering method and device
CN109195110B (en) Indoor positioning method based on hierarchical clustering technology and online extreme learning machine
CN109460539B (en) Target positioning method based on simplified volume particle filtering
Wang et al. Neural subgraph counting with Wasserstein estimator
Kim et al. LinkBlackHole $^{*} $*: Robust Overlapping Community Detection Using Link Embedding
CN113468382B (en) Knowledge federation-based multiparty loop detection method, device and related equipment
Murphey et al. A parallel GRASP for the data association multidimensional assignment problem
CN108152789A (en) Utilize the passive track-corelation data correlation and localization method of RSS information
KR20180089479A (en) User data sharing method and device
CN112560878A (en) Service classification method and device and Internet system
Kumari et al. Baybfed: Bayesian backdoor defense for federated learning
Kesavareddigari et al. Identification and asymptotic localization of rumor sources using the method of types

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant