CN109002513A - A kind of data clustering method and device - Google Patents
A kind of data clustering method and device Download PDFInfo
- Publication number
- CN109002513A CN109002513A CN201810723419.8A CN201810723419A CN109002513A CN 109002513 A CN109002513 A CN 109002513A CN 201810723419 A CN201810723419 A CN 201810723419A CN 109002513 A CN109002513 A CN 109002513A
- Authority
- CN
- China
- Prior art keywords
- data
- uncertain
- data set
- mass center
- initial mass
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of data clustering method and device, in the case where obtaining uncertain data to be clustered, uncertain probability density function based on uncertain data, it calculates and cluster information needed is carried out to uncertain data, such as uncertain probability density function based on uncertain data, recalculate the default initial mass center of the data set, be considered as the default initial mass center that uncertain data is recalculated to the data set relative to the uncertain data of the expectation square error summation of the data set expectation square error and the uncertain data to other data sets default initial mass center the sum of expectation square error, and then the desired the smallest data set of square error summation value is determined as target data set, uncertain data is divided to the target data to concentrate, it realizes based on uncertain data Uncertain probability density function improves the accuracy of uncertain data cluster to the cluster of uncertain data.
Description
Technical field
The invention belongs to technical field of data processing more particularly to a kind of data clustering methods and device.
Background technique
Measure inaccurate, sampling error, outdated data source or other etc. due to, data often have uncertainty
The property of (abbreviation uncertain data), especially in the application for needing to interact with true environment, such as Location based service and biography
In the applications such as sensor monitoring, for tracking mobile target (such as vehicle or people) in Location based service, in Location based service
In can not track the accurate instantaneous positions of all mobile targets, therefore the change in location process of each mobile target completely
With uncertainty, this uncertainty can have an impact the management of data, such as data query and data clusters.
The uncertainty of data includes two types at present: already present uncertain and numerical value is uncertain.First
In seed type, regardless of target or data tuple exist whether, data itself have existed uncertainty.Such as in relational database
Data tuple may there are a probability value of degree of belief is associated with that can show it.In second of type, a data
As a closed area, the probability density function (PDF) of the data limits the value of the data.Both types are come
It says, available data clusters have following two:
By being suitble to hybrid density not with the solution of EM (Expectation Maximization, greatest hope) algorithm
The problem of deterministic data clusters and Fuzzy C-Means Cluster Algorithm, but both data clustering methods do not account for not really
The qualitative influence to cluster causes to cluster accurate reduction.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of data clustering method and devices, for improving uncertainty
The accuracy of data clusters.Technical solution is as follows:
The present invention provides a kind of data clustering method, which comprises
In the case where getting uncertain data to be clustered, to any data set: by the uncertain data
It is divided in the data set, the uncertain probability density function based on the uncertain data recalculates the data set
Default initial mass center;
To any data set: calculating the default initial mass center that the uncertain data is recalculated to the data set
It is expected that square error and the uncertain data to other data sets default initial mass center the sum of expectation square error,
The sum of described expectation square error is determined as expectation square error summation of the uncertain data relative to the data set;
The desired the smallest data set of square error summation value is determined as target data set;
The uncertain data is divided to the target data to concentrate.
Preferably, described to any data set: the uncertain data to be divided in the data set, based on described in not
The uncertain probability density function of deterministic data, the default initial mass center for recalculating the data set include:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true
Qualitative probabilistic density function.
Preferably, described to any data set: calculate the uncertain data recalculated to the data set it is pre-
If the expectation of the default initial mass center of the expectation square error and the uncertain data of initial mass center to other data sets
The sum of square error includes:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square
Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be
Uncertain probability density function, K are data set sum.
The present invention also provides a kind of data clustering methods, which comprises
In the case where getting uncertain data to be clustered, the uncertainty based on the uncertain data is general
Rate density function, determine the uncertain data to each data set default initial mass center desired distance;
The smallest data set of desired distance is determined as to the target data set of the uncertain data, and will be described not true
Qualitative data is divided to the target data and concentrates;
Uncertain probability density function based on the uncertain data, recalculates the pre- of the target data set
If initial mass center, and iteration executes the uncertain probability density function based on the uncertain data, determines described not true
Qualitative data to each data set default initial mass center desired distance and the smallest data set of desired distance is determined as institute
The step of stating the target data set of uncertain data, until meeting preset condition.
Preferably, the uncertain probability density function based on the uncertain data determines described uncertain
The desired distance of default initial mass center of property data to each data set includes:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is not
Certainty probability density function.
Preferably, the uncertain probability density function based on the uncertain data, recalculates the mesh
Mark data set default initial mass center include:
Based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true
Qualitative probabilistic density function.
The present invention also provides a kind of data clusters device, described device includes:
First computing unit, in the case where getting uncertain data to be clustered, to any data set: will
The uncertain data is divided in the data set, the uncertain probability density function based on the uncertain data,
Recalculate the default initial mass center of the data set;
Second computing unit, for any data set: calculating the uncertain data and recalculated to the data set
The default initial mass center of the expectation square error of default initial mass center out and the uncertain data to other data sets
The sum of expectation square error, the sum of described expectation square error is determined as the uncertain data relative to the data set
Expectation square error summation;
Determination unit, for that will it is expected that the smallest data set of square error summation value is determined as target data set;
Division unit is concentrated for the uncertain data to be divided to the target data.
Preferably, first computing unit, for being based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true
Qualitative probabilistic density function;
Or
Second computing unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square
Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be
Uncertain probability density function, K are data set sum.
The present invention also provides a kind of data clusters device, described device includes:
Determination unit, in the case where getting uncertain data to be clustered, being based on the uncertain number
According to uncertain probability density function, determine the uncertain data to each data set default initial mass center expectation
Distance;
Division unit, for the smallest data set of desired distance to be determined as to the target data of the uncertain data
Collection, and the uncertain data is divided to the target data and is concentrated;
Computing unit recalculates described for the uncertain probability density function based on the uncertain data
The default initial mass center of target data set, and the determination unit and division unit iteration execution are triggered based on described not true
The uncertain probability density function of qualitative data, determine the uncertain data to each data set default initial mass center
Desired distance and the step of the smallest data set of desired distance is determined as the target data set of the uncertain data, directly
To meeting preset condition.
Preferably, the determination unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is not
Certainty probability density function;
Or
The computing unit, for being based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is not true
Qualitative probabilistic density function.
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, based on uncertain number
According to uncertain probability density function, calculate and cluster information needed carried out to uncertain data, such as based on uncertainty
The uncertain probability density function of data, recalculates that the default initial mass center of the data set, to be considered as uncertain data opposite
In the default initial mass center that the uncertain data of the expectation square error summation of the data set is recalculated to the data set
Expectation square error and the uncertain data to other data sets default initial mass center expectation square error it
With, and then the desired the smallest data set of square error summation value is determined as target data set, uncertain data is divided
It is concentrated to the target data, realizes the uncertain probability density function based on uncertain data to uncertain data
Cluster improves the accuracy of uncertain data cluster.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow chart of data clustering method provided in an embodiment of the present invention;
Fig. 2 is another flow chart of data clustering method provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of data clusters device provided in an embodiment of the present invention;
Fig. 4 is another structural schematic diagram of data clusters device provided in an embodiment of the present invention.
Specific embodiment
Currently, data clusters problem is in data set Cj(j is from 1 to K) finds a data set C, wherein data set CjBy base
In the average value c of similitudej(it is considered as data set CjDefault initial mass center) constitute, and different Data Clustering Algorithms can
With the different objective function of correspondence, but its juche idea is the distance and maximization minimized between same data intensive data
Distance between different data intensive data, wherein the distance minimized between same data intensive data can also be considered as minimizing
The distance between same every data of data set Sino-U.S. and minimizing preset in every data and the data set initial mass center it
Between distance.
Applicant is from hard clustering algorithm --- the research of mean cluster (K-means) algorithm is suitable for uncertain data
Clustering algorithm, wherein the purpose of K-means algorithm be to be focused to find out a data set C from K data it is flat to minimize
Square sum of the deviations (SSE).The calculation formula of square error summation is as follows:
| | | | indicate a data xiWith the default initial mass center c of data setjDistance.For example, Euclidean distance is defined as:One data set CiDefault initial mass center defined by following vector form:
Corresponding, the process of K-means algorithm is as follows:
1.Assign initial values for cluster means c1 to cK
2.repeat
3.for i=1 to n do
4.Assign each data point xi to cluster Cj where||cj-xi||is the
minimum.
5.end for
6.for j=1 to K do
7.Recalculate cluster mean cj of cluster Cj
8.end for
9.until convergence
10.return C
Its process, which is briefly described, is: 1) presetting initial mass center to the setting of each data set;2) every data to be clustered are calculated
The distance between default initial mass center to each data set (| | cj-xi| |), and data to be clustered are divided to apart from value most
In small data set;3) the default initial mass center apart from the smallest data set of value is recalculated;4) iterative step 2 to 3 until
Meet preset condition.
From above-mentioned K-means algorithm it is found that not accounting for probabilistic shadow when being clustered using K-means algorithm
It rings, therefore applicant sums up when clustering to uncertain data, needs the uncertainty based on uncertain data
Probability density function carries out cluster information needed to uncertain data to calculate, and every data to be clustered arrive when will such as cluster
The distance between default initial mass center of each data set (| | cj-xi| |) be changed to desired distance E (| | cj-xi| |), and be based on
The uncertain probability density function of uncertain data presets initial mass center to calculate, or the purpose of cluster is considered as most
Smallization it is expected square error summation, so as to improve the accuracy of uncertain data cluster.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Referring to Fig. 1, the data are poly- it illustrates a kind of flow chart of data clustering method provided in an embodiment of the present invention
Class method is directed to uncertain data, clusters accuracy for improving uncertain data, can specifically include following steps:
101: in the case where getting uncertain data to be clustered, to any data set: by uncertain data
It is divided in the data set, the uncertain probability density function based on uncertain data recalculates the pre- of the data set
If initial mass center.
It is to be understood that for each data set, preset for each data set and preset initial mass center, when obtaining
When getting uncertain data to be clustered, which is divided to any data and is concentrated, then based on uncertainty
The uncertain probability density function of data recalculates the default initial mass center of any data set, that is, passes through uncertain data
Uncertain probability density function change data set default initial mass center.
Uncertain probability density function based on uncertain data in the present embodiment, recalculates the data set
A kind of mode for presetting initial mass center is:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty
Probability density function.
By taking three data sets as an example, these three data sets are respectively: data set 1, data set 2 and data set 3, will not really
Qualitative data xiIn the case where being divided to data set 1, the default just prothyl of data set 1 can be recalculated based on above-mentioned formula
The heart, to change the default initial mass center of data set by the uncertain probability density function of uncertain data.
The wherein attribute value o of uncertain probability density function and uncertain dataiCorrelation, such as uncertain probability
Density function is attribute value oiIn the probability density function of time t, i=1 to n.The performance of the uncertainty probability density function
Form may is that averag density function or gauss of distribution function.
102: to any data set: the default initial mass center that calculating uncertain data to the data set recalculates
It is expected that square error and uncertain data to other data sets default initial mass center the sum of expectation square error, by the phase
The sum of square error is hoped to be determined as expectation square error summation of the uncertain data relative to the data set.
That is expectation square error of the uncertain data relative to j-th of data set in determining all data sets
When summation, the default initial mass center of j-th of data set is the default initial mass center recalculated, other data sets it is default just
The prothyl heart is pre-set to preset initial mass center.
Still by above-mentioned three data sets: for data set 1, data set 2 and data set 3, determining uncertain data phase
For data set 1 expectation square error summation when, the default initial mass center of data set 1 be recalculate it is default just prothyl
The default initial mass center of the heart, data set 2 and data set 3 is pre-set to preset initial mass center.
In the present embodiment, to any data set: calculating uncertain data and preset to what the data set recalculated
The expectation square mistake of default initial mass center of the expectation square error and uncertain data of initial mass center to other data sets
A kind of mode of the sum of difference is:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square
Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be
Uncertain probability density function, K are data set sum.
103: the desired the smallest data set of square error summation value being determined as target data set, and by uncertain number
It is concentrated according to target data is divided to, thus total by a minimum expectation square error is considered as to the cluster of uncertain data
The problem of with E (SSE), can thus determine the mesh of uncertain data by expectation square error summation value minimum
Data set is marked, and then uncertain data can be divided to target data concentration.
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, to any data set: will
Uncertain data is divided in the data set, and the uncertain probability density function based on uncertain data recalculates
The default initial mass center of the data set, and to any data set: calculate what uncertain data was recalculated to the data set
Preset initial mass center expectation square error and uncertain data to other data sets default initial mass center expectation it is flat
The sum of square error, it would be desirable to which it is total relative to the expectation square error of the data set that the sum of square error is determined as uncertain data
With, it would be desirable to the smallest data set of square error summation value is determined as target data set, and uncertain data is divided to institute
Target data concentration is stated, to realize the uncertain probability density function based on uncertain data to uncertain data
Thus cluster improves the accuracy of uncertain data cluster.
Referring to Fig. 2, it illustrates another flow chart of data clustering method provided in an embodiment of the present invention, the data
Clustering method is equally directed to uncertain data, for improving uncertain data cluster accuracy, can specifically include following
Step:
201: in the case where getting uncertain data to be clustered, the uncertainty based on uncertain data is general
Rate density function, determine uncertain data to each data set default initial mass center desired distance.
In the present embodiment, the desired distance of default initial mass center of uncertain data to each data set can be denoted as
E(||cj-xi| |), particularly, the various geometric figure range of indeterminacy (e.g., line, circle) and different uncertainties are general
Rate density function is intended to using numerical integrating, Given this can using E (| cj-xi|2) substitution E (| | cj-xi||)。
Thus uncertain data to each data set default initial mass center desired distance calculation formula are as follows:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertain
Property probability density function.
202: the smallest data set of desired distance being determined as to the target data set of uncertain data, and will be uncertain
Data are divided to target data concentration.
203: the uncertain probability density function based on uncertain data recalculates the default first of target data set
The prothyl heart, and iteration executes step 201 and step 202, until meeting preset condition.
In the present embodiment, a kind of mode for recalculating the default initial mass center of target data set is: it is based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty
Probability density function.
Furthermore herein it should be noted is that: execute step 201 in iteration and determine uncertain data to every number
According to the default initial mass center of collection desired distance when, if the default initial mass center of some data set is recalculated, for weight
New calculate is preset for the data set of initial mass center, when executing step 201 it is confirmed that uncertain data is to the data set
The desired distance of the default initial mass center recalculated, i.e. above-mentioned formula E (| | cj-xi| |) in cjIt recalculates
Preset initial mass center.
Wherein preset condition can be depending on practical application, such as preset condition may is that (1) when desired distance is less than
Pre-determined distance (depending on practical application) (2) is before uncertain data to be clustered in an iteration is reassigned to
Target data set (3) reach default the number of iterations when the number of iterations (depending on practical application).
Process shown in above-mentioned Fig. 2 is expressed as follows with endless form:
1.Assign initial values for cluster means c1 to cK(c1 to cKIt is each data
The default initial mass center of collection)
2.repeat
3.for i=1 to n do
4.Assign each data point xi(uncertain data) to cluster Cj(j-th of data set)
where E(||cj-xi| |) (desired distance) is the minimum.
5.end for
6.for j=1 to K do
7.Recalculate cluster mean cj of cluster Cj
8.end for
9.until convergence
10.return C (target data set)
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, based on uncertain number
According to uncertain probability density function, determine uncertain data to each data set default initial mass center expectation away from
From, it would be desirable to it is determined as the target data set of uncertain data apart from the smallest data set, and uncertain data is divided
It is concentrated to target data, the uncertain probability density function based on uncertain data recalculates the pre- of target data set
If initial mass center, and iteration above-mentioned steps can thus not known up to meeting preset condition based on uncertain data
Property probability density function to the cluster of uncertain data, thus improve the accuracy of uncertain data cluster.
In order to prove the feasibility of above-mentioned data clustering method, data clusters mode provided in this embodiment is applied to
In the scene of the corresponding uncertain data of the target moved in plane space, each uncertain data is allowed to exist in this scenario
The mobile position of a direction is evenly distributed on one section of straight line.
Assuming that presetting initial mass center c=(p, q) and a uncertain data x is specified in a uncertain line segment
On, the destination node of the uncertainty line segment is (a, b) and (c, d), and the linear equation of this uncertain line segment in this way is available
Parameter is expressed as (a+t (c-a), b+t (d-b)), and wherein t belongs to [0,1].Uncertain probability density letter is indicated using f (t)
Number.The distance of uncertain line segment is expressed as simultaneously
It is further:
Wherein B=2 [(c-a) (a-p)+(d-b) (b-q)], C=(p-a)2+(q-b)2。
If uncertainty probability density function f (t) be it is equally distributed, as f (t)=1, above formula is just
Become:
So as to go out desired distance for equally distributed indeterminacy of calculation,
To realize the cluster to uncertain data.It may be noted that a bit: being uniformly distributed is a special case, when not being to be uniformly distributed then
Uncertain probability density function can be indicated using Gaussian function etc..
It is to verify data clustering method provided in this embodiment (being referred to as FK-means method for ease of description)
The no accuracy for improving cluster, simulate following scene in the present embodiment: the system of traceable one group of moving target position is
Through having clapped the snapshot of one group of these target position of reaction, these position datas are there are in record set, wherein each position data
There is certain uncertainty, captures unascertained information using uncertain factor thus.Next relatively FK-means
Except the difference of method and K-means method: (1) K-means method being applied to record and neutralized FK-means method application
Data uncertainty is neutralized in record.More specifically, the two-dimensional space first at one 100 × 100 generates one group of random number
According to as record.For every data, the uncertainty of a data includes that probabilistic type, data can move most
The direction that small distance D and data can move.
Next, the actual position of these data is simulated according to record and uncertainty from the tired raw bits deposited in record
The offset set generates.It is in particular for every data, collected position data record is on record, then it is randomly generated one
Data determine its possible moving distance.(multidirectional) or two-way uncertainty are moved freely if belonged to, will be generated in addition
One data determines its possible moving direction.Such as position data is indicated using actual value.
In the present embodiment, the data set that FK-means method and K-means method are directed to is as follows:
(1) record (using tradition K-means)
(2) record and uncertainty (using FK-means)
(3) true value (using tradition K-means)
In order to verify FK-means method in the close work from the data set generated in truthful data of data set of generation
With using the widely used adjustment orchid moral index (ARI) for being used to calculate similarity between cluster result.ARI value is higher, then and two
Cluster result similarity is higher.Applicant will between the data set generated by (2) and (3) ARI index and (1) and (3) generate
Data set between ARI index be compared.
The number (n) of uncertain data to be clustered, the number of data set (K) and can mobile minimum range
(D) value of these three parameters will change in an experiment.Table 1, which is presented, to be changed obtained by the value of D when keeping n=1000 and K=20
The different experiments result arrived.In different parameter combinations, 500 experiments have been done.Note is generated in advance in experiment each time
Record, the combination of uncertainty degree, actual value.The combination of these data is to be used in three kinds of cluster process simultaneously.It is identical pre-
If initial mass center set is also used simultaneously into three kinds of cluster process, in this way can be to avoid by K-means method and FK-
Deviation caused by initial mass center is preset in means method.Test each time, allow K-means method ((1) neutralize (3) in) and
FK-means method (in (2)) run to always ought all uncertain datas to be clustered in the cluster it is continuous twice
Just terminate when not changing in iteration or when the number of iterations reaches 10000 times.Blue moral index and time interval are adjusted by difference
FK-means method and K-means method 500 times experiments are averaged to obtain.
As can be seen from Table 1, in being applied to record data, the adjustment orchid moral index of FK-means method is always than tradition
K-means method is high.Pairwise testing the result shows that, two kinds of (p < 0.000001 in each use-case) under the conditions of all settings
The adjustment orchid moral index value of method is a difference in that significantly.This result shows that, the data set obtained by FK-means method
Closer to the data set obtained from real world.Change speech, FK-means method can obtain a data set, and this data set
It is the preferable prediction that data set is obtained from real world availability data.
1. experimental result of table
D | 2.5 | 5 | 7.5 | 10 | 20 | 50 |
ARI(FK-means) | 0.733 | 0.689 | 0.652 | 0.632 | 0.506 | 0.311 |
ARI(K-means) | 0.700 | 0.626 | 0.573 | 0.523 | 0.351 | 0.121 |
It improves | 0.033 | 0.063 | 0.079 | 0.109 | 0.155 | 0.189 |
Improve percentage | 4.77% | 10.03% | 13.84% | 20.82% | 44.34% | 155.75% |
Applicant is in depth tested by assigning different values to n, K and D and keeping other variables constants.Institute
Under there is something special, it is found by the applicant that FK-means method is improved than traditional K-means method, and result of study shows when uncertain
Property degree increase when, the improvement of FK-means method is also higher.On the other hand, except when the number of data set is very small
When, the number of uncertain data to be clustered and the number of data set are not have greatly to the effect of FK-means method
Influence.In terms of efficiency, it is found by the applicant that FK-means method ratio K-means method needs more calculating times, still
It is required it is more calculate the time and usually only need extra time of fair amount, this be it is reasonable, because of the side FK-means
Method considers uncertainty and makes clustering result quality more preferable, that is, the accuracy clustered improves.
For the various method embodiments described above, for simple description, therefore, it is stated as a series of action combinations, but
Be those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because according to the present invention, certain
A little steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know that, it is retouched in specification
The embodiment stated belongs to preferred embodiment, and related actions and modules are not necessarily necessary for the present invention.
Corresponding with above method embodiment, the embodiment of the present invention also provides a kind of data clusters device, structure such as Fig. 3
It is shown, it may include: the first computing unit 11, the second computing unit 12, determination unit 13 and division unit 14.
First computing unit 11, in the case where getting uncertain data to be clustered, to any data set:
Uncertain data is divided in the data set, the uncertain probability density function based on uncertain data is counted again
Calculate the default initial mass center of the data set.
It is to be understood that for each data set, preset for each data set and preset initial mass center, when obtaining
When getting uncertain data to be clustered, which is divided to any data and is concentrated, then based on uncertainty
The uncertain probability density function of data recalculates the default initial mass center of any data set, that is, passes through uncertain data
Uncertain probability density function change data set default initial mass center.
The first computing unit 11, which recalculates a kind of mode of the default initial mass center of the data set, in the present embodiment is:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty
Probability density function.
Second computing unit 12, for any data set: calculating uncertain data and recalculated to the data set
Default initial mass center expectation square error and uncertain data to other data sets default initial mass center expectation
The sum of square error, it would be desirable to which the sum of square error is determined as expectation square error of the uncertain data relative to the data set
Summation.
That is expectation square error of the uncertain data relative to j-th of data set in determining all data sets
When summation, the default initial mass center of j-th of data set is the default initial mass center recalculated, other data sets it is default just
The prothyl heart is pre-set to preset initial mass center.
The second computing unit 12, which obtains a kind of mode of expectation square error summation, in the present embodiment is: based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square
Error and the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) be
Uncertain probability density function, K are data set sum.
Determination unit 13, for that will it is expected that the smallest data set of square error summation value is determined as target data set, from
And the problem of by minimum expectation square error summation E (SSE) is considered as to the cluster of uncertain data, thus may be used
To determine the target data set of uncertain data by expectation square error summation value minimum.
Division unit 14 is concentrated for uncertain data to be divided to target data.
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, to any data set: will
Uncertain data is divided in the data set, and the uncertain probability density function based on uncertain data recalculates
The default initial mass center of the data set, and to any data set: calculate what uncertain data was recalculated to the data set
Preset initial mass center expectation square error and uncertain data to other data sets default initial mass center expectation it is flat
The sum of square error, it would be desirable to which it is total relative to the expectation square error of the data set that the sum of square error is determined as uncertain data
With, it would be desirable to the smallest data set of square error summation value is determined as target data set, and uncertain data is divided to institute
Target data concentration is stated, to realize the uncertain probability density function based on uncertain data to uncertain data
Thus cluster improves the accuracy of uncertain data cluster.
Referring to Fig. 4, can wrap it illustrates another structure of data clusters device provided in an embodiment of the present invention
It includes: determination unit 21, division unit 22 and computing unit 23.
Determination unit 21, for being based on uncertain data in the case where getting uncertain data to be clustered
Uncertain probability density function, determine uncertain data to each data set default initial mass center desired distance.
In the present embodiment, the desired distance of default initial mass center of uncertain data to each data set can be denoted as
E(||cj-xi| |), particularly, the various geometric figure range of indeterminacy (e.g., line, circle) and different uncertainties are general
Rate density function is intended to using numerical integrating, Given this can using E (| cj-xi|2) substitution E (| | cj-xi||)。
Thus uncertain data to each data set default initial mass center desired distance calculation formula are as follows:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertain
Property probability density function.
Division unit 22, for the smallest data set of desired distance to be determined as to the target data set of uncertain data,
And uncertain data is divided to target data and is concentrated.
Computing unit 23 recalculates number of targets for the uncertain probability density function based on uncertain data
According to the default initial mass center of collection, and trigger determination unit 21 and 22 iteration of division unit execute based on uncertain data not really
Qualitative probabilistic density function, determine uncertain data to each data set default initial mass center desired distance and will expectation
The step of being determined as the target data set of uncertain data apart from the smallest data set, until meeting preset condition.
In the present embodiment, computing unit 23, which recalculates a kind of mode of the default initial mass center of target data set, is:
Based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor uncertain data, f (xi) it is uncertainty
Probability density function.
Furthermore herein it should be noted is that: triggering determination unit iteration execute determine uncertain data to often
It is right if the default initial mass center of some data set is recalculated when the desired distance of the default initial mass center of a data set
In recalculating for the data set for presetting initial mass center, determination unit 21 it is confirmed that the data set recalculate it is default
The desired distance of initial mass center, i.e. above-mentioned formula E (| | cj-xi| |) in cjIt is the default initial mass center recalculated.
Wherein preset condition can be depending on practical application, such as preset condition may is that (1) when desired distance is less than
Pre-determined distance (depending on practical application) (2) is before uncertain data to be clustered in an iteration is reassigned to
Target data set (3) reach default the number of iterations when the number of iterations (depending on practical application).
From above-mentioned technical proposal it is found that in the case where obtaining uncertain data to be clustered, based on uncertain number
According to uncertain probability density function, determine uncertain data to each data set default initial mass center expectation away from
From, it would be desirable to it is determined as the target data set of uncertain data apart from the smallest data set, and uncertain data is divided
It is concentrated to target data, the uncertain probability density function based on uncertain data recalculates the pre- of target data set
If initial mass center, and iteration above-mentioned steps can thus not known up to meeting preset condition based on uncertain data
Property probability density function to the cluster of uncertain data, thus improve the accuracy of uncertain data cluster.
In addition, the present embodiment also provides a kind of storage medium, storage is by computer program, the calculating on the storage medium
Machine program is for realizing above-mentioned data clustering method.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight
Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other.
For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng
See the part explanation of embodiment of the method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that
A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or
The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged
Except there is also other identical elements in the process, method, article or apparatus that includes the element.
The foregoing description of the disclosed embodiments can be realized those skilled in the art or using the present invention.To this
A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can
Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited
It is formed on the embodiments shown herein, and is to fit to consistent with the principles and novel features disclosed in this article widest
Range.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of data clustering method, which is characterized in that the described method includes:
In the case where getting uncertain data to be clustered, to any data set: the uncertain data is divided
Into the data set, the uncertain probability density function based on the uncertain data recalculates the pre- of the data set
If initial mass center;
To any data set: calculating the expectation for the default initial mass center that the uncertain data is recalculated to the data set
Square error and the uncertain data to other data sets default initial mass center the sum of expectation square error, by institute
State the expectation square error summation that the sum of desired square error is determined as the uncertain data relative to the data set;
The desired the smallest data set of square error summation value is determined as target data set;
The uncertain data is divided to the target data to concentrate.
2. the method according to claim 1, wherein described to any data set: by the uncertain data
It is divided in the data set, the uncertain probability density function based on the uncertain data recalculates the data set
Default initial mass center include:
Based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertainty
Probability density function.
3. the method according to claim 1, wherein described to any data set: calculating the uncertain number
Other are arrived according to the expectation square error and the uncertain data of the default initial mass center recalculated to the data set
The sum of the expectation square error of the default initial mass center of data set includes:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square error
And the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) it is not true
Qualitative probabilistic density function, K are data set sum.
4. a kind of data clustering method, which is characterized in that the described method includes:
In the case where getting uncertain data to be clustered, the uncertain probability based on the uncertain data is close
Spend function, determine the uncertain data to each data set default initial mass center desired distance;
The smallest data set of desired distance is determined as to the target data set of the uncertain data, and by the uncertainty
Data are divided to the target data and concentrate;
Uncertain probability density function based on the uncertain data recalculates the default first of the target data set
The prothyl heart, and iteration executes the uncertain probability density function based on the uncertain data, determines the uncertainty
Data to each data set default initial mass center desired distance and by the smallest data set of desired distance be determined as it is described not
The step of target data set of deterministic data, until meeting preset condition.
5. according to the method described in claim 4, it is characterized in that, the uncertainty based on the uncertain data is general
Rate density function, the desired distance of default initial mass center for determining the uncertain data to each data set include:
Based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertainty
Probability density function.
6. according to the method described in claim 4, it is characterized in that, the uncertainty based on the uncertain data is general
Rate density function, the default initial mass center for recalculating the target data set include:
Based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertain general
Rate density function.
7. a kind of data clusters device, which is characterized in that described device includes:
First computing unit, in the case where getting uncertain data to be clustered, to any data set: will be described
Uncertain data is divided in the data set, the uncertain probability density function based on the uncertain data, again
Calculate the default initial mass center of the data set;
Second computing unit, for any data set: calculating what the uncertain data was recalculated to the data set
Preset initial mass center expectation square error and the uncertain data to other data sets default initial mass center phase
It hopes the sum of square error, the sum of described expectation square error is determined as phase of the uncertain data relative to the data set
Hope square error summation;
Determination unit, for that will it is expected that the smallest data set of square error summation value is determined as target data set;
Division unit is concentrated for the uncertain data to be divided to the target data.
8. device according to claim 7, which is characterized in that first computing unit, for being based on formula:
Obtain j-th of data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertainty
Probability density function;
Or
Second computing unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjThe default initial mass center c recalculatedjExpectation square error
And the uncertain data to other data sets default initial mass center the sum of expectation square error, f (xi) it is not true
Qualitative probabilistic density function, K are data set sum.
9. a kind of data clusters device, which is characterized in that described device includes:
Determination unit, in the case where getting uncertain data to be clustered, based on the uncertain data
Uncertain probability density function, determine the uncertain data to each data set default initial mass center expectation away from
From;
Division unit, for the smallest data set of desired distance to be determined as to the target data set of the uncertain data, and
The uncertain data is divided to the target data to concentrate;
Computing unit recalculates the target for the uncertain probability density function based on the uncertain data
The default initial mass center of data set, and the determination unit and division unit iteration execution are triggered based on the uncertainty
The uncertain probability density function of data, determine the uncertain data to each data set default initial mass center phase
The step of hoping distance and the smallest data set of desired distance be determined as the target data set of the uncertain data, until full
Sufficient preset condition.
10. device according to claim 9, which is characterized in that the determination unit, for being based on formula:
Obtain uncertain data xiTo j-th of data set CjDefault initial mass center cjDesired distance, f (xi) it is uncertainty
Probability density function;
Or
The computing unit, for being based on formula:
Obtain target data set CjDefault initial mass center cj, wherein xiFor the uncertain data, f (xi) it is uncertain general
Rate density function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810723419.8A CN109002513B (en) | 2018-07-04 | 2018-07-04 | Data clustering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810723419.8A CN109002513B (en) | 2018-07-04 | 2018-07-04 | Data clustering method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109002513A true CN109002513A (en) | 2018-12-14 |
CN109002513B CN109002513B (en) | 2022-07-19 |
Family
ID=64598536
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810723419.8A Active CN109002513B (en) | 2018-07-04 | 2018-07-04 | Data clustering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109002513B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110689180A (en) * | 2019-09-18 | 2020-01-14 | 科大国创软件股份有限公司 | Intelligent route planning method and system based on geographic position |
CN112989221A (en) * | 2021-02-18 | 2021-06-18 | 支付宝(杭州)信息技术有限公司 | Position-based family relation analysis method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030046253A1 (en) * | 2001-05-17 | 2003-03-06 | Honeywell International Inc. | Neuro/fuzzy hybrid approach to clustering data |
US20090222472A1 (en) * | 2008-02-28 | 2009-09-03 | Aggarwal Charu C | Method and Apparatus for Aggregation in Uncertain Data |
CN103177088A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Biomedicine missing data compensation method |
CN104731916A (en) * | 2015-03-24 | 2015-06-24 | 无锡中科泛在信息技术研发中心有限公司 | Optimizing initial center K-means clustering method based on density in data mining |
CN105260748A (en) * | 2015-10-16 | 2016-01-20 | 吉林大学 | Method for clustering uncertain data |
CN106684905A (en) * | 2016-11-21 | 2017-05-17 | 国网四川省电力公司经济技术研究院 | Wind power plant dynamic equivalence method with wind power forecast uncertainty considered |
CN107316081A (en) * | 2017-06-12 | 2017-11-03 | 大连理工大学 | A kind of uncertain data sorting technique based on extreme learning machine |
-
2018
- 2018-07-04 CN CN201810723419.8A patent/CN109002513B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030046253A1 (en) * | 2001-05-17 | 2003-03-06 | Honeywell International Inc. | Neuro/fuzzy hybrid approach to clustering data |
US20090222472A1 (en) * | 2008-02-28 | 2009-09-03 | Aggarwal Charu C | Method and Apparatus for Aggregation in Uncertain Data |
CN103177088A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Biomedicine missing data compensation method |
CN104731916A (en) * | 2015-03-24 | 2015-06-24 | 无锡中科泛在信息技术研发中心有限公司 | Optimizing initial center K-means clustering method based on density in data mining |
CN105260748A (en) * | 2015-10-16 | 2016-01-20 | 吉林大学 | Method for clustering uncertain data |
CN106684905A (en) * | 2016-11-21 | 2017-05-17 | 国网四川省电力公司经济技术研究院 | Wind power plant dynamic equivalence method with wind power forecast uncertainty considered |
CN107316081A (en) * | 2017-06-12 | 2017-11-03 | 大连理工大学 | A kind of uncertain data sorting technique based on extreme learning machine |
Non-Patent Citations (1)
Title |
---|
肖宇鹏 等: ""基于模糊c-均值的空间不确定数据聚类"", 《计算机工程》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110689180A (en) * | 2019-09-18 | 2020-01-14 | 科大国创软件股份有限公司 | Intelligent route planning method and system based on geographic position |
CN112989221A (en) * | 2021-02-18 | 2021-06-18 | 支付宝(杭州)信息技术有限公司 | Position-based family relation analysis method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109002513B (en) | 2022-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
He et al. | Indoor localization and automatic fingerprint update with altered AP signals | |
Oh et al. | Markov chain Monte Carlo data association for multi-target tracking | |
CN108536851B (en) | User identity recognition method based on moving track similarity comparison | |
CN109951807A (en) | Fusion RSS and CSI indoor orientation method based on WiFi signal | |
Sharifzadeh et al. | Supporting spatial aggregation in sensor network databases | |
He et al. | Tilejunction: Mitigating signal noise for fingerprint-based indoor localization | |
CN108650626A (en) | A kind of fingerprinting localization algorithm based on Thiessen polygon | |
CN109374986B (en) | Thunder and lightning positioning method and system based on cluster analysis and grid search | |
CN105554873B (en) | A kind of Wireless Sensor Network Located Algorithm based on PSO-GA-RBF-HOP | |
CN110049549B (en) | WiFi fingerprint-based multi-fusion indoor positioning method and system | |
Zhang et al. | Hybrid fuzzy clustering method based on FCM and enhanced logarithmical PSO (ELPSO) | |
CN111460508B (en) | Track data protection method based on differential privacy technology | |
CN104066178B (en) | A kind of indoor wireless location fingerprint generation method based on artificial neural network | |
CN109002513A (en) | A kind of data clustering method and device | |
CN109195110B (en) | Indoor positioning method based on hierarchical clustering technology and online extreme learning machine | |
CN109460539B (en) | Target positioning method based on simplified volume particle filtering | |
Wang et al. | Neural subgraph counting with Wasserstein estimator | |
Kim et al. | LinkBlackHole $^{*} $*: Robust Overlapping Community Detection Using Link Embedding | |
CN113468382B (en) | Knowledge federation-based multiparty loop detection method, device and related equipment | |
Murphey et al. | A parallel GRASP for the data association multidimensional assignment problem | |
CN108152789A (en) | Utilize the passive track-corelation data correlation and localization method of RSS information | |
KR20180089479A (en) | User data sharing method and device | |
CN112560878A (en) | Service classification method and device and Internet system | |
Kumari et al. | Baybfed: Bayesian backdoor defense for federated learning | |
Kesavareddigari et al. | Identification and asymptotic localization of rumor sources using the method of types |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |