CN104615722B - Blended data clustering method with quickly dividing is searched for based on density - Google Patents

Blended data clustering method with quickly dividing is searched for based on density Download PDF

Info

Publication number
CN104615722B
CN104615722B CN201510063814.4A CN201510063814A CN104615722B CN 104615722 B CN104615722 B CN 104615722B CN 201510063814 A CN201510063814 A CN 201510063814A CN 104615722 B CN104615722 B CN 104615722B
Authority
CN
China
Prior art keywords
data
mixed
distance
mixed data
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510063814.4A
Other languages
Chinese (zh)
Other versions
CN104615722A (en
Inventor
陈晋音
何辉豪
杨东勇
陈军敢
卢瑾
顾东袁
张健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201510063814.4A priority Critical patent/CN104615722B/en
Publication of CN104615722A publication Critical patent/CN104615722A/en
Application granted granted Critical
Publication of CN104615722B publication Critical patent/CN104615722B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of blended data clustering method searched for based on density and quickly divide, it is characterised in that includes the following steps:Determine the type that is dominant of blended data in mixed attributes data set;Blended data is calculated according to the type that is dominant of blended data and concentrates the distance between any two blended data;According to the distance between any two blended data is appointed, cluster radius is optimized in default cluster radius value range based on density searching algorithm, and final cluster result is used as using the optimal corresponding corresponding cluster result of cluster radius.The present invention analysis method that to blended data be dominant determines the proprietary type of blended data, different distance computational methods are used for different blended datas, data dimension information that attribute is dominant can be effectively played in the importance of overall data information and accurately calculate the distance of data;And the Data Clustering Algorithm searched for based on density and quickly divided is used, quick fast, accuracy rate height.

Description

Mixed data clustering method based on density search and rapid division
Technical Field
The invention relates to the technical field of data clustering, in particular to a mixed data clustering method based on density search and rapid division.
Background
With the continuous development of communication technology and hardware equipment, the data mining technology has huge application prospects in the aspects of real-time monitoring systems, meteorological satellite remote sensing, network traffic monitoring and the like, and aiming at the characteristics of rapid and continuous arrival and continuous growth of data, the traditional clustering algorithm cannot be suitable for data objects, and the data provides new requirements for the clustering algorithm, namely 1, no assumption is needed on the number of natural clusters; 2. clusters of arbitrary shape can be found; 3. with the ability to handle outliers. Moreover, in the face of the fact that most data is mixed attribute data, which includes both numerical attribute data and classification attribute data, it is important to effectively mine valuable information from such mixed attribute data.
In recent years, data clustering research work is widely developed, but most of the existing data clustering algorithms are limited to processing data only with numerical attributes, and in addition, a small number of algorithms are limited to processing data only with classification attributes, and the algorithms for mixed attribute data are few. Aggarwal et al propose an evolutionary data clustering framework, cluStream, which for the first time employs a two-stage processing framework: online micro-clustering and offline macro-clustering. The online phase algorithm provides a micro-cluster structure, and maintains the arrived data points continuously to generate summary information. And the off-line stage algorithm is responsible for corresponding user requests and generates final clustering results according to the summary data. However, the CleaStream algorithm also has some disadvantages: first the algorithm cannot handle clusters of arbitrary shape; secondly, the adaptability to noise is poor; finally, the number of the clustered micro-clusters needs to be specified manually, and the shape distribution of the original data clusters is seriously influenced. The son-Stream algorithm is proposed by Cao Peak et al, and the algorithm continues to use a two-stage processing framework of CleaStream, divides micro-clusters into potential core micro-clusters and isolated point micro-cluster structures, and can support clustering of any shape. But since the Den-Stream algorithm uses globally consistent absolute density as a parameter, the clustering result is very sensitive to the selection of the parameter. Aiming at the problem of the Den-Stream algorithm, muhammed Z R et al propose an HECES algorithm, and an ellipsoid type cluster is adopted, so that variable density data can be processed. The StrDenAP algorithm is provided by Zhang Jianpeng et al, and a good clustering effect can be obtained by using a two-stage frame of CluStream for reference on the basis of the StrAP algorithm and adopting a neighbor propagation algorithm.
In view of the problem that most of the data is mixed attribute data in reality, researchers also propose some algorithms for directly processing the mixed attribute data. The Yangchun et al proposes the HCluSteam algorithm, which proposes a histogram representation mode of micro-clustering on the classification attribute part of the mixed attribute on the basis of the CluStream, and models the arrival time of the sample by using the Poisson process. The problem with this algorithm is that it is not able to process arbitrarily shaped clusters. The MCStream algorithm proposed on the basis of HCluStrea uses a two-stage framework, measures the distance between objects by using dimension-oriented distance in online micro-clustering, and performs final clustering by using a modified M-DBSCAN density clustering algorithm in macro-clustering. The algorithm can process clusters of any shape, but has the problems that parameters of dimension-oriented distance are given by a user, and more parameters are needed.
Disclosure of Invention
Aiming at the problems of the existing clustering in processing the mixed attribute data: (1) A distance calculation method for mixed data type data cannot be directly and effectively processed; (2) Whether the distance calculation method is reasonable or not cannot be determined, and a corresponding evaluation method is not available; (3) The traditional data based on density has high calculation complexity and unstable accuracy. The invention provides a mixed data clustering method based on density search and rapid division.
A mixed data clustering method based on density search and rapid partitioning comprises the following steps:
s1: determining the dominant type of the mixed data in the mixed attribute data set, and specifically confirming based on the following principles:
if it isConsidering the mixed data in the mixed attribute data set as numerical value dominance data;
if it isConsidering the mixed data in the mixed attribute data set as classified dominance data;
otherwise (i.e. theOr) Then the mixed data in the mixed attribute data set is considered to be the balanced mixed attribute data.
S2: and calculating the distance between any two mixed data in the mixed data set according to the dominant types of the mixed data.
When the mixed data in the mixed attribute data set D is numerical value dominance data, the distance between any two mixed data is calculated by the following method:
(a1) Computing arbitrary two mixed data X i ,X j Distance d (X) of the median attribute part i ,X j ) n Comprises the following steps:
calculating any two mixed data X by adopting a dualization method i ,X j The distance of the classification attribute part in (1) in each dimension, such as the distance of the mixed data Xi, xj in the p-th dimension, is:
mix data X i ,X j Distance d (X) of middle classification attribute part i ,X j ) c Comprises the following steps:
(a2) Calculating mixed data X using the distance of the numerical attribute part and the distance of the classification attribute part i ,X j Distance d (X) of i ,X j ):
d(X i ,X j )=d(X i ,X j ) n +d(X i ,X j ) c
When the mixed data in the mixed attribute data set D is the classification dominance type data, the distance between any two mixed data is calculated by:
(b1) Normalizing each dimension of the numerical attribute part of any one mixed data to obtain the standard value of each numerical attribute, wherein the mixed data X i The standard value of the p-th numerical attribute of (1) is:
wherein, the first and the second end of the pipe are connected with each other,the value of the p-th numerical attribute of the mixed data Xi,for the maximum value of this dimension in all the blended data,for the minimum value of this dimension in all the blended data,taking the value of the dimension in the mixed data i;
the distance of the numerical attribute part is:
the distance of each dimension of the classification attribute part of any two objects Xi, xj is binary, that is, the distance between the p-th dimensions of Xi, xj is:
the distance of the classification attribute part is:
(b2) Calculating D (X) using the distance of the numerical attribute part and the distance of the classification attribute part i ,X j ):
D(X i ,X j )=d(X i ,X j ) n +d(X i ,X j ) c
Most clustering algorithms treat distance measurement independently for each attribute, process each attribute and compare the value relationship of the same attribute when calculating the distance, and actually for a sample set (mixed attribute data set), each attribute is related, and the relationship reflects the inherent generic structure contained in the sample set. Therefore, as a preferable mode, when the mixed data in the mixed attribute data set D is the balanced type mixed attribute data, the distance between any two mixed data is calculated by the following formula:
wherein, d p (X i ,X j ) The distance in the p-th dimension representing the hybrid data Xi and Xj is calculated according to the following formula:
d pq (X i ,X j ) Representing the distances Xi and Xj in the p-dimension relative to the q-dimension (in practice, conditional probabilities), is calculated according to the following formula:
wherein, the first and the second end of the pipe are connected with each other,for mixed data X i The value in the P-th dimension is,for mixed data X j Taking the value in the P dimension as the value of the mixed data in the P dimensionThe set of all possible values in the qth dimension is taken as the complete set, Z is a subset of the complete set, Z c Is the complement of the Z, and is,when mixing data X 1 The value in the p dimension isThe probability that a value in the q dimension belongs to z,when mixing data X 1 The value in the p dimension isWhen the value in the q dimension belongs to Z c The probability of (c).
Due to the fact thatAndall values of (A) are in [0,1 ]]Within the range, thus further correctingIs calculated as:
let d pq (X i ,X j ) Is taken as value of [0,1 ]]Within the range.
By modifying two values X of the property p i And X j The distance between these two values is expressed as the co-occurrence probability of the two values and the set of attribute values of the other attribute q. When multiple classification attributes are present, the attribute valueAndthe distances relative to these attributes can be cumulatively calculated in the same way. When a numerical attribute is present, attribute value is discretizedAndthe distance relative to the numerical attribute can be calculated in a similar manner.
According to the invention, the distance between any two mixed data is calculated by adopting different methods according to the dominance type of the mixed data, so that the influence of non-dominance attributes on the overall similarity of data objects is reduced, and the importance of each dimension attribute needs to be comprehensively considered in the balanced mixed attribute data.
S3: optimizing the clustering radius in a preset clustering radius value range based on a density search algorithm according to the distance between any two mixed data, and taking the optimal clustering result corresponding to the clustering radius as a final clustering result, wherein the method specifically comprises the following steps:
s3-1: setting the number of particles and the maximum iteration number of the particle swarm algorithm, and initializing the particle swarm according to a preset clustering radius to endow each particle with a speed and a position;
s3-2: calculating the density of each mixed data under the current density radius, and determining the distance of the mixed data of the data according to the density of all the mixed data and the distance between any two mixed data;
in the invention, the density rho of the ith mixed data is calculated according to the following formula i
Wherein, the first and the second end of the pipe are connected with each other,d c is the current density radius.
In the invention, the distance delta of the ith mixed data is calculated according to the following formula i
Wherein ρ i Density of the ith mixed data, p j Density of jth mixed data.
S3-3: fitting the densities and distances of all mixed data to obtain a clustering center set under the current density radius;
the distribution of the density δ and the distance ρ of all data was found and the relationship was fitted with a linear function δ = f (ρ).
Preferably, in the present invention, a functional relation δ = f (ρ) between δ and ρ is obtained by linear fitting, and the functional relation is fitted according to an inverse function y = b0+ b1/x in regression analysis, so thatThen y = y 0 +b 1 X', then a fitted δ = f (ρ) curve can be obtained using a linear regression model;
and calculating by using the F distribution and the t distribution to obtain the number of the singular mixed data with the delta = F (rho) and a singular mixed data set, wherein the singular mixed data set is a clustered cluster center set, and the number of the singular mixed data is the clustered cluster number.
Calculating each residual distribution characteristic of the fitting function by adopting F distribution and t distribution of regression analysis, and solving singular mixtureComposite data set (c) 1 ,c 2 ,…,c k ) The singular mixed data set is a cluster center, where k is the total number of clusters, and the value of k depends on the mixed attribute data set.
S3-4: according to the clustering centers, rapidly dividing the mixed attribute data set based on the distance to obtain a plurality of clusters, wherein the number of the clusters is the same as that of the clustering centers in the clustering center set;
s3-5: calculating the Fitness Fitness of the current quick partitioning result according to the following formula:
wherein k is the total number of cluster centers, n k Denotes the total number of mixed data in the k-th cluster, i is the index of the mixed data, c l Is the l cluster center, d (x) i ,c l ) Representing mixed data x i To the center of the cluster c l The distance of (d);
s3-6: aiming at any particle, taking the current fitness of the particle as the optimal individual extreme value of the particle, taking the current position as the optimal position, determining the global optimal extreme value and the global optimal position according to the individual extreme value of each particle, and updating iteration times iter = iter +1;
s3-7: when the iteration number iter < = maximum, updating the position and the speed of each particle according to the following formula, and then turning to the step S3-3; otherwise, turning to step S3-8, wherein Maxiter is the maximum number of iterations, and the particle position and velocity of the mth ion are updated according to the following formula:
v m (t+1)=w*v m (t)+α1*β1*(pbestd-d cm (t))+α2*β2*gbestd-d cm (t)),
d cm (t+1)=d cm (t)+v m (t+1),
wherein v is m (t) represents the flight velocity of the mth particle in the t generation (i.e., the tth iteration), v m (t + 1) represents the velocity of the m-th particle in the t +1 generation, w is the inertial weight, α 1 and α2 is a constant coefficient, pbestd is the optimal position obtained from the mth evolutionary iteration of the mth particle, gbestd is the global optimal position obtained from the tth evolutionary iteration, β 1 and β 2 are [0,1 ]]Random number of d cm (t) represents the position of the mth particle in the t-th generation, d cm (t + 1) represents the position of the particle in the t +1 generation;
s3-8: and outputting the global extreme value and the position of the global extreme value, taking the output global extreme value at the moment as the current optimal density radius, and taking the clustering result corresponding to the optimal density radius as the final clustering result.
Aiming at the fact that mixed data comprises multiple data types, the invention designs a new data distance measurement method to meet data calculation of numerical data and classified data, and finally realizes a data clustering method PSO-PD _ HDC (data clustering method based on density search and rapid division) based on density search and rapid division to finish high-efficiency data clustering, and converts the mixed data clustering into an optimization problem of optimal density radius, and automatically finishes clustering in an optimization searching process, thereby greatly improving the speed and accuracy of clustering.
The method designs a classification dominance analysis method to determine the characteristics of the mixed attribute data, sets three mixed attribute data distance calculation methods, can effectively play the importance of the attribute dominance data dimension information in the whole data information and accurately calculate the distance of the data; the data clustering algorithm based on density search and rapid division is adopted, and the method has the characteristic of rapidly and accurately clustering data. The mixed data widely exist in practical application, and clustering of the mixed data is the key for further data analysis and knowledge mining, so that the method has research and application values.
Drawings
FIG. 1 is a flow chart of a data clustering algorithm based on density search and fast partitioning;
fig. 2 is a flow chart for determining the dominant type of mixed data in the mixed attribute data set D.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
Example 1
In the embodiment, a 'catalog marketing' (catalog market) research object of marketing, mixed data which needs to be clustered are taken as customer information, namely a set consisting of all customer information is taken as a data set to be clustered, firstly, analysis is carried out according to various information of customers, each piece of customer information comprises numerical attribute information such as age, income, online time and the like, and category attribute information such as gender, constellation, consumption variety and the like, all the customer information is clustered by adopting the mixed data clustering method based on density search and rapid division, and then marketing strategies such as specific product recommendation, regular release of similar personnel for purchasing articles and the like are carried out on users of different categories according to clustering results.
As shown in fig. 1, the hybrid data clustering method based on density search and fast partitioning in this embodiment includes:
s1: determining a dominant type of mixed data in the mixed attribute data set D (i.e., data set D);
in this embodiment, as shown in fig. 2, according to a ratio of m to d and a ratio of n to d, where d is an attribute dimension (a dimension of the client information in this embodiment) of the mixed data, m is a dimension of a numerical attribute in the mixed data, and n is a dimension of a classification attribute.
In this embodiment, the dominant type of the mixed data is determined according to the following method:
if it isConsidering the mixed data in the mixed attribute data set D as numerical value dominance data;
if it isThe mixed data in the mixed attribute data set D is considered as scoreClass dominance data;
otherwise (i.e. theOr) Then, the mixed data in the mixed attribute data set D is regarded as the balanced type mixed attribute data.
S2: calculating the distance between any two mixed data in the mixed data set according to the dominant types of the mixed data;
(a) For the numerical value dominance data, calculating the distance between any two mixed data by the following steps:
(a1) Calculating the distance d (X) of the numerical attribute part in any two mixed data Xi, xj i ,X j ) n Comprises the following steps:
calculating any two mixed data Xi, X by adopting a dualization method j The distance of the classification attribute part in (1) in each dimension, such as the distance of the mixed data Xi, xj in the p-th dimension, is:
mix data X i ,X j Distance d (X) of medium classification attribute part i ,X j ) c Comprises the following steps:
(a2) Calculating mixed data X using the distance of the numerical attribute part and the distance of the classification attribute part i ,X j Distance d (X) of i ,X j ):
d(X i ,X j )=d(X i ,X j ) n +d(X i ,X j ) c
(b) For the classified dominance data, calculating the distance between any two mixed data by the following steps:
(b1) Normalizing each dimension of the numerical attribute part of any one mixed data to obtain the standard value of each numerical attribute, wherein the mixed data X i The standard value of the p-th numerical attribute is as follows:
wherein the content of the first and second substances,for mixing data X i The value of the pth numeric attribute of (a),for the maximum value of this dimension in all the blended data,for the minimum value of this dimension in all the blended data,taking the value of the dimension in the mixed data i;
the distance of the numerical attribute part is:
any two objects X i ,X j The distance of each dimension of the classification attribute part of (1) is binary, that is, the distance between the p-th dimensions of Xi and Xj is:
the distance of the classification attribute part is:
(b2) Calculating D (X) using the distance of the numerical attribute part and the distance of the classification attribute part i ,X j ):
D(X i ,X j )=d(X i ,X j ) n +d(X i ,X j ) c
(c) For the balanced mixed attribute data, calculating the distance between any two mixed data:
wherein d is p (X i ,X j ) The distance in the p-th dimension representing the hybrid data Xi and Xj is calculated according to the following formula:
d pq (X i ,X j ) Represents X i And X j The distance in the p-th dimension (actually, the conditional probability) with respect to the q-th dimension is calculated according to the following formula:
wherein the content of the first and second substances,for mixing data X i The value in the P-th dimension is,for mixed data X j The value in the P dimension can be obtained asA set of mixed data of all possible values in the q-th dimension, z being a subset of the set.
Examples of the calculation method include: assume that five data (i.e., mixed data) are contained in the mixed attribute data set D, i.e., D = { X = 1 ,X 2, X 3, X 4, X 5 D =2 for each data dimension:
when values of the respective mixed data in the p-dimension (1 st dimension) and the q-dimension (2 nd dimension) in the mixed attribute data set D are as shown in Table 1, that is, whereIs A1, only the mixed data X is included in the 5 data 1 Is A1, then when the dimension p is A1, all possible value sets of all mixed data in the dimension q are { B1}, z is a subset of { B1}, i.e., z may beOr B1.When mixing data X 1 When the value in the p dimension is A1, the probability that the value in the q dimension belongs to z.
Can calculate by the same wayRepresenting mixed data X 2 When the value in the p dimension is A2, the value in the q dimension belongs to the complement z of z c The probability of (a) of (b) being,namely, selecting proper z to make the sum of the two probabilities maximum, and taking the sum of the maximum probabilities as X 1 And X 2 Distance d in the p dimension relative to the q dimension pq (X i ,X j )。
When values are taken in the p-dimension (1 st dimension) and the q-dimension (2 nd dimension) of each mixed data in the mixed attribute data set D, as shown in Table 2, that is, whereIs A1, except X in 5 data 1 The p dimension of (a) is A1, and also X 4 Is also A1, then when the value of the p dimension is A1, all possible value sets of all data in the q dimension are { B1, B4}, z is a subset of { B1, B4}, i.e., z may beEither { B1} or { B4} or { B1, B4},when mixing data X 1 When the value in the p dimension is A1, the value in the q dimension belongs to the probability of z, and whenWhen the temperature of the water is higher than the set temperature,when Z = { B1},when Z = { B4},when Z = { B1, B4},
by the same token can calculateRepresenting mixed data X 2 When the value in the p dimension is A2, the value in the q dimension is the complement z of z c The probability of (a) of (b) being,namely, selecting proper z to make the sum of the two probabilities maximum, and taking the sum of the maximum probabilities as X 1 And X 2 Distance d in the p dimension relative to the q dimension pq (X i ,X j )。
TABLE 1
Mixing data Value of the p dimension Value of the q dimension
X 1 A1 B1
X 2 A2 B2
X 3 A3 B3
X 4 A4 B4
X 5 A5 B5
TABLE 2
Mixing data Value of the p dimension Value of the q dimension
X 1 A1 B1
X 2 A2 B2
X 3 A3 B3
X 4 A1 B4
X 5 A5 B5
S3: the method comprises the steps of calculating and optimizing a clustering radius dc based on particle swarm optimization, obtaining a functional relation between density rho and distance delta values of all mixed data by using a mixed attribute data distance calculation method based on occupancy analysis, fitting rho and delta functional relations by using a linear function, calculating an optimal clustering center set (c 1, c2, \8230;, ck) through F distribution and t distribution, and completing data set clustering through rapid division.
S3-1: determining an optimal density radius dc based on a particle swarm optimization within a preset value range;
s3-11: determining a cluster radius d c Value range of [ d ] c_low ,d c_high ]Setting upper and lower limits of particle speed, and initializing a particle swarm P, specifically comprising setting the particle number, maximum iteration number maximum and initial flight speed.
In the range d using random function rand () c_low ,d c_high ]An internal random generates a (a =10 in this embodiment) particles and initializes the current evolution iter =0.
Suppose that the mixed attribute data set D contains S mixed data, i.e., D = { X 1 ,X 2 ,…X i …,X s Then cluster radius d c Value range d of c,low =S*1%,d c,high =S*20%。
In the present embodiment, maximum number of iterations maxim =5 is set.
S3-12: according to the current respective particle d c The corresponding cluster center set (c) is obtained by a cluster center determination method based on density search 1 ,c 2 ,…,c k ) The method specifically comprises the following steps:
(a) The density ρ value of each mixed data (i.e., object) is calculated at the current density radius dc, and the distance δ of the mixed data of the data is determined according to the density ρ values of all mixed data and the distance between any two mixed data.
In the present embodiment, the density ρ of the ith mixed data is calculated according to the following formula i
Wherein the content of the first and second substances,d c is the current density radius.
Calculating the distance delta of the ith mixed data according to the following formula i
Where ρ is i Density of the ith mixed data, p j Density of jth mixed data.
(b) Fitting the density rho and the distance delta of all mixed data to obtain a cluster center set (generally comprising a plurality of centers, wherein the specific number depends on the mixed attribute data set D);
the distribution of the density δ and the distance ρ of all data was obtained, and the relationship was fitted with a linear function δ = f (ρ).
Preferably, in the present invention, a functional relationship δ = f (ρ) between δ and ρ is obtained by linear fitting, and the functional relationship is fitted according to an inverse function y = b0+ b1/x in regression analysis, so thatThen y = y 0 +b 1 X', then a fitted δ = f (ρ) curve can be obtained using a linear regression model;
and calculating by using the F distribution and the t distribution to obtain the number of the singular mixed data of delta = F (rho) and a singular mixed data set, wherein the singular mixed data set is a clustered cluster center set, and the number of the singular mixed data is the clustered cluster number.
Calculating each residual distribution characteristic of the fitting function by adopting F distribution and t distribution of regression analysis, and obtaining a singular mixed data set (c) 1 ,c 2 ,…,c k ) The singular mixed data set is a cluster center, where k is the total number of clusters, and the value of k depends on the mixed attribute data set.
(c) According to the clustering center, all data in the mixed attribute data set D are divided based on the distance (in the embodiment, a distance nearest principle is adopted), and the mixed data in all the mixed attribute data sets D are subjected to clustering division.
(d) And (3) according to the sum of the distances in the clusters as a Fitness function Fitness, obtaining a corresponding Fitness function value of each particle:
where k is the total number of cluster centers, n k Denotes the total number of mixed data in the k-th cluster, i is the index of the mixed data, c l Is the l cluster center, d (x) i ,c l ) Representing mixed data x i To the cluster center c l The distance of (d);
(e) And taking the current fitness value of the particle as the optimal individual extreme value pbestf of the particle, taking the current position as the optimal position pbestd, determining the global optimal extreme value gbestf and the global optimal position gbestd according to the individual extreme value of each particle, and updating an evolution algebra (namely the iteration times) iter = iter +1.
(f) When evolution algebraicter < = maximum, updating the position and the speed (namely the position information and the speed information) of each particle, and then turning to the step (b); otherwise, turning to the step (g),
wherein the particle position and velocity of the m-th ion are updated according to the following formula,
v m (t+1)=w*v m (t)+α1*β1*(pbestd-d cm (t))+α2*β2*gbestd-d cm (t)),
d cm (t+1)=d cm (t)+v m (t+1),
wherein v is m (t) represents the flight velocity of the mth particle in the t generation (i.e., the t iteration), v m (t + 1) represents the flight speed (i.e. velocity) of the mth particle in the t +1 generation, w is the inertial weight (w =0.9 in this embodiment), α 1 and α 2 are constant coefficients (α 1= α 2=2 in this embodiment), pbestd is the optimal position obtained from the tth evolutionary iteration of the mth particle, gbestd is the global optimal position obtained from the tth evolutionary iteration, β 1 and β 2 are [0,1]Random number of (d) cm (t) denotes the position of the m-th particle in the t-th generation (referred to as dc value in the t-th generation), d cm (t + 1) indicates the position of the particle in the generation t +1 (which here means the dc value in the generation t + 1).
(g) Outputting a global extreme value gbestf and a global extreme value position gbestd, and taking the output global extreme value gbestd at the moment as the current optimal d c And with an optimum d c And taking the corresponding clustering result as a final clustering result.
Example 2
The clustering method of the embodiment is completed based on the following experimental platform: the experiment platform comprises a PC (personal computer), an operating system is Windows 7, and an integrated development environment is Microsoft Visual C + +2010. The hardware conditions are as follows: the CPU is Intel Core I52.6GHz and the memory is 4GB.
To verify the performance of the new algorithm PSO-PD _ HDC (i.e., the mixed attribute data clustering algorithm based on density search and fast partitioning), five real datasets were used, all from UCI and its Learning library (Machine Learning replication), with the specific information shown in table 3.
TABLE 3
The clustering method (PSO-PD _ HDC clustering), the IWKM algorithm, the SBAC algorithm, the K-protocols algorithm and the KL-FCM-GM algorithm of the embodiment are respectively utilized to cluster the data sets.
In the experiment, except for special description, learning factors in the particle swarm optimization PSO are set to α 1= α 2=1.8, inertia weight w =0.9, particle number a =10, and maximum iteration number iter =5.
In this embodiment, the clustering accuracy provided by Huang and Ng is used as an evaluation criterion of the clustering effect, and the clustering accuracy r is defined as follows:
wherein a is i Indicating the number of samples that are ultimately correctly classified, k indicating the number of clusters, and n indicating the number of samples in the mixed attribute dataset (i.e., the number of mixed data). The higher the clustering accuracy, the better the clustering effect of the clustering method. When the value of r is 1, the clustering result of the algorithm on the data set is completely correct at this time.
The Iris dataset contains 150 data objects (i.e., mixed data), each data object being described by 4 numerical attributes (i.e., the data in the Iris dataset is numerical-dominated data). The Iris dataset has three class attributes: iris-Setosa, iris-Versicolour and Iris-Virginica. In all data sets, the class attributes do not participate in the clustering process, but are only used to evaluate the clustering results of the algorithm.
The accuracy of the clustering result obtained by clustering the Iris data set by adopting the PSO-PD _ HDC algorithm, the IWKM algorithm, the SBAC algorithm, the K-protocols algorithm and the KL-FCM-GM algorithm is shown in Table 4.
TABLE 4
Algorithm Cluster accuracy rate (r)
K-prototypes 0.819
SBAC 0.426
KL-FCM-GM 0.335(α=1.1)
IWKM 0.822
PSO-PD_HDC 0.900
As can be seen from Table 4, the clustering accuracy of the algorithms PSO-PD _ HDC, K-prototypes, SBAC and IWKM is 0.90, 0.819, 0.426 and 0.822 respectively; and the highest clustering accuracy of the KL-FCM-GM is 0.335 when the fuzzy coefficient is 1.1. The clustering results in Table 3 show that the clustering accuracy of the PSO-PD _ HDC algorithm is 8.1%, 47.4%, 56.5% and 7.8% higher than that of the K-prototypes, SBAC, KL-FCM-GM and IWKM algorithms respectively. The performance of the PSO _ PD _ HDC algorithm is better.
The sobean dataset is a classification attribute dataset, and is composed of 47 data objects, and each data object is described by 35 classification attributes (that is, the data in the sobean dataset is classification dominance data). The Soybean dataset has four class attribute values: diaporthe Stem cane, charcoal Rot, rhizoctonia Rot, and Phytophthora Rot.
The accuracy of the clustering result obtained by clustering the Soybean data set by adopting the PSO-PD _ HDC algorithm, the IWKM algorithm, the SBAC algorithm, the K-protocols algorithm and the KL-FCM-GM algorithm is shown in Table 5.
TABLE 5
Algorithm Cluster accuracy rate (r)
K-prototypes 0.856
SBAC 0.617
KL-FCM-GM 0.903(α=1.8)
IWKM 0.908
PSO-PD_HDC 0.957
As can be seen from Table 5, the clustering accuracy rates of the algorithms PSO-PD _ HDC, K-prototypes, SBAC and IWKM are respectively 0.957, 0.856, 0.617 and 0.908; and the highest clustering accuracy of the KL-FCM-GM is 0.903 when the fuzzy coefficient is 1.8. The clustering results in Table 5 show that the clustering accuracy of PSO-PD _ HDC is 10.3%, 34%, 5.4% and 3.9% higher than that of K-prototypes, SBAC, KL-FCM-GM and IWKM algorithms respectively. The performance of the PSO-PD _ HDC algorithm is better.
The Zoo dataset contains 101 data objects, each described by a numerical attribute and 15 classification attributes. The Zoo dataset has 7 class attribute values.
The clustering accuracy of the PSO-PD _ HDC algorithm, WFK-protocols algorithm, EKP algorithm, SBAC algorithm, K-protocols algorithm, and KL-FCM-GM algorithm is listed in Table 6.
TABLE 6
Algorithm Cluster accuracy (r)
K-prototypes 0.806
SBAC 0.426
KL-FCM-GM 0.864(α=1.3)
EKP 0.629
WFK-prototypes 0.908(α=2.1)
PSO-PD_HDC 0.891
As can be seen from Table 6, the clustering accuracy of the algorithms PSO-PD _ HDC, K-prototypes, SBAC and EKP is 0.891, 0.806, 0.426 and 0.629 respectively; and the highest clustering accuracy of the KL-FCM-GM is 0.864 when the fuzzy coefficient is 1.3. The highest clustering accuracy of WFK-prototypes is 0.908 when the fuzzy coefficient is 2.1. The clustering results in Table 5 show that the clustering accuracy of the PSO-PD _ HDC algorithm is respectively 8.5%, 46.5%, 2.7% and 26.2% higher than that of the K-prototypes, SBAC, KL-FCM-GM and EKP algorithms and 1.7% lower than that of the WFK-prototypes algorithm. The performance of the PSO-PD _ HDC algorithm is relatively good.
An Acute Infmatins dataset contains 120 data objects, each described by 1 numeric attribute and 6 categorical attributes. The Acute dataset has 2 class attribute values.
The clustering accuracy of the PSO-PD _ HDC algorithm, WFK-protocols algorithm, EKP algorithm, SBAC algorithm, K-protocols algorithm, and KL-FCM-GM algorithm is listed in Table 7.
TABLE 7
Algorithm Cluster accuracy (r)
K-prototypes 0.610
SBAC 0.508
KL-FCM-GM 0.682(α=1.1)
EKP 0.508
WFK-prototypes 0.710(α=1.1)
PSO-PD_HDC 0.917
As can be seen from Table 7, the clustering accuracy of the algorithms PSO-PD _ HDC, K-protocols, SBAC and EKP is 0.917, 0.610, 0.508 and 0.508 respectively; and the highest clustering accuracy of the KL-FCM-GM is 0.682 when the fuzzy coefficient is 1.1. The highest clustering accuracy of WFK-prototypes is 0.710 when the fuzzy coefficient is 1.1. The clustering results in Table 7 show that the PSO-PD _ HDC algorithm has higher clustering accuracy than K-protocols, SBAC, KL-FCM-GM, EKP and WFK-protocols by 30.7%, 40.9%, 23.5%, 40.9% and 20.7%, respectively. The performance of the PSO-PD _ HDC algorithm is better.
The Statlog Heart dataset contains 270 data objects, each described by 5 numeric attributes and 9 classification attributes. StatlogHeart has 2 class attribute values (i.e., the data in the StatlogHeart dataset is balanced mixed attribute data).
The clustering accuracy of the PSO-PD _ HDC algorithm, WFK-protocols algorithm, EKP algorithm, SBAC algorithm, K-protocols algorithm, and KL-FCM-GM algorithm is listed in Table 8.
TABLE 8
Algorithm Cluster accuracy (r)
K-prototypes 0.577
SBAC 0.752
KL-FCM-GM 0.758(α=1.7)
EKP 0.545
WFK-prototypes 0.835(α=1.3)
PSO-PD_HDC 0.848
The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (4)

1. A mixed data clustering method based on density search and rapid division is used for clustering customer information and then developing specific product recommendation for different types of users according to clustering results, and is characterized by comprising the following steps:
s1: determining the dominant type of mixed data in the mixed attribute data set D, wherein the mixed data is customer information:
if it isConsidering the mixed data in the mixed attribute data set D as numerical value dominance data;
if it isConsidering the mixed data in the mixed attribute data set D as classified dominance data;
if not, the mixed data in the mixed attribute data set D is considered to be balanced mixed attribute data;
d is the dimension of the mixed data, m is the dimension of the numerical attribute in the mixed data, and n is the dimension of the classification attribute;
s2: and calculating the distance between any two mixed data in the mixed attribute data set D according to the dominant type of the mixed data:
(a) For the numerical value dominance data, calculating the distance between any two mixed data by the following steps:
(a1) Computing arbitrary two mixed data X i ,X j Distance d (X) of the median attribute part i ,X j ) n Comprises the following steps:
calculating any two mixed data X by adopting a dualization method i ,X j The distance of the classification attribute part in (1) in each dimension, such as the distance of the mixed data Xi, xj in the p-th dimension, is:
mix data X i ,X j Distance d (X) of medium classification attribute part i ,X j ) c Comprises the following steps:
(a2) Calculating mixed data X using the distance of the numerical attribute part and the distance of the classification attribute part i ,X j Distance d (X) of i ,X j ):
d(X i ,X j )=d(X i ,X j ) n +d(X i ,X j ) c
(b) For the classified dominance data, calculating the distance between any two mixed data by the following steps:
(b1) Normalizing each dimension of the numerical attribute part of any one mixed data to obtain the standard value of each numerical attribute, wherein the mixed data X i The standard value of the p-th numerical attribute is as follows:
wherein, the first and the second end of the pipe are connected with each other,for mixed data X i The value of the pth numeric attribute of (a),for the maximum value of this dimension in all the blended data,for the minimum value of this dimension in all the blended data,taking the value of the dimension in the mixed data i;
the distance of the numerical attribute part is:
any two objects X i ,X j The distance of each dimension of the classification attribute part of (1) is binary, that is, the distance between the p-th dimensions of Xi and Xj is:
the distance of the classification attribute part is:
(b2) Calculating D (X) using the distance of the numerical attribute part and the distance of the classification attribute part i ,X j ):
D(X i ,X j )=d(X i ,X j ) n +d(X i ,X j ) c
(c) For the balanced type mixed attribute data, calculating the distance between any two mixed data:
wherein, d p (X i ,X j ) Representing mixed data X i And X j The distance in the p-th dimension is calculated according to the following formula:
d pq (X i ,X j ) Represents X i And X j The distance in the p-dimension relative to the q-dimension is calculated according to the following formula:
wherein the content of the first and second substances,for mixed data X i The value in the P-th dimension is,for mixed data X j The value in the P dimension is obtained asA set consisting of mixed data of all possible values in the qth dimension, z being a subset of the set;
s3: the method comprises the steps of calculating and optimizing clustering radius dc based on particle swarm optimization, obtaining a functional relation between density rho and distance delta values of all mixed data by using a mixed attribute data distance calculation method based on occupation analysis, fitting rho and delta functional relations by using a linear function, calculating an optimal clustering center set (c 1, c2, \ 8230;, ck) through F distribution and t distribution, and finishing data set clustering through rapid division:
s3-1: determining a cluster radius d c Value range of [ d ] c_low ,d c_high ]Setting upper and lower particle speed limits and initializing a particle swarm P, wherein the particle swarm P specifically comprises the set particle number, maximum iteration number maximum and initial flight speed;
s3-2: according to the current respective particle d c The corresponding cluster center set (c) is obtained by a cluster center determination method based on density search 1 ,c 2 ,…,c k ) The method specifically comprises the following steps:
(a) Calculating the density rho value of each mixed data under the current density radius dc, and determining the distance delta of the mixed data of the data set according to the density rho values of all the mixed data and the distance between any two mixed data;
(b) Fitting the density rho and the distance delta of all mixed data to obtain a clustering center set;
(c) According to the clustering center, all mixed data in the mixed attribute data set D are divided based on distance, and the mixed data in all the mixed attribute data set D are subjected to clustering division;
(d) And (3) according to the sum of the distances in the clusters as a Fitness function Fitness, obtaining a corresponding Fitness function value of each particle:
where k is the total number of cluster centers, n k Indicating mixing in the kth clusterTotal number of data, i is the index of the mixed data, c l Is the l cluster center, d (x) i ,c l ) Representing mixed data x i To the cluster center c l The distance of (d);
(e) Taking the current fitness value of the particle as the optimal individual extreme value pbestf of the particle, taking the current position as the optimal position pbestd, determining the global optimal extreme value gbestf and the global optimal position gbestd according to the individual extreme value of each particle, and updating the evolution algebra iter = iter +1;
(f) When evolution iter < = maximer, updating the position and speed of each particle, and then turning to the step (b); otherwise, turning to the step (g);
wherein the particle position and velocity of the m-th ion are updated according to the following formula,
v m (t+1)=w*v m (t)+α1*β1*(pbestd-d cm (t))+α2*β2*gbestd-d cm (t)),
d cm (t+1)=d cm (t)+v m (t+1),
wherein v is m (t) represents the flight velocity of the mth particle in the t generation, v m (t + 1) represents the flight speed of the mth particle in the t +1 generation, w is the inertial weight and is 0.9, α 1 and α 2 are constant coefficients, α 1= α 2=2, pbestd is the optimal position obtained in the tth evolutionary iteration of the mth particle, gbestd is the global optimal position obtained in the tth evolutionary iteration, β 1 and β 2 are [0,1]Random number of d cm (t) represents the position of the mth particle in the t-th generation, d cm (t + 1) represents the position of the particle at the t +1 generation;
(g) Outputting a global extreme value gbestf and a global extreme value position gbestd, and taking the output global extreme value gbestd at the moment as the current optimal d c And with an optimum d c And taking the corresponding clustering result as a final clustering result.
2. The method for clustering mixed data based on density search and fast segmentation as claimed in claim 1, wherein in the step (a), the density ρ of the ith mixed data is calculated according to the following formula i
Wherein the content of the first and second substances,d c is the current density radius;
calculating the distance delta of the ith mixed data according to the following formula i
Where ρ is i Density of the ith mixed data, p j Density of jth mixed data.
3. The method for clustering mixed data based on density search and fast partition as claimed in claim 1, wherein in the step (b), the functional relationship δ = f (p) between δ and p is obtained by linear fitting, and the functional relationship is fitted according to the inverse function y = b0+ b1/x in the regression analysis, so thatThen y = b 0 +b 1 X' and obtaining a fitted delta = f (rho) curve by using a linear regression model;
calculating by using F distribution and t distribution to obtain the number of singular mixed data with delta = F (rho) and a singular mixed data set, wherein the singular mixed data set is a clustered cluster center set, and the number of the singular mixed data is the clustered cluster number;
calculating each residual distribution characteristic of the fitting function by adopting F distribution and t distribution of regression analysis, and solving a singular mixed data set (c) 1 ,c 2 ,...,c k ) The singular mixed data set is the cluster center, where k is the total number of clusters, and the value of k depends on the mixed attribute data set D.
4. The method for clustering mixed data based on density search and fast partition as claimed in claim 1, wherein in step 3-1, random function rand () is used to range [ d ] c_low ,d c_high ]Randomly generating 10 particles, and initializing a current evolution algebra iter =0;
assuming that the mixed attribute data set D contains S mixed data, D = { X = 1 ,X 2 ,...X i ...,X s Then cluster radius d c Value range d of c,low =S*1%,d c,high =S*20%;
Maximum number of iterations maximum =5.
CN201510063814.4A 2015-02-06 2015-02-06 Blended data clustering method with quickly dividing is searched for based on density Active CN104615722B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510063814.4A CN104615722B (en) 2015-02-06 2015-02-06 Blended data clustering method with quickly dividing is searched for based on density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510063814.4A CN104615722B (en) 2015-02-06 2015-02-06 Blended data clustering method with quickly dividing is searched for based on density

Publications (2)

Publication Number Publication Date
CN104615722A CN104615722A (en) 2015-05-13
CN104615722B true CN104615722B (en) 2018-04-27

Family

ID=53150164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510063814.4A Active CN104615722B (en) 2015-02-06 2015-02-06 Blended data clustering method with quickly dividing is searched for based on density

Country Status (1)

Country Link
CN (1) CN104615722B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169447A1 (en) * 2015-12-09 2017-06-15 Oracle International Corporation System and method for segmenting customers with mixed attribute types using a targeted clustering approach
CN107515892A (en) * 2017-07-07 2017-12-26 国网浙江省电力公司 A kind of electrical network low voltage cause diagnosis method excavated based on big data
CN109492094A (en) * 2018-10-15 2019-03-19 上海电力学院 A kind of mixing multidimensional property data processing method based on density
CN111209347B (en) * 2018-11-02 2024-04-16 北京京东振世信息技术有限公司 Method and device for clustering mixed attribute data
CN111489440B (en) * 2020-04-16 2023-08-29 无锡荣恩科技有限公司 Three-dimensional scanning modeling method for nonstandard parts
CN114648711B (en) * 2022-04-11 2023-03-10 成都信息工程大学 Clustering-based cloud particle sub-image false target filtering method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254020A (en) * 2011-07-22 2011-11-23 西安电子科技大学 Global K-means clustering method based on feature weight
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
CN103020122A (en) * 2012-11-16 2013-04-03 哈尔滨工程大学 Transfer learning method based on semi-supervised clustering
CN103049651A (en) * 2012-12-13 2013-04-17 航天科工深圳(集团)有限公司 Method and device used for power load aggregation
KR20130076348A (en) * 2011-12-28 2013-07-08 고려대학교 산학협력단 Method and apparatus for managing foaf data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254020A (en) * 2011-07-22 2011-11-23 西安电子科技大学 Global K-means clustering method based on feature weight
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
KR20130076348A (en) * 2011-12-28 2013-07-08 고려대학교 산학협력단 Method and apparatus for managing foaf data
CN103020122A (en) * 2012-11-16 2013-04-03 哈尔滨工程大学 Transfer learning method based on semi-supervised clustering
CN103049651A (en) * 2012-12-13 2013-04-17 航天科工深圳(集团)有限公司 Method and device used for power load aggregation

Also Published As

Publication number Publication date
CN104615722A (en) 2015-05-13

Similar Documents

Publication Publication Date Title
CN104615722B (en) Blended data clustering method with quickly dividing is searched for based on density
Su et al. Facilitating score and causal inference trees for large observational studies
KR20210028724A (en) Biased data removal using machine learning models
De Carvalho et al. Partitional clustering algorithms for symbolic interval data based on single adaptive distances
Guo et al. Gene regulatory network inference using PLS-based methods
CN107506480A (en) A kind of excavated based on comment recommends method with the double-deck graph structure of Density Clustering
CN111222847B (en) Open source community developer recommendation method based on deep learning and unsupervised clustering
Wang et al. Discovering multiple co-clusterings with matrix factorization
CN109636212A (en) The prediction technique of operation actual run time
Wang et al. A new population initialization of particle swarm optimization method based on pca for feature selection
Li et al. A novel rough fuzzy clustering algorithm with a new similarity measurement
Azimlu et al. House price prediction using clustering and genetic programming along with conducting a comparative study
Biase et al. On supervised learning to model and predict cattle weight in precision livestock breeding
CN108491477A (en) Neural network recommendation method based on multidimensional cloud and user&#39;s dynamic interest
Devanta Optimization of the K-Means Clustering Algorithm Using Davies Bouldin Index in Iris Data Classification
Wang et al. ProbNet: Bayesian deep neural network for point cloud analysis
Li et al. Compositional clustering: Applications to multi-label object recognition and speaker identification
Ding et al. Time-varying Gaussian Markov random fields learning for multivariate time series clustering
Speer et al. Spectral clustering gene ontology terms to group genes by function
Khan et al. An analysis of particle swarm optimization with data clustering-technique for optimization in data mining
Fan et al. Non-parametric power-law data clustering
Zhang et al. A Weighted KNN Algorithm Based on Entropy Method
Akdemir et al. Ensemble learning with trees and rules: Supervised, semi-supervised, unsupervised
CN112069318B (en) Maternal and infant problem cluster analysis method, device and computer storage medium based on LDA
Pei Hybrid immune clonal particle swarm optimization multi-objective algorithm for constrained optimization problems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant