CN108763590B

CN108763590B - Data clustering method based on double-variant weighted kernel FCM algorithm

Info

Publication number: CN108763590B
Application number: CN201810636707.XA
Authority: CN
Inventors: 唐益明; 胡相慧; 丰刚永; 华丹阳; 任福继; 张有成; 宋小成
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2021-07-27
Anticipated expiration: 2038-06-20
Also published as: CN108763590A

Abstract

The invention discloses a data clustering method based on a bivariate weighted kernel (FCM) algorithm, which comprises the steps of firstly, optimally dividing a data set to minimize a target function; obtaining an initial membership matrix, a typical value matrix and an initial clustering center; calculating the distance between the data points and the clustering center in the multi-core high-dimensional space; iteratively obtaining a membership value and a likelihood typical value; so that the objective function obtains the clustering result corresponding to the minimum value as the final clustering result. The invention adopts the kernel function guided by the combined kernel to replace the common Euclidean distance function, and can better divide the linear data and the nonlinear data; the noise immunity of the algorithm is enhanced by adopting the typical value matrix, the accuracy of algorithm clustering is improved, the proportion of various kernels in the combined kernel can be automatically adjusted to meet the requirements of different data sets on different kernel functions, and the problem of uncertainty of the ordinary kernel algorithm on kernel function selection is solved.

Description

Data clustering method based on double-variant weighted kernel FCM algorithm

Technical Field

The invention relates to the technical field of data clustering, in particular to a data clustering method based on a bivariate weighted kernel (FCM) algorithm.

Background

Clustering is an important research content in the fields of data mining and artificial intelligence, and plays a great role in various fields such as big data, pattern recognition, image segmentation, machine learning and the like. Clustering is a process of dividing data according to similarity rules of the data, the division result is determined by the rules, and the divided groups or sets are also often called clusters. The fuzzy c-means clustering algorithm (FCM) is a basic method of fuzzy clustering, lays a foundation for a clustering algorithm based on a target function, but is sensitive to initialization of a clustering center and is easily influenced by noise points. To improve the noise immunity of the algorithm, Krishnapuram and Keller propose a PCM algorithm. The PCM algorithm adopts a probability partition matrix, and the probability membership reflects the typical degree of a data point to a certain clustering center. In addition, it relaxes the constraint of dividing the matrix sum to 1 in the FCM algorithm, and compared to the FCM algorithm, PCM can better handle noise points. However, the PCM algorithm easily obtains fewer cluster centers or overlapping clusters than the predetermined number of clusters.

Based on the reason, Zhang provides an improved clustering mode of a probabilistic clustering algorithm in the literature, and a new parameter eta is added_iTo reduce the error of the algorithm, although the clustering algorithm with possibility can overcome the problem of consistent clustering, the original m is subjected to_pSelection of parameters is exceptionally sensitive, different m_pThe resulting cluster centers will be two distinct values, even if the values differ very little. An improved probability fuzzy c-means clustering algorithm PFCM provided by Nikhil has good noise robustness and can not generate coincident clusters, however, the selection of parameters a and b by the PFCM usually needs artificial designation and lacks theoretical basis, and has strong dependency.

The algorithm has a good clustering effect on linear data, but the clustering effect on nonlinear data is not ideal, and by introducing a kernel function, original data passes through a mercer kernel condition to make sample data X ═ X₁,…,x_nMapping into a high-dimensional feature space F, and mapping data to phi (x)₁),...,φ(x_n) And F, wherein phi is a mapping function, and samples are clustered in a space F to form a kernel-based fuzzy clustering algorithm. Yang provides a kernel-based fuzzy clustering algorithm KFCM, Genton also shows a kernel machine learning mode from the statistical angle, the algorithms enable data points to be mapped to a high-dimensional feature space, the kernel function and the optimized clustering error are used to enable the kernel-based fuzzy clustering algorithm to have good robustness for noise and outliers, the problem that the PFCM algorithm is sensitive to parameter setting is solved, however, the kernel-based fuzzy clustering algorithm has a good effect on spherical data, and an ideal effect is not obtained on non-spherical data.

Zhao et al, previously proposed in the literature that the segmented clustering algorithm of the largest kernel of multiple kernels is mostly concerned with supervised and semi-supervised cluster learning, which is based on maximally marginal clustering, but it is obvious that the clustering algorithm is mostly used for hard clustering. The multi-core method proposed by Hsin-Chien provides great flexibility for selection and combination of basic cores, increases information sources from different angles, and improves the encoding capability of domain knowledge.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a data clustering method based on a dual-variable weighted kernel FCM algorithm, so that the problems that the FCM is sensitive to noise points and the PCM is easy to generate consistent clustering can be accurately avoided, the accuracy of the algorithm is improved, and the data structure information existing in a data set is more accurately mined. Meanwhile, the most suitable weight value and the current membership value can be automatically found, so that the reliability and the convergence of the algorithm are improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a data clustering method based on a bivariate weighted kernel FCM algorithm, which is used for clustering customer information and recommending products to customers according to clustering results and comprises the following steps:

step 1, a data set X is customer information, wherein X is { X ═ X₁,x₂,…,x_n}，x_jIs the jth data point; j is 1,2, …, n, n is the total number of data; optimally dividing the data set X to make the J value of the objective function in the formula (1) be minimum:

in the formula (1), i represents the ith class, c represents the number of the classified classes, i is more than or equal to 1 and less than or equal to c, and c is more than or equal to 2 and less than or equal to n; u. of_ijRepresents the jth data point x_jMembership toMembership value in class i;

is u_ijM is a weighted exponent representing the degree of cluster ambiguity, t_ijRepresents the jth data point x_jA likelihood representative value of membership to the i-th class,

is t_ijEta is a weighted exponent and is used for controlling the proportion of membership and a typical value to realize bivariation; d_ijJ-th data point x representing multi-kernel high-dimensional feature space_jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature space_iAnd has:

in the formula (2), w_lIs the weight parameter of the kernel, M is the total number of kernels,

α_ijkexpressed by formula (3):

in the formula (3), k_l(x_j,x_q) Is the ith gaussian kernel function expressed by equation (4):

in formula (4)

Is the width parameter of the function, x_jl,x_qlRespectively representing the l-dimension characteristic values of the j-th data point and the q-th data point;

in the formula (1), the reaction mixture is,

represents the jth data point x_jAnd has:

in the formula (5), the reaction mixture is,

represents a constant; x is the number of_zZ is more than or equal to 1 and less than or equal to n and represents the z-th data point; | x_j-x_zI represents the jth data point x_jAnd z-th data point x_zThe Euclidean distance between;

in the formula (1), r_iRepresents a penalty factor and has:

in the formula (6) | | x_j-v_iThe | | represents the Euclidean distance between the jth data point and the ith clustering center;

step 2, processing the data set X by using a fuzzy C-means clustering algorithm FCM to obtain a membership matrix U and a clustering center V which are respectively as follows:

obtaining a parameter penalty factor r by calculating according to the formula (6)_iAnd taking a membership matrix U obtained by using a fuzzy C-means clustering algorithm FCM as an initial membership matrix U of a bivariate weighted kernel FCM algorithm (U⁰)；

Step 3, randomly initializing the jth data point x_jTypical values belonging to class i

Defining the iteration times as iter, and defining the maximum iteration times as iterMax; and initializing iter as 1; membership of the iter iterationMatrix is U^(iter)A typical value matrix for the ith iteration is T^(iter)；

Step 4, obtaining parameters from the formula (3)

Step 5, obtaining parameter beta from formula (7)_l：

Step 6, obtaining the weight parameter of the kernel of the iter iteration from the formula (8)

Step 7, obtaining the jth data point x of the multi-core high-dimensionality feature space by the formula (2)_jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature space_iA distance D between_ijSquare of (d);

step 8, obtaining the jth data point x of the iter iteration from the formula (9)_jTypical values belonging to class i

Step 9, obtaining the jth data point x of the iter iteration from the formula (10)_jMembership value to class i

Step 10, setting a threshold value epsilon, and judging an algorithm stop condition

Or if iter > iterMax is true, if true, then

For optimal membership value, for the jth data point x_jGet it

The value of i corresponding to the maximum value is used to obtain the data point x_jBelongs to the i-th class according to the obtained membership matrix U^(iter)Obtaining all data point sets Y belonging to the ith class, and calculating

y_jIs the jth data in the ith class of data point set Y,

formula is the average value of the set Y, n' is the total number of data points belonging to the ith class, and then the clustering center of the ith class

The same method obtains the clustering centers of other classes; recommending products for the new customers according to the clustering center matrix; if the stop condition is not satisfied, the value of iter is increased by 1, and steps 4 to 10 are repeated until the condition is satisfied.

Compared with the prior art, the invention has the beneficial effects that:

1. the Double-variable Weighted Kernel FCM clustering algorithm DWKFCM (Weighted Kernel Fuzzy C-Means with Double variables) adopts a multi-Kernel method, integrates the advantages of a Fuzzy clustering method FCM and the advantages of a probability clustering algorithm PCM, reduces the influence of Kernel function selection on an experimental result, is sensitive to Kernel function selection, and adds a probability concept on the basis of multi-Kernel, so that the algorithm has stronger noise immunity and the obtained clustering result is more accurate.

2. The DWKFCM of the invention adopts the kernel-based algorithm which can carry out nonlinear data operation, and maps the nonlinear operation of common data to a high-dimensional data space, thereby increasing the robustness of the algorithm.

3. The DWKFCM extends the DWKFCM to the aspect of soft clustering, fuzzifies the attribution of the data points, considers the spatial distribution characteristics of the data points at the same time, and judges the influence of the data points on the clustering center by calculating the distance between the data points, wherein the influence is smaller when the data points are farther away from the clustering center, so the method has stronger flexibility.

4. Data information is increased explosively nowadays, and clustering of data is a key for further data analysis and knowledge mining, so that the method has a good application value.

Drawings

FIG. 1 is a flow chart of a data clustering algorithm based on a bivariate weighted kernel FCM algorithm;

fig. 2 is a Sammon map of a iris dataset.

Detailed Description

Referring to fig. 1, the data clustering method based on the dual-variant weighted kernel FCM algorithm in this embodiment is used for clustering customer information, and recommending products to customers according to clustering results, and is performed according to the following steps:

step 1, a data set X is customer information, wherein X is { X ═ X₁,x₂,…,x_n}，x_jIs the jth data point; j is 1,2, …, n, n is the total number of data, such as: 10000 pieces of customer information are adopted, namely n is 10000; optimally dividing the data set X to make the J value of the objective function in the formula (1) be minimum:

in the formula (1), i represents the ith class, c represents the number of classified classes, i.e. the class of the product, such as: c is 10, i is more than or equal to 1 and less than or equal to c, and c is more than or equal to 2 and less than or equal to n; u. of_ijRepresents the jth data point x_jMembership values belonging to class i;

is u_ijM is a weighted exponent representing the degree of cluster ambiguity, m can take the value 2, t_ijRepresents the jth data point x_jA likelihood representative value of membership to the i-th class,

is t_ijEta is a weighted exponent and is used for controlling the proportion of membership and a typical value to realize bivariation, and eta can be 2; d_ijJ-th data point x representing multi-kernel high-dimensional feature space_jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature space_iAnd has:

in the formula (2), w_lIs the weight parameter of the kernel, M is the total number of the kernels, and the value of M is the number of attributes of the client information, such as: m is equal to 6, and M is equal to 6,

α_ijkexpressed by formula (3):

in formula (4)

Is the width parameter of the function, using the formula:

obtaining the calculated value, x_jl,x_qlRespectively representing the l-dimension characteristic values of the j-th data point and the q-th data point;

in the formula (1), the reaction mixture is,

represents the jth data point x_jAnd has:

in the formula (5), the reaction mixture is,

represents a constant, which can take the value of 2; x is the number of_zZ is more than or equal to 1 and less than or equal to n and represents the z-th data point; | x_j-x_zI represents the jth data point x_jAnd z-th data point x_zThe Euclidean distance between;

in the formula (1), r_iRepresents a penalty factor and has:

in the formula (6) | | x_j-v_iAnd | | l represents the Euclidean distance between the jth data point and the ith cluster center.

obtaining a parameter penalty factor r by calculating according to the formula (6)_iAnd to benefitUsing the membership matrix U obtained by the fuzzy C-means clustering algorithm FCM as the initial membership matrix U of the dual-variant weighted kernel FCM algorithm (U: (C-means clustering algorithm))⁰)。

Defining iteration times as iter, the maximum iteration times as iterMax, and setting the maximum iteration times iterMax as 100; and initializing iter as 1; the membership matrix of the iter iteration is U^(iter)A typical value matrix for the ith iteration is T^(iter)。

Step 4, obtaining parameters from the formula (3)

Step 5, obtaining parameter beta from formula (7)_l：

Step 7, obtaining the jth data point x of the multi-core high-dimensionality feature space by the formula (2)_jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature space_iA distance D between_ijSquare of (d).

Step 10, setting the threshold value epsilon to 0.00001, and judging the algorithm stop condition

Or if iter > iterMax is true, if true, then

For optimal membership value, for the jth data point x_jGet it

y_jIs the jth data in the ith class of data point set Y,

the formula is the average value of the set Y, and n' is the total number of data points belonging to the ith class, which can be obtained from the number of elements in the set Y. Then cluster center of the ith class

The same method obtains the clustering of other classesA core; recommending products for the new customers according to the clustering center matrix; if the stop condition is not satisfied, the value of iter is increased by 1, and steps 4 to 10 are repeated until the condition is satisfied.

In this embodiment, a marketing "catalog marking" is used as a research object, data that needs to be clustered is used as customer information, that is, a set formed by all customer information is used as a data set to be clustered, and each piece of customer information includes numerical attribute information such as age, income, internet surfing time, gender, constellation, consumer varieties, and the like. The data clustering method in the embodiment is adopted to cluster all customer information, and then marketing strategies such as specific product recommendation and regular release of articles purchased by similar personnel are carried out on different customers according to clustering results.

The clustering method in the embodiment is completed based on the following experimental platform: the operating system is a PC of Windows 7 and the integrated development environment is MATLAB R2015 b. The hardware conditions are as follows: the CPU is Intel Core 3.20GHz and the memory is 8 GB. The parameters m and η in the algorithm are both set to 2.0, and the maximum number of iterations iterMax is 100.

In order to verify the performance of the data clustering method DWKFCM based on the dual-variant weighted kernel FCM algorithm, four real data sets were used: iris data set, wine data set, glass data set, and diabetes data set of pidan. The specific information of these four data sets is shown in table 1.

Table 1 summary of the actual data set information used in the experiment

The clustering method DWKFCM, the fuzzy C-mean clustering algorithm FCM, the probability clustering algorithm PCM and the probability fuzzy clustering algorithm PFCM of the embodiment are respectively utilized, and the kernel-based fuzzy clustering algorithm KFCM and the multi-kernel fuzzy clustering algorithm MKFC are utilized to cluster the data sets. Taking the clustering accuracy CR as an evaluation standard of the clustering effect, and defining the clustering accuracy CR as follows:

wherein a is_iRepresenting the number of samples that are ultimately correctly classified, c representing the number of clusters, and n representing the number of samples in the data set. The higher the clustering accuracy, the better the clustering effect of the clustering method. When the value of CR is 1, the clustering result of the algorithm on the data set is completely correct.

The iris dataset contains 150 data objects, each described by 4 attributes: calyx length, calyx width, petal length and petal width, and the algorithm is used for predicting which of three categories (Setosa, Versicolour, Virginica) iris belongs to according to the 4 attributes. As can be seen from the Sammon map of the iris data set in fig. 2, the data set has two types of data overlapping each other and not easy to be divided, and the two types of data are marked by "Δ" and "o" in the map, which brings great challenges to the clustering method. The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the iris data sets is shown in table 2.

TABLE 2 clustering accuracy of algorithms on Iris data set

Algorithm	Cluster accuracy (CR)
		FCM	0.877
PCM	0.668
		PFCM	0.808
KFCM	0.914
		MKFCM	0.932
DWKFCM	0.959

It can be seen from table 2 that the accuracy of the DWKFCM algorithm is improved by 8.2%, 29.1%, 15.1%, 4.5% and 2.7% over FCM, PCM, PFCM, KFCM and MKFCM, respectively. The performance of the DWKFCM algorithm is therefore best.

The glass data set is a data set characterized by glass material, and comprises 214 data objects, each object is represented by 9 attributes, and can be divided into 6 classes with different numbers. The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the glass data set is shown in table 3.

TABLE 3 clustering accuracy of algorithms on glass datasets

Algorithm	Cluster accuracy (CR)
		FCM	0.813
PCM	0.739
		PFCM	0.846
KFCM	0.888
		MKFCM	0.901
DWKFCM	0.953

The clustering results in table 3 show that the clustering accuracy of the DWKFCM algorithm is improved by 14%, 21.4%, 10.7%, 6.5% and 5.2% respectively over FCM, PCM, PFCM, KFCM and MKFCM. The performance of the DWKFCM algorithm is better.

The wine data set is a data set characterized by analysis of the composition of wine and contains 178 data, each data containing 13 attributes, classified into 3 types, and the number of attributes is the largest compared to iris and glass data sets. The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the wine data set is shown in table 4.

TABLE 4 clustering accuracy of algorithms on wine datasets

Algorithm	Cluster accuracy (CR)
		FCM	0.708
PCM	0.409
		PFCM	0.688
KFCM	0.777
		MKFCM	0.841
DWKFCM	0.925

The clustering results in table 4 show that the clustering accuracy of the DWKFCM algorithm is improved by 21.7%, 51.6%, 23.7%, 14.8% and 8.4% respectively over FCM, PCM, PFCM, KFCM and MKFCM. The performance of the DWKFCM algorithm is better.

The diabetes data set is a diabetes diagnosis data set, and belongs to the field of medicine. The data set contains 768 instances, each containing 8 attributes: pregnancy (pregnant times), Plasma glucose concentration within 2h of the meal (Plasma glucose concentration), diastolic blood pressure mmHg (diastatic blood pressure), thickness of skin fissure (triceps skin fold thickness), serum insulin mu U/ml (serum insulin), Body mass index (Body mass index), family history (pedigree function), Age (Age). The algorithm determines whether the patient is diabetic based on these 8 attributes. The data set consisted of 500 data sets without disease and 268 patient data statistics.

The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the diabetes data set is shown in table 5.

TABLE 5 clustering accuracy of algorithms on diabetes data sets

Algorithm	Cluster accuracy (CR)
		FCM	0.582
PCM	0.355
		PFCM	0.754
KFCM	0.773
		MKFCM	0.831
DWKFCM	0.934

As can be seen from Table 5, the DWKFCM of the present invention can achieve 93.4% accuracy in diabetes data set diagnosis of diabetes, while the accuracy of other algorithms is lower than 90%, and the PCM algorithm has even only 35.5% clustering accuracy. The clustering results in table 5 show that the clustering accuracy of the DWKFCM algorithm is improved by 35.2%, 57.9%, 18%, 16.1% and 10.3% respectively over FCM, PCM, PFCM, KFCM and MKFCM. The performance of the DWKFCM algorithm is better.

The double-variable weighted kernel fuzzy c-means clustering algorithm DWKFCM can be well applied to the fields of marketing, flower, glucose wine, glass classification, diabetes diagnosis and the like, reliably mines information in data and has high practicability.

Claims

1. A data clustering method based on a bivariate weighted kernel (FCM) algorithm is used for clustering customer information and recommending products to customers according to clustering results, and is characterized by comprising the following steps:

in the formula (1), i represents the ith class, c represents the number of the classified classes, i is more than or equal to 1 and less than or equal to c, and c is more than or equal to 2 and less than or equal to n; u. of_ijRepresents the jth data point x_jMembership values belonging to class i;

α_ijkexpressed by formula (3):

in formula (4)

in the formula (1), the reaction mixture is,

represents the jth data point x_jAnd has:

in the formula (5), θ represents a constant; x is the number of_zZ is more than or equal to 1 and less than or equal to n and represents the z-th data point; | x_j-x_zI represents the jth data point x_jAnd z-th data point x_zThe Euclidean distance between;

in the formula (1), r_iRepresents a penalty factor and has:

obtaining a parameter penalty factor r by calculating according to the formula (6)_iAnd taking a membership matrix U obtained by using a fuzzy C-means clustering algorithm FCM as an initial membership matrix U of a bivariate weighted kernel FCM algorithm⁽⁰⁾；

Defining the iteration times as iter, and defining the maximum iteration times as iterMax; and initializing iter as 1; the membership matrix of the iter iteration is U^(iter)A typical value matrix for the ith iteration is T^(iter)；

Step 4, obtaining parameters from the formula (3)

Step 5, obtaining parameter beta from formula (7)_l：

Or if iter > iterMax is true, if true, then

For optimal membership value, for the jth data point x_jGet it

The value of i corresponding to the maximum value is used to obtain the data point x_jBelongs to the i-th class according to the obtained membership matrix U^(iter)To obtainAll data point sets Y belonging to class i are calculated

y_jIs the jth data in the ith class of data point set Y,