CN108763590B - Data clustering method based on double-variant weighted kernel FCM algorithm - Google Patents

Data clustering method based on double-variant weighted kernel FCM algorithm Download PDF

Info

Publication number
CN108763590B
CN108763590B CN201810636707.XA CN201810636707A CN108763590B CN 108763590 B CN108763590 B CN 108763590B CN 201810636707 A CN201810636707 A CN 201810636707A CN 108763590 B CN108763590 B CN 108763590B
Authority
CN
China
Prior art keywords
formula
clustering
algorithm
data point
kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810636707.XA
Other languages
Chinese (zh)
Other versions
CN108763590A (en
Inventor
唐益明
胡相慧
丰刚永
华丹阳
任福继
张有成
宋小成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201810636707.XA priority Critical patent/CN108763590B/en
Publication of CN108763590A publication Critical patent/CN108763590A/en
Application granted granted Critical
Publication of CN108763590B publication Critical patent/CN108763590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data clustering method based on a bivariate weighted kernel (FCM) algorithm, which comprises the steps of firstly, optimally dividing a data set to minimize a target function; obtaining an initial membership matrix, a typical value matrix and an initial clustering center; calculating the distance between the data points and the clustering center in the multi-core high-dimensional space; iteratively obtaining a membership value and a likelihood typical value; so that the objective function obtains the clustering result corresponding to the minimum value as the final clustering result. The invention adopts the kernel function guided by the combined kernel to replace the common Euclidean distance function, and can better divide the linear data and the nonlinear data; the noise immunity of the algorithm is enhanced by adopting the typical value matrix, the accuracy of algorithm clustering is improved, the proportion of various kernels in the combined kernel can be automatically adjusted to meet the requirements of different data sets on different kernel functions, and the problem of uncertainty of the ordinary kernel algorithm on kernel function selection is solved.

Description

Data clustering method based on double-variant weighted kernel FCM algorithm
Technical Field
The invention relates to the technical field of data clustering, in particular to a data clustering method based on a bivariate weighted kernel (FCM) algorithm.
Background
Clustering is an important research content in the fields of data mining and artificial intelligence, and plays a great role in various fields such as big data, pattern recognition, image segmentation, machine learning and the like. Clustering is a process of dividing data according to similarity rules of the data, the division result is determined by the rules, and the divided groups or sets are also often called clusters. The fuzzy c-means clustering algorithm (FCM) is a basic method of fuzzy clustering, lays a foundation for a clustering algorithm based on a target function, but is sensitive to initialization of a clustering center and is easily influenced by noise points. To improve the noise immunity of the algorithm, Krishnapuram and Keller propose a PCM algorithm. The PCM algorithm adopts a probability partition matrix, and the probability membership reflects the typical degree of a data point to a certain clustering center. In addition, it relaxes the constraint of dividing the matrix sum to 1 in the FCM algorithm, and compared to the FCM algorithm, PCM can better handle noise points. However, the PCM algorithm easily obtains fewer cluster centers or overlapping clusters than the predetermined number of clusters.
Based on the reason, Zhang provides an improved clustering mode of a probabilistic clustering algorithm in the literature, and a new parameter eta is addediTo reduce the error of the algorithm, although the clustering algorithm with possibility can overcome the problem of consistent clustering, the original m is subjected topSelection of parameters is exceptionally sensitive, different mpThe resulting cluster centers will be two distinct values, even if the values differ very little. An improved probability fuzzy c-means clustering algorithm PFCM provided by Nikhil has good noise robustness and can not generate coincident clusters, however, the selection of parameters a and b by the PFCM usually needs artificial designation and lacks theoretical basis, and has strong dependency.
The algorithm has a good clustering effect on linear data, but the clustering effect on nonlinear data is not ideal, and by introducing a kernel function, original data passes through a mercer kernel condition to make sample data X ═ X1,…,xnMapping into a high-dimensional feature space F, and mapping data to phi (x)1),...,φ(xn) And F, wherein phi is a mapping function, and samples are clustered in a space F to form a kernel-based fuzzy clustering algorithm. Yang provides a kernel-based fuzzy clustering algorithm KFCM, Genton also shows a kernel machine learning mode from the statistical angle, the algorithms enable data points to be mapped to a high-dimensional feature space, the kernel function and the optimized clustering error are used to enable the kernel-based fuzzy clustering algorithm to have good robustness for noise and outliers, the problem that the PFCM algorithm is sensitive to parameter setting is solved, however, the kernel-based fuzzy clustering algorithm has a good effect on spherical data, and an ideal effect is not obtained on non-spherical data.
Zhao et al, previously proposed in the literature that the segmented clustering algorithm of the largest kernel of multiple kernels is mostly concerned with supervised and semi-supervised cluster learning, which is based on maximally marginal clustering, but it is obvious that the clustering algorithm is mostly used for hard clustering. The multi-core method proposed by Hsin-Chien provides great flexibility for selection and combination of basic cores, increases information sources from different angles, and improves the encoding capability of domain knowledge.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a data clustering method based on a dual-variable weighted kernel FCM algorithm, so that the problems that the FCM is sensitive to noise points and the PCM is easy to generate consistent clustering can be accurately avoided, the accuracy of the algorithm is improved, and the data structure information existing in a data set is more accurately mined. Meanwhile, the most suitable weight value and the current membership value can be automatically found, so that the reliability and the convergence of the algorithm are improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a data clustering method based on a bivariate weighted kernel FCM algorithm, which is used for clustering customer information and recommending products to customers according to clustering results and comprises the following steps:
step 1, a data set X is customer information, wherein X is { X ═ X1,x2,…,xn},xjIs the jth data point; j is 1,2, …, n, n is the total number of data; optimally dividing the data set X to make the J value of the objective function in the formula (1) be minimum:
Figure BDA0001700972760000021
in the formula (1), i represents the ith class, c represents the number of the classified classes, i is more than or equal to 1 and less than or equal to c, and c is more than or equal to 2 and less than or equal to n; u. ofijRepresents the jth data point xjMembership toMembership value in class i;
Figure BDA0001700972760000022
is uijM is a weighted exponent representing the degree of cluster ambiguity, tijRepresents the jth data point xjA likelihood representative value of membership to the i-th class,
Figure BDA0001700972760000023
is tijEta is a weighted exponent and is used for controlling the proportion of membership and a typical value to realize bivariation; dijJ-th data point x representing multi-kernel high-dimensional feature spacejAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature spaceiAnd has:
Figure BDA0001700972760000024
in the formula (2), wlIs the weight parameter of the kernel, M is the total number of kernels,
Figure BDA0001700972760000025
αijkexpressed by formula (3):
Figure BDA0001700972760000026
in the formula (3), kl(xj,xq) Is the ith gaussian kernel function expressed by equation (4):
Figure BDA0001700972760000027
in formula (4)
Figure BDA0001700972760000031
Is the width parameter of the function, xjl,xqlRespectively representing the l-dimension characteristic values of the j-th data point and the q-th data point;
in the formula (1), the reaction mixture is,
Figure BDA0001700972760000032
represents the jth data point xjAnd has:
Figure BDA0001700972760000033
in the formula (5), the reaction mixture is,
Figure BDA00017009727600000311
represents a constant; x is the number ofzZ is more than or equal to 1 and less than or equal to n and represents the z-th data point; | xj-xzI represents the jth data point xjAnd z-th data point xzThe Euclidean distance between;
in the formula (1), riRepresents a penalty factor and has:
Figure BDA0001700972760000034
in the formula (6) | | xj-viThe | | represents the Euclidean distance between the jth data point and the ith clustering center;
step 2, processing the data set X by using a fuzzy C-means clustering algorithm FCM to obtain a membership matrix U and a clustering center V which are respectively as follows:
Figure BDA0001700972760000035
obtaining a parameter penalty factor r by calculating according to the formula (6)iAnd taking a membership matrix U obtained by using a fuzzy C-means clustering algorithm FCM as an initial membership matrix U of a bivariate weighted kernel FCM algorithm (U0);
Step 3, randomly initializing the jth data point xjTypical values belonging to class i
Figure BDA0001700972760000036
Defining the iteration times as iter, and defining the maximum iteration times as iterMax; and initializing iter as 1; membership of the iter iterationMatrix is U(iter)A typical value matrix for the ith iteration is T(iter)
Step 4, obtaining parameters from the formula (3)
Figure BDA0001700972760000037
Step 5, obtaining parameter beta from formula (7)l
Figure BDA0001700972760000038
Step 6, obtaining the weight parameter of the kernel of the iter iteration from the formula (8)
Figure BDA0001700972760000039
Figure BDA00017009727600000310
Step 7, obtaining the jth data point x of the multi-core high-dimensionality feature space by the formula (2)jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature spaceiA distance D betweenijSquare of (d);
step 8, obtaining the jth data point x of the iter iteration from the formula (9)jTypical values belonging to class i
Figure BDA0001700972760000041
Figure BDA0001700972760000042
Step 9, obtaining the jth data point x of the iter iteration from the formula (10)jMembership value to class i
Figure BDA0001700972760000043
Figure BDA0001700972760000044
Step 10, setting a threshold value epsilon, and judging an algorithm stop condition
Figure BDA0001700972760000045
Or if iter > iterMax is true, if true, then
Figure BDA0001700972760000046
For optimal membership value, for the jth data point xjGet it
Figure BDA0001700972760000047
The value of i corresponding to the maximum value is used to obtain the data point xjBelongs to the i-th class according to the obtained membership matrix U(iter)Obtaining all data point sets Y belonging to the ith class, and calculating
Figure BDA0001700972760000048
yjIs the jth data in the ith class of data point set Y,
Figure BDA0001700972760000049
formula is the average value of the set Y, n' is the total number of data points belonging to the ith class, and then the clustering center of the ith class
Figure BDA00017009727600000410
The same method obtains the clustering centers of other classes; recommending products for the new customers according to the clustering center matrix; if the stop condition is not satisfied, the value of iter is increased by 1, and steps 4 to 10 are repeated until the condition is satisfied.
Compared with the prior art, the invention has the beneficial effects that:
1. the Double-variable Weighted Kernel FCM clustering algorithm DWKFCM (Weighted Kernel Fuzzy C-Means with Double variables) adopts a multi-Kernel method, integrates the advantages of a Fuzzy clustering method FCM and the advantages of a probability clustering algorithm PCM, reduces the influence of Kernel function selection on an experimental result, is sensitive to Kernel function selection, and adds a probability concept on the basis of multi-Kernel, so that the algorithm has stronger noise immunity and the obtained clustering result is more accurate.
2. The DWKFCM of the invention adopts the kernel-based algorithm which can carry out nonlinear data operation, and maps the nonlinear operation of common data to a high-dimensional data space, thereby increasing the robustness of the algorithm.
3. The DWKFCM extends the DWKFCM to the aspect of soft clustering, fuzzifies the attribution of the data points, considers the spatial distribution characteristics of the data points at the same time, and judges the influence of the data points on the clustering center by calculating the distance between the data points, wherein the influence is smaller when the data points are farther away from the clustering center, so the method has stronger flexibility.
4. Data information is increased explosively nowadays, and clustering of data is a key for further data analysis and knowledge mining, so that the method has a good application value.
Drawings
FIG. 1 is a flow chart of a data clustering algorithm based on a bivariate weighted kernel FCM algorithm;
fig. 2 is a Sammon map of a iris dataset.
Detailed Description
Referring to fig. 1, the data clustering method based on the dual-variant weighted kernel FCM algorithm in this embodiment is used for clustering customer information, and recommending products to customers according to clustering results, and is performed according to the following steps:
step 1, a data set X is customer information, wherein X is { X ═ X1,x2,…,xn},xjIs the jth data point; j is 1,2, …, n, n is the total number of data, such as: 10000 pieces of customer information are adopted, namely n is 10000; optimally dividing the data set X to make the J value of the objective function in the formula (1) be minimum:
Figure BDA0001700972760000051
in the formula (1), i represents the ith class, c represents the number of classified classes, i.e. the class of the product, such as: c is 10, i is more than or equal to 1 and less than or equal to c, and c is more than or equal to 2 and less than or equal to n; u. ofijRepresents the jth data point xjMembership values belonging to class i;
Figure BDA0001700972760000052
is uijM is a weighted exponent representing the degree of cluster ambiguity, m can take the value 2, tijRepresents the jth data point xjA likelihood representative value of membership to the i-th class,
Figure BDA0001700972760000053
is tijEta is a weighted exponent and is used for controlling the proportion of membership and a typical value to realize bivariation, and eta can be 2; dijJ-th data point x representing multi-kernel high-dimensional feature spacejAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature spaceiAnd has:
Figure BDA0001700972760000054
in the formula (2), wlIs the weight parameter of the kernel, M is the total number of the kernels, and the value of M is the number of attributes of the client information, such as: m is equal to 6, and M is equal to 6,
Figure BDA0001700972760000055
αijkexpressed by formula (3):
Figure BDA0001700972760000056
in the formula (3), kl(xj,xq) Is the ith gaussian kernel function expressed by equation (4):
Figure BDA0001700972760000057
in formula (4)
Figure BDA0001700972760000058
Is the width parameter of the function, using the formula:
Figure BDA0001700972760000059
obtaining the calculated value, xjl,xqlRespectively representing the l-dimension characteristic values of the j-th data point and the q-th data point;
in the formula (1), the reaction mixture is,
Figure BDA0001700972760000061
represents the jth data point xjAnd has:
Figure BDA0001700972760000062
in the formula (5), the reaction mixture is,
Figure BDA00017009727600000611
represents a constant, which can take the value of 2; x is the number ofzZ is more than or equal to 1 and less than or equal to n and represents the z-th data point; | xj-xzI represents the jth data point xjAnd z-th data point xzThe Euclidean distance between;
in the formula (1), riRepresents a penalty factor and has:
Figure BDA0001700972760000063
in the formula (6) | | xj-viAnd | | l represents the Euclidean distance between the jth data point and the ith cluster center.
Step 2, processing the data set X by using a fuzzy C-means clustering algorithm FCM to obtain a membership matrix U and a clustering center V which are respectively as follows:
Figure BDA0001700972760000064
obtaining a parameter penalty factor r by calculating according to the formula (6)iAnd to benefitUsing the membership matrix U obtained by the fuzzy C-means clustering algorithm FCM as the initial membership matrix U of the dual-variant weighted kernel FCM algorithm (U: (C-means clustering algorithm))0)。
Step 3, randomly initializing the jth data point xjTypical values belonging to class i
Figure BDA0001700972760000065
Defining iteration times as iter, the maximum iteration times as iterMax, and setting the maximum iteration times iterMax as 100; and initializing iter as 1; the membership matrix of the iter iteration is U(iter)A typical value matrix for the ith iteration is T(iter)
Step 4, obtaining parameters from the formula (3)
Figure BDA0001700972760000066
Step 5, obtaining parameter beta from formula (7)l
Figure BDA0001700972760000067
Step 6, obtaining the weight parameter of the kernel of the iter iteration from the formula (8)
Figure BDA0001700972760000068
Figure BDA0001700972760000069
Step 7, obtaining the jth data point x of the multi-core high-dimensionality feature space by the formula (2)jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature spaceiA distance D betweenijSquare of (d).
Step 8, obtaining the jth data point x of the iter iteration from the formula (9)jTypical values belonging to class i
Figure BDA00017009727600000610
Figure BDA0001700972760000071
Step 9, obtaining the jth data point x of the iter iteration from the formula (10)jMembership value to class i
Figure BDA0001700972760000072
Figure BDA0001700972760000073
Step 10, setting the threshold value epsilon to 0.00001, and judging the algorithm stop condition
Figure BDA0001700972760000074
Or if iter > iterMax is true, if true, then
Figure BDA0001700972760000075
For optimal membership value, for the jth data point xjGet it
Figure BDA0001700972760000076
The value of i corresponding to the maximum value is used to obtain the data point xjBelongs to the i-th class according to the obtained membership matrix U(iter)Obtaining all data point sets Y belonging to the ith class, and calculating
Figure BDA0001700972760000077
yjIs the jth data in the ith class of data point set Y,
Figure BDA0001700972760000078
the formula is the average value of the set Y, and n' is the total number of data points belonging to the ith class, which can be obtained from the number of elements in the set Y. Then cluster center of the ith class
Figure BDA0001700972760000079
The same method obtains the clustering of other classesA core; recommending products for the new customers according to the clustering center matrix; if the stop condition is not satisfied, the value of iter is increased by 1, and steps 4 to 10 are repeated until the condition is satisfied.
In this embodiment, a marketing "catalog marking" is used as a research object, data that needs to be clustered is used as customer information, that is, a set formed by all customer information is used as a data set to be clustered, and each piece of customer information includes numerical attribute information such as age, income, internet surfing time, gender, constellation, consumer varieties, and the like. The data clustering method in the embodiment is adopted to cluster all customer information, and then marketing strategies such as specific product recommendation and regular release of articles purchased by similar personnel are carried out on different customers according to clustering results.
The clustering method in the embodiment is completed based on the following experimental platform: the operating system is a PC of Windows 7 and the integrated development environment is MATLAB R2015 b. The hardware conditions are as follows: the CPU is Intel Core 3.20GHz and the memory is 8 GB. The parameters m and η in the algorithm are both set to 2.0, and the maximum number of iterations iterMax is 100.
In order to verify the performance of the data clustering method DWKFCM based on the dual-variant weighted kernel FCM algorithm, four real data sets were used: iris data set, wine data set, glass data set, and diabetes data set of pidan. The specific information of these four data sets is shown in table 1.
Table 1 summary of the actual data set information used in the experiment
Figure BDA00017009727600000710
Figure BDA0001700972760000081
The clustering method DWKFCM, the fuzzy C-mean clustering algorithm FCM, the probability clustering algorithm PCM and the probability fuzzy clustering algorithm PFCM of the embodiment are respectively utilized, and the kernel-based fuzzy clustering algorithm KFCM and the multi-kernel fuzzy clustering algorithm MKFC are utilized to cluster the data sets. Taking the clustering accuracy CR as an evaluation standard of the clustering effect, and defining the clustering accuracy CR as follows:
Figure BDA0001700972760000082
wherein a isiRepresenting the number of samples that are ultimately correctly classified, c representing the number of clusters, and n representing the number of samples in the data set. The higher the clustering accuracy, the better the clustering effect of the clustering method. When the value of CR is 1, the clustering result of the algorithm on the data set is completely correct.
The iris dataset contains 150 data objects, each described by 4 attributes: calyx length, calyx width, petal length and petal width, and the algorithm is used for predicting which of three categories (Setosa, Versicolour, Virginica) iris belongs to according to the 4 attributes. As can be seen from the Sammon map of the iris data set in fig. 2, the data set has two types of data overlapping each other and not easy to be divided, and the two types of data are marked by "Δ" and "o" in the map, which brings great challenges to the clustering method. The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the iris data sets is shown in table 2.
TABLE 2 clustering accuracy of algorithms on Iris data set
Algorithm Cluster accuracy (CR)
FCM 0.877
PCM 0.668
PFCM 0.808
KFCM 0.914
MKFCM 0.932
DWKFCM 0.959
It can be seen from table 2 that the accuracy of the DWKFCM algorithm is improved by 8.2%, 29.1%, 15.1%, 4.5% and 2.7% over FCM, PCM, PFCM, KFCM and MKFCM, respectively. The performance of the DWKFCM algorithm is therefore best.
The glass data set is a data set characterized by glass material, and comprises 214 data objects, each object is represented by 9 attributes, and can be divided into 6 classes with different numbers. The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the glass data set is shown in table 3.
TABLE 3 clustering accuracy of algorithms on glass datasets
Algorithm Cluster accuracy (CR)
FCM 0.813
PCM 0.739
PFCM 0.846
KFCM 0.888
MKFCM 0.901
DWKFCM 0.953
The clustering results in table 3 show that the clustering accuracy of the DWKFCM algorithm is improved by 14%, 21.4%, 10.7%, 6.5% and 5.2% respectively over FCM, PCM, PFCM, KFCM and MKFCM. The performance of the DWKFCM algorithm is better.
The wine data set is a data set characterized by analysis of the composition of wine and contains 178 data, each data containing 13 attributes, classified into 3 types, and the number of attributes is the largest compared to iris and glass data sets. The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the wine data set is shown in table 4.
TABLE 4 clustering accuracy of algorithms on wine datasets
Algorithm Cluster accuracy (CR)
FCM 0.708
PCM 0.409
PFCM 0.688
KFCM 0.777
MKFCM 0.841
DWKFCM 0.925
The clustering results in table 4 show that the clustering accuracy of the DWKFCM algorithm is improved by 21.7%, 51.6%, 23.7%, 14.8% and 8.4% respectively over FCM, PCM, PFCM, KFCM and MKFCM. The performance of the DWKFCM algorithm is better.
The diabetes data set is a diabetes diagnosis data set, and belongs to the field of medicine. The data set contains 768 instances, each containing 8 attributes: pregnancy (pregnant times), Plasma glucose concentration within 2h of the meal (Plasma glucose concentration), diastolic blood pressure mmHg (diastatic blood pressure), thickness of skin fissure (triceps skin fold thickness), serum insulin mu U/ml (serum insulin), Body mass index (Body mass index), family history (pedigree function), Age (Age). The algorithm determines whether the patient is diabetic based on these 8 attributes. The data set consisted of 500 data sets without disease and 268 patient data statistics.
The accuracy of the clustering results obtained by using the DWKFCM algorithm, the FCM algorithm, the PCM algorithm, the PFCM algorithm, the KFCM algorithm, and the MKFCM algorithm on the diabetes data set is shown in table 5.
TABLE 5 clustering accuracy of algorithms on diabetes data sets
Algorithm Cluster accuracy (CR)
FCM 0.582
PCM 0.355
PFCM 0.754
KFCM 0.773
MKFCM 0.831
DWKFCM 0.934
As can be seen from Table 5, the DWKFCM of the present invention can achieve 93.4% accuracy in diabetes data set diagnosis of diabetes, while the accuracy of other algorithms is lower than 90%, and the PCM algorithm has even only 35.5% clustering accuracy. The clustering results in table 5 show that the clustering accuracy of the DWKFCM algorithm is improved by 35.2%, 57.9%, 18%, 16.1% and 10.3% respectively over FCM, PCM, PFCM, KFCM and MKFCM. The performance of the DWKFCM algorithm is better.
The double-variable weighted kernel fuzzy c-means clustering algorithm DWKFCM can be well applied to the fields of marketing, flower, glucose wine, glass classification, diabetes diagnosis and the like, reliably mines information in data and has high practicability.

Claims (1)

1. A data clustering method based on a bivariate weighted kernel (FCM) algorithm is used for clustering customer information and recommending products to customers according to clustering results, and is characterized by comprising the following steps:
step 1, a data set X is customer information, wherein X is { X ═ X1,x2,…,xn},xjIs the jth data point; j is 1,2, …, n, n is the total number of data; optimally dividing the data set X to make the J value of the objective function in the formula (1) be minimum:
Figure FDA0001700972750000011
in the formula (1), i represents the ith class, c represents the number of the classified classes, i is more than or equal to 1 and less than or equal to c, and c is more than or equal to 2 and less than or equal to n; u. ofijRepresents the jth data point xjMembership values belonging to class i;
Figure FDA0001700972750000012
is uijM is a weighted exponent representing the degree of cluster ambiguity, tijRepresents the jth data point xjA likelihood representative value of membership to the i-th class,
Figure FDA0001700972750000013
is tijEta is a weighted exponent and is used for controlling the proportion of membership and a typical value to realize bivariation; dijJ-th data point x representing multi-kernel high-dimensional feature spacejAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature spaceiAnd has:
Figure FDA0001700972750000014
in the formula (2), wlIs the weight parameter of the kernel, M is the total number of kernels,
Figure FDA0001700972750000015
αijkexpressed by formula (3):
Figure FDA0001700972750000016
in the formula (3), kl(xj,xq) Is the ith gaussian kernel function expressed by equation (4):
Figure FDA0001700972750000017
in formula (4)
Figure FDA0001700972750000018
Is the width parameter of the function, xjl,xqlRespectively representing the l-dimension characteristic values of the j-th data point and the q-th data point;
in the formula (1), the reaction mixture is,
Figure FDA0001700972750000019
represents the jth data point xjAnd has:
Figure FDA00017009727500000110
in the formula (5), θ represents a constant; x is the number ofzZ is more than or equal to 1 and less than or equal to n and represents the z-th data point; | xj-xzI represents the jth data point xjAnd z-th data point xzThe Euclidean distance between;
in the formula (1), riRepresents a penalty factor and has:
Figure FDA0001700972750000021
in the formula (6) | | xj-viThe | | represents the Euclidean distance between the jth data point and the ith clustering center;
step 2, processing the data set X by using a fuzzy C-means clustering algorithm FCM to obtain a membership matrix U and a clustering center V which are respectively as follows:
Figure FDA0001700972750000022
obtaining a parameter penalty factor r by calculating according to the formula (6)iAnd taking a membership matrix U obtained by using a fuzzy C-means clustering algorithm FCM as an initial membership matrix U of a bivariate weighted kernel FCM algorithm(0)
Step 3, randomly initializing the jth data point xjTypical values belonging to class i
Figure FDA0001700972750000023
Defining the iteration times as iter, and defining the maximum iteration times as iterMax; and initializing iter as 1; the membership matrix of the iter iteration is U(iter)A typical value matrix for the ith iteration is T(iter)
Step 4, obtaining parameters from the formula (3)
Figure FDA0001700972750000024
Step 5, obtaining parameter beta from formula (7)l
Figure FDA0001700972750000025
Step 6, obtaining the weight parameter of the kernel of the iter iteration from the formula (8)
Figure FDA0001700972750000026
Figure FDA0001700972750000027
Step 7, obtaining the jth data point x of the multi-core high-dimensionality feature space by the formula (2)jAnd the clustering center v of the ith class of the multi-kernel high-dimensional feature spaceiA distance D betweenijSquare of (d);
step 8, obtaining the jth data point x of the iter iteration from the formula (9)jTypical values belonging to class i
Figure FDA0001700972750000028
Figure FDA0001700972750000029
Step 9, obtaining the jth data point x of the iter iteration from the formula (10)jMembership value to class i
Figure FDA00017009727500000210
Figure FDA00017009727500000211
Step 10, setting a threshold value epsilon, and judging an algorithm stop condition
Figure FDA0001700972750000031
Or if iter > iterMax is true, if true, then
Figure FDA0001700972750000032
For optimal membership value, for the jth data point xjGet it
Figure FDA0001700972750000033
The value of i corresponding to the maximum value is used to obtain the data point xjBelongs to the i-th class according to the obtained membership matrix U(iter)To obtainAll data point sets Y belonging to class i are calculated
Figure FDA0001700972750000034
yjIs the jth data in the ith class of data point set Y,
Figure FDA0001700972750000035
formula is the average value of the set Y, n' is the total number of data points belonging to the ith class, and then the clustering center of the ith class
Figure FDA0001700972750000036
The same method obtains the clustering centers of other classes; recommending products for the new customers according to the clustering center matrix; if the stop condition is not satisfied, the value of iter is increased by 1, and steps 4 to 10 are repeated until the condition is satisfied.
CN201810636707.XA 2018-06-20 2018-06-20 Data clustering method based on double-variant weighted kernel FCM algorithm Active CN108763590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810636707.XA CN108763590B (en) 2018-06-20 2018-06-20 Data clustering method based on double-variant weighted kernel FCM algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810636707.XA CN108763590B (en) 2018-06-20 2018-06-20 Data clustering method based on double-variant weighted kernel FCM algorithm

Publications (2)

Publication Number Publication Date
CN108763590A CN108763590A (en) 2018-11-06
CN108763590B true CN108763590B (en) 2021-07-27

Family

ID=63979218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810636707.XA Active CN108763590B (en) 2018-06-20 2018-06-20 Data clustering method based on double-variant weighted kernel FCM algorithm

Country Status (1)

Country Link
CN (1) CN108763590B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670537A (en) * 2018-12-03 2019-04-23 济南大学 The full attribute weight fuzzy clustering method of multicore based on quasi- Monte Carlo feature
CN111367901B (en) * 2020-02-27 2024-04-02 智慧航海(青岛)科技有限公司 Ship data denoising method
CN111476236B (en) * 2020-04-09 2023-07-21 湖南城市学院 Self-adaptive FCM license plate positioning method and system
CN112541407B (en) * 2020-08-20 2022-05-13 同济大学 Visual service recommendation method based on user service operation flow
CN113283242B (en) * 2021-05-31 2024-04-26 西安理工大学 Named entity recognition method based on combination of clustering and pre-training model
CN113268333B (en) * 2021-06-21 2024-03-19 成都锋卫科技有限公司 Hierarchical clustering algorithm optimization method based on multi-core computing
CN116723136B (en) * 2023-08-09 2023-11-03 南京华飞数据技术有限公司 Network data detection method applying FCM clustering algorithm
CN117112871B (en) * 2023-10-19 2024-01-05 南京华飞数据技术有限公司 Data real-time efficient fusion processing method based on FCM clustering algorithm model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195734B1 (en) * 2006-11-27 2012-06-05 The Research Foundation Of State University Of New York Combining multiple clusterings by soft correspondence
CN107203785A (en) * 2017-06-02 2017-09-26 常州工学院 Multipath Gaussian kernel Fuzzy c-Means Clustering Algorithm

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049636A (en) * 2012-09-12 2013-04-17 江苏大学 Method and system for possibly fuzzy K-harmonic means clustering
CN105894024A (en) * 2016-03-29 2016-08-24 合肥工业大学 Possibility fuzzy c mean clustering algorithm based on multiple kernels
CN106846326A (en) * 2017-01-17 2017-06-13 合肥工业大学 Image partition method based on multinuclear local message FCM algorithms
CN107220977B (en) * 2017-06-06 2019-08-30 合肥工业大学 The image partition method of Validity Index based on fuzzy clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195734B1 (en) * 2006-11-27 2012-06-05 The Research Foundation Of State University Of New York Combining multiple clusterings by soft correspondence
CN107203785A (en) * 2017-06-02 2017-09-26 常州工学院 Multipath Gaussian kernel Fuzzy c-Means Clustering Algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Fuzzy clustering with optimized-parameters multiple Gaussian Kernels;Issam Dagher;《IEEE》;20151130;全文 *
基于粒子群优化的直觉模糊核聚类算法研究;余晓东;《通信学报》;20150525;第36卷(第05期);第78-84段 *

Also Published As

Publication number Publication date
CN108763590A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN108763590B (en) Data clustering method based on double-variant weighted kernel FCM algorithm
Dudoit et al. Classification in microarray experiments
DeSarbo et al. Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables
Wang et al. On fuzzy cluster validity indices
Steinley et al. Evaluating mixture modeling for clustering: recommendations and cautions.
Patil et al. Hybrid prediction model for type-2 diabetic patients
Fonseca et al. Mixture-model cluster analysis using information theoretical criteria
Hunt et al. Theory & Methods: Mixture model clustering using the MULTIMIX program
Yang et al. Unsupervised fuzzy model-based Gaussian clustering
Albergante et al. Estimating the effective dimension of large biological datasets using Fisher separability analysis
Witten et al. Supervised multidimensional scaling for visualization, classification, and bipartite ranking
Long et al. Boosting and microarray data
CN113889192B (en) Single-cell RNA-seq data clustering method based on deep noise reduction self-encoder
Mukhopadhyay Large-scale mode identification and data-driven sciences
CN111652303A (en) Outlier detection method based on spectral clustering under non-independent same distribution
CN110400610B (en) Small sample clinical data classification method and system based on multichannel random forest
Vengatesan et al. The performance analysis of microarray data using occurrence clustering
Jena et al. An integrated novel framework for coping missing values imputation and classification
Miller et al. Emergent unsupervised clustering paradigms with potential application to bioinformatics
CN111582370B (en) Brain metastasis tumor prognostic index reduction and classification method based on rough set optimization
CN115985503B (en) Cancer prediction system based on ensemble learning
CN117195027A (en) Cluster weighted clustering integration method based on member selection
CN110991517A (en) Classification method and system for unbalanced data set in stroke
Ragunthar et al. Classification of gene expression data with optimized feature selection
CN114358191A (en) Gene expression data clustering method based on depth automatic encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant