CN110880015B

CN110880015B - Distributed integrated clustering analysis method based on fuzzy C-means

Info

Publication number: CN110880015B
Application number: CN201910981453.XA
Authority: CN
Inventors: 母亚双; 王利东; 刘晓东
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2023-04-07
Anticipated expiration: 2039-10-16
Also published as: CN110880015A

Abstract

The invention relates to a distributed integrated clustering analysis method based on a fuzzy C mean value, and belongs to the technical field of machine learning and big data analysis. According to the fuzzy C-means theory, aiming at bottleneck factors faced by a traditional clustering analysis method when a large-scale data set is processed, under a Map-Reduce distributed computing model, the data are randomly partitioned, then the clustering center of each data is extracted, and the clustering centers of each data are integrated and fused, so that the clustering analysis process of the large-scale data is finally completed. The invention carries out distributed integrated analysis on the clustering problem of large-scale data, and realizes the aim of ensuring clustering precision and simultaneously ensuring lower clustering time in the data clustering analysis process.

Description

Distributed integrated clustering analysis method based on fuzzy C-means

Technical Field

The invention relates to a distributed integrated clustering analysis method based on a fuzzy C mean value, and belongs to the technical field of machine learning and big data analysis.

Background

With the continuous progress of science and technology, the rapid development of internet and the increasingly perfect database technology, new data is continuously generated in various industries of human society, for example, more and more data are accumulated in shopping, diet, tourism, medical treatment and the like, and the big data era has come. Big data is already closely related and inseparable to our life. The traditional data analysis method has good advantages in processing limited or small amount of data, but in the face of the massive data, the traditional data analysis method faces extremely serious data explosion problems. From the hardware level, although the computer hardware has been developed rapidly, the storage, management and analysis capability of a single computer still cannot meet the requirement of big data; from a software level, the size of big data has surpassed the ability of traditional data analysis methods to capture, manage, and process in a tolerable time.

Since the massive data contains valuable information and has important influence on social production and daily life, scientific research on analysis and processing of the massive data is increasingly emphasized by various social circles. Many scholars give big data complexity in three dimensions (i.e., 3vs, volume-data capacity, velocity-data input-output speed, variety-data type and source), and point to increasingly serious challenges and opportunities they face. Big data is a new information asset that needs high capacity, high speed and diversity that can enhance decision making, discovery and optimization processing patterns. The data (common) researched by the classical machine learning is not only different from big data in nature in 3Vs, but also the algorithm generally needs to traverse all samples for many times and read the data into a memory, so that the classical machine learning algorithm is expected to face the 3Vs disaster of the big data. How to combine hardware equipment and software technology and effectively analyze and process big data, that is, a large-scale data mining technology based on hardware and software, has become a research hotspot in the current era, and is one of the problems to be solved urgently to promote the progress and development of the subject.

Fuzzy C-means (FCM) clustering is one of the well-known algorithms in the field of data analysis. The FCM algorithm has the advantages of high clustering speed, high accuracy, few parameters and the like, is one of popular clustering algorithms at present, and is widely applied to the actual fields of credit assessment, medical treatment and health, traffic management and the like. With the rapid increase of data generation in daily life, data analysis and processing by using an FCM model in big data is inevitable. The traditional FCM algorithm or model has achieved a good result in processing small-scale data sets or medium-scale data sets, but these algorithms cannot directly process the large-scale data clustering problem, and the reasons for this phenomenon mainly include the following layers:

and (4) memory limitation: the memory of each computer is fixed, and for a large-scale data set, it is very difficult to store all training sample data or most training sample data in the memory of one computer.

Time complexity: for large data, the analysis and processing of the algorithm is very time consuming and it is very difficult to complete the experimental process within an acceptable time frame.

Data complexity: the high dimensionality and multi-modal nature of large-scale datasets can be very difficult to design for experiments, and can also have a large impact on the experimental results in terms of performance.

For the above reasons, parallel computing or distributed computing is a common and reliable choice for many machine learning algorithms to analyze and process big data. First, parallel computing can solve the data storage problem by storing data on multiple computer resources, and in addition, parallel computing enables programs to run on multiple machines simultaneously, which greatly improves the execution efficiency of algorithms. Although some existing parallel implementations of FCMs can handle big data analysis problems, the methods all have the common feature that a parallel computing technology is applied to local or each iteration of the FCM algorithm, and due to differences in hardware configuration, delays in network communication, imbalance in data distribution and the like, a large amount of communication overhead exists among computers in a cluster, so that the clustering efficiency of the algorithms is not high.

In order to more accurately mine and explore the internal rules of the traditional FCM algorithm method in the parallel or distributed clustering process, further meet the increasing requirements of big data analysis and processing and achieve the aim of ensuring the clustering precision and simultaneously ensuring the clustering time to be lower, the invention provides a novel distributed integrated clustering algorithm.

Disclosure of Invention

The invention aims to provide a distributed integrated clustering analysis method based on a fuzzy C mean value, which is used for solving the bottleneck problem of the traditional clustering analysis method in processing large-scale data and achieving the aim of ensuring clustering precision and shortening clustering time in the data analysis process.

The invention adopts the idea of Map-Reduce distributed framework, firstly randomly divides the original data into a plurality of subdata sets in a computer cluster, further applies fuzzy C-means algorithm to cluster each subdata set, and applies fuzzy C-means clustering again to form a final clustering center on the basis of fusing the clustering centers of all the subdata sets, thereby completing the clustering analysis process of large-scale data, and realizing the goal of ensuring clustering precision and ensuring that the clustering consumes less time in the data analysis process.

In order to achieve the aim, based on actual data with class labels commonly used in the field of data analysis, data are randomly divided under a distributed computing framework, and the divided data are subjected to cluster analysis and fusion through a classical FCM mean algorithm. The invention provides a distributed integrated clustering analysis method based on a fuzzy C mean value on the basis of the existing work.

The technical scheme of the invention is as follows: according to a Map-Reduce distributed computing framework, aiming at limiting factors of a traditional FCM algorithm in a big data clustering analysis process, original data are randomly divided in a distributed mode, clustering is carried out on the divided data by applying the traditional FCM algorithm, clustering centers are determined, then the clustering centers are integrated and fused, the traditional FCM algorithm is further applied to extract final clustering centers, and therefore the distributed clustering analysis process oriented to large-scale data processing is completed.

The integrated clustering analysis method mainly comprises three layers under a distributed framework Map-Reduce, wherein the 1 st layer is introduced by splitting the layer into two steps, and the specific implementation steps are as follows:

step1, carrying out randomized ordering on the data set X (Layer 1);

the Step1 mainly has the function of randomly ordering a data set in a distributed mode under a Map-Reduce frame, the method can overcome the problems of insufficient memory, high time complexity and the like in the random arrangement of the traditional serial algorithm, and the specific implementation steps under the Map-Reduce model are,

step1.1, hypothesis data set XCan be divided into m sub-data sets in a distributed system

Step1.2, in the jth Mapper function, for each sample x _i ∈X _j Randomly generating a random integer, using the random integer as key of the Mapper function, and sampling x _i Value as the Mapper function;

step1.3, in the Reducer function, aggregating the samples from the Mapper function according to a key, and sequentially storing the aggregated sample data into a data set to finally form a randomized sample data set.

Step2, dividing the data set X into

Sub-data set->

The Step2 is mainly used for dividing the data set into a plurality of sub-data sets in a distributed mode under the Map-Reduce framework, the method can avoid the problems of insufficient memory, high time complexity and the like caused by direct division of the data, the method is used in combination with the Step1, the purpose of distributed random division of large-scale data can be achieved, and the specific implementation steps under the Map-Reduce model are as follows:

step2.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system

The final data set X is partitioned into +>

A sub-data set;

step2.2 in the jth Mapper function, for each sample x _i ∈X _j Computing

The remainder is based on the sum of the value of the sum and the value of the remainder is based on the sum of the value of the sum>

As key of the Mapper function, sample x _i Value as the Mapper function;

step2.3, in the Reducer function, aggregating the samples from the Mapper function according to key, and storing the aggregated samples into different subdata sets in sequence to finally form

Individual sample data set>

Step3, extracting the final cluster center c ₁ ,c ₂ ,...,c _c (Layer 2)；

The Step3 mainly functions to extract respective cluster centers from each self-data set under a Map-Reduce framework, then perform aggregation on the cluster centers and extract a final cluster center, and the method overcomes the problem of insufficient memory when a traditional clustering algorithm is used for processing large-scale data, and the specific implementation steps under the Map-Reduce model are as follows:

step3.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system

Step3.2 in the jth Mapper function, for dataset X _j Determining a data set X by applying a classical FCM clustering algorithm _j Cluster-like center of (c) ₁ (X _j ),c ₂ (X _j ),...,c _c (X _j ) Here, the number of class clusters is set as the number of classes in the data set X, the key of the Mapper function is designated as null, and the class cluster center c is set as ₁ (X _j ),c ₂ (X _j ),...,c _c (X _j ) Value as the Mapper function;

step3.3, in the Reducer function, performing aggregation on centers from the Mapper function to form a new data set, determining the cluster center of the new data set by applying the classical FCM algorithm again, and finally forming the cluster center c of the sample data X ₁ ,c ₂ ,...,c _c 。

Step4, dividing the samples in the X into different class clusters X according to the class cluster centers ₁ ,X ₂ ,...,X _c (Layer 3)。

The Step4 mainly functions to calculate the distance from each sample point to the center of the nearest cluster class under a distributed framework, and by calculating the distance, the method can achieve the purpose of dividing different sample points into different clusters, and the specific implementation steps under the Map-Reduce model are as follows:

step4.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system

Step4.2 in the jth Mapper function, for sample x _i ∈X _j Compute to cluster center c ₁ ,c ₂ ,...,c _c The subscript of the cluster center closest to the cluster center is used as a key, and the sample x is used as _i Value as a function of the Mapper. Inputting sample data into different class cluster sets X according to different keys ₁ ,X ₂ ,...,X _c In (1).

The invention has the beneficial effects that:

the invention carries out distributed integrated analysis on the clustering problem of large-scale data, realizes the aim of ensuring the clustering precision and simultaneously ensuring lower clustering time consumption in the data clustering analysis process, can overcome the problems of memory limitation, time complexity limitation, data complexity limitation and the like in the traditional clustering analysis algorithm along with the increase of data quantity in actual production, provides a powerful clustering analysis tool for the field of machine learning and data analysis facing to big data, and further provides decision support for actual application.

Drawings

FIG. 1 is an overall hierarchical diagram of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1, in the present embodiment, an Iris data set containing 150 pieces of data (samples) of 3 categories in a UCI database is taken as an example of a fuzzy C-means distributed integrated clustering analysis method, and the specific steps of the distributed integrated clustering method are as follows:

step1, carrying out randomized ordering on the data set X (Layer 1);

step1.1, assume that dataset X can be partitioned into 2 sub-datasets X in a distributed system ₁ And X ₂ The number of samples is 75 and 75 respectively, so that the system generates two Mapper functions for processing;

step1.2 in the 1 st Mapper function, for each sample x _i ∈X ₁ Randomly generating a random integer, using the random integer as key of the Mapper function, and sampling x _i The same operation applies to the 2 nd Mapper function as the value of the Mapper function;

step1.3, aggregating samples from the 1 st Mapper function and the 2 nd Mapper function in the Reducer function according to key, and sequentially storing the aggregated sample data into a data set to finally form 150 randomized sample data sets.

Step2, dividing the data set X into

Sub-data set->

Step2.1, assume that dataset X can be partitioned into 2 sub-datasets X in a distributed system ₁ And X ₂ The number of samples is 75 and 75 respectively, so that the system generates two Mapper functions to process, and the final data set X is set to be divided into

A sub-data set;

step2.2 in the 1 st Mapper function, for each sample x _i ∈X _j Calculating out

As key of the Mapper function, sample x _i The same operation applies to the 2 nd Mapper function as the value of the Mapper function;

step2.3, in the Reducer function, aggregating the samples from the 1 st Mapper function and the 2 nd Mapper function according to key, storing the aggregated samples into different sub-data sets in sequence, and finally forming 3 sample data sets X ¹ ,X ² ,X ³ 。

Step3, extracting the final cluster center c ₁ ,c ₂ ,c ₃ (Layer 2)；

Step3.1, assume that data set X is divided into 3 sample data sets X by Step1 and Step2 in distributed system ¹ ,X ² ,X ³ At this time, the system will generate 3 Mapper functions for processing;

step3.2 in the 1 st Mapper function, for dataset X ₁ Determining a data set X by applying a classical FCM clustering algorithm ₁ Cluster-like center of (c) ₁ (X ₁ ),c ₂ (X ₁ ),c ₃ (X ₁ ) Here, the number of class clusters is set as the number of classes in the data set X, the key of the Mapper function is designated as null, and the class cluster center c is set as ₁ (X ₁ ),c ₂ (X ₁ ),c ₃ (X ₁ ) The same operation applies to the 2 nd and 3 rd Mapper functions as the value of the Mapper function;

step3.3, in the Reducer function, the centers from the Mapper function are aggregated to form a new data set, and the classical FCM algorithm is applied again to determine the cluster-like center of the new data setFinally, the cluster center c of the sample data X is formed ₁ ,c ₂ ,c ₃ 。

Step4, dividing the samples in the X into different class clusters X according to the class cluster centers ₁ ,X ₂ ,X ₃ (Layer 3)。

Step4.1, assume that dataset X can be partitioned into 2 sub-datasets X in a distributed system ₁ And X ₂ The number of samples is 75 and 75 respectively, so that the system generates two Mapper functions for processing;

step4.2 in the 1 st Mapper function for sample x _i ∈X _j Computing to a cluster center c ₁ ,c ₂ ,c ₃ The subscript of the cluster center closest to the cluster center is used as a key, and the sample x is used as _i The value of the Mapper function is input sample data into different cluster sets X according to the key ₁ ,X ₂ ,...,X _c The same applies to the 2 nd Mapper function.

Example 2: in order to compare the performance of the fuzzy C-means-based distributed integrated clustering analysis method (LP-FCM) and the traditional clustering algorithm in the embodiment, the embodiment performs comparative analysis on the provided LP-FCM algorithm and the traditional clustering algorithms such as K-means, K-medoids and FCM. These conventional clustering algorithms are implemented using the MATLAB toolbox (software version is R2015 b), with the parameter 'Start' (if present) set to 'sample', and the other parameters in the algorithm use default values in the toolbox. The data sets used in the comparative experiment are 20 frequently used data sets in the UCI database, and specific information of the data sets is shown in table 1 below:

table 1: detailed information of 20 data sets

The data sets in the table 1 are subjected to clustering analysis by applying a K-means algorithm, a K-medoids algorithm, an FCM algorithm and an LP-FCM algorithm, and after 50 clustering analysis experiments, the average clustering accuracy result of each data set is shown in the table 2 (indicated by the bold result in the table, and the current algorithm has the best clustering effect on the data set):

table 2: comparative analysis on 20 data sets

As can be seen from the results in Table 2, the K-means algorithm and the FCM algorithm can have better clustering results for 5, 5 and 1 data sets respectively, while the proposed LP-FCM algorithm has better results for 8 data sets. Meanwhile, the LP-FCM algorithm is superior to other clustering algorithms on the data sets on the average accuracy rate.

Example 3: in order to verify the characteristics of the fuzzy C-means based distributed integrated cluster analysis method (LP-FCM) in terms of execution time, the embodiment takes a Covertype data set as an example to research the clustering time. The used Covertype dataset has a total of 54 conditional data and a decision attribute containing 7 classes, and the total number of samples contained in the original dataset is 581,012. In order to clearly show the parallelism of the proposed algorithm, the present embodiment samples the original data set by the boststrap technique to generate 6 data sets with different sizes, and the specific information of these data sets is shown in table 3 below:

table 3: detailed information of 6 data sets

The LP-FCM algorithm related by the invention is realized by Java language, and the realized program is executed on a small cluster comprising 1 host computer (Intel Core i 5-4590.30GHZ, 8GB RAM, ubuntu 14.04.1LTS (64 Bit) OS) and 5 service computers (Intel Core i 3.93GHZ, 4GBRAM, ubuntu 14.04.1LTS (64 Bit) OS). By performing cluster analysis on different data sets in table 3 at different numbers of processors (Mapper numbers), the specific running time is shown in table 4 below, where the mark "-" indicates that a warning of insufficient memory occurs during the running of the program.

Table 4: execution time (seconds) of the proposed algorithm on different data sets

From the results in table 4, it can be found that: (1) In most cases, the running time of the cluster analysis is gradually reduced with the increase of the number of processors, but for the data set Covertype (25 MB), the clustering time of 6 processors is greater than that of 5 processors, because communication overhead exists among the processors and the overhead occupies a large proportion of the time of the whole cluster analysis; (2) As the amount of data increases, the smaller number of processors cannot directly process the data sets due to memory limitations and the like, which can be solved by increasing the number of processors.

The above embodiments first describe the specific embodiments of the present invention in detail with reference to the drawings, and then analyze the specific practical effects of the present invention in terms of both the clustering accuracy and the clustering time, but the present invention is not limited to the above embodiments and the analysis methods of the effects, and various changes can be made within the knowledge scope of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A distributed integrated clustering analysis method based on fuzzy C-means is characterized in that: under a Map-Reduce distributed computing model, firstly, randomly blocking data distribution, then extracting a clustering center of each block of data, performing integrated fusion on the clustering center of each block of data, and finally completing the cluster analysis of large-scale data, wherein the integrated cluster analysis method comprises three layers under a distributed framework Map-Reduce, and the specific processing steps are as follows:

step1, carrying out randomized ordering on the data set X, and Layer1;

step2, dividing the data set X into

Sub-data set->

Layer1；

Step3, extracting the final cluster center c ₁ ,c ₂ ,...,c _c ，Layer 2；

Step4, dividing the samples in the X into different class clusters X according to the class cluster centers ₁ ,X ₂ ,...,X _c ，Layer 3；

The specific steps of the Step3 under the Map-Reduce model are as follows:

Step3.2 in the jth Mapper function, on data set X _j Determining a data set X using FCM clustering algorithm _j Cluster-like center of (c) ₁ (X _j ),c ₂ (X _j ),...,c _c (X _j ) Here, the number of the class clusters is set as the number of classes in the data set X, the key of the Mapper function is designated as null, and the class cluster center c is set as ₁ (X _j ),c ₂ (X _j ),...,c _c (X _j ) Value as the Mapper function;

step3.3, in the Reducer function, aggregating centers from the Mapper function to form a new data set, determining the cluster center of the new data set by applying the FCM algorithm again, and finally forming the cluster center c of the sample data X ₁ ,c ₂ ,...,c _c ；

The specific steps of Step4 under the Map-Reduce model are as follows:

Step4.2 in the jth Mapper function, for sample x _i ∈X _j Compute to cluster center c ₁ ,c ₂ ,...,c _c The subscript of the cluster center closest to the cluster center is used as a key, and the sample x is used as _i Inputting sample data into different class cluster sets X according to the key as the value of the Mapper function ₁ ,X ₂ ,...,X _c In (1).

2. The fuzzy C-means based distributed integrated cluster analysis method according to claim 1, wherein the specific steps of Step1 under the Map-Reduce model are as follows:

step1.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system

and Step1.3, aggregating the samples from the Mapper function in the Reducer function according to a key, and sequentially storing the aggregated sample data into a data set to finally form a randomized sample data set.

3. The fuzzy C-means based distributed integrated cluster analysis method according to claim 1, wherein the Step2 under the Map-Reduce model comprises the following specific steps:

The final data set X is partitioned into ^ greater than or equal to>

A sub-data set;

step2.2 in the jth Mapper function, for each sample x _i ∈X _j Computing

As key of the Mapper function, sample x _i Value as the Mapper function;

step2.3, in the Reducer function, aggregating the samples from the Mapper function according to the key, and storing the aggregated samples into different subdata sets in sequence to finally form the final product

Individual sample data set->

/>