CN110880015B - Distributed integrated clustering analysis method based on fuzzy C-means - Google Patents

Distributed integrated clustering analysis method based on fuzzy C-means Download PDF

Info

Publication number
CN110880015B
CN110880015B CN201910981453.XA CN201910981453A CN110880015B CN 110880015 B CN110880015 B CN 110880015B CN 201910981453 A CN201910981453 A CN 201910981453A CN 110880015 B CN110880015 B CN 110880015B
Authority
CN
China
Prior art keywords
data
clustering
data set
cluster
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910981453.XA
Other languages
Chinese (zh)
Other versions
CN110880015A (en
Inventor
母亚双
王利东
刘晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN201910981453.XA priority Critical patent/CN110880015B/en
Publication of CN110880015A publication Critical patent/CN110880015A/en
Application granted granted Critical
Publication of CN110880015B publication Critical patent/CN110880015B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a distributed integrated clustering analysis method based on a fuzzy C mean value, and belongs to the technical field of machine learning and big data analysis. According to the fuzzy C-means theory, aiming at bottleneck factors faced by a traditional clustering analysis method when a large-scale data set is processed, under a Map-Reduce distributed computing model, the data are randomly partitioned, then the clustering center of each data is extracted, and the clustering centers of each data are integrated and fused, so that the clustering analysis process of the large-scale data is finally completed. The invention carries out distributed integrated analysis on the clustering problem of large-scale data, and realizes the aim of ensuring clustering precision and simultaneously ensuring lower clustering time in the data clustering analysis process.

Description

Distributed integrated clustering analysis method based on fuzzy C-means
Technical Field
The invention relates to a distributed integrated clustering analysis method based on a fuzzy C mean value, and belongs to the technical field of machine learning and big data analysis.
Background
With the continuous progress of science and technology, the rapid development of internet and the increasingly perfect database technology, new data is continuously generated in various industries of human society, for example, more and more data are accumulated in shopping, diet, tourism, medical treatment and the like, and the big data era has come. Big data is already closely related and inseparable to our life. The traditional data analysis method has good advantages in processing limited or small amount of data, but in the face of the massive data, the traditional data analysis method faces extremely serious data explosion problems. From the hardware level, although the computer hardware has been developed rapidly, the storage, management and analysis capability of a single computer still cannot meet the requirement of big data; from a software level, the size of big data has surpassed the ability of traditional data analysis methods to capture, manage, and process in a tolerable time.
Since the massive data contains valuable information and has important influence on social production and daily life, scientific research on analysis and processing of the massive data is increasingly emphasized by various social circles. Many scholars give big data complexity in three dimensions (i.e., 3vs, volume-data capacity, velocity-data input-output speed, variety-data type and source), and point to increasingly serious challenges and opportunities they face. Big data is a new information asset that needs high capacity, high speed and diversity that can enhance decision making, discovery and optimization processing patterns. The data (common) researched by the classical machine learning is not only different from big data in nature in 3Vs, but also the algorithm generally needs to traverse all samples for many times and read the data into a memory, so that the classical machine learning algorithm is expected to face the 3Vs disaster of the big data. How to combine hardware equipment and software technology and effectively analyze and process big data, that is, a large-scale data mining technology based on hardware and software, has become a research hotspot in the current era, and is one of the problems to be solved urgently to promote the progress and development of the subject.
Fuzzy C-means (FCM) clustering is one of the well-known algorithms in the field of data analysis. The FCM algorithm has the advantages of high clustering speed, high accuracy, few parameters and the like, is one of popular clustering algorithms at present, and is widely applied to the actual fields of credit assessment, medical treatment and health, traffic management and the like. With the rapid increase of data generation in daily life, data analysis and processing by using an FCM model in big data is inevitable. The traditional FCM algorithm or model has achieved a good result in processing small-scale data sets or medium-scale data sets, but these algorithms cannot directly process the large-scale data clustering problem, and the reasons for this phenomenon mainly include the following layers:
and (4) memory limitation: the memory of each computer is fixed, and for a large-scale data set, it is very difficult to store all training sample data or most training sample data in the memory of one computer.
Time complexity: for large data, the analysis and processing of the algorithm is very time consuming and it is very difficult to complete the experimental process within an acceptable time frame.
Data complexity: the high dimensionality and multi-modal nature of large-scale datasets can be very difficult to design for experiments, and can also have a large impact on the experimental results in terms of performance.
For the above reasons, parallel computing or distributed computing is a common and reliable choice for many machine learning algorithms to analyze and process big data. First, parallel computing can solve the data storage problem by storing data on multiple computer resources, and in addition, parallel computing enables programs to run on multiple machines simultaneously, which greatly improves the execution efficiency of algorithms. Although some existing parallel implementations of FCMs can handle big data analysis problems, the methods all have the common feature that a parallel computing technology is applied to local or each iteration of the FCM algorithm, and due to differences in hardware configuration, delays in network communication, imbalance in data distribution and the like, a large amount of communication overhead exists among computers in a cluster, so that the clustering efficiency of the algorithms is not high.
In order to more accurately mine and explore the internal rules of the traditional FCM algorithm method in the parallel or distributed clustering process, further meet the increasing requirements of big data analysis and processing and achieve the aim of ensuring the clustering precision and simultaneously ensuring the clustering time to be lower, the invention provides a novel distributed integrated clustering algorithm.
Disclosure of Invention
The invention aims to provide a distributed integrated clustering analysis method based on a fuzzy C mean value, which is used for solving the bottleneck problem of the traditional clustering analysis method in processing large-scale data and achieving the aim of ensuring clustering precision and shortening clustering time in the data analysis process.
The invention adopts the idea of Map-Reduce distributed framework, firstly randomly divides the original data into a plurality of subdata sets in a computer cluster, further applies fuzzy C-means algorithm to cluster each subdata set, and applies fuzzy C-means clustering again to form a final clustering center on the basis of fusing the clustering centers of all the subdata sets, thereby completing the clustering analysis process of large-scale data, and realizing the goal of ensuring clustering precision and ensuring that the clustering consumes less time in the data analysis process.
In order to achieve the aim, based on actual data with class labels commonly used in the field of data analysis, data are randomly divided under a distributed computing framework, and the divided data are subjected to cluster analysis and fusion through a classical FCM mean algorithm. The invention provides a distributed integrated clustering analysis method based on a fuzzy C mean value on the basis of the existing work.
The technical scheme of the invention is as follows: according to a Map-Reduce distributed computing framework, aiming at limiting factors of a traditional FCM algorithm in a big data clustering analysis process, original data are randomly divided in a distributed mode, clustering is carried out on the divided data by applying the traditional FCM algorithm, clustering centers are determined, then the clustering centers are integrated and fused, the traditional FCM algorithm is further applied to extract final clustering centers, and therefore the distributed clustering analysis process oriented to large-scale data processing is completed.
The integrated clustering analysis method mainly comprises three layers under a distributed framework Map-Reduce, wherein the 1 st layer is introduced by splitting the layer into two steps, and the specific implementation steps are as follows:
step1, carrying out randomized ordering on the data set X (Layer 1);
the Step1 mainly has the function of randomly ordering a data set in a distributed mode under a Map-Reduce frame, the method can overcome the problems of insufficient memory, high time complexity and the like in the random arrangement of the traditional serial algorithm, and the specific implementation steps under the Map-Reduce model are,
step1.1, hypothesis data set XCan be divided into m sub-data sets in a distributed system
Figure BDA0002235319740000037
Step1.2, in the jth Mapper function, for each sample x i ∈X j Randomly generating a random integer, using the random integer as key of the Mapper function, and sampling x i Value as the Mapper function;
step1.3, in the Reducer function, aggregating the samples from the Mapper function according to a key, and sequentially storing the aggregated sample data into a data set to finally form a randomized sample data set.
Step2, dividing the data set X into
Figure BDA0002235319740000031
Sub-data set->
Figure BDA0002235319740000032
The Step2 is mainly used for dividing the data set into a plurality of sub-data sets in a distributed mode under the Map-Reduce framework, the method can avoid the problems of insufficient memory, high time complexity and the like caused by direct division of the data, the method is used in combination with the Step1, the purpose of distributed random division of large-scale data can be achieved, and the specific implementation steps under the Map-Reduce model are as follows:
step2.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure BDA0002235319740000033
The final data set X is partitioned into +>
Figure BDA0002235319740000034
A sub-data set;
step2.2 in the jth Mapper function, for each sample x i ∈X j Computing
Figure BDA0002235319740000035
The remainder is based on the sum of the value of the sum and the value of the remainder is based on the sum of the value of the sum>
Figure BDA0002235319740000036
As key of the Mapper function, sample x i Value as the Mapper function;
step2.3, in the Reducer function, aggregating the samples from the Mapper function according to key, and storing the aggregated samples into different subdata sets in sequence to finally form
Figure BDA0002235319740000041
Individual sample data set>
Figure BDA0002235319740000042
Step3, extracting the final cluster center c 1 ,c 2 ,...,c c (Layer 2);
The Step3 mainly functions to extract respective cluster centers from each self-data set under a Map-Reduce framework, then perform aggregation on the cluster centers and extract a final cluster center, and the method overcomes the problem of insufficient memory when a traditional clustering algorithm is used for processing large-scale data, and the specific implementation steps under the Map-Reduce model are as follows:
step3.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure BDA0002235319740000043
Step3.2 in the jth Mapper function, for dataset X j Determining a data set X by applying a classical FCM clustering algorithm j Cluster-like center of (c) 1 (X j ),c 2 (X j ),...,c c (X j ) Here, the number of class clusters is set as the number of classes in the data set X, the key of the Mapper function is designated as null, and the class cluster center c is set as 1 (X j ),c 2 (X j ),...,c c (X j ) Value as the Mapper function;
step3.3, in the Reducer function, performing aggregation on centers from the Mapper function to form a new data set, determining the cluster center of the new data set by applying the classical FCM algorithm again, and finally forming the cluster center c of the sample data X 1 ,c 2 ,...,c c
Step4, dividing the samples in the X into different class clusters X according to the class cluster centers 1 ,X 2 ,...,X c (Layer 3)。
The Step4 mainly functions to calculate the distance from each sample point to the center of the nearest cluster class under a distributed framework, and by calculating the distance, the method can achieve the purpose of dividing different sample points into different clusters, and the specific implementation steps under the Map-Reduce model are as follows:
step4.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure BDA0002235319740000044
Step4.2 in the jth Mapper function, for sample x i ∈X j Compute to cluster center c 1 ,c 2 ,...,c c The subscript of the cluster center closest to the cluster center is used as a key, and the sample x is used as i Value as a function of the Mapper. Inputting sample data into different class cluster sets X according to different keys 1 ,X 2 ,...,X c In (1).
The invention has the beneficial effects that:
the invention carries out distributed integrated analysis on the clustering problem of large-scale data, realizes the aim of ensuring the clustering precision and simultaneously ensuring lower clustering time consumption in the data clustering analysis process, can overcome the problems of memory limitation, time complexity limitation, data complexity limitation and the like in the traditional clustering analysis algorithm along with the increase of data quantity in actual production, provides a powerful clustering analysis tool for the field of machine learning and data analysis facing to big data, and further provides decision support for actual application.
Drawings
FIG. 1 is an overall hierarchical diagram of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: as shown in fig. 1, in the present embodiment, an Iris data set containing 150 pieces of data (samples) of 3 categories in a UCI database is taken as an example of a fuzzy C-means distributed integrated clustering analysis method, and the specific steps of the distributed integrated clustering method are as follows:
step1, carrying out randomized ordering on the data set X (Layer 1);
step1.1, assume that dataset X can be partitioned into 2 sub-datasets X in a distributed system 1 And X 2 The number of samples is 75 and 75 respectively, so that the system generates two Mapper functions for processing;
step1.2 in the 1 st Mapper function, for each sample x i ∈X 1 Randomly generating a random integer, using the random integer as key of the Mapper function, and sampling x i The same operation applies to the 2 nd Mapper function as the value of the Mapper function;
step1.3, aggregating samples from the 1 st Mapper function and the 2 nd Mapper function in the Reducer function according to key, and sequentially storing the aggregated sample data into a data set to finally form 150 randomized sample data sets.
Step2, dividing the data set X into
Figure BDA0002235319740000051
Sub-data set->
Figure BDA0002235319740000052
Step2.1, assume that dataset X can be partitioned into 2 sub-datasets X in a distributed system 1 And X 2 The number of samples is 75 and 75 respectively, so that the system generates two Mapper functions to process, and the final data set X is set to be divided into
Figure BDA0002235319740000053
A sub-data set;
step2.2 in the 1 st Mapper function, for each sample x i ∈X j Calculating out
Figure BDA0002235319740000054
The remainder is based on the sum of the value of the sum and the value of the remainder is based on the sum of the value of the sum>
Figure BDA0002235319740000055
As key of the Mapper function, sample x i The same operation applies to the 2 nd Mapper function as the value of the Mapper function;
step2.3, in the Reducer function, aggregating the samples from the 1 st Mapper function and the 2 nd Mapper function according to key, storing the aggregated samples into different sub-data sets in sequence, and finally forming 3 sample data sets X 1 ,X 2 ,X 3
Step3, extracting the final cluster center c 1 ,c 2 ,c 3 (Layer 2);
Step3.1, assume that data set X is divided into 3 sample data sets X by Step1 and Step2 in distributed system 1 ,X 2 ,X 3 At this time, the system will generate 3 Mapper functions for processing;
step3.2 in the 1 st Mapper function, for dataset X 1 Determining a data set X by applying a classical FCM clustering algorithm 1 Cluster-like center of (c) 1 (X 1 ),c 2 (X 1 ),c 3 (X 1 ) Here, the number of class clusters is set as the number of classes in the data set X, the key of the Mapper function is designated as null, and the class cluster center c is set as 1 (X 1 ),c 2 (X 1 ),c 3 (X 1 ) The same operation applies to the 2 nd and 3 rd Mapper functions as the value of the Mapper function;
step3.3, in the Reducer function, the centers from the Mapper function are aggregated to form a new data set, and the classical FCM algorithm is applied again to determine the cluster-like center of the new data setFinally, the cluster center c of the sample data X is formed 1 ,c 2 ,c 3
Step4, dividing the samples in the X into different class clusters X according to the class cluster centers 1 ,X 2 ,X 3 (Layer 3)。
Step4.1, assume that dataset X can be partitioned into 2 sub-datasets X in a distributed system 1 And X 2 The number of samples is 75 and 75 respectively, so that the system generates two Mapper functions for processing;
step4.2 in the 1 st Mapper function for sample x i ∈X j Computing to a cluster center c 1 ,c 2 ,c 3 The subscript of the cluster center closest to the cluster center is used as a key, and the sample x is used as i The value of the Mapper function is input sample data into different cluster sets X according to the key 1 ,X 2 ,...,X c The same applies to the 2 nd Mapper function.
Example 2: in order to compare the performance of the fuzzy C-means-based distributed integrated clustering analysis method (LP-FCM) and the traditional clustering algorithm in the embodiment, the embodiment performs comparative analysis on the provided LP-FCM algorithm and the traditional clustering algorithms such as K-means, K-medoids and FCM. These conventional clustering algorithms are implemented using the MATLAB toolbox (software version is R2015 b), with the parameter 'Start' (if present) set to 'sample', and the other parameters in the algorithm use default values in the toolbox. The data sets used in the comparative experiment are 20 frequently used data sets in the UCI database, and specific information of the data sets is shown in table 1 below:
Figure BDA0002235319740000071
table 1: detailed information of 20 data sets
The data sets in the table 1 are subjected to clustering analysis by applying a K-means algorithm, a K-medoids algorithm, an FCM algorithm and an LP-FCM algorithm, and after 50 clustering analysis experiments, the average clustering accuracy result of each data set is shown in the table 2 (indicated by the bold result in the table, and the current algorithm has the best clustering effect on the data set):
Figure BDA0002235319740000072
Figure BDA0002235319740000081
table 2: comparative analysis on 20 data sets
As can be seen from the results in Table 2, the K-means algorithm and the FCM algorithm can have better clustering results for 5, 5 and 1 data sets respectively, while the proposed LP-FCM algorithm has better results for 8 data sets. Meanwhile, the LP-FCM algorithm is superior to other clustering algorithms on the data sets on the average accuracy rate.
Example 3: in order to verify the characteristics of the fuzzy C-means based distributed integrated cluster analysis method (LP-FCM) in terms of execution time, the embodiment takes a Covertype data set as an example to research the clustering time. The used Covertype dataset has a total of 54 conditional data and a decision attribute containing 7 classes, and the total number of samples contained in the original dataset is 581,012. In order to clearly show the parallelism of the proposed algorithm, the present embodiment samples the original data set by the boststrap technique to generate 6 data sets with different sizes, and the specific information of these data sets is shown in table 3 below:
Figure BDA0002235319740000082
table 3: detailed information of 6 data sets
The LP-FCM algorithm related by the invention is realized by Java language, and the realized program is executed on a small cluster comprising 1 host computer (Intel Core i 5-4590.30GHZ, 8GB RAM, ubuntu 14.04.1LTS (64 Bit) OS) and 5 service computers (Intel Core i 3.93GHZ, 4GBRAM, ubuntu 14.04.1LTS (64 Bit) OS). By performing cluster analysis on different data sets in table 3 at different numbers of processors (Mapper numbers), the specific running time is shown in table 4 below, where the mark "-" indicates that a warning of insufficient memory occurs during the running of the program.
Figure BDA0002235319740000091
Table 4: execution time (seconds) of the proposed algorithm on different data sets
From the results in table 4, it can be found that: (1) In most cases, the running time of the cluster analysis is gradually reduced with the increase of the number of processors, but for the data set Covertype (25 MB), the clustering time of 6 processors is greater than that of 5 processors, because communication overhead exists among the processors and the overhead occupies a large proportion of the time of the whole cluster analysis; (2) As the amount of data increases, the smaller number of processors cannot directly process the data sets due to memory limitations and the like, which can be solved by increasing the number of processors.
The above embodiments first describe the specific embodiments of the present invention in detail with reference to the drawings, and then analyze the specific practical effects of the present invention in terms of both the clustering accuracy and the clustering time, but the present invention is not limited to the above embodiments and the analysis methods of the effects, and various changes can be made within the knowledge scope of those skilled in the art without departing from the spirit of the present invention.

Claims (3)

1. A distributed integrated clustering analysis method based on fuzzy C-means is characterized in that: under a Map-Reduce distributed computing model, firstly, randomly blocking data distribution, then extracting a clustering center of each block of data, performing integrated fusion on the clustering center of each block of data, and finally completing the cluster analysis of large-scale data, wherein the integrated cluster analysis method comprises three layers under a distributed framework Map-Reduce, and the specific processing steps are as follows:
step1, carrying out randomized ordering on the data set X, and Layer1;
step2, dividing the data set X into
Figure FDA0004037458440000011
Sub-data set->
Figure FDA0004037458440000012
Layer1;
Step3, extracting the final cluster center c 1 ,c 2 ,...,c c ,Layer 2;
Step4, dividing the samples in the X into different class clusters X according to the class cluster centers 1 ,X 2 ,...,X c ,Layer 3;
The specific steps of the Step3 under the Map-Reduce model are as follows:
step3.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure FDA0004037458440000013
Step3.2 in the jth Mapper function, on data set X j Determining a data set X using FCM clustering algorithm j Cluster-like center of (c) 1 (X j ),c 2 (X j ),...,c c (X j ) Here, the number of the class clusters is set as the number of classes in the data set X, the key of the Mapper function is designated as null, and the class cluster center c is set as 1 (X j ),c 2 (X j ),...,c c (X j ) Value as the Mapper function;
step3.3, in the Reducer function, aggregating centers from the Mapper function to form a new data set, determining the cluster center of the new data set by applying the FCM algorithm again, and finally forming the cluster center c of the sample data X 1 ,c 2 ,...,c c
The specific steps of Step4 under the Map-Reduce model are as follows:
step4.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure FDA0004037458440000014
Step4.2 in the jth Mapper function, for sample x i ∈X j Compute to cluster center c 1 ,c 2 ,...,c c The subscript of the cluster center closest to the cluster center is used as a key, and the sample x is used as i Inputting sample data into different class cluster sets X according to the key as the value of the Mapper function 1 ,X 2 ,...,X c In (1).
2. The fuzzy C-means based distributed integrated cluster analysis method according to claim 1, wherein the specific steps of Step1 under the Map-Reduce model are as follows:
step1.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure FDA0004037458440000015
Step1.2, in the jth Mapper function, for each sample x i ∈X j Randomly generating a random integer, using the random integer as key of the Mapper function, and sampling x i Value as the Mapper function;
and Step1.3, aggregating the samples from the Mapper function in the Reducer function according to a key, and sequentially storing the aggregated sample data into a data set to finally form a randomized sample data set.
3. The fuzzy C-means based distributed integrated cluster analysis method according to claim 1, wherein the Step2 under the Map-Reduce model comprises the following specific steps:
step2.1, assume that dataset X can be partitioned into m sub-datasets in a distributed system
Figure FDA0004037458440000021
The final data set X is partitioned into ^ greater than or equal to>
Figure FDA0004037458440000022
A sub-data set;
step2.2 in the jth Mapper function, for each sample x i ∈X j Computing
Figure FDA0004037458440000023
The remainder is based on the sum of the value of the sum and the value of the remainder is based on the sum of the value of the sum>
Figure FDA0004037458440000024
As key of the Mapper function, sample x i Value as the Mapper function;
step2.3, in the Reducer function, aggregating the samples from the Mapper function according to the key, and storing the aggregated samples into different subdata sets in sequence to finally form the final product
Figure FDA0004037458440000025
Individual sample data set->
Figure FDA0004037458440000026
/>
CN201910981453.XA 2019-10-16 2019-10-16 Distributed integrated clustering analysis method based on fuzzy C-means Active CN110880015B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910981453.XA CN110880015B (en) 2019-10-16 2019-10-16 Distributed integrated clustering analysis method based on fuzzy C-means

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910981453.XA CN110880015B (en) 2019-10-16 2019-10-16 Distributed integrated clustering analysis method based on fuzzy C-means

Publications (2)

Publication Number Publication Date
CN110880015A CN110880015A (en) 2020-03-13
CN110880015B true CN110880015B (en) 2023-04-07

Family

ID=69728467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910981453.XA Active CN110880015B (en) 2019-10-16 2019-10-16 Distributed integrated clustering analysis method based on fuzzy C-means

Country Status (1)

Country Link
CN (1) CN110880015B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814979B (en) * 2020-07-06 2024-02-23 河南工业大学 Fuzzy set automatic dividing method based on dynamic programming
CN114362973B (en) * 2020-09-27 2023-02-28 中国科学院软件研究所 K-means and FCM clustering combined flow detection method and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844294A (en) * 2016-03-21 2016-08-10 全球能源互联网研究院 Electricity usage behavior analysis method based on FCM cluster algorithm
CN107330458A (en) * 2017-06-27 2017-11-07 常州信息职业技术学院 A kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers
CN107480694A (en) * 2017-07-06 2017-12-15 重庆邮电大学 Three clustering methods are integrated using the weighting selection evaluated twice based on Spark platforms

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550744A (en) * 2015-12-06 2016-05-04 北京工业大学 Nerve network clustering method based on iteration

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844294A (en) * 2016-03-21 2016-08-10 全球能源互联网研究院 Electricity usage behavior analysis method based on FCM cluster algorithm
CN107330458A (en) * 2017-06-27 2017-11-07 常州信息职业技术学院 A kind of fuzzy C-means clustering method of minimum variance clustering of optimizing initial centers
CN107480694A (en) * 2017-07-06 2017-12-15 重庆邮电大学 Three clustering methods are integrated using the weighting selection evaluated twice based on Spark platforms

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张海建 ; .基于云平台的层次聚类算法在煤炭产业中的应用.煤炭技术.2013,(12),全文. *
李兰英 ; 董义明 ; 孔银 ; 周秋丽 ; .改进K-means算法的MapReduce并行化研究.哈尔滨理工大学学报.2016,(01),全文. *

Also Published As

Publication number Publication date
CN110880015A (en) 2020-03-13

Similar Documents

Publication Publication Date Title
Dong et al. Learning space partitions for nearest neighbor search
Dafir et al. A survey on parallel clustering algorithms for big data
Mohammed et al. A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms
Jian et al. Parallel data mining techniques on graphics processing unit with compute unified device architecture (CUDA)
Bader et al. Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks
Chen et al. Rubik: A hierarchical architecture for efficient graph neural network training
Pan et al. CogBoost: Boosting for fast cost-sensitive graph classification
CN103714009B (en) A kind of GPU realizes method based on the MapReduce of internal memory unified management
Seddati et al. Quadruplet networks for sketch-based image retrieval
CN110880015B (en) Distributed integrated clustering analysis method based on fuzzy C-means
Cai et al. Adaptive density-based spatial clustering for massive data analysis
Saffran et al. A low-cost energy-efficient Raspberry Pi cluster for data mining algorithms
Sukhija et al. Topic modeling and visualization for big data in social sciences
Tang et al. A parallel k‐means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce
Yu et al. Mining emerging patterns by streaming feature selection
Elkano et al. CHI-PG: A fast prototype generation algorithm for Big Data classification problems
Xie et al. Scalable clustering by aggregating representatives in hierarchical groups
Lin et al. Graph embedding with hierarchical attentive membership
Martín-Fernández et al. Indexes to find the optimal number of clusters in a hierarchical clustering
Zhang et al. MapReduce-based distributed tensor clustering algorithm
Antaris et al. In-memory stream indexing of massive and fast incoming multimedia content
Zhao et al. Lsif: A system for large-scale information flow detection based on topic-related semantic similarity measurement
Chen et al. Siamese network based multiscale self-supervised heterogeneous graph representation learning
Qi et al. Architectural implications of GNN aggregation programming abstractions
Gao et al. Construction and Optimization of Co-occurrence-attribute-interaction Model for Column Semantic Recognition.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant