CN107423764A

CN107423764A - K Means clustering methods based on NSS AKmeans and MapReduce processing big data

Info

Publication number: CN107423764A
Application number: CN201710619794.3A
Authority: CN
Inventors: 王霞; 康春阳
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2017-07-26
Filing date: 2017-07-26
Publication date: 2017-12-01

Abstract

The invention discloses a kind of K Means clustering methods based on NSS AKmeans and MapReduce processing big data, this method is in the case of unknown number of clusters amount, block analysis is carried out to large data sets using based on MapReduce improved NSS AKmeans, obtains number of clusters amount and the cluster center of each subset.The result of each subset is merged afterwards, obtains the number of clusters amount of data set and the initial cluster center close to actual value.Cluster analysis finally is completed to large data sets on the basis of existing initial cluster center using the K Means algorithms of standard.The present invention is solved the problems, such as in the K Means algorithms based on Hadoop known to number of clusters amount needs, and the initial cluster center being calculated is more accurate, in the 3rd MapReduce operation, reduces the iterations of K Means algorithms.

Description

K-Means clusters based on NSS-AKmeans and MapReduce processing big datas Method

Technical field

It is more particularly to a kind of to be based on NSS-AKmeans and MapReduce the present invention relates to the cluster analysis in machine learning Handle the K-Means clustering methods of big data.

Background technology

With the arriving in big data epoch, sharply increasing for data brings great challenge for data analysing method.Pass The machine learning method of system directly directly uses on large data sets, can run into the problem of various.

K-Means as one of ten big machine in normal service learning algorithms has extensive use.K-Means can not only be single Data analysis is solely carried out, and can be as a part for other learning tasks.K-Means use needs to choose in initial cluster The heart, and the quality at the center chosen has a great impact to cluster result.Hadoop as a distributed system architecture, High-speed computation and storage can be carried out using cluster.Its analysis and processing for big data has great significance. The design that Hadoop framework is most crucial is exactly：HDFS and MapReduce.HDFS provides storage for the data of magnanimity, then MapReduce provides calculating for the data of magnanimity.MapReduce parallel model can be greatly enhanced K-Means's Operation efficiency, provided great convenience for processing big data.

Parallelizations of the K-Means on MapReduce is realized and improved, and has many achievements in research to be suggested.It is existing K-Means methods are the realization on MapReduce still keeps K-Means intrinsic the shortcomings that.As K-Means input, just The quality at beginning cluster center has a great impact for final cluster result.But the existing K-Means side based on MapReduce Selection of the method to initial cluster center improvement or it is limited, K-Means iterationses are still very high, and number of clusters amount Number is also required to known.Documents below makes certain improvements to realization of the K-Means algorithms on MapReduce.

Document 1.Chaturbhuj, Kaustubh S., and Gauri Chaudhary. " Parallel clustering of large data set on Hadoop using data mining techniques."Futuristic Trends in Research and Innovation for Social Welfare(Startup Conclave),World Conference on.IEEE,2016.

Document 2.Moertini, Veronica S., and Liptia Venica. " Enhancing parallel k- means using map reduce for discovering knowledge from big data."Cloud Computing and Big Data Analysis(ICCCBDA),2016IEEE International Conference on.IEEE,2016

Document 1 determines initial cluster center using PSO searching algorithms, and K-Means iterations is reduced with this.This text Although having arrived preferable initial cluster center using PSO algorithm search in offering, outside the tangible Hadoop platform of realization of PSO algorithms , and known to the value needs of number of clusters amount.

Document 2 completes cluster analysis using two MapReduce operations.In first MapReduce operation, logarithm Sampled to obtain a subset according to collection, cluster analysis is carried out to subset using K-Means algorithms, the cluster center for obtaining subset is made For initial cluster center.In second MapReduce operation, using existing initial cluster center, completed with K-Means algorithms Cluster analysis.The shortcomings that this algorithm be obtained initial cluster center than randomly choosing close to true cluster center, but difference is still It is bigger.Therefore in second MapReduce operation, the iterations of K-Means algorithms is still very high.Likewise, this Known to the value of number of clusters amount needs in algorithm.

The problem of algorithm that document above proposes is primarily present is that the number of clusters amount of data set is required for, it is known that can not be by calculating Method obtains.The initial cluster center obtained in algorithm is due to apart from each other with true cluster center, in the final cluster of data set is calculated During the heart, the iterations of K-Means algorithms is still very high.

The content of the invention

It is an object of the invention to provide a kind of K-Means based on NSS-AKmeans and MapReduce processing big datas Clustering method, to solve the problems, such as that number of clusters amount is needed, it is known that initial cluster center is not accurate enough in background technology.With existing method Compare, this method can be based on MapReduce realize to large data sets carry out cluster analysis automatically select number of clusters amount and obtain compared with The clustering method at accurate initial cluster center.

To reach above-mentioned purpose, the present invention is achieved by the following technical solutions：

Based on the K-Means clustering methods of NSS-AKmeans and MapReduce processing big datas, comprise the following steps：

(1) in first MapReduce operation, logarithm value type data set is pre-processed, and is included the cleaning of data, is returned One changes, and resets；

(2) in the second MapReduce operation of data input for exporting first MapReduce operation, at second In MapReduce operations, the subset of data volume is obtained to each patch based sampling of the data set of input, utilizes NSS-AKmeans Algorithm carries out cluster analysis to each subset, obtains the cluster center of each subset, carries out analysis merging to these cluster centers afterwards Obtain initial cluster center；

(3) in the 3rd MapReduce operation, on the basis of existing initial cluster center, the K- of standard is utilized Parallelization of the Means algorithms on MapReduce completes cluster analysis to data set.

Further improve of the invention is, in step (1), carries out random rearrangement to data set so that each data strip Mesh random distribution.

Further improve of the invention is, in step (2), in second MapReduce operation, to the data of input Each patch based sampling of collection obtains subset of the data volume between 5000~10000.

Further improve of the invention is that in step (2), concrete methods of realizing is as follows：

1) each subset is clustered using NSS-AKmeans algorithms, obtains number of clusters amount and the cluster center of each subset, Assuming that (n₁, n₂..., n_n) be each subset number of clusters amount, K is mode therein, and K is set as to the number of clusters amount of data set；By son Result of the cluster quantity not equal to K is deleted, and the subset cluster result obtained afterwards is as follows：

2) using the cluster center of obtained subset, the initial cluster center of data set is calculated, it is as follows：

The present invention has following beneficial effect：

K-Means clustering method of the present invention based on NSS-AKmeans and MapReduce processing big datas, can be in cluster Big data is clustered in the case that quantity is unknown, obtains accurate initial cluster center, is completed more using K-Means afterwards Accurate cluster.Known to the method solves the problems, such as in the K-Means algorithms based on Hadoop that number of clusters amount needs, and The initial cluster center being calculated is more accurate, in the 3rd MapReduce operation, reduces the iteration of K-Means algorithms Number.

Further, random rearrangement is carried out to data set so that each Data Entry random distribution.So reduce second The irrational possibility of sampling subset in individual MapReduce operations.Specifically, the purpose of data cleansing is deleted in data set There is the entry of loss of data.It is more convenient and accurate that normalization data make it that data calculate.Random rearrangement is carried out to data set, made Obtain each Data Entry random distribution.The irrational possibility of sampling subset after so reducing.

Further, in step (2), in second MapReduce operation, to each patch based sampling of the data set of input Obtain subset of the data volume between 5000~10000.Sampling obtains the less subset of data volume, overcomes NSS-AKmeans Algorithm can not handle the difficulty of big data quantity.

In addition, in step (2), existing clustering method NSS-AKmeans processing big data is improved based on MapReduce Number of clusters amount and accurate initial cluster center can be obtained.This part has obtained the number of clusters amount of data set, solves existing K-Means algorithms based on Hadoop known will be required number of clusters amount.The initial cluster center obtained simultaneously is than randomly selecting It is accurate a lot.

Brief description of the drawings

Fig. 1 is the stream for the K-Means clustering methods that the present invention handles big data based on NSS-AKmeans and MapReduce Cheng Tu；

Fig. 2 is that the present invention reduces the effect of K-Means iterationses and the comparison of other algorithms.

Embodiment

The invention will be further described with specific embodiment below in conjunction with the accompanying drawings.

As shown in figure 1, the K-Means provided by the invention based on NSS-AKmeans and MapReduce processing big datas gathers Class method, comprises the following steps：

(1) in first MapReduce operation, data are cleaned, normalization and random rearrangement.Data cleansing Purpose is to delete the entry that data are concentrated with loss of data.Random rearrangement is carried out to data set so that each Data Entry is random Distribution.The irrational possibility of sampling subset after so reducing.

(2) in second MapReduce, using first MapReduce output as input.

A) data volume is obtained between 5000~10000 to each patch based sampling of the data set of input in Map functions Subset.Cluster analysis is carried out to each subset using NSS-AKmeans algorithms, obtains the cluster center of each subset.Wherein have Body step is as follows：

I. each point p t Neighbor Points are calculated, define a set tNN.TNN collection using this t Neighbor Points as point p The member of conjunction.And the polymerizing energy value each put is calculated according to equation below.

Wherein d_ipFor point p and the Euclidean distance at set tNN midpoints.

Ii. the tNN set of the maximum point of selective polymerization energy value is used as a cluster, by the tNN set at this cluster midpoint Point add this cluster.Judge that number of the cluster midpoint occurrence number more than t/2 exceedes more than half, then retain this cluster, and These points are deleted from data set.If ineligible, this cluster is not retained, and these points are deleted from data set Remove.Continue to find qualified cluster according to the method described above from remaining data, until data are grouped as null data set.By upper State operation and rough division has been carried out to subset, the high-density region of subset has been calculated.

Iii. subset is more accurately clustered using the fuzzy K-Means algorithms of polymerization.Thus every height is obtained Collect accurate number of clusters amount and cluster center.This algorithm is used as the initial cluster center of subset as calculation using the center of subset high-density region The input of method.If X={ X₁,X₂,…,X_nBe subset in point.Each X_i{ x can be expressed as_i,1；x_i,2；…；x_i,m}.M is represented The dimension each put.The purpose of cluster of the fuzzy K-Means algorithms of polymerization is to minimize this function：

Wherein u_i,jRepresent X_iWith j-th of cluster z_jBetween relation.D_i,jRepresent j-th of cluster center and i-th of point X_iBetween Euclidean distance.

In order to minimize P functions, U fixed first minimizes P by variable of Z, then fixes Z using U as variable to minimize P, circulation top-operation no longer change until P.U and P updates according to the following formula：

B) the cluster center of each subset obtained above is subjected to analysis merging.Assuming that (n₁, n₂..., n_n) it is each subset Number of clusters amount, K is mode therein, and K is set as to the number of clusters amount of data set.Result of the subset number of clusters amount not equal to K is deleted, The subset cluster result obtained afterwards is as follows：

Subset	Number of clusters amount	Cluster center
			Subset 1	K	p₁₁,p₁₂,…,p_1K
Subset 2	K	p₂₁,p₂₂,…,p_2K
			…	…	…
Subset n	K	p_n1,p_n2,…,p_nK

C) using the cluster center of obtained subset, the initial cluster center of data set is calculated, it is as follows：

(3) in the 3rd MapReduce, on the basis of existing initial cluster center, calculated using the K-Means of standard Parallelization of the method on MapReduce is realized completes cluster analysis to initial data, thus obtains the final cluster center of data set. The key step of K-Means algorithms is as follows：

A) initial division is carried out to data set according to obtained initial cluster center, obtains K cluster；

B) each point is calculated to the distance at each cluster center, adds it to that nearest cluster；

C) center of each cluster is recalculated；

D) repetitive process b), c), it is known that the center of each cluster no longer changes or reached greatest iteration in some accuracy rating Number.

Experiment and effect analysis

The initial cluster center for the data set that table 1 obtains for this algorithm, the comparison between final cluster center and true cluster center. Observed number is it has been found that this algorithm can not only automatically select out the number of clusters amount of data set, and be improved by being based on MapReduce The obtained initial cluster center of NSS-AKmeans algorithms it is final cluster center it is very close, and more connect compared to more final cluster center Nearly true cluster center.Thus this algorithm overcomes the number of clusters amount mentioned in background technology and can not automatically selected, it is necessary to which known ask Topic.Meanwhile the very close real cluster center in the obtained initial cluster center of this algorithm.

Fig. 2 is that the present invention reduces the effect of K-Means iterationses and the comparison of other algorithms.Observation find this algorithm by In that can obtain more accurately initial cluster center, the effect for reducing K-Means iterationses is better than other algorithms.Show in figure Show, K-Means algorithms by just having reached the condition of convergence after an iteration, and other method by iteration for several times also It is not reaching to the condition of convergence.

Table 1 is the comparative result at cluster center and true cluster center required by cluster in the present invention：

Claims

1. based on NSS-AKmeans and MapReduce processing big data K-Means clustering methods, it is characterised in that including with Lower step：

(1) in first MapReduce operation, logarithm value type data set is pre-processed, and includes the cleaning of data, normalizing Change, reset；

(3) in the 3rd MapReduce operation, on the basis of existing initial cluster center, calculated using the K-Means of standard Parallelization of the method on MapReduce completes cluster analysis to data set.

2. the K-Means cluster sides according to claim 1 based on NSS-AKmeans and MapReduce processing big datas Method, it is characterised in that in step (1), random rearrangement is carried out to data set so that each Data Entry random distribution.

3. the K-Means cluster sides according to claim 1 based on NSS-AKmeans and MapReduce processing big datas Method, it is characterised in that in step (2), in second MapReduce operation, to each patch based sampling of the data set of input Obtain subset of the data volume between 5000~10000.

4. the K-Means cluster sides according to claim 3 based on NSS-AKmeans and MapReduce processing big datas Method, it is characterised in that in step (2), concrete methods of realizing is as follows：

1) each subset is clustered using NSS-AKmeans algorithms, obtains number of clusters amount and the cluster center of each subset, it is assumed that (n₁, n₂..., n_n) be each subset number of clusters amount, K is mode therein, and K is set as to the number of clusters amount of data set；By subset cluster Result of the quantity not equal to K is deleted, and the subset cluster result obtained afterwards is as follows：

<mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>p</mi> <mrow> <mi>j</mi> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>K</mi> <mo>.</mo> </mrow> 1