CN108520284A

CN108520284A - A kind of improved spectral clustering and parallel method

Info

Publication number: CN108520284A
Application number: CN201810344423.3A
Authority: CN
Inventors: 强保华; 孙颢宁; 王玉峰; 谢武; 韦二龙; 史喜娜; 赵兴朝
Original assignee: Guilin University of Electronic Technology; CETC 54 Research Institute
Current assignee: Guilin University of Electronic Technology; CETC 54 Research Institute
Priority date: 2018-04-17
Filing date: 2018-04-17
Publication date: 2018-09-11

Abstract

The invention discloses a kind of improvement Spectral Clustering based on swarm intelligence algorithm, the corresponding feature vector of preceding 2k maximum eigenvalue by choosing Laplacian Matrix is used as the source data of cluster, then good initialization central point is chosen by swarm intelligence algorithm and carries out cluster operation, improve the stability of the highest accuracy rate and multiple cluster result of cluster.Invention introduces cuckoo searching algorithms to find initialization central point, fitness function during cuckoo searching algorithm is used into sum of squared errors function, applied in spectral clustering, the data point for the minimum error sum of squares that search is obtained is as initialization central point.Lay dimension countermeasures in cuckoo searching algorithm are introduced particle cluster algorithm by the present invention, slow down in particle cluster algorithm convergence rate, using the frequent smaller step-length of Lay dimension countermeasures generation and with larger step size once in a while, the different speed more new formulas stressed of introducing under different step-lengths.

Description

A kind of improved spectral clustering and parallel method

Technical field

The invention belongs to the unsupervised learning methods in machine learning, are related to clustering method and swarm intelligence algorithm.

Background technology

The purpose of cluster is divided to data, and similar data are divided into same class cluster, dissimilar data It is divided into inhomogeneity cluster.The diversity of data is brought with the development of information technology, perhaps multidata dimension is made to have not Correlation, traditional clustering algorithm are difficult the processing incoherent data of these dimensions.Spectral clustering is a kind of novel clustering algorithm, It can be clustered in the sample space of arbitrary shape and globally optimal solution can be converged to, and be widely used in computer and regarded The fields such as feel, text mining and biological information excavation.At this stage, the structure of similar matrix is mainly included for the research of spectral clustering It makes, the problems such as selection of feature vector, class number of clusters purpose are determining and the application of algorithm.In these research fields, feature vector Selection to class cluster division be of crucial importance.Classical NJW algorithms use preceding k (k is class number of clusters mesh) a maximum special The corresponding feature vector of value indicative carries out cluster operation.The it is proposeds such as Sun are based on the feature selecting of principal component analysis (PCA) extraction, i.e., The several feature vectors for no longer using characteristic value larger, but Genetic algorithm searching PCA space is used, target can be reflected by finding The feature vector subset of conceptual information is as principal component direction.Zhao etc. proposes the feature vector selection algorithm to sort based on entropy, Entropy is sorted and selects wherein preferable feature vector, points out exist than most in document by the entropy for first calculating feature vector The better feature vector group of k big feature vector, the feature vector number selected not necessarily k.Rebagliati etc. It is proposed that the difference of characteristic value can assist in the selection of feature vector quantity, the corresponding feature vector of larger characteristic value is advantageous In cluster.Document《Machine-learning research:Four current directions》Middle analysis is suitable for complete One group of feature vector of volume data is not necessarily present, even if there are differ and surely obtained according to a small amount of prior information of offer.

Invention content

The present invention provides a kind of improvement Spectral Clustering based on swarm intelligence algorithm passes through and chooses Laplacian Matrix Source data of the corresponding feature vector of preceding 2k maximum eigenvalue as cluster, then by swarm intelligence algorithm choose it is good just Beginningization central point carries out cluster operation, improves the stability of the highest accuracy rate and multiple cluster result of cluster, more in guarantee Cluster accuracy rate is improved under the premise of secondary cluster result stability.

In order to keep the stability of result, selection is needed preferably to initialize central point, invention introduces cuckoo to search Rope (Cuckoo Search, CS) algorithm finds initialization central point.Fitness function during cuckoo searching algorithm is adopted With sum of squared errors function, it is applied in spectral clustering, the data point for the minimum error sum of squares that search is obtained is as initial Change central point, the accuracy rate unstability generated because randomly selecting central point can be reduced.

In order to make its searching process convergence rate faster, the present invention also provides tie up the Lay in cuckoo searching algorithm to fly The step of row strategy introduces particle cluster algorithm, slows down in particle cluster algorithm convergence rate, is produced using Lay dimension countermeasures Raw frequently smaller step-length and and larger step size once in a while, the different speed more new formulas stressed are introduced under different step-lengths, work as small step The influence for weakening globally optimal solution when long, weakens the influence of current particle history optimal solution when big step-length.By the innovatory algorithm For optimizing, smaller convergency value can be obtained and accelerate convergence rate.

Description of the drawings

Fig. 1 is the spectral clustering flow chart of feature based extension and cuckoo searching algorithm.

Fig. 2 is the particle cluster algorithm flow chart for merging cuckoo search.

Fig. 3 is that parallelization builds Laplacian Matrix DAG figures.

Fig. 4 is parallelization K-means algorithms DAG figures.

Fig. 5 is to improve spectral clustering and other clustering algorithm comparing results.

Fig. 6 is particle cluster algorithm and the comparison of other improvements algorithm for merging cuckoo search.

Fig. 7 is that parallelization builds Laplacian Matrix single machine and the comparison of cluster time.

Fig. 8 is that parallelization K-means algorithms single machine is compared with the cluster time.

Specific implementation mode

Source data of the corresponding feature vector of preceding 2k maximum eigenvalue of Laplacian Matrix as cluster is chosen, both may be used To generate higher than common NJW algorithms cluster accuracy rate as a result, also will produce the knot for clustering accuracy rate less than common NJW algorithms The fluctuation of fruit, multiple cluster result is higher than common NJW algorithms cluster result.Repeatedly cluster accuracy rate occur fluctuation be because For the feature samples space in spectral clustering, low dimensional space sample is more compacted, and after expanding to 2k dimensions by k dimensions, is introduced and is divided The poor dimension of class cluster ability, increases the independence between data, to more disperse compared to low dimensional spatial data points；And And it since spectral clustering is also to use random initializtion central point, increases and is absorbed in the general of locally optimal solution in cluster process Rate causes repeatedly to cluster accuracy rate unstable.In order to keep the stability of result, selection is needed preferably to initialize central point.

Present invention introduces cuckoo search (Cuckoo Search, CS) algorithms to find initialization central point.Cuckoo is searched for Algorithm is a kind of swarm intelligence algorithm, it is to find that nest hatches parasitic mode and Lai Wei flies (L é vy based on cuckoo Flight a kind of) optimizing pattern of mechanism.The algorithm finds premium position according to fitness function value is calculated.Cuckoo is searched Fitness function in rope algorithmic procedure uses sum of squared errors function.By the way that above-mentioned cuckoo searching algorithm is applied to spectrum In cluster, the data point for the minimum error sum of squares that search is obtained can be reduced as initialization central point because of random choosing The accuracy rate unstability for taking central point and generating.

In order to make its searching process convergence rate faster, the present invention draws the Lay dimension countermeasures in cuckoo searching algorithm Enter particle cluster algorithm, particle cluster algorithm is on the basis of to animal cluster activity behavior observation, using the individual in group to information It is shared so that the movement of entire group is generated the evolutionary process from disorder to order in problem solving space, it is optimal to obtain Solution.Slow down in particle cluster algorithm convergence rate, using Lay dimension countermeasures generate frequently smaller step-length and and once in a while compared with Big step-length introduces the different speed more new formulas stressed under different step-lengths, weakens the influence of globally optimal solution when small step-length, Weaken the influence of current particle history optimal solution when big step-length.The innovatory algorithm is used for optimizing, smaller convergence can be obtained It is worth and accelerates convergence rate.

In order to improve the treatment effeciency of mass data, invention introduces Spark distributed computing frameworks, use Spark RDD programming models have designed and Implemented the parallelization that Laplacian Matrix is built in spectral clustering and K-means is calculated The parallelization of method makes the storage and processing process of data become cluster from single machine, improves processing speed and the storage of data With the ability of processing mass data.

Referring to Fig.1, feature based extension and the spectral clustering realization of cuckoo searching algorithm include the following steps：

(1) initial data X=[x are given₁,x₂,x₃,…,x_n]∈R^dWith clusters number k.

(2) Laplacian Matrix L is calculated according to following formula：

(3) the Standard Process Y of the feature vector of the preceding 2k maximum eigenvalue of Laplacian Matrix L is calculated.

(4) position of n Bird's Nest is randomly initialized in matrix Y

(5) using the position of n Bird's Nest as cluster centre, point carries out clustering respectively, calculates the fitness of each Bird's Nest Value F, and retain the Bird's Nest position of relatively small F values：F_best=min { F₁,F₂,…,F_n}。

(6) Bird's Nest position is updated according to the Lay dimension countermeasures in cuckoo searching algorithm：

(7) Bird's Nest behind update position is subjected to clustering again, calculates the fitness of each Bird's Nest, then according to suitable Response compares new and old two generations Bird's Nest, and compares previous generation minimum F values, retains relatively small F values Bird's Nest position：

(8) random number r ∈ [0,1] and detection probability P are used_aCompare, if r ＜ P_a, retain Bird's Nest position；If r ＞ P_a, then lead to Formula update Bird's Nest position is crossed, the smaller Bird's Nest position of the front and back F values of update is retained.

(9) return to step (5) continues to execute if the not up to stop condition of maximum iteration or setting；Otherwise it protects It stays the solution of minimum fitness value and enters below step and operate.

(10) K-means clusters are carried out to the optimal Bird's Nest position of acquisition, finally exports cluster centre point and cluster result.

With reference to Fig. 2, the particle cluster algorithm realization of fusion cuckoo search includes the following steps：

(1) it initializes, the size of population, its position of random initializtion x is set_i=(x_i1, x_i2..., x_iD), setting is just Beginning speed v_i=(v_i1, v_i2..., v_iD)。

(2) the fitness value F for calculating each particle retains the history desired positions p of each particle experience_i=(p_i1, p_i2..., p_iD) and group in undergo desired positions p_g=(p_g1, p_g2..., p_gD)。

(3) particle updates speed and the position of oneself according to following formula：

Wherein, c₁、c₂For Studying factors or accelerator coefficient；Generally normal number, generally equal to 2；r₁、r₂Value range is [0,1] is equally distributed pseudo random number in the section.

(4) decrease speed for judging fitness value in nearest 10 iterative process in an iterative process, when decrease speed is low It in threshold value, introduces Lay and ties up countermeasures, enable step=L é vy (λ), frequently small step-length and big step once in a while are realized with step Long strategy, and difference is stressed to impact factor in small step-length and big step-length；When small step-length, weaken the shadow of globally optimal solution It rings, formula is as follows：When big step-length, weaken current particle history optimal solution Influence, formula is as follows：Wherein： Indicate the global impact factor；Indicate the local influence factor.

(5) return 2 is needed if the not up to stop condition of maximum iteration or setting) it continues to execute；Otherwise it protects Stay the position of minimum fitness value.

Fig. 3 is to build Laplacian Matrix Stage flow charts using Spark RDD parallelizations, is as follows：

1) dataRDD, format Tuple of input data are created<index,data>, wherein index is line number, Data is data；

2) it is cloneRDD to replicate the dataRDD created；

3) above-mentioned two groups of RDD are made into Descartes's collection using cartesian operators, to reduce the amount of computing repeatedly, and used Filter operators filter out an identical half data in set, obtain groupRDD；

4) flatmap operators is used to calculate the similarity of two pairs of data in each group of set in groupRDD；

5) sum of the reduceByKey calculating matrix per a line is used, diagonal matrix matrixD_RDD is obtained；

6) matrixD_RDD obtains matrix D using map algorithms^-1/2；

7) map operators are used to calculate AD^-1/2Obtain AD_RDD；

8) map operators are used to calculate D^-1/2AD^-1/2Obtain L_RDD.

Fig. 4 be using Spark RDD parallelization K-means algorithm main loop flow charts, the specific steps are：

(1) k central point kClusters of initialization is worth to according to k；

(2) kClusters is set to broadcast variable using setBroadCasst methods；

(3) mapToPair methods are used to calculate each affiliated class cluster of data；

(4) reduceByKey methods are used, the data point of same class cluster is recalculated into central point；

(5) step (2)-(4) are repeated until reaching condition of convergence position.

Fig. 5 is to carrying out accuracy rate comparison after improvement spectral clustering with other clustering algorithms, and the 5 kinds of algorithms compared are K- Means algorithms, common NJW algorithms, the ESBER_D algorithms based on entropy sequence, the 2K_NJW algorithms of feature expansion, fusion cuckoo The CS_2K_NJW algorithms of algorithm and feature extension.

The highest accuracy rate of four groups of data clusters of 2K_NJW algorithms pair is higher than NJW algorithms and ESBER_D algorithms, so obtaining The feature vector number of the high cluster result of accuracy rate not necessarily k.Feature vector per dimension, which all has, divides class The ability of cluster.And it therefrom finds, 2K_NJW algorithms have fluctuation to 20 cluster results of data set, this is because introducing Additional k feature vector in, some feature vectors can play certain interference providing useful information content to cluster operation simultaneously Effect, increases degree of scatter of the data point in characteristic vector space so that cluster process becomes more initialization central point For sensitivity, it is easy to be absorbed in locally optimal solution, so as to cause global poor result.

The cluster result obtained for CS_2K_NJW algorithms, it can be clearly seen that, ensure 2K_NJW algorithm high-accuracies On the basis of maintain preferable stability, to obtain cluster result more preferably than other three kinds of algorithms.The algorithm is using Cuckoo searching algorithm is added on the basis of being clustered in 2k feature vector, is with the sum of the cluster internal variance for minimizing all clusters Target searches out good clustering initialization central point, the ability for making full use of every one-dimensional vector to divide class cluster, more to ensure The stability of secondary cluster result, cuckoo searching algorithm equally increase original calculation in order to which high-quality initialization central point is calculated The calculation amount of method.

Fig. 6 is the knot that convergence time and minimum convergency value comparison are carried out to searching algorithm after improvement and other swarm intelligence algorithms The algorithm of fruit, comparison includes cuckoo search (CS) algorithm, population (PSO), the global particle for introducing cuckoo searching algorithm Group's (CS_PSO (WL)) algorithm and part introduce population (CS_PSO (PL)) algorithm of cuckoo searching algorithm.

As seen in Figure 6, basic PSO convergence speed of the algorithm is very fast, and degree of convergence is substantially better than CS algorithms, 1000 time-consuming 1%-3%s more multi-purpose than CS algorithm of iteration or so, so overall efficiency is better than CS algorithms.And for CS_PSO (WL) Algorithm, convergence rate is slow compared to PSO algorithms, and is also more than PSO algorithms on 1000 time-consuming early most data sets of iteration, But degree of convergence is better than PSO algorithms.CS_PSO (PL) algorithmic statement degree maintains an equal level with CS_PSO (WL) algorithm, is better than PSO Algorithm, the degree of convergence iterations that CS_PSO (PL) algorithm reaches PSO algorithms will be less than PSO algorithms, 1000 operations of iteration Taking will lack compared with CS_PSO (WL) algorithm, than the time that PSO algorithm takes multi-purpose 1%-4% or so.Therefore CS_PSO (PL) The degree of convergence that algorithm reaches PSO algorithms takes and remains basically stable with PSO algorithms, but can reach in the short period be better than behind The convergence effect of PSO algorithms.Consider, CS_PSO (PL) efficiency of algorithm is more excellent.

Fig. 7 and Fig. 8 is respectively the parallel method cluster and single machine state for building Laplacian Matrix and K-means algorithms Under run time comparison.As seen from the figure, the run time of the parallel method cluster of two kinds of operations will be less than single-unit operation Time, parallel data amount is bigger, and the operational efficiency of cluster is higher.

Experiment shows to carry out feature extension to the characteristic vector space of spectral clustering, and is determined using cuckoo searching algorithm The initialization central point of cluster can improve the accuracy rate and stability of spectral clustering.Cuckoo is incorporated in particle cluster algorithm The Lay of bird searching algorithm ties up countermeasures, and improves the step-length weight in formula, can obtain that convergency value is low and fast convergence rate Swarm intelligence algorithm.Using Spark RDD programming model parallelization spectral clustering calculating process, the processing speed of data is improved The ability of degree and storage and processing mass data.

Claims

1. a kind of improved spectral clustering and parallel method, the method includes：

Source data of the corresponding feature vector of preceding 2k maximum eigenvalue of Laplacian Matrix L as cluster is chosen, k is class cluster Number；

Initialization central point is found using cuckoo searching algorithm；

It is described to be included the following steps using cuckoo searching algorithm searching initialization central point：

(1) initial data X=[x are given₁,x₂,x₃,…,x_n]∈R^dWith clusters number k；

(2) Laplacian Matrix L is calculated according to following formula：

A_ij=exp (- | d_ij|²/2σ²), i ≠ j, A_ii=0；

(3) the Standard Process Y of the feature vector of the preceding 2k maximum eigenvalue of Laplacian Matrix L is calculated；

(4) position of n Bird's Nest is randomly initialized in matrix Y

(5) using the position of n Bird's Nest as cluster centre, point carries out clustering respectively, calculates the fitness value F of each Bird's Nest, And retain the Bird's Nest position of relatively small F values：F_best=min { F₁,F₂,…,F_n}；

(6) Bird's Nest position is updated according to the Lay dimension countermeasures in cuckoo searching algorithm；

(7) Bird's Nest behind update position is subjected to clustering again, the fitness of each Bird's Nest is calculated, then according to fitness New and old two generations Bird's Nest is compared, and compares previous generation minimum F values, retains relatively small F values Bird's Nest position；

(8) random number r ∈ [0,1] and detection probability P are used_aCompare, if r ＜ P_a, retain Bird's Nest position；If r ＞ P_a, then pass through public affairs Formula updates Bird's Nest position, retains the smaller Bird's Nest position of the front and back F values of update；

(9) return to step (5) continues to execute if the not up to stop condition of maximum iteration or setting；Otherwise retain most The solution and entrance below step of small fitness value operate；

2. according to the method described in claim 1, the Lay dimension countermeasures in the wherein described cuckoo searching algorithm introduce particle Group's algorithm, the particle cluster algorithm include the following steps：

(1) it initializes, the size of population, its position of random initializtion x is set_i=(x_i1, x_i2..., x_iD), initial velocity is set v_i=(v_i1, v_i2..., v_iD)；

(2) the fitness value F for calculating each particle retains the history desired positions p of each particle experience_i=(p_i1, p_i2..., p_iD) and group in undergo desired positions p_g=(p_g1, p_g2..., p_gD)；

Wherein, c₁、c₂For Studying factors or accelerator coefficient；r₁、r₂It is that equally distributed pseudo random number, value range are in the section [0,1]；

(4) decrease speed for judging fitness value F in nearest 10 iterative process in an iterative process, when decrease speed is less than threshold It when value, introduces Lay and ties up countermeasures, enable step=L é vy (λ), frequently small step-length and big step-length once in a while are realized with step Strategy；When small step-length, weaken the influence of globally optimal solution, formula is as follows： When big step-length, weaken the influence of current particle history optimal solution, formula is as follows：

Wherein：Indicate the global impact factor；Expression office Portion's impact factor；

(5) return to step (2) continues to execute if the not up to stop condition of maximum iteration or setting；Otherwise retain most The position of small fitness value F.