CN108595499A

CN108595499A - A kind of population cluster High dimensional data analysis method of clone's optimization

Info

Publication number: CN108595499A
Application number: CN201810221722.8A
Authority: CN
Inventors: 罗养霞
Original assignee: XI'AN UNIVERSITY OF FINANCE AND ECONOMICS
Current assignee: XI'AN UNIVERSITY OF FINANCE AND ECONOMICS
Priority date: 2018-03-18
Filing date: 2018-03-18
Publication date: 2018-09-28

Abstract

The invention belongs to High dimensional space data analysis technical fields, a kind of population cluster High dimensional data analysis method of clone's optimization are disclosed, using based on clone's dynamic select population clustering technique；The assessment measure of combined coding mechanism and feature based dimension contribution rate based on restriction.Particle Swarm Theory is applied in high dimensional data clustering, guidance ground random search cluster centre vector is concentrated in data using the Optimizing Search mechanism of particle cluster algorithm.Each particle is considered as antibody, regard a kind of clustering cluster dividing mode of data set to be clustered as, particle is optimized and immunoevolution simultaneously, when dynamic evolution, particle by its affinity height it is directly proportional into Mobile state clone, by antibody concentration be inversely proportional carry out clone inhibition, by affinity height be inversely proportional carry out local variations.The present invention, which is effectively prevented from, is absorbed in local optimum, improves the stability and reliability of clustering algorithm.Accelerate high dimensional data search process, prevents from being absorbed in suboptimization.

Description

A kind of population cluster High dimensional data analysis method of clone's optimization

Technical field

The invention belongs to High dimensional space data analysis technical fields more particularly to a kind of population of clone's optimization to cluster height Dimension data analysis method.

Background technology

In recent years, data mining causes the very big concern of information industry circle and entire society, the reason is that in daily life In the presence of largely can be with widely used data, and the presence of high dimensional data in practice be more universal, and there is an urgent need to by data It is converted into useful information and knowledge.Currently, the clustering algorithm of low-dimensional data comparative maturity, but in practical applications, it is high The data of dimension, for example, the data of finance data, retail business, the data of telecommunications industry and biological data generally existing.Data It is influenced by " dimension calamity " (the curse of dimensionality), many traditional clustering algorithms apply to high dimension Often fail according to upper, exist such as to initial value it is sensitive, be easily trapped into that local best points, algorithm retractility are poor, can not handle The problems such as large-scale data.Therefore, there is very important theory significance to the research of high dimensional data clustering and applies valence Value.

High dimensional data is a highly important task in clustering, many applications need to comprising a large amount of characteristic items or The object of person's dimension is analyzed.It may be incoherent that its data characteristics, which is between multiple dimensions, with the increase of dimension, data What is become is more and more sparse so that the distance between point loses meaning in pairs, and the averag density between data becomes very low.Tradition cluster When method is to high dimensional data clustering, the problem is that：1. high dimensional data, which is concentrated, has a large amount of unrelated attributes so that in institute Have in dimension and there is a possibility that cluster is almost nil；2. the data of high-dimensional data space, dilute compared with the data distribution in lower dimensional space It dredges, distance is almost equal between data, it is difficult to be measured with distance.

Inside data mining, in order to meet the needs of numerous users in different application field, researchers propose very Spininess mainly has the cluster of (1) based on dimensionality reduction to the clustering method of high dimensional data；(2) subspace clustering；(3) based on hypergraph Cluster；(4) joint cluster.Dimensionality reduction be exactly by Mapping of data points to more low-dimensional spatially to seek the compact representation of data A kind of technology, the compact representation of this lower dimensional space is beneficial to be further processed data.Different dimension reduction methods, it Seek that the mode that the low-dimensional of high dimensional data indicates is different, and the data and the degree of approximation of initial data after dimensionality reduction are also different, It is also different to their clustering performance.Its maximum disadvantage be it a specific criterion is not provided evaluate from The quality that higher-dimension is converted to low-dimensional.And for the data of very higher-dimension, the training process convergence of cluster can be very slow.It is sub empty Between cluster be also known as feature selecting, it is divided into original data space different subspaces, only on those relevant subspaces Investigate the presence of cluster.Such algorithm can find the cluster of any type and shape in any amount dimension, result in theory It is made of the cluster of one group of different subspace, and can be represented by a disjunctive expression, and need not determine dimension amount in advance. The disadvantage is that if parameter setting is improper, it is likely to leave out some important clusters in the beta pruning stage, specified to one For data set, to determine that these parameters are extremely difficult.The relationship map that high dimensional data is asked is arrived based on the clustering method of hypergraph On one hypergraph, the relationship of certain data is expressed on the super side of each in figure, and the weights on side then indicate the close of corresponding relation Degree.This method biggest advantage is that it does not have to calculate the similarity between high dimensional data during cluster, therefore calculates The time complexity of method is relatively low.But foot point is not that the data type of cluster is restricted.The thought of joint cluster is exactly that will first gather The attribute of class data set is divided into several groups, then represents the set of properties for each set of properties one new attribute of proposition, after And carry out high dimensional data cluster for several attributes derived from.The deficiency of this method is the raising of cluster data quality Dependent on the cluster of its attribute, and attribute is clustered and also has to depend on corresponding data set.All due to each method There is its advantage and defect, is not that a kind of algorithm can in practical applications can be according to particular problem suitable for all situations The characteristics of select suitable algorithm.

Clone's optimization population cluster high dimensional data method that this scheme proposes is in conjunction with the excellent of dimensionality reduction and subspace clustering The searching method of point design.Dimensionality reduction technology is typically to pass through feature selecting (Feature selection) or eigentransformation (Feature transforma-tion) can utilize traditional gather by original high-dimensional data space reduction to compared with lower dimensional space Class method completes clustering processing.Feature selection approach is the requirement or data set characteristic according to cluster target, from all attributes Important attribute set is selected to be clustered.In general, feature selecting includes two parts, first, being carried out to each character subset Search, second is that being evaluated character subset by certain criterion.Subspace clustering (Subspace Clustering) is different Class be present in different subspaces, such method seeks to effectively extract the cluster for being present in subspace.With the total space Dimension reduction method it is different, subspace clustering is that each cluster searches for its corresponding subspace.It, will be sub empty according to the difference of the direction of search Between clustering method be divided into two major classes：The searching method of bottom-up (Bottom-up Subspace Search) and top-down The searching method of (Top-bottom subspace search).The elder generation in correlation rule is utilized in bottom-up searching method Property is tested, merges neighbouring dense cell to form cluster.CLIQUE algorithms first with correlation rule priori decision search and Merge the grid that density is more than given threshold value, forms candidate subspace, and its subspace midpoint is pressed into these candidate subspaces The size of quantity (covering) sorts, followed by Minimum description length criterion by the lower subspace beta pruning of scale.It is top-down Searcher rule be to be scanned for subspace according to direction from top to bottom.PROCLUS algorithms are that earliest use is pushed up certainly And the projected clustering algorithm of lower search strategy.PROCLUS is an algorithm based on central point, uses random sampling and Greedy Method combines and selects some cluster central points, then calculates the weight often tieed up to each cluster with determining discriminant function, is constantly changing The weight that dimension is adjusted during generation, finally finds out the class around these central points.DOC algorithms be used simultaneously from bottom to On grid policies and top-down iterative modification cluster quality strategy, and propose a kind of determining for optimal projective clustering Justice, but it still needs further improvement for the precision and operational efficiency of DOC algorithms.

Algorithm above is the main thought with the relevant Clustering Algorithm of Hi-dimensional Dataset of this programme, feature selecting or eigentransformation It is to find all clusters inside the same proper subspace, has ignored inside high-dimensional data space, different clusters may has Different proper subspaces；Subspace clustering method can then make different clusters there are different subspaces, but such methods Computational complexity it is higher.

In conclusion problem of the existing technology is：When traditional clustering method is to high dimensional data clustering, due to higher-dimension There are a large amount of unrelated attributes in data set, it is sparse compared with the data distribution in lower dimensional space so that there are clusters in all dimensions Possibility it is almost nil, cluster when, it is difficult to accomplish Fast Convergent, and ensure that global search is optimal.

Particle cluster algorithm is the optimization algorithm based on swarm intelligence theory, compares emphasis and searches for premium class in whole dimension spaces Central point, the intensive good subspace of search data set is clustered.It is generated by the interparticle cooperation and competition of population Swarm intelligence instructs Optimizing Search, convergence rate very fast.Evolution Theory have it is stronger identification, study, memory and it is adaptive should be able to Power, clone operations realize the expansion in antibody population space, and the antibody to generate new provides basis.This research one side grain to be utilized Swarm optimization guiding search direction reaches effective quick clustering convergence；On the other hand each iteration of particle cluster algorithm is generated As a result it is cloned, the search result of particle cluster algorithm is expanded to the population space of bigger, by being carried out not to portion gene More fine local search is realized in variation with degree, recompresses search result to original seed group space size by selection, To ensure that cluster has good global search and local search performance.

The groundwork of this patent is combined Immune Clone Selection with particle swarm optimization algorithm in clustering, establishing base It, in conjunction with Immune Clone Selection mechanism, is constructed on the basis of Further aim function in the high dimensional data Clustering Model of particle cluster algorithm For the population Dynamic Clustering Algorithm of data clusters analysis.Unlike existing research, in terms of particle variations and evolution, It is improved in terms of the assessment measurement of particle group coding and high dimensional data feature dimensions, overcomes traditional clustering algorithm sensitive to initial value The shortcomings that, the stability of high dimensional data cluster is improved, research is clustered for high dimensional data and application provides Technical Reference.

Invention content

In view of the problems of the existing technology, the present invention provides a kind of population cluster high dimensional datas point of clone's optimization Analysis method.

The invention is realized in this way a kind of population of clone's optimization clusters High dimensional data analysis method, the clone The population cluster High dimensional data analysis method of optimization generates N number of particle, adjusts the position of this N number of particle, calculates corresponding suitable Response；The clone of different number is carried out according to its antibody-antigene affinity and antibody-antibody similarity to N number of particle；Clone's Antibody, with the more respective antibody-antigene affinity of original antibody, is retained after the selection for gene by Immune Clone Selection The highest particle of affinity, into next iteration；It to the last produces the optimum antibody of capture antigen or reaches specified Until iterations.

Further, the population cluster High dimensional data analysis method of clone's optimization includes the following steps：

Step 1, the initialization each sample of particle, which is randomly assigned, to be calculated all kinds of for certain one kind as initial clustering Cluster centre, as the position encoded of primary；N times are repeated in the speed for initializing particle, and symbiosis is at N number of initial Population；

Step 2 calculates the contribution rate each tieed up in every one kind in each particle to such, the highest s dimension of contribution rate The serial number of dimension calculates the fitness of particle as feature dimensions；

Step 3 compares the fitness for the desired positions Best_id that fitness value is lived through with it to each particle Value, if more preferably, updating Best_id；

Step 4 compares fitness value and the fitness of desired positions Best_id that group is undergone to each particle Value, if more preferably, updating Best_Value；

Step 5 adjusts speed and the position of particle；

Step 6, the k mean clusters of new individual；

Step 7 reaches algorithm termination condition, then terminates；Otherwise two are gone to step.

Further, particle initialization includes with coding in the step 1：The space encoder of design is quasi- to be made of three parts (SUP, CEP, CPV), wherein SUP indicate that the real coding string of proper subspace, CEP indicate the real coding string at class center, CPV Indicate class center degree of change (record update position, for adjusting global and local consistency).Initial population is given birth in a random basis At a feature dimensions of random selection SUP_maxnumber (maximum feature dimensions number) and CEP_maxnnumber (maximum classes Number) a data object carries out coding composition individual, and then iteration N_size (scale of preset initial population) is secondary, that is, completes The generation of initial population.

Further, fitness function calculates in the step 2, is indicated the contribution rate of subspace clustering with feature dimensions；

K with { C₁,C₂,…C_kCentered on subspace class { A₁,A₂,…A_k, to each subclass A_i(i=1,2 ..., K) it is measured, contribution rate metric evaluation function is as follows：

J expressions contain intrinsic dimensionality in subspace,Indicate class A_iOn data point jth dimension and the of central point J ties up distance, and value is smaller, indicates class A_iBe class on feature dimensions j it is compact, also referred to as ties up j to class A_iContribution it is big, F_ijValue It is bigger；Conversely, claiming dimension j to class A_iContribution it is small.Calculate A_iAll feature dimensions are to A in class_iContribution and be expressed as μ_i：

The sum of fitness by all K classes indicates the fitness of all particles：

Further, Immune Clone Selection dynamic clustering specifically includes in the step 1：

● the position Z of each particle in initial initialization population_i={ Z_i1, Z_i2..., Z_ikAnd speed V_i={ V_i1, V_i1..., V_ik}；

● While (current iteration number t<T)；// provide cycle qualifications

● Fori=1to populations N//cluster starts；

● minimal distance principle is pressed by all vector X' in X '_jIt assigns in the class cluster that a cluster centre Zij is represented；

● calculate the adaptive value of each particle；

● clone's quantity is calculated, particle is cloned；

● Immune Clone Selection is carried out to data；

● update the current optimal solution of each particle；

● the current optimal solution of update group；

● the speed of more new particle and position；

●End for

● endwhile//cycle terminates

● calculate the index of Clustering Effect；

● output cluster result；

● terminate

Further, the population cluster High dimensional data analysis method of clone's optimization includes：Initialize population size N, maximum iteration T, variation amplitude coefficient lambda, antibody likeness coefficient η cluster manifold X, as follows：

Further, particle is evaluated and is measured according to formula in the step 5；

K with { C₁,C₂,…C_k, centered on subspace class { A₁,A₂,…A_k, to each subclass A_i(i=1, 2 ..., k) it is measured, contribution rate metric evaluation function is as follows：

The sum of fitness by all K classes indicates the fitness of all particles：

Further, particle of new generation is clustered according to following k mean algorithms in the step 6：

(1) it is encoded according to the cluster centre of particle, according to arest neighbors rule, determines the clustering of the corresponding particle；

(2) according to clustering, new cluster centre is calculated, the fitness value of more new particle updates original encoded radio.

Further particle is cloned, is positively correlated by affinity, clone's thought of concentration inverse correlation.Formula defines table Show as follows：

Wherein a is clone's upper limit quantity, F_{i_Affinity}Indicate affinity degree, F_similarityIndicate that similarity, β indicate antibody kind The size of group,Indicating that certain similar population number accounts for total antibody population number ratio, ratio is higher, and concentration is bigger, gram Grand number is smaller.

Further, the particle position and speed of the Immune Clone Selection Dynamic Clustering Algorithm：

V_i'_d=wV_id+n₁rand1(P_id-X_id)+n₂rand2(P_gd-X_id)；

X′_id=X_id+V_id；

Parameter selection includes three parameters：w、n₁、n₂, maximum speed V_max, maximum position X_max, Population Size：W takes 0.4 To 0.8, n₁、n₂Take 1.0 to 2.0, maximum speed：V_max=0.2*X_max, maximum position X_max=max (X_i)<Per one-dimensional maximum Value>, Population Size：N=20,30,40,50.

It is normalized to X '={ x '₁,x'₂,…x'_n, Clustering Effect index I (k) obtains the k of maximum value as cluster numbers, also It needs to judge the corresponding best cluster results of categorized data set X.

The present invention is theoretical applied in high dimensional data clustering by population (PSO), utilizes the optimization of particle cluster algorithm Search mechanisms are concentrated with guidance ground random search cluster centre vector in data.A group random particles are initialized, are looked for by iteration To optimal solution, in each iteration, particle updates the position of oneself by tracking two " extreme values ", and one is particle itself The preferably solution found, i.e., individual extreme value (p_best), another extreme value are that all particles are searched in the successive dynasties in entire population The optimal solution (g_best) reached in the process, i.e. global extremum, have emphasize it is distributed, relatively easy, individual between it is direct Or indirect reciprocation, there is very strong adaptability and robustness.

Present invention improves over particle group coding and subspace valuation functions, general coding method emphasis is empty in class central point Between encode, and project is improved to combined coding mode, by feature selecting space, the class center space of points (position of particle in corresponding PSO Set) and knots modification (speed of particle in corresponding PSO) three parts of central point constitute jointly space encoder.Subspace is improved to comment Estimate mode, proposes fitness function of the feature based dimension to subspace clustering contribution rate, be the valuation functions of subspace clustering, than More different subspace clustering effects are together evaluated the feature dimensions that cluster result joint subspace includes.

Evolution Theory is applied to clustering problem and solved by the present invention, on the basis of Further aim function, is selected in conjunction with clone Select a good opportunity reason, each particle be considered as antibody, regard a kind of clustering cluster dividing mode of data set to be clustered as, at the same to particle into Row optimization and immunoevolution.In evolutionary process, particle is cloned, is inversely proportional by antibody concentration by its affinity height is directly proportional Carry out clone inhibition, being inversely proportional by affinity height carries out local variations.

Currently, data mining and data analysis have broad application prospects under study for action, the present invention is existed by Clone cells The multiple directions of same particle periphery carry out global or local search, promote the particle tachytelic evolution in population, are solving higher-dimension When the clustering problem of data, the traditional clustering algorithm disadvantage sensitive to initial value is not only overcome, but also can be effectively prevented from sunken Enter local optimum, improves the stability and reliability of clustering algorithm.The traditional clustering algorithm disadvantage sensitive to initial value is overcome, Accelerate high dimensional data search process, prevents from being absorbed in suboptimization；Life is also mostly high dimensional data with other data in practice, Such as biological data, image data, network data, economic data, medical data, utilization and analysis to these data provide skill Art refers to, and to the research that WEB data, text cluster and class internal schema are the clustering problem that non-spherical is spread, is especially adding There is important theory significance and positive facilitation in terms of speed convergence and global optimum.

Description of the drawings

Fig. 1 is the population cluster High dimensional data analysis method flow diagram of clone's optimization provided in an embodiment of the present invention.

Fig. 2 is cluster result schematic diagram of each algorithm provided in an embodiment of the present invention on wine data sets.

Fig. 3 is that the embodiment of the present invention provides cluster result schematic diagram of each algorithm on Ionosphere data sets.

Fig. 4 is cluster result schematic diagram of each algorithm provided in an embodiment of the present invention on spambase data sets.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The present invention in terms of particle variations and the evolution, in terms of the assessment measurement of particle group coding and high dimensional data feature dimensions into Row research and improvement, overcome the disadvantage that traditional clustering algorithm is sensitive to initial value, improve the stability of high dimensional data cluster, are High dimensional data cluster research provides practicable theory and technology reference.

The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.

As shown in Figure 1, the population cluster High dimensional data analysis method of clone's optimization provided in an embodiment of the present invention includes Following steps：

S101：The initialization of population, when initializing particle, by each sample be randomly assigned for certain one kind, as initial Clustering, and calculate all kinds of cluster centres, as the position encoded of primary, and initialize the speed of particle, instead N times are carried out again, and symbiosis is at N number of primary group；

S102：The contribution rate each tieed up in every one kind in each particle to such is calculated, and the highest s dimension of contribution rate The serial number of dimension obtain feature dimensions as such, while calculating the fitness of the particle；

S103：To each particle, compare the fitness for the desired positions Best_id that its fitness value is lived through with it Value, if more preferably, updating Best_id；

S104：To each particle, compare:The fitness for the desired positions Best_Value that fitness value and group are undergone Value, if more preferably, updating Best_Value；

S105：Adjust speed and the position of particle；

S106：The k mean clusters of new individual；

S107：If reaching algorithm termination condition, terminate, otherwise goes to step S102；In the mistake using particle cluster algorithm Cheng Zhong, when carrying out repartitioning classification to individual of new generation less than S106, it is possible to will appear empty class, gather if there is empty Class then takes out the farthest pattern vector of cluster centre from the cluster of some other non-empty, vector is put into empty cluster at random, weight This multiple process, until in division without empty cluster.

In step S105：Speed and the position of particle are adjusted according to formula；

The sum of fitness by all K classes indicates the fitness of all particles：

The fitness of so entire particle (proper subspace) is exactly the sum of the fitness for seeking k class.

In step s 106：For particle of new generation, clustered according to k mean algorithms below：A, according to particle Cluster centre encodes, and according to arest neighbors rule, determines the clustering of the corresponding particle；B, it according to clustering, calculates new Cluster centre, the fitness value of more new particle update original encoded radio.Since k mean values have stronger local search ability, Therefore introducing the skilled speed of the population of k mean cluster thoughts can greatly improve.

The population cluster High dimensional data analysis method of clone's optimization provided in an embodiment of the present invention is taken for be clustered Data setting cluster number of clusters k=2, and be incremented toRespective optimum kind cluster center is found respectively, finally by each Clustering Effect index I (k) under k values determines cluster numbers and corresponding cluster centre.

First, initialization population size N, maximum iteration T, variation amplitude coefficient lambda, antibody likeness coefficient η gather Class manifold X, as follows：

The application principle of the present invention is further described with reference to specific embodiment.

The population cluster High dimensional data analysis method of clone's optimization provided in an embodiment of the present invention includes the following steps：

1, particle initialization and coding

Space encoder is quasi- to be made of (SUP, CEP, CPV) three parts, and wherein SUP indicates the real coding of proper subspace String, CEP indicate the real coding string at class center, CPV indicate class center degree of change (record update position, for adjust it is global and Locally coherence), initial population generates in a random basis, random selection SUP_maxnumber (maximum feature dimensions number) A feature dimensions and CEP_maxnnumber (maximum class number) a data object carry out coding composition individual, then iteration N_ Size (scale of preset initial population) is secondary, that is, completes the generation of initial population.

2, fitness function calculates, and is indicated the contribution rate of subspace clustering with feature dimensions；

J expressions contain intrinsic dimensionality in subspace,Indicate class A_iOn data point jth dimension and the of central point J ties up distance, and value is smaller, indicates class A_iBe class on feature dimensions j it is compact, also referred to as ties up j to class A_iContribution it is big, F_ijValue It is bigger；Conversely, claiming dimension j to class A_iContribution it is small.

Calculate A_iAll feature dimensions are to A in class_iContribution and be expressed as μ_i：

The sum of fitness by all K classes indicates the fitness of all particles：

3, Immune Clone Selection, particle cluster algorithm are all vectors for the particle position when being updated to particle position Direction all updates, and is easy to skip more excellent or optimal location, therefore in Immune Clone Selection dynamic clustering, genetic mutation is operated to population The partial gene fragments (each genetic fragment corresponds to a cluster centre) of new particle (i.e. antibody) are pressed after each iteration of algorithm Formula carries out mutation operation, so as to increase the dynamic local search capability to the particle current location.

Clone is defined as follows method：

Population clone determines that affinity is higher with affinity and concentration, and clone's number is bigger, and antibody concentration is higher, gram Grand number is smaller, is positively correlated by affinity, clone's thought of concentration inverse correlation.Formula definition indicates as follows：

4, particle position and speed

V′_id=wV_id+n₁rand1(P_id-X_id)+n₂rand2(P_gd-X_id)；

X′_id=X_id+V_id；

Parameter selection includes three parameters：w、n₁、n₂, maximum speed V_max, maximum position X_max, Population Size：W is quasi- 0.4 to 0.8, n is taken in fixed experiment₁、n₂1.0 to 2.0, maximum speed is taken in drafting experiment：V_max=0.2*X_max, maximum position X_max=max (X_i)<Per one-dimensional maximum value>, draft experimental population size：N=20,30,40,50.

5, stopping criterion for iteration determines

The Stopping criteria selection of usual algorithm has following three criterion：

(1) fitness of optimum individual reaches given threshold value.

(2) iterations reach a preset maximum iterations.

(3) when the fitness solved in search process is no longer substantially change after continuous multi-generation.

6. Immune Clone Selection dynamic clustering summary algorithm is as follows:

There are the parameters such as speed, position for Immune Clone Selection Dynamic Clustering Algorithm, and each more new capital of each particle is to pass through What speed and position carried out.

The application effect of the present invention is explained in detail with reference to experiment.

It in order to verify the feasibility and validity of the present invention, is analyzed and is compared by experiment, comparison application is classical to calculate Subspace clustering algorithm-PROCLUS algorithms of method k-means algorithms and classics, and compare the grain with the band clone in project Subgroup High Dimensional Clustering Analysis algorithm is (in experiment referred to as：Clone_POS_Cluster).

Data set is chosen：In order to which whether verification algorithm is effective to high dimensional data cluster, and ensure the practicability of algorithm, chooses Data have two groups, first, real application data, second is that classical machine learning data.Real application data derives from interbank The official Shibor data (http of short-term loan at daily interest interest rate://www.shibor.org official websites), chose for 1500 day of trade, totally 9 Group data.Classical machine learning data source (comes from http in UCI data sets://archive.ics.uci.edu/ml/ nets Location), selected three group data set therein obtained is respectively：Wine data sets, Ionosphere data sets, spambase data Collection.

Cluster result is compared according to following three indexs：

1) Purity purity：It is the another of object of the cluster as obtained by algorithm operation to what extent comprising former single class A kind of measurement：

If purity is bigger, cluster result is more close with known " brass tacks " obtained by algorithm, and Clustering Effect is better.

2)RI：Rand statistics are a kind of to take ideal cluster similarity matrix related to ideal class similarity matrix Spend the measurement as Cluster Validity.Ideal cluster similarity matrix, the i-th j is 1, if two objects i and j are same Otherwise a cluster is 0；Ideal class similarity matrix, the i-th j is 1, if two object i and j in same class, otherwise for 0.Rand statistics can calculate as follows：

Wherein, f00=has the number of the object pair of different class and different clusters；

F01=has the number of the object pair of different class and identical cluster；

f₁₀The number of the object pair of=class having the same and different clusters；

The number of the object pair of f11=classes having the same and identical cluster；

Rand statistics are bigger it can be seen from formula, and cluster result gets over phase with known " brass tacks " obtained by algorithm Closely, Clustering Effect is better.

3) Error_degree error rates remember that data amount check is T in initial data, and the data amount check of the i-th class is T_i, pass through Cluster, obtains i-th₁Class corresponds to the i-th class of initial data, and i-th₁The data amount check for belonging to original i-th class in the data of class isThen the error rate of the i-th class is：

It remembers the data point that each class is confused after row cluster into and (belongs to i-th₁Class and be not belonging to the i-th class) number be T₁', then Total false rate is：

Algorithms of different is used in plan respectively, in k-means algorithms, PROCLUS_clustering algorithms and project Population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster), analysis two groups of different data collection of comparison are cloned, and count above three A Cluster Validity measurement index, provides specific experiment parameter and experiment analysis results.

It is separately operable k-means algorithms, PROCLUS_clustering algorithms and clone's population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster), and three above Cluster Validity measurement index is counted, specific experimental result and analysis are such as Under：

When running PROCLUS_clustering algorithms, need to set relevant parameter crossover probability P c, mutation probability m P, Number max_fnum, number of clusters mesh max_cnum, population scale popsize and the iteration maximum times max_gen of feature dimensions are selected, Depending on the setting of these parameters will be according to specific data set, specific experiment parameter such as table 1：

The parameter of 1 experimental setup PROCLUS_clustering algorithms of table

When running population High Dimensional Clustering Analysis algorithm, relevant parameter is also set：W generally takes 0.4 to 0.8, and n1, n2 are general 1.0 to 2.0 are taken, maximum speed：Vmax=0.2*Xmax, maximum position Xmax=max (Xi)<Per one-dimensional maximum value>, population Size：The setting of these parameters of N=20,30,40 also will be depending on specific data, specific parameter such as table 2：

2 each parameters of experimental setup PROCLUS_clustering of table

(1) table 3 is k-means algorithms, PROCLUS_clustering algorithms and clone's population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster) three algorithms are shown in wine data in the sign subspace clustered on wine data sets, Fig. 2 Cluster result on collection.

The proper subspace that 3 each algorithm of table clusters on wine data sets

From figure 2 it can be seen that on wine data sets, the value of the error_drgee of population High Dimensional Clustering Analysis algorithm is most Small, purity second, RI are also the error_drgee values second obtained by second, PROCLUS_clustering algorithms, purity It is maximum with the value of RI, effect it is worst be k-means algorithms, and as can be seen from Table 5, pass through PROCLUS_ The dimension for the optimal solution that clustering algorithms and population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster) are found all is 8 Dimension, only each class of population High Dimensional Clustering Analysis algorithm must tie up different, to contain jointly in the dimension that the two algorithms are selected dimension Degree is：3,4,5,11, it is believed that this apteryx is important in all dimensions, and the optimal solution that k-means algorithms are found Dimension is to tie up entirely, i.e., 13 dimensions.From the cluster knot solved with upper table and it can be seen from the figure that, clone's population High Dimensional Clustering Analysis algorithm gained The error rate of fruit is minimum, and Clustering Effect is preferable, and the dimension of proper subspace is also minimum, is worked as in three algorithms In, the solution obtained by PROCLUS_clustering algorithms and population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster) is most Alright, while also illustrating, population higher-dimension algorithm can reduce the influence of " dimension calamity " to a certain extent, poly- to high dimensional data Class is effective.

(2) table 4 is k-means algorithms, PROCLUS_clustering algorithms and clone's population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster) proper subspace clustered on Ionosphere data sets, Fig. 3 are shown three algorithms and exist Cluster result on Ionosphere data sets.

The proper subspace that 4 each algorithm of table clusters on Ionosphere data sets

It can be seen that on Ionosphere data sets from table 4 and Fig. 3, clone population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster) the purity values that gained solves most are beaten, RI values are maximum, error_drgee values are minimum；Followed by k- Solution obtained by means algorithms, effect it is worst be solution obtained by PROCLUS_clustering algorithms.PROCLUS_ Clustering algorithms and clone's population High Dimensional Clustering Analysis algorithm can drop to dimension very low, but Clone_POS_ The error rate of Cluster algorithms is lower, purity and RI highers, so effect is more preferable, it is clear that this population high dimensional data Clustering algorithm is effective to dimensionality reduction, can be used for high dimensional data cluster.

(3) table 5 is k-means algorithms, PROCLUS_clustering algorithms and clone's population High Dimensional Clustering Analysis algorithm (Clone_POS_Cluster) proper subspace clustered on spambase data sets, Fig. 4 are shown three algorithms and exist Cluster result on spambase data sets.

The proper subspace that 5 each algorithm of table clusters on spambase data sets

From, as can be seen that on spambase data sets, population High Dimensional Clustering Analysis algorithm gained solves in table 5 and Fig. 4 Purity values are maximum, RI values are maximum, error_drgee values are minimum, and effect is best；Followed by PROCLUS_clustering algorithms The solution of gained, effect it is worst be solution obtained by k-means algorithms.G PROCLUS_clustering algorithms and population higher-dimension Clustering algorithm displays the advantage of high dimensional data, and because the subspace of each class of population High Dimensional Clustering Analysis algorithm Feature dimensions can be different, so can more accurately be clustered, Clustering Effect is better than PROCLUS_clustering algorithms.

The cluster feature subspace dimension of k-means algorithms is 57 dimension of full dimension, PROCLUS_clustering algorithm gained The dimension of solution is identical 13 dimension of each class, and the dimension of clone's population High Dimensional Clustering Analysis algorithm is 13 different dimensions of each class, phase Comparatively, Clone_POS_Cluster algorithms and PROCLUS_clustering algorithms all greatly reduce data set Dimension, Clone_POS_Cluster algorithms but remain better Clustering Effect, are said from this angle, population High Dimensional Clustering Analysis Solution obtained by algorithm is better than solution obtained by k-means algorithms and PROCLUS_clustering algorithms." dimension calamity " is by data Caused by dimension height, therefore, under the premise of ensureing Clustering Effect, the lower the dimension for being desirable to data the better.This experiment Further demonstrate feasibility and validity that Clone_POS_Cluster algorithms cluster high dimensional data.

Summarize so carrying out one for the experimental result of three above algorithm, such as following table：

6 algorithms of different experimental result of table summarizes and compares

From both the above it can be seen from the figure that, clone's population High Dimensional Clustering Analysis algorithm (Clone_POS_ of research Cluster) to dimension be 13 wine data sets, it is good without other two kinds of algorithms, but to dimension be 34 Ionosphere The spambase data sets that data set and dimension are 57 all achieve good effect, and Clone_POS_Cluster algorithms are significantly Ground reduces the dimension of data set, but remains the same Clustering Effect, is said from this angle, population High Dimensional Clustering Analysis algorithm institute Solution obtained by better than k-means algorithms must be solved, it is also slightly more excellent than PROCLUS_clustering algorithm.

In conclusion either in artificial data, or on truthful data, Purity, RI, error_drgee tri- Evaluation of a measurement index to its experimental result, all illustrate the algorithm researched and proposed for high dimensional data cluster be it is effective, The influence of " dimension calamity " can be reduced to a certain extent.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims

1. a kind of population of clone's optimization clusters High dimensional data analysis method, which is characterized in that the particle of clone's optimization Clustering class High dimensional data analysis method generates N number of particle, adjusts the position of particle, is dynamically selected；It is anti-to N number of particle measurement Similarity between body-antibody and " body-antigen binding affinity passes through measurement results and carries out different interparticle Immune Clone Selections；To grain Son carries out specific mutation operation, measures and compare original antibody antibody-antigene affinity, retains affinity most by Immune Clone Selection High particle, and dynamic updates each particle rapidity of population, position, enters back into next iteration；It to the last exports optimal anti- The population of body, or while reaching given number of iterations, terminate；

Similarity function F between " antibody-antibody "_similarity, indicate i and j in n-dimensional space at a distance from, apart from smaller, Similarity is bigger：

" antibody-antigene " affinity of i-th kind of clustering calculates function F_{i_Affinity}, indicate as follows：

Wherein for given data acquisition system, M is a constant, indicates affinity force coefficient, D (x_i,y_i) indicate data set to be clustered x_iTo the distance of its data set central point, w_iIndicate the weighted factor of ith feature attribute, and all characteristic attribute weighted sums are 1。

2. the population of clone's dynamic optimization clusters High dimensional data analysis method as described in claim 1, which is characterized in that institute The population cluster High dimensional data analysis method for stating clone's optimization includes the following steps：

Step 1 initializes population sample, is randomly assigned and classifies, and initialization cluster subgroup initializes the speed of particle, Each population is measured, as the position encoded of primary, n times are repeated, generates N number of primary group；

Step 2 evaluates population based on contribution rate；Specific to calculate per the contribution rate each tieed up in one kind to such, contribution rate is most High preceding m feature serial number is also the fitness of the particle as selected feature dimensions；

Step 3 compares the fitness value for the desired positions Best_id that fitness value is lived through with it, such as to each particle Fruit is more preferable, updates Best_id；

Step 4 compares fitness value and the fitness value of desired positions Best_id that group is undergone to each particle Best_Value, if more preferably, updating Best_Value；

Step 5 adjusts speed and the position of particle；

Step 6, the k mean clusters of new individual；

3. the population of clone's optimization as claimed in claim 2 clusters High dimensional data analysis method, which is characterized in that the step Particle coding uses " the combined coding mechanism based on restriction " in rapid one；General coding method focuses on the class center space of points more On, using the real coding scheme based on cluster centre；Space encoder forms SUP, CEP, CPV by three parts, and wherein SUP is indicated The real coding string of proper subspace, CEP indicate that the real coding string at class center, CPV indicate class center degree of change；With specific reference to Its quantized value is encoded bunchiness by respective value range, while under restrictive condition, is effectively shortened code length, is prevented because of particle Length increased dramatically, and runnability is quite declined；Initial population generates in a random basis, random selection SUP_ The maximum feature dimensions number feature dimensions of maxnumber and the maximum class number data objects of CEP_maxnnumber are compiled Code composition individual, then the scale of the preset initial populations of iteration N_size time, that is, complete the generation of initial population.

4. the population of clone's optimization as claimed in claim 2 clusters High dimensional data analysis method, which is characterized in that the step Fitness function calculates in rapid two, is indicated the contribution rate of subspace clustering with feature dimensions；

K with { C₁,C₂,…C_k, centered on subspace class { A₁,A₂,…A_k, to each subclass A_i(i=1,2 ..., k) It is measured, contribution rate metric evaluation function is as follows：

J expressions contain intrinsic dimensionality in subspace,Indicate class A_iOn the jth dimension of data point tieed up with the jth of central point Distance, value is smaller, indicates class A_iBe class on feature dimensions j it is compact, also referred to as ties up j to class A_iContribution it is big, F_ijValue get over Greatly；Conversely, claiming dimension j to class A_iContribution it is small；Calculate A_iAll feature dimensions are to A in class_iContribution and be expressed as μ_i：

The sum of fitness by all K classes indicates the fitness of all particles：

5. the population of clone's optimization as claimed in claim 2 clusters High dimensional data analysis method, which is characterized in that the step Immune Clone Selection dynamic clustering process specifically includes in rapid one：

The position Z of each particle in initial initialization population_i={ Z_i1, Z_i2..., Z_ikAnd speed V_i={ V_i1, V_i1..., V_ik}；

While (current iteration number t<T)；// provide cycle qualifications

For i=1 to populations N//cluster starts；

By minimal distance principle by all vector X' in X '_jIt assigns in the class that a cluster centre Zij is represented；

Calculate the adaptive value of each particle；

Clone's quantity is calculated, particle is cloned；

Immune Clone Selection is carried out to data；

Update the current optimal solution of each particle；

Update the current optimal solution of group；

The speed of more new particle and position；

End for

End while//cycle terminates

Calculate the index of Clustering Effect；

Export cluster result；

Terminate.

6. the population of clone's optimization as claimed in claim 5 clusters High dimensional data analysis method, characterized in that the clone It is position and speed to select the newer attribute of particle of Dynamic Clustering Algorithm：

V_i'_d=wV_id+n₁rand1(P_id-X_id)+n₂rand2(P_gd-X_id)；

X'_id=X_id+V_id；

Dynamic update includes three parameters：w、n₁、n₂；It is related to particle rapidity and position：V_maxIndicate maximum speed, X_maxIt indicates most Big position；

Population Size W refers to value range：0.5 to 0.9；

n₁、n₂Can be 1.0 to 2.0 with reference to value；

Dynamic mapping principle refers to：Maximum speed：V_max=0.3*X_max, maximum position X_max=max (X_i) take per one-dimensional maximum Value；

Population Size reference value：N=10,20,30,40,50.

7. the population of clonal vaviation optimization as claimed in claim 2 clusters High dimensional data analysis method, which is characterized in that institute The clone stated is defined as follows method：

Population clone determines that affinity is higher with affinity and concentration, and clone's number is bigger, and antibody concentration is higher, clones number It is smaller, it is positively correlated by affinity, clone's thought of concentration inverse correlation；Formula definition indicates as follows：

Wherein a is clone's upper limit quantity, F_{i_Affinity}Indicate affinity degree, F_similarityIndicate that similarity, β indicate antibody population Size,Indicate that certain similar population number accounts for total antibody population number ratio, ratio is higher, and concentration is bigger, clones number It is smaller.

8. the population of clone's optimization as claimed in claim 2 clusters High dimensional data analysis method, which is characterized in that the step For particle of new generation in rapid six, clustered according to following k mean algorithms：

9. the population of clone's optimization as described in claim 1 clusters High dimensional data analysis method, which is characterized in that described gram The population of grand optimization clusters High dimensional data analysis method：Population size N, maximum iteration T are initialized, make a variation width Coefficient lambda is spent, antibody likeness coefficient η clusters manifold X, as follows：

It is normalized to X'={ x'₁,x'₂,…x'_n, Clustering Effect index I (k) obtains the k of maximum value as cluster numbers, it is also necessary to Judge the corresponding best cluster results of categorized data set X.