CN104991974A

CN104991974A - Particle swarm algorithm-based multi-label classification method

Info

Publication number: CN104991974A
Application number: CN201510464344.2A
Authority: CN
Inventors: 梁庆中; 樊媛媛; 姚宏; 颜雪松; 胡成玉; 曾德泽; 刘超
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2015-07-31
Filing date: 2015-07-31
Publication date: 2015-10-21

Abstract

The invention provides a particle swarm algorithm-based multi-label classification method. The particle swarm algorithm-based multi-label classification method includes an optimization stage and a classification stage. According to the optimization stage, a particle swarm algorithm is adopted to optimize the feature weight of a feature weighting KNN algorithm. According to the classification stage, the feature weight obtained in the optimization stage is applied to the feature weighting KNN algorithm, so that a test sample X can be classified, and finally, the labels of all samples in a test set can be outputted, namely, classification is completed. With the particle swarm algorithm-based multi-label classification method of the invention adopted, an optimal feature weight can be found out to eliminate the redundancy or irrelevance of the features in a data set which are attribute values when a distance is calculated, and therefore, distance deviation can be decreased, and the accuracy of classification can be improved.

Description

A kind of method of many labelings based on particle cluster algorithm

Technical field

The invention belongs to many labelings technical field, be specifically related to a kind of many labelings based on particle cluster algorithm method.

Background technology

The research of many labelings problem is promoted by text classification, and modern, many practical applications are all many labelings problems, and such as scene classification, protein functional assays, separated film and music are sorted out.The sample that many label datas are concentrated has multiple label, and how setting up and solve such optimization problem is the major issue that will solve.Though the realization of algorithm has certain difficulty, its advantage is that it does not change the structure of data set, does not destroy the incidence relation between classification, reflects the special nature of many labelings.According to the distinct methods setting up optimization problem, this algorithm also can be divided into multiple different form, as: maximize entropy algorithm based on many labelings algorithm of Adaboost algorithm, the many labelings algorithm using traditional decision-tree expansion, many labels algorithm of support vector machine, many labels k nearest neighbor algorithm (KNN algorithm), many labels.But these algorithms can due to the redundancy of eigenwert or uncorrelated and cause the error of calculation.

Summary of the invention

One of the object of the invention is the defect for overcoming prior art, provides a kind of method of the many labelings based on particle cluster algorithm that degree of accuracy is high.

A kind of method of many labelings based on particle cluster algorithm provided by the invention, comprises optimizing phase and sorting phase:

S10: the optimizing phase is the feature weight adopting particle cluster algorithm to optimize characteristic weighing KNN algorithm, specifically comprises the steps:

S11: adopt random device initialization population, the position of each particle and the dimension of speed are n, the feature weight vector w=(w that its position corresponding data concentrates to record ₁, w ₂, w _n): Qi Zhongyou

Σ_{i - 1}^{n} w_{i} = 1;

S12: calculate adaptive value, and then obtain locally optimal solution and globally optimal solution:

When calculating adaptive value, the position of particle and feature weight are applied in characteristic weighing KNN algorithm, former training sample is concentrated front 70% as new training sample, rear 30% as new forecast sample collection, this forecast sample collection is classified, calculate the accuracy of classification, accuracy is more high more meets adaptive value;

The original tag that forecast sample integrates every bar record is as li=(li1, li2, lin), and the prediction label after sorted is lj=(lj1, lj2, ljn), li and l _jbe sum, so accuracy rate Accruay=sum/n with the number that overlaps;

S20: sorting phase:

The feature weight drawn optimizing phase is applied in characteristic weighing KNN algorithm classifies to test sample book X, and the final label exporting all samples in test set, has namely classified.

Further, described particle cluster algorithm comprises the steps:

SA1: initialization Particle Swarm, comprising the position xi=(x of the whole population of initialization _i1, x _i2, x _id) ^twith speed v i=(v _i1, v _i2, v _id) ^tand local optimum and total optimization, wherein id represents d particle in the i-th generation.

SA2: calculate the fitness value fitness of each particle in current position _id=f (x _id).Then according to the size of fitness value, initialization locally optimal solution pbest _i=fitness _iwith total optimization solution gbest=min (fintess ₁, fitness ₂, fitness _n), i=1,2, N;

SA3: in each iterative process, each particle upgrades position and the speed of oneself according to following criterion

v _id(t+1)＝wv _id(t)+c ₁r ₁(p _ld-x _id(t))+c ₂r ₂(p _gd-x _id(t))

x _id(t+1)＝x _id(t)+v _id(t+1)

Wherein v _idfor the speed of particle, x _idfor the position of particle, w is inertia weight, c1 and c2 is acceleration factor, r ₁and r ₂random number, P _ldfor globally optimal solution and P _gdfor locally optimal solution;

SA4: upgrade locally optimal solution pbest _iwith total optimization solution gbest;

SA5: if total optimization solution gbest reaches the threshold value of setting or reached maximum iteration time, algorithm can stop calculating; Otherwise jump to step SA3.

Further, described characteristic weighing KNN algorithm specifically comprises the steps:

SB1: input m training sample, and set k value size;

SB2: the A [1] in first Stochastic choice training set ~ A [k] sample is as the initial most adjacent node of the k of sample X to be predicted;

SB3: the weighting Euclidean distance wd (X, A [i]) calculating the most adjacent node of sample X to be predicted and each initial k, i=1,2 ... .., k), calculating range formula is:

w d (X, A [i]) = \sqrt{Σ_{l = 1}^{n} w_{l} {(X_{l} - A {[i]}_{l})}^{2}},

Wherein n represents sample A [i] attribute number, i.e. A [i]=(A [i] 1, A [i] 2, A [i] 3, A [i] n);

SB4: will the distance wd (X, A [i]) obtained be asked in described step SB3 by ascending sort, try to achieve the maximum distance maxD=max{d (X, A [i]) of distance wd (X, A [i]) | i=1,2 ... .., k};

SB5: calculation training concentrates the distance of remaining record and sample to be tested X successively, and compared with the maximum distance maxD tried to achieve in described step SB4, if less than maximum distance maxD, then maximum distance maxD is updated to the distance value of this record and sample to be tested X, and again by ascending order adjust the distance wd (X, A [i]) sequence;

SB6: the occurrence number calculating the label of the every bar record in present range wd (X, A [i]) sequence, and sort according to the height of occurrence number;

SB7: using the label of L label before sequence in described step SB6 obtains as sample X.

Beneficial effect of the present invention is, this method can find optimum feature weight to eliminate the redundancy of the feature (referring to property value when calculating distance) of data centralization or uncorrelated, thus decreases range deviation, improves the accuracy of classification.

Accompanying drawing explanation

Figure 1 shows that the many labelings method flow diagram that the present invention is based on particle cluster algorithm.

Embodiment

Hereafter will describe the present invention in detail in conjunction with specific embodiments.It should be noted that the combination of technical characteristic or the technical characteristic described in following embodiment should not be considered to isolated, they can mutually be combined thus be reached better technique effect.

As shown in Figure 1, a kind of method of many labelings based on particle cluster algorithm provided by the invention comprises optimizing phase and sorting phase:

Optimizing phase is the feature weight adopting population (Particle Swarm Optimization, PSO) algorithm optimization characteristic weighing KNN algorithm, and concrete steps are as follows:

Σ_{i - 1}^{n} w_{i} = 1;

S20: sorting phase:

The feature weight drawn optimizing phase is applied in characteristic weighing KNN algorithm classifies to test sample book X, and the final label exporting all samples in test set, completes classification.

Particle cluster algorithm is the one belonging to evolution algorithmic, and be a kind of optimized algorithm based on iteration, system initialization is one group of RANDOM SOLUTION, by iterated search optimal value.But it does not use intersection (crossover) and variation (mutation), but the particle that particle follows optimum in solution space is searched for, the advantage of PSO is simple easily realization and does not have many parameters to need adjustment.Have a lot of particle in its each population, each particle has its position x and speed v Two Variables, and often produce in the new population of a generation and have the position of a particle best, this particle is exactly the locally optimal solution pbest of this generation _i, from locally optimal solution, produce globally optimal solution gbest.

Particle cluster algorithm comprises the steps:

SA2: calculate the fitness value fitness of each particle in current position _id=f (x _id).Then according to the size of fitness value, initialization locally optimal solution pbest _i=fitness _iwith total optimization solution gbest=min (fintess ₁, fitness ₂, fitness _n), i=1,2, N.

SA3: in each iterative process, each particle upgrades position and the speed of oneself according to following criterion:

v _id(t+1)＝wv _id(t)+c ₁r ₁(p _ld-x _id(t))+c ₂r ₂(p _gd-x _id(t))

x _id(t+1)＝x _id(t)+v _id(t+1)

Characteristic weighing KNN algorithm specifically comprises the steps:

SB1: input m training sample, and set k value size;

w d (X, A [i]) = \sqrt{Σ_{l = 1}^{n} w_{l} {(X_{l} - A {[i]}_{l})}^{2}},

For verifying validity of the present invention, test as follows:

Test independent operating 10 times, population is 50, and iterations is 100, inertia weight w=1, Studying factors c1=c2=2, for saving time, most adjacent node number K=1.

Four data sets of this test that what table 1 was listed is, they are all data sets that machine learning is commonly used, in order to make to have comparability between data characteristics, carried out standardization to data set, training set example and test set example respectively account for 70% and 30% of total example:

Table 1: data set

Data set	Attribute number	Class number	Training set example	Test set example
					CAL500	68	174	351	151
Emotions	72	6	391	202
					Scene	294	6	1211	1196
Yeast	8	10	1039	445

Experimental result and analysis:

Table 2 gives without the KNN algorithm of weights, WKNN-DIS and the Performance comparision of the many labels based on particle cluster algorithm provided by the invention PSOKNN algorithm on different test set, wherein WKNN-DIS is the KNN method based on distance, characteristic weighing KNN algorithm discusses because different characteristic is different to the influence degree of label, can cause the error of classification.And WKNN-DIS is the algorithm based on Euclidean distance, because the difference of distance is also different to the influence degree of label, in general, larger for sample to be sorted impact apart from nearer training sample, shared weight is also larger.WKNN-DIS and PSOKNN is similar, and the weight of its distance is also optimized by optimized algorithm, selects the particle cluster algorithm the same with PSOKNN in this test.

Table 2: the accuracy rate of algorithm on test set compares

Wherein Average Accuracy is the mean value of PSOKNN (or WKNN-DIS) algorithm 10 experimental results, the accuracy rate of first 10,20,30 refers to and the size of the individuality in last generation according to adaptive value is sorted, then by before rank 10,20,30 individuality correspondence weight combination be applied to respectively in classification, accuracy rate is averaged, after carrying out 10 tests, each experiment is averaged and average again, the origin of front 10,20,30 accuracys rate that Here it is.

Result can find out that WKNN-DIS is generally high in the accuracy rate of each data set than original KNN method by experiment, and PSOKNN method is generally good than WKNN-DIS method.Because PSOKNN method is main is find optimum feature weight to eliminate the redundancy of the feature (referring to property value when calculating distance) of data centralization or uncorrelated as far as possible.Other two kinds of methods all do not have this function, so other two kinds of methods have error when calculating minimum distance, because minimum distance is calculated with reasonably range formula by attribute, so higher to the value dependence of attribute, when redundance is higher, deviation will affect the accuracy of classification.So the method for proposition EVOLUTIONARY COMPUTATION herein draws the feature weight of optimization, reduce range deviation, improve the accuracy of classification, it is also proposed feasible adaptive classifier effectively.

Although give some embodiments of the present invention, it will be understood by those of skill in the art that without departing from the spirit of the invention herein, can change embodiment herein.Above-described embodiment is exemplary, should using embodiment herein as the restriction of interest field of the present invention.

Claims

1. based on many labelings method of particle cluster algorithm, it is characterized in that, comprise optimizing phase and sorting phase:

S11: adopt random device initialization population, the position of each particle and the dimension of speed are n, the vectorial w=(w of feature weight that the position corresponding data of each particle concentrates record ₁, w ₂, w _n): Qi Zhongyou

Σ_{i - 1}^{n} w_{i} = 1;

S20: sorting phase:

2. a kind of method of many labelings based on particle cluster algorithm as claimed in claim 1, it is characterized in that, described particle cluster algorithm comprises the steps:

SA1: initialization Particle Swarm, comprising the position xi=(x of the whole population of initialization _i1, x _i2, x _id) ^twith speed v i=(v _i1, v _i2, v _id) ^tand local optimum and total optimization, wherein id represents d particle in the i-th generation;

v _id(t+1)＝wv _id(t)+c ₁r ₁(p _ld-x _id(t))+c ₂r ₂(p _gd-x _id(t))

x _id(t+1)＝x _id(t)+v _id(t+1)

3. a kind of method of many labelings based on particle cluster algorithm as claimed in claim 1, it is characterized in that, described characteristic weighing KNN algorithm specifically comprises the steps:

SB1: input m training sample, and set k value size;

w d (X, A [i]) = \sqrt{Σ_{l = 1}^{n} w_{l} {(X_{l} - A {[i]}_{l})}^{2}},

SB5: calculation training concentrates the distance of remaining record and sample to be tested X successively, and compared with the maximum distance maxD tried to achieve in described step SB4, if less than maximum distance maxD, maximum distance maxD is updated to the distance value of this record and sample to be tested X, again by ascending order adjust the distance wd (X, A [i]) sequence;