CN110210529A

CN110210529A - A kind of feature selection approach based on binary quanta particle swarm optimization

Info

Publication number: CN110210529A
Application number: CN201910400448.5A
Authority: CN
Inventors: 葛瑞泉; 刘勇; 吴卿; 沈渊锋; 严义; 高政; 郑小芳
Original assignee: Hangzhou Kongtrolink Information Technology Co Ltd; Zhejiang University ZJU
Current assignee: Hangzhou Kongtrolink Information Technology Co Ltd; Zhejiang University ZJU
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-09-06

Abstract

The invention discloses a kind of feature selection approach based on binary quanta particle swarm optimization.Feature correlation analysis is carried out using maximum information coefficient, then feature selecting processing is carried out by improved BQPSO algorithm, carries out classification accuracy verifying using SVM later.Gene expression profile the experimental results showed that, based on improved BQPSO algorithm carry out feature selecting be a kind of practicable method.The present invention mainly improves the binary quanta particle colony optimization algorithm of standard, and the calculating of local attraction's has used the mode based on complete learning strategy, while introducing the variation thought of genetic algorithm to increase the diversity of population.Experiment shows that better classification accuracy can be obtained using improved BQPSO algorithm progress feature selecting.

Description

A Feature Selection Method Based on Binary Quantum Particle Swarm Algorithm

技术领域technical field

本发明属于数据挖掘技术领域，涉及一种基于二进制量子粒子群算法的特征选择方法。The invention belongs to the technical field of data mining, and relates to a feature selection method based on binary quantum particle swarm algorithm.

背景技术Background technique

在分类问题中，数据集通常包含成千上万的特征，包括那些相关，不相关以及冗余特征，由于数据集的过度庞大，甚至可能会降低分类性能，这就会出现“维数灾难”。通过特征选择来减少数据集的维数是数据降维的方式之一。In classification problems, data sets usually contain tens of thousands of features, including those related, irrelevant and redundant features. Due to the excessively large data set, it may even reduce the classification performance, which will cause the "curse of dimensionality" . Reducing the dimensionality of a data set through feature selection is one of the ways of data dimensionality reduction.

特征选择在模式识别领域中占据着非常重要的地位，并且有非常高的研究价值。一方面，通过使用特征选择可以有效减少待处理的数据量降低计算开销；另一方面，特征选择能够消除非关键的干扰特征，降低特征之间的相关性，增强特征的有效性。Feature selection occupies a very important position in the field of pattern recognition, and has very high research value. On the one hand, feature selection can effectively reduce the amount of data to be processed and reduce computing overhead; on the other hand, feature selection can eliminate non-critical interference features, reduce the correlation between features, and enhance the effectiveness of features.

目前，有基于过滤法、包装法和嵌入法的特征选择方法。包装法是使用分类器来评估生成的特征子集。而过滤法是特征子集按照其信息内容和统计度量来进行评估的。通常，包装法会比过滤法获得更好的结果，但计算量较大。嵌入法的分类器构建过程也是一个特征选择的过程。如何设计有效的特征选择方法是当前高维数据面临的一个重要问题。Currently, there are feature selection methods based on filtering, wrapping, and embedding. The wrapper method uses a classifier to evaluate the generated subset of features. Filtering, on the other hand, evaluates subsets of features in terms of their information content and statistical measures. In general, wrapping methods give better results than filtering methods, but are more computationally intensive. The classifier construction process of the embedding method is also a feature selection process. How to design an effective feature selection method is an important problem facing high-dimensional data.

发明内容Contents of the invention

本发明的目的是针对现有的对高维、小样本数据的特征选择的需求，提出一种基于二进制量子粒子群算法的特征选择方法。该方法采用最大信息系数(the maximalinformation coefficient，简写MIC)(见DN,R.,et al.的论文Detecting novelassociations in large data sets.Science(New York,N.Y.),2011.334(6062))进行数据预处理，删除弱相关性的特征，再通过改进的二进制量子粒子群(Binary QuantumParticle Swarm Optimization，BQPSO)算法进行特征选择操作，之后所选特征利用SVM进行分类准确率验证，保持较高的准确率。The purpose of the present invention is to propose a feature selection method based on binary quantum particle swarm algorithm in view of the existing demand for feature selection of high-dimensional and small-sample data. This method uses the maximal information coefficient (MIC for short) (see DN, R., et al.'s paper Detecting novelassociations in large data sets. Science (New York, N.Y.), 2011.334 (6062)) for data preprocessing , delete weakly correlated features, and then perform feature selection operations through the improved Binary Quantum Particle Swarm Optimization (BQPSO) algorithm, and then use SVM to verify the classification accuracy of the selected features to maintain a high accuracy rate.

本发明的具体步骤如下：Concrete steps of the present invention are as follows:

步骤1：输入公共数据集；Step 1: Input public dataset;

步骤2：使用最大信息系数MIC计算各个数据字段特征和类标的相关性，设定相关性小于阈值的作为弱相关的特征，删除弱相关特征；Step 2: Use the maximum information coefficient MIC to calculate the correlation between the characteristics of each data field and the class label, set the correlation less than the threshold as the weakly correlated feature, and delete the weakly correlated feature;

使用最大信息系数MIC计算各个特征和类标的相关性，具体为：Use the maximum information coefficient MIC to calculate the correlation between each feature and class mark, specifically:

其中X是样本特征，Y是类标，B取数据总量的0.6或者0.55次方。Among them, X is the sample feature, Y is the class label, and B takes the 0.6 or 0.55 power of the total amount of data.

步骤3：针对强相关特征，使用遗传算法的变异思想和二进制量子粒子群算法进行最优特征子集选择；Step 3: For strongly correlated features, use the mutation idea of genetic algorithm and binary quantum particle swarm algorithm to select the optimal feature subset;

具体为：Specifically:

1)初始化种群；1) Initialize the population;

2)根据适应度函数计算群体中每个粒子的适应度值，并与前一次局部最优值进行比较，如果f(x_i)<f(pbest_i)，则pbest_i＝x_i，反之不更新；2) Calculate the fitness value of each particle in the population according to the fitness function, and compare it with the previous local optimal value, if f(x _i )<f(pbest _i ), then pbest _i =x _i , otherwise not renew;

3)计算种群最优值gbest，计算平均最优值mbest；3) Calculate the population optimal value gbest, and calculate the average optimal value mbest;

4)计算局部吸引子p_i，计算粒子的新位置更新概率pr；4) Calculate the local attractor p _i , and calculate the new position update probability pr of the particle;

5)根据函数Transf(p_i,pr)更新x_i的值；5) Update the value of _xi according to the function Transf(p _i , pr);

6)筛选出适应度值较差的粒子，利用遗传算法的变异思想，对适应度值较差的粒子以Pm的概率进行变异，从而提高粒子群的多样性；6) Screen out the particles with poor fitness values, and use the mutation idea of genetic algorithm to mutate the particles with poor fitness values with the probability of Pm, thereby increasing the diversity of particle swarms;

7)判读是否满足终止条件，如果不满足则返回到Step4)，否则进入下一步操作；7) Judging whether the termination condition is met, if not, then return to Step4), otherwise enter the next step;

8)输出最优特征子集；8) Output the optimal feature subset;

其中适应度函数为：where the fitness function is:

其中，w_A是SVM分类准确率权重，w_F是与类标强相关的特征数量权重，sum(chrom)是指与类标强相关的特征数量，Acc是根据所选特征的分类准确率，mic_c是通过最大线性系数MIC计算特征与类标之间的相关性得到；mic_f是通过最大信息系数MIC计算特征与特征之间的相关性；Among them, w _A is the weight of SVM classification accuracy rate, w _F is the weight of the number of features related to the class standard strength, sum(chrom) refers to the number of features related to the class standard strength, Acc is the classification accuracy rate based on the selected features, mic_c is obtained by calculating the correlation between features and class labels through the maximum linear coefficient MIC; mic_f is calculated by calculating the correlation between features and features through the maximum information coefficient MIC;

步骤4：使用支持向量机算法对所选特征子集进行有效性验证评价。Step 4: Use the support vector machine algorithm to verify and evaluate the validity of the selected feature subset.

本发明的有益效果：本发明主要是对标准的二进制量子粒子群优化算法进行了改进，局部吸引子的计算使用了基于完全学习策略的方式，同时引入遗传算法的变异思想来增加粒子群的多样性。实验表明，使用改进的BQPSO算法进行特征选择，能得到更好的分类准确率。Beneficial effects of the present invention: the present invention mainly improves the standard binary quantum particle swarm optimization algorithm. The calculation of the local attractor uses a method based on a complete learning strategy, and at the same time introduces the variation idea of the genetic algorithm to increase the diversity of the particle swarm sex. Experiments show that using the improved BQPSO algorithm for feature selection can get better classification accuracy.

附图说明Description of drawings

图1为本发明的算法总流程图；Fig. 1 is the general flowchart of algorithm of the present invention;

图2为本发明的二进制量子粒子群算法流程图；Fig. 2 is the binary quantum particle swarm algorithm flowchart of the present invention;

图3为Lymphoma淋巴瘤数据集通过本发明得到的特征子集，通过支持向量机(Support Vector Machine，SVM)得到分类准确率。Fig. 3 is a feature subset obtained by the present invention for the Lymphoma lymphoma data set, and the classification accuracy is obtained by a Support Vector Machine (SVM).

具体实施方式Detailed ways

如图1所示，一种基于二进制量子粒子群算法的特征选择方法，具体步骤如下：As shown in Figure 1, a feature selection method based on binary quantum particle swarm algorithm, the specific steps are as follows:

步骤1、输入公共数据集Lymphoma，其中样本数量为45，特征数量4026，其中负样本数量为22，正样本数量为23。Step 1. Input the public dataset Lymphoma, where the number of samples is 45, the number of features is 4026, the number of negative samples is 22, and the number of positive samples is 23.

步骤2、利用最大信息系数(MIC)计算所有特征与类标的相关性。MIC计算方法如公式(1)(2)所示。Step 2. Using the maximum information coefficient (MIC) to calculate the correlation between all features and class labels. The calculation method of MIC is shown in formula (1) (2).

步骤3、根据MIC值对特征进行相关性排序，根据设定的阈值删除部分弱相关特征。Step 3. The features are sorted according to the relevance of the MIC value, and some weakly correlated features are deleted according to the set threshold.

步骤4、对剩下的特征采用二进制粒子群算法进行搜索优化得到最优特征子集。具体算法流程图见图2。Step 4. Use the binary particle swarm optimization algorithm to search and optimize the remaining features to obtain the optimal feature subset. The specific algorithm flow chart is shown in Figure 2.

在BQPSO算法中，没有速度和轨迹的概念，只有粒子位置点和粒子之间距离的概念。两个粒子之间的距离用汉明距离表示。在QPSO中p_i是计算种群的局部吸引子，p_id的值在pbest_id和gbest_d之间，p_i＝(p_i1,p_i2,...p_iD)则位于以pbest_i和gbest为对角线两端的超矩阵中，p_i到pbest_i或gbest的距离必须小于对角线的长度，即必须满足如下不等式：In the BQPSO algorithm, there is no concept of velocity and trajectory, only the concept of particle position and distance between particles. The distance between two particles is represented by the Hamming distance. In QPSO, p _i is the local attractor of the calculation population, the value of p _id is between pbest _id and gbest _d , p _i =(p _i1 ,p _i2 ,...p _iD ) is located between pbest _i and gbest In the supermatrix at both ends of the diagonal, the distance from p _i to pbest _i or gbest must be less than the length of the diagonal, that is, the following inequality must be satisfied:

|p_i-pbest_i|≤|pbest_i-gbest| (3)|p _i -pbest _i |≤|pbest _i -gbest| (3)

|p_i-gbest|≤|pbest_i-gbest| (4)|p _i -gbest|≤|pbest _i -gbest| (4)

通过局部吸引子p_i的计算，可以使种群产生多样性，跳出粒子的局部搜索区域。在BQPSO算法中，p_i的产生方式和QPSO算法有所不同，是通过父代pbest_i和gbest中的每一位随机交叉从而产生新的子代。Through the calculation of the local attractor p _i , the population can be diversified and jump out of the local search area of the particle. In the BQPSO algorithm, the generation method of p _i is different from that of the QPSO algorithm, and a new offspring is generated through the random intersection of each bit in the parent pbest _i and gbest.

随着PSO算法迭代的深入，粒子容易过早收敛，陷入局部最优解。为了解决这一问题，引入遗传算法的变异思想，对于一些适应度较差的粒子以Pm的概率变异，增加粒子群的多样性，防止粒子过早陷入局部最优解。With the deepening of the PSO algorithm iteration, the particles tend to converge prematurely and fall into the local optimal solution. In order to solve this problem, the mutation idea of genetic algorithm is introduced, and some particles with poor fitness are mutated with the probability of Pm, so as to increase the diversity of the particle swarm and prevent the particles from falling into the local optimal solution prematurely.

本方法希望在选择较少的特征数的同时获得较高的分类准确性。因此，设计算法的适应度函数为公式(3)：This method hopes to obtain higher classification accuracy while selecting fewer features. Therefore, the fitness function of the design algorithm is formula (3):

其中sum(chrom)是指每个种群所选特征数量，Acc是根据所选特征进行分类得到的准确率。该方法使用二分类器SVM，根据每个种群的特征子集对样本进行分类建模，使用fitness评价效果。该适应度函数使所选的特征数尽可能少，同时使分类错误率尽可能低。二进制量子粒子群算法选择特征的过程如下：Among them, sum(chrom) refers to the number of selected features for each population, and Acc is the accuracy rate obtained by classifying according to the selected features. This method uses a binary classifier SVM to classify and model samples according to the feature subset of each population, and uses fitness to evaluate the effect. The fitness function makes the number of selected features as small as possible while making the classification error rate as low as possible. The process of selecting features by binary quantum particle swarm algorithm is as follows:

1)初始化种群。1) Initialize the population.

2)根据适应度函数计算群体中每个粒子的的适应度值，并与前一次局部最优值进行比较，如果f(x_i)<f(pbest_i)，则pbest_i＝x_i，反之不更新。2) Calculate the fitness value of each particle in the population according to the fitness function, and compare it with the previous local optimal value, if f(x _i )<f(pbest _i ), then pbest _i =x _i , otherwise Not updated.

3)计算种群最优值gbest，计算平均最优值mbest。3) Calculate the population optimal value gbest, and calculate the average optimal value mbest.

4)计算局部吸引子p_i，计算粒子的新位置更新概率pr。4) Calculate the local attractor p _i , and calculate the update probability pr of the particle's new position.

5)根据函数Transf(p_i,pr)更新x_i的值。5) Update the value of _xi according to the function Transf(p _i , pr).

6)筛选出适应度值较差的粒子，利用遗传算法的变异思想，对适应度值较差的粒子以Pm的概率进行变异，从而提高粒子群的多样性。6) Screen out the particles with poor fitness value, and use the mutation idea of genetic algorithm to mutate the particles with poor fitness value with the probability of Pm, so as to increase the diversity of particle swarms.

7)判读是否满足终止条件，如果不满足则返回到Step4，否则进入下一步操作。7) Judging whether the termination condition is satisfied, if not, return to Step4, otherwise enter the next step.

8)输出最优染色体，即最优的01串，其中0表示没有选中该特征，1表示选中了该特征。8) Output the optimal chromosome, that is, the optimal 01 string, where 0 indicates that the feature is not selected, and 1 indicates that the feature is selected.

步骤5、以上四个步骤重复循环多次得到所选特征子集。使用十倍交叉验证对每次得到的特征子集进行验证。通过支持向量机分类建模得到的改进的BQPSO算法与BQPSO算法的分类准确率比较示意图(见图3)。Step 5. The above four steps are repeated multiple times to obtain the selected feature subset. Each resulting subset of features is validated using ten-fold cross-validation. Schematic diagram of the classification accuracy comparison between the improved BQPSO algorithm and the BQPSO algorithm obtained through support vector machine classification modeling (see Figure 3).

Claims

1. a feature selection method based on binary quantum particle swarm algorithm, is characterized in that: the concrete steps of this method are as follows:

Step 1: Input public dataset;

Step 2: Use the maximum information coefficient MIC to calculate the correlation between the characteristics of each data field and the class label, set the correlation less than the threshold as the weakly correlated feature, and delete the weakly correlated feature;

Step 3: For strongly correlated features, use the mutation idea of genetic algorithm and binary quantum particle swarm algorithm to select the optimal feature subset;

Specifically:

1) Initialize the population;

2) Calculate the fitness value of each particle in the population according to the fitness function, and compare it with the previous local optimal value, if f(x _i )<f(pbest _i ), then pbest _i =x _i , otherwise not renew;

3) Calculate the population optimal value gbest, and calculate the average optimal value mbest;

4) Calculate the local attractor p _i , and calculate the new position update probability pr of the particle;

5) Update the value of _xi according to the function Transf(p _i , pr);

6) Screen out the particles with poor fitness values, and use the mutation idea of genetic algorithm to mutate the particles with poor fitness values with the probability of Pm, thereby increasing the diversity of particle swarms;

7) Judging whether the termination condition is met, if not, then return to Step4), otherwise enter the next step;

8) Output the optimal feature subset;

where the fitness function is:

Among them, w _A is the weight of SVM classification accuracy rate, w _F is the weight of the number of features related to the class standard strength, sum(chrom) refers to the number of features related to the class standard strength, Acc is the classification accuracy rate based on the selected features, mic_c is obtained by calculating the correlation between features and class labels through the maximum linear coefficient MIC; mic_f is calculated by calculating the correlation between features and features through the maximum information coefficient MIC;

Step 4: Use the support vector machine algorithm to verify and evaluate the validity of the selected feature subset.

2. a kind of feature selection method based on binary quantum particle swarm algorithm according to claim 1, is characterized in that: use maximum information coefficient MIC to calculate the correlation of each feature and class mark,

Specifically:

Among them, X is the sample feature, Y is the class label, and B takes the 0.6 or 0.55 power of the total amount of data.