CN110210529A - A kind of feature selection approach based on binary quanta particle swarm optimization - Google Patents

A kind of feature selection approach based on binary quanta particle swarm optimization Download PDF

Info

Publication number
CN110210529A
CN110210529A CN201910400448.5A CN201910400448A CN110210529A CN 110210529 A CN110210529 A CN 110210529A CN 201910400448 A CN201910400448 A CN 201910400448A CN 110210529 A CN110210529 A CN 110210529A
Authority
CN
China
Prior art keywords
feature
calculated
algorithm
correlation
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910400448.5A
Other languages
Chinese (zh)
Inventor
葛瑞泉
刘勇
吴卿
沈渊锋
严义
高政
郑小芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Kongtrolink Information Technology Co Ltd
Zhejiang University ZJU
Original Assignee
Hangzhou Kongtrolink Information Technology Co Ltd
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Kongtrolink Information Technology Co Ltd, Zhejiang University ZJU filed Critical Hangzhou Kongtrolink Information Technology Co Ltd
Priority to CN201910400448.5A priority Critical patent/CN110210529A/en
Publication of CN110210529A publication Critical patent/CN110210529A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of feature selection approach based on binary quanta particle swarm optimization.Feature correlation analysis is carried out using maximum information coefficient, then feature selecting processing is carried out by improved BQPSO algorithm, carries out classification accuracy verifying using SVM later.Gene expression profile the experimental results showed that, based on improved BQPSO algorithm carry out feature selecting be a kind of practicable method.The present invention mainly improves the binary quanta particle colony optimization algorithm of standard, and the calculating of local attraction's has used the mode based on complete learning strategy, while introducing the variation thought of genetic algorithm to increase the diversity of population.Experiment shows that better classification accuracy can be obtained using improved BQPSO algorithm progress feature selecting.

Description

A kind of feature selection approach based on binary quanta particle swarm optimization
Technical field
The invention belongs to data mining technology fields, are related to a kind of feature selecting based on binary quanta particle swarm optimization Method.
Background technique
In classification problem, data set generally comprises thousands of feature, including those related, uncorrelated and redundancies Feature, it is excessive huge due to data set, in some instances it may even be possible to classification performance to be reduced, this just will appear " dimension disaster ".Pass through spy The dimension that sign selects to reduce data set is one of the mode of Data Dimensionality Reduction.
Feature selecting, in occupation of very important status, and has very high researching value in area of pattern recognition. On the one hand, can effectively reduce data volume to be processed by using feature selecting reduces computing cost;On the other hand, feature is selected Non-key interference characteristic can be eliminated by selecting, and reduce the correlation between feature, the validity of Enhanced feature.
Currently, there is the feature selection approach based on filtration method, pack and embedding inlay technique.Pack is commented using classifier Estimate the character subset of generation.And filtration method is that character subset is assessed according to its information content and statistical measures.In general, Pack can than filtration method obtain preferably as a result, but calculation amount it is larger.The classifier building process of embedding inlay technique is also a spy Levy the process of selection.How to design effective feature selection approach is the major issue that current high dimensional data faces.
Summary of the invention
The purpose of the present invention is being directed to the demand of the existing feature selecting to higher-dimension, Small Sample Database, a kind of base is proposed In the feature selection approach of binary quanta particle swarm optimization.This method uses maximum information coefficient (the maximal Information coefficient, writes a Chinese character in simplified form MIC) (see DN, the paper Detecting novel of R., et al. Associations in large data sets.Science (New York, N.Y.), 2011.334 (6062)) it is counted Data preprocess deletes the feature of weak dependence, then passes through improved binary quanta particle group (Binary Quantum Particle Swarm Optimization, BQPSO) algorithm carries out feature selecting operation, later selected characteristic use SVM into The verifying of row classification accuracy, keeps higher accuracy rate.
The specific steps of the present invention are as follows:
Step 1: input common data sets;
Step 2: the correlation of each data field feature and category is calculated using maximum information coefficient MIC, setting is related Property be less than threshold value the weak relevant feature of conduct, delete weak correlated characteristic;
The correlation of each feature and category is calculated using maximum information coefficient MIC, specifically:
Wherein X is sample characteristics, and Y is category, and B takes 0.6 or 0.55 power of total amount of data.
Step 3: being directed to strong correlation feature, carried out using the variation thought and binary quanta particle swarm optimization of genetic algorithm Optimal feature subset selection;
Specifically:
1) initialization population;
2) fitness value of each particle in group is calculated according to fitness function, and is carried out with a preceding local optimum Compare, if f (xi)<f(pbesti), then pbesti=xi, otherwise it does not update;
3) population optimal value gbest is calculated, average optimal value mbest is calculated;
4) the sub- p of local attraction is calculatedi, calculate the new location updating Probability p r of particle;
5) according to function Transf (pi, pr) and update xiValue;
6) the poor particle of fitness value is filtered out, using the variation thought of genetic algorithm, the grain poor to fitness value Son is made a variation with the probability of Pm, to improve the diversity of population;
7) whether interpretation meets termination condition, and Step4 is returned to if being unsatisfactory for), otherwise enter and operates in next step;
8) optimal feature subset is exported;
Wherein fitness function are as follows:
Wherein, wAIt is svm classifier accuracy rate weight, wFIt is the feature quantity weight with category strong correlation, sum (chrom) is Refer to the feature quantity with category strong correlation, Acc is according to the classification accuracy of selected feature, and mic_c is by maximum linear system The correlation that number MIC is calculated between feature and category obtains;Mic_f is to calculate feature and feature by maximum information coefficient MIC Between correlation;
Step 4: validation verification evaluation being carried out to selected character subset using algorithm of support vector machine.
Beneficial effects of the present invention: the present invention mainly changes the binary quanta particle colony optimization algorithm of standard Into the calculating of local attraction's has used the mode based on complete learning strategy, while the variation thought for introducing genetic algorithm is come Increase the diversity of population.Experiment shows to carry out feature selecting using improved BQPSO algorithm, can preferably be classified Accuracy rate.
Detailed description of the invention
Fig. 1 is algorithm general flow chart of the invention;
Fig. 2 is binary quanta particle swarm optimization flow chart of the invention;
Fig. 3 is the character subset that Lymphoma lymthoma data set obtains through the invention, passes through support vector machines (Support Vector Machine, SVM) obtains classification accuracy.
Specific embodiment
As shown in Figure 1, a kind of feature selection approach based on binary quanta particle swarm optimization, the specific steps are as follows:
Step 1, input common data sets Lymphoma, wherein sample size is 45, feature quantity 4026, wherein negative sample Quantity is 22, and positive sample quantity is 23.
Step 2, the correlation that all features and category are calculated using maximum information coefficient (MIC).MIC calculation method is such as public Shown in formula (1) (2).
Step 3 carries out relevance ranking to feature according to MIC value, deletes the weak correlated characteristic in part according to the threshold value of setting.
Step 4, to remaining feature using binary particle swarm algorithm scan for optimization obtain optimal feature subset.Tool Body algorithm flow chart is shown in Fig. 2.
In BQPSO algorithm, without the concept of speed and track, distance is general only between particle position point and particle It reads.The distance between two particles are indicated with Hamming distance.The p in QPSOiIt is local attraction's for calculating population, pidValue exist pbestidAnd gbestdBetween, pi=(pi1,pi2,...piD) be then located at pbestiIt is the hypermatrix at diagonal line both ends with gbest In, piTo pbestiOr the distance of gbest is necessarily less than cornerwise length, that is, must satisfy such as lower inequality:
|pi-pbesti|≤|pbesti-gbest| (3)
|pi-gbest|≤|pbesti-gbest| (4)
Pass through the sub- p of local attractioniCalculating, can make population generate diversity, jump out the local search area of particle.? In BQPSO algorithm, piProducing method and QPSO algorithm it is different, be by parent pbestiWith each in gbest with Machine intersects to generate new filial generation.
With going deep into for PSO algorithm iteration, particle is easy Premature Convergence, falls into locally optimal solution.It is asked to solve this Topic introduces the variation thought of genetic algorithm, and the particle poor for some fitness is made a variation with the probability of Pm, increases population Diversity prevents particle from falling into locally optimal solution too early.
This method wishes to obtain higher classification accuracy while selecting less characteristic.Therefore, algorithm for design Fitness function be formula (3):
Wherein sum (chrom) refers to feature quantity selected by each population, and Acc is to be classified to obtain according to selected feature Accuracy rate.This method uses two classifier SVM, carries out classification model construction to sample according to the character subset of each population, uses Fitness evaluation effect.The fitness function keeps selected characteristic as few as possible, while keeping classification error rate as low as possible. Binary quanta particle swarm optimization selects the process of feature as follows:
1) initialization population.
2) according to fitness function calculate group in each particle fitness value, and with a preceding local optimum into Row compares, if f (xi)<f(pbesti), then pbesti=xi, otherwise it does not update.
3) population optimal value gbest is calculated, average optimal value mbest is calculated.
4) the sub- p of local attraction is calculatedi, calculate the new location updating Probability p r of particle.
5) according to function Transf (pi, pr) and update xiValue.
6) the poor particle of fitness value is filtered out, using the variation thought of genetic algorithm, the grain poor to fitness value Son is made a variation with the probability of Pm, to improve the diversity of population.
7) whether interpretation meets termination condition, and Step4 is returned to if being unsatisfactory for, and otherwise enters and operates in next step.
8) optimal chromosome, i.e., 01 optimal string are exported, wherein 0 indicates not choose this feature, 1 indicates to have selected the spy Sign.
Step 5, above four step repetitive cyclings repeatedly obtain selected character subset.Using ten times of cross validations to each Obtained character subset is verified.The improved BQPSO algorithm and BQPSO algorithm modeled by support vector cassification Classification accuracy comparison schematic diagram (see Fig. 3).

Claims (2)

1. a kind of feature selection approach based on binary quanta particle swarm optimization, it is characterised in that: the specific steps of this method It is as follows:
Step 1: input common data sets;
Step 2: the correlation of each data field feature and category is calculated using maximum information coefficient MIC, setting correlation is small In the weak relevant feature of the conduct of threshold value, weak correlated characteristic is deleted;
Step 3: being directed to strong correlation feature, carried out using the variation thought and binary quanta particle swarm optimization of genetic algorithm optimal Feature subset selection;
Specifically:
1) initialization population;
2) fitness value of each particle in group is calculated according to fitness function, and is compared with a preceding local optimum Compared with if f (xi)<f(pbesti), then pbesti=xi, otherwise it does not update;
3) population optimal value gbest is calculated, average optimal value mbest is calculated;
4) the sub- p of local attraction is calculatedi, calculate the new location updating Probability p r of particle;
5) according to function Transf (pi, pr) and update xiValue;
6) filter out the poor particle of fitness value, using the variation thought of genetic algorithm, the particle poor to fitness value with The probability of Pm makes a variation, to improve the diversity of population;
7) whether interpretation meets termination condition, and Step4 is returned to if being unsatisfactory for), otherwise enter and operates in next step;
8) optimal feature subset is exported;
Wherein fitness function are as follows:
Wherein, wAIt is svm classifier accuracy rate weight, wFThe feature quantity weight with category strong correlation, sum (chrom) refer to The feature quantity of category strong correlation, Acc are according to the classification accuracy of selected feature, and mic_c is by maximum linear coefficient MIC The correlation calculated between feature and category obtains;Mic_f is calculated between feature and feature by maximum information coefficient MIC Correlation;
Step 4: validation verification evaluation being carried out to selected character subset using algorithm of support vector machine.
2. a kind of feature selection approach based on binary quanta particle swarm optimization according to claim 1, feature exist In: the correlation of each feature and category is calculated using maximum information coefficient MIC,
Specifically:
Wherein X is sample characteristics, and Y is category, and B takes 0.6 or 0.55 power of total amount of data.
CN201910400448.5A 2019-05-14 2019-05-14 A kind of feature selection approach based on binary quanta particle swarm optimization Pending CN110210529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910400448.5A CN110210529A (en) 2019-05-14 2019-05-14 A kind of feature selection approach based on binary quanta particle swarm optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910400448.5A CN110210529A (en) 2019-05-14 2019-05-14 A kind of feature selection approach based on binary quanta particle swarm optimization

Publications (1)

Publication Number Publication Date
CN110210529A true CN110210529A (en) 2019-09-06

Family

ID=67787230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910400448.5A Pending CN110210529A (en) 2019-05-14 2019-05-14 A kind of feature selection approach based on binary quanta particle swarm optimization

Country Status (1)

Country Link
CN (1) CN110210529A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659719A (en) * 2019-09-19 2020-01-07 江南大学 Aluminum profile flaw detection method
CN111191764A (en) * 2019-12-30 2020-05-22 内蒙古工业大学 Bus passenger flow volume test method and system based on SPGAPSO-SVM algorithm
CN112819062A (en) * 2021-01-26 2021-05-18 淮阴工学院 Fluorescence spectrum quadratic characteristic selection method based on mixed particle swarm and continuous projection
CN113408731A (en) * 2021-06-21 2021-09-17 北京计算机技术及应用研究所 K-near quantum circuit realizing method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140257767A1 (en) * 2013-03-09 2014-09-11 Bigwood Technology, Inc. PSO-Guided Trust-Tech Methods for Global Unconstrained Optimization
US20150242759A1 (en) * 2014-02-21 2015-08-27 Battelle Memorial Institute Method of generating features optimal to a dataset and classifier
CN105718943A (en) * 2016-01-19 2016-06-29 南京邮电大学 Character selection method based on particle swarm optimization algorithm
CN107657098A (en) * 2017-09-15 2018-02-02 哈尔滨工程大学 Perimeter antenna array Sparse methods based on quantum chicken group's mechanism of Evolution
CN108140145A (en) * 2015-08-13 2018-06-08 D-波系统公司 For the system and method for creating and being interacted using the higher degree between quantum device
CN108805159A (en) * 2018-04-17 2018-11-13 杭州电子科技大学 A kind of high dimensional data feature selection approach based on filtration method and genetic algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140257767A1 (en) * 2013-03-09 2014-09-11 Bigwood Technology, Inc. PSO-Guided Trust-Tech Methods for Global Unconstrained Optimization
US20150242759A1 (en) * 2014-02-21 2015-08-27 Battelle Memorial Institute Method of generating features optimal to a dataset and classifier
CN108140145A (en) * 2015-08-13 2018-06-08 D-波系统公司 For the system and method for creating and being interacted using the higher degree between quantum device
CN105718943A (en) * 2016-01-19 2016-06-29 南京邮电大学 Character selection method based on particle swarm optimization algorithm
CN107657098A (en) * 2017-09-15 2018-02-02 哈尔滨工程大学 Perimeter antenna array Sparse methods based on quantum chicken group's mechanism of Evolution
CN108805159A (en) * 2018-04-17 2018-11-13 杭州电子科技大学 A kind of high dimensional data feature selection approach based on filtration method and genetic algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈渊锋: "基于改进的粒子群优化算法的特征选择方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 1, pages 18 - 49 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659719A (en) * 2019-09-19 2020-01-07 江南大学 Aluminum profile flaw detection method
CN110659719B (en) * 2019-09-19 2022-02-08 江南大学 Aluminum profile flaw detection method
CN111191764A (en) * 2019-12-30 2020-05-22 内蒙古工业大学 Bus passenger flow volume test method and system based on SPGAPSO-SVM algorithm
CN112819062A (en) * 2021-01-26 2021-05-18 淮阴工学院 Fluorescence spectrum quadratic characteristic selection method based on mixed particle swarm and continuous projection
CN113408731A (en) * 2021-06-21 2021-09-17 北京计算机技术及应用研究所 K-near quantum circuit realizing method

Similar Documents

Publication Publication Date Title
CN110210529A (en) A kind of feature selection approach based on binary quanta particle swarm optimization
US11977634B2 (en) Method and system for detecting intrusion in parallel based on unbalanced data Deep Belief Network
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
Cheng et al. Label ranking methods based on the Plackett-Luce model
Bandyopadhyay et al. Multiobjective GAs, quantitative indices, and pattern classification
CN102663100B (en) Two-stage hybrid particle swarm optimization clustering method
CN110147321A (en) A kind of recognition methods of the defect high risk module based on software network
CN108363810A (en) Text classification method and device
CN110188785A (en) A kind of data clusters analysis method based on genetic algorithm
Parrott et al. Multi-objective techniques in genetic programming for evolving classifiers
CN112906890A (en) User attribute feature selection method based on mutual information and improved genetic algorithm
CN110837884B (en) Effective mixed characteristic selection method based on improved binary krill swarm algorithm and information gain algorithm
CN109670687A (en) A kind of mass analysis method based on particle group optimizing support vector machines
CN108805159A (en) A kind of high dimensional data feature selection approach based on filtration method and genetic algorithm
CN112633346A (en) Feature selection method based on feature interactivity
CN114625868A (en) Electric power data text classification algorithm based on selective ensemble learning
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
Ahlawat et al. A genetic algorithm based feature selection for handwritten digit recognition
CN104636814A (en) Method and system for optimizing random forest models
CN111275206A (en) Integrated learning method based on heuristic sampling
CN117978661A (en) Influence maximization method based on refused neighborhood
CN114169406A (en) Feature selection method based on symmetry uncertainty joint condition entropy
Lin et al. A new density-based scheme for clustering based on genetic algorithm
Lingras et al. Statistical, evolutionary, and neurocomputing clustering techniques: cluster-based vs object-based approaches
CN105654498A (en) Image segmentation method based on dynamic local search and immune clone automatic clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination