CN108664562B

CN108664562B - The text feature selection method of particle group optimizing

Info

Publication number: CN108664562B
Application number: CN201810315024.4A
Authority: CN
Inventors: 琚小明; 王锋华; 钱仲文; 毛大鹏; 吴翔; 邢雅菲; 张全; 于晓蝶; 夏洪涛; 成敬周; 王政; 孙晨; 王仲锋; 吕旭芬; 张旭东; 张建松
Original assignee: State Grid Zhejiang Electric Power Co Ltd; East China Normal University; Zhejiang Huayun Information Technology Co Ltd
Current assignee: State Grid Zhejiang Electric Power Co Ltd; East China Normal University; Zhejiang Huayun Information Technology Co Ltd
Priority date: 2018-04-10
Filing date: 2018-04-10
Publication date: 2019-10-01
Anticipated expiration: 2038-04-10
Also published as: CN108664562A

Abstract

The text feature selection method based on particle group optimizing that the invention discloses a kind of, this method is to solve the problems, such as to indicate that Text eigenvector higher-dimension and sparse occurs in text using vector space model, local searching strategy is embedded into particle swarm optimization algorithm by the present invention selects uncorrelated and significant character subset, particle swarm algorithm is instructed to select different features in search process by considering the relevant information of population, to select the feature for being more conducive to classification accuracy from primitive character.The present invention can select the character subset most beneficial for text representation from huge text word set, lay a good foundation so as to classification, the text-processing for text.

Description

The text feature selection method of particle group optimizing

Technical field

The present invention relates to natural language processing field, specifically a kind of feature selecting based on particle swarm optimization algorithm Method (PSO-FS) preferably indicates text applied to effective feature in the feature selecting of text, is selected.

Background technique

In big data era, data generation is increasingly huge, and useful information is obtained from a large amount of data becomes more multiple It is miscellaneous.Data are handled then in big data era using artificial method, the data of generation are increasingly huge, from a large amount of data The useful information of middle acquisition becomes more complicated.It is then difficult to be handled in the way of artificial data, so natural For expect handling data using machine.

Text classification, which refers to, carries out feature selecting and analysis to text, by text feature attribute it is most like be classified as one kind Process.Text classification includes several steps below: segmenting, stop words, feature selecting, vector space model is gone to indicate, instruct Practice classifier and classifies.The most of contents of text are all that natural language indicates, different from machine language, it is therefore desirable to will be original Text is converted.VSM is to be indicated text using vector space model, if resulting word will be segmented as feature , then vector dimension is huge, this not only brings complexity to calculating, and segments in gained word and have a large amount of useless letter Breath also brings interference to classification, therefore effectively the quantity of text feature item and controlling feature item is extremely critical for selection One step.

Feature selecting refers to through certain feature calculation method, selects from total characteristic set and distinguishes with text The feature of degree intensity is characterized item.Feature selecting has a variety of meanings to text-processing: (1) feature selecting can improve model Estimated performance effectively improves accuracy rate.(2) time of the training time of model and prediction are all reduced, is improved whole Efficiency.(3) the generation process of implication and data in data is disclosed.It is in simple terms exactly that feature selecting makes from number Most effectively feature is selected according to concentration, preferably data are understood.The validity feature collection that feature selecting goes out is smaller, so that Expression is dimension reduction, reduces the learning cost of model.Existing document frequency (DF), X2 (Chi) statistics, information gain (IG), the common feature selection approach of the these types such as mutual information (MI) is described analysis and comparison, the results show that for not Same classifier and data set, every kind of method have excellent lack.

Particle swarm algorithm (Particle Swarm Optimization, PSO) is derived to the simulation of flock of birds predation Important swarm intelligence algorithm.PSO most starts random a group random examples, and the continuous iteration of behavior by simulating flock of birds searches out Optimal solution.Each time during iteration, current optimal solution, and more new historical optimal solution can be all recorded, and change certainly The position of body and the speed of movement.The algorithm has stronger ability of searching optimum, and is conducive to understand implementation also very Simply.As a kind of optimization tool, it is effectively used to many fields.But there is also defects for own, are encountering part When extreme value, the speed of particle is reduced rapidly until stagnating, and is difficult to jump out Local Extremum, precocious phenomenon occurs, and inertia is weighed It is one important parameter of particle swarm algorithm again, to adjust the search capability of population.

Summary of the invention

The object of the present invention is to provide a kind of text feature selection method of particle group optimizing, this method is very strong using particle Local search ability select significant to class discrimination ability and comprising the characteristic set more than text information amount, be effectively reduced The dimension of text vector.

A kind of text feature selection method of particle group optimizing, this method comprising the following specific steps

1) text set is segmented with participle tool, the word after participle is formed into a word set, as text set Primitive character indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T₁, t₂..., t_n}；

2) firstly, calculating feature t using formula (1)_iWith the average relationship distance R between other features_i,

Wherein p (t_i, t_j) indicate t_i, t_jThe probability of co-occurrence, i.e. t_i, t_jThe number in sentence is appeared in jointly than upper text set The quantity of all words, p (t_i) indicate feature t_iThe probability of appearance, i.e. feature t_iThe number of appearance is the same as all word numbers of text set The ratio of amount；p(t_j) indicate feature t_jThe probability of appearance, i.e. feature t_jRatio of the number of appearance with all word quantity of text set Value；R_iValue is higher to show that this feature is bigger with the relationship of other features, R_iValue is lower to show this feature with the area other features Yue You Not；After the R value for acquiring all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put Enter dissimilar group D, later half is put into similar group S；

3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several two at random The particle x (i.e. the position of particle) of system simultaneously initializes each initial velocity degree v_i, use x_iAnd v_iIt is expressed as the position of i-th of particle It sets and speed, and is all the vector of m dimension, the value of each dimension is the random number between (0,1)；

4) according to formula (2) come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, B), a and b is custom parameter (being set as a=-4, b=4), and specific practice is to use v_idIndicate v_iD dimension value, if v_id> B then enables v_id=b, if v_id< a, then enable v_id=a, other situations v_idIt is constant；

Wherein, it usesIndicate the optimum position that each particle lives through itself,It is best to indicate that population particle lives through Position, c₁And c₂It is Studying factors, usual situation c₁=c₂=2；r₁And r₂It is the random number between [0,1]；

By formula (3) come the position of more new particle, and each dimension values of particle are changed, specific practice be first with Formula (4) calculates s (v_id), s (v_id) indicate a function, wherein e is natural constant,Indicate-the v of e_idPower, then s (v_id) compare with random number rand, if s (v_id) > rand, then by x_id=1, it is otherwise set as 0, x here_idIndicate particle x_iD The value of dimension；A wherein value of rand random initializtion；

x_i=x_i+v_i (3)

5) according to step 4), particle x is obtained_iEach dimension values of position are 0 or 1；Due to particle x_iEach dimension and feature The correspondence of the feature of set T, according to x_iDimension values be 1 position, obtain character subset T ', according to similar features set S and T ' is divided into dissimilar character subset D ' and similar features subset S ' by dissimilar feature set D；To the similar spy in particle The quantity for dissimilar feature of seeking peace is controlled, and defined parameters α enables n_D′=α n, n_S′=(1- α) n, n_D′For dissimilar feature Subset D ' Characteristic Number lower limit, n_S′It is the upper limit of similar features subset S ' Characteristic Number；When the number of the middle feature of D ' is less than nD_′ When, the feature in D is selected at random to D ', until the middle number of D ' reaches n_D′, and update x_i, i.e., by x_iIn the modification of corresponding dimension values It is 1；Similarly, when the number of the middle feature of S ' is greater than n_S′When, feature is rejected in random S, until number reaches n in S_S′, and update x_i, I.e. by x_iIn corresponding dimension values be revised as 0；

X by upper operation, after being updated_i, and the T ' after update；

6) x is utilized_iRepresented character subset set T ' indicates text training KNN classifier with vector space model, and counts Calculate the accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N Indicate that test text concentrates the total number of samples for including, n_accThe test text number correctly classified；

7) step 6) is utilized, according toWithCalculate accuracy rate of the character subset represented by particle for classificationWith And

8) rightWithIt is updated；IfThenIfThen

9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4), otherwise terminate, And it exports

10) basisIn obtain optimal feature subset.

Beneficial effects of the present invention: the present invention can select the spy most beneficial for text representation from huge text word set Subset is levied, is laid a good foundation so as to classification, the text-processing for text.

Detailed description of the invention

Fig. 1 is that text of the present invention segments instance graph；

Fig. 2 is similarity feature and dissimilar feature set calculation flow chart；

Fig. 3 is similar features and dissimilar feature limitation figure in particle of the present invention.

Specific embodiment

The present invention is a kind of text feature selection method based on particle group optimizing, the feature choosing in this method text It selects, can effectively select the feature to contain much information, to realize preferably text representation.

1) text such as Fig. 1 is segmented with participle tool, the word after participle is formed into a word set, as text Primitive character indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T₁, t₂..., t_n}

2) firstly, feature t is calculated with formula 1_iWith the average relationship distance R between other features_i,

Wherein p (t_i, t_j) indicate t_i, t_jThe probability of co-occurrence, i.e. t_i, t_jThe number in sentence is appeared in jointly than upper text set The quantity of all words, p (t_i) indicate feature t_iThe probability of appearance, i.e. feature t_iThe number of appearance is the same as all word numbers of text set The ratio of amount；p(t_j) indicate feature t_jThe probability of appearance, i.e. feature t_jRatio of the number of appearance with all word quantity of text set Value；R_iValue is higher to show that this feature is bigger with the relationship of other features, R_iValue is lower to show this feature with the area other features Yue You Not；After the R value for acquiring all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put Enter dissimilar group D, later half is put into similar group S.As shown in Figure 2.

3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several two at random The particle x (i.e. the position of particle) of system and to initialize each initial velocity degree v_i, use x_iAnd v_iIt is expressed as i-th of particle Position and speed, and be all the vector of m dimension, the value of each dimension is the random number between (0,1).

4) according to formula 2 come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, B), a and b is custom parameter (being set as a=-4, b=4 herein), and specific practice is to use v_idIndicate v_iD dimension value, if v_id> b, then enable v_id=b, if v_id< a, then enable v_id=a, other situations v_idIt is constant.

Wherein, it usesIndicate the optimum position that each particle lives through itself,It is best to indicate that population particle lives through Position, c₁And c₂It is Studying factors, usual situation c₁=c₂=2.r₁And r₂It is the random number between [0,1].

By formula 3 come the position of more new particle, and each dimension values of particle are changed, specific practice be first with Formula 4 calculates s (v_id), s (v_id) indicate a function, wherein e is natural constant,Indicate-the v of e_idPower, then s (v_id) compare with random number rand, if s (v_id) > rand, then by x_id=1, it is otherwise set as 0, x here_idIndicate particle x_iD The value of dimension.A wherein value of rand random initializtion.

x_i=x_i+v_i (3)

5) rapid according to step 4, available particle x_iEach dimension values of position are 0 or 1.Due to particle x_iEach dimension It is corresponding with the feature of characteristic set T, according to x_iDimension values be 1 position, available character subset T ', according to similar spy S and dissimilar feature set D are closed in collection, and T ' is divided into dissimilar character subset D ' and similar features subset S '.To in particle Similar features and the quantity of dissimilar feature controlled, defined parameters α enables n_D′=α n, n_S′=(1- α) n, n_D′For not Similar features subset D ' Characteristic Number lower limit, n_S′, it is the upper limit of similar features subset S Characteristic Number.When the number of feature in D Less than n_D′When, the feature in D is selected at random to D, until number reaches n in D_D′, and update x_i, i.e., by x_iIn corresponding dimension values It is revised as 1.Similarly, when the number of feature in S is greater than n_S′, when, feature is rejected in random S, until number reaches n in S_S′, and more New x_i, i.e., by x_iIn corresponding dimension values be revised as 0.X by upper operation, after being updated_i, and the T after update.Tool Body example is as shown in Figure 3.

7) the rapid method of step 6 is utilized, according toWithCharacter subset represented by particle is calculated for the accurate of classification RateAnd

8) rightWithIt is updated；IfThenIfThen

9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4, otherwise terminate, And it exports

10) basisIn obtain optimal feature subset.Such asIndicate that position is 1 in characteristic set X, Collection composed by 2,3,7,8 feature is combined into optimal characteristics set.

Claims

1. a kind of text feature selection method of particle group optimizing, which is characterized in that this method comprising the following specific steps

1) text set is segmented with participle tool, the word after participle is formed into a word set, as the original of text set Feature indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T₁, t₂..., t_n}；

Wherein p (t_i, t_j) indicate t_i, t_jThe probability of co-occurrence, i.e. t_i, t_jThe number appeared in sentence jointly is more all than upper text set The number of word, p (t_i) indicate feature t_iThe probability of appearance, i.e. feature t_iThe number of appearance is the same as all word quantity of text set Ratio；p(t_j) indicate feature t_jThe probability of appearance, i.e. feature t_jRatio of the number of appearance with all word quantity of text set；R_i Value is higher to show that this feature is bigger with the relationship of other features, R_iValue is lower to show that this feature is more had any different with other features；It asks After the R value for obtaining all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put into not Similar group D, later half are put into similar group S；

3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several binary systems at random Particle x, that is, particle position, and initialize each initial velocity degree v_i, use x_iAnd v_iIt is expressed as the position and speed of i-th of particle Degree, and be all the vector of m dimension, the value of each dimension is the random number between (0,1)；

4) according to formula (2) come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, b), a It is custom parameter with b, specific practice is to use v_idIndicate v_iD dimension value, if v_id> b, then enable v_id=b, if v_id< a, then Enable v_id=a, other situations v_idIt is constant；

Wherein, it usesIndicate the optimum position that each particle lives through itself,Indicate the optimum bit that population particle lives through It sets, c₁And c₂It is Studying factors；r₁And r₂It is the random number between [0,1]；

By formula (3) come the position of more new particle, and each dimension values of particle are changed, specific practice is first with formula (4) s (v is calculated_id), s (v_id) indicate a function, wherein e is natural constant,Indicate-the v of e_idPower, then s (v_id) Compare with random number rand, if s (v_id) > rand, then by x_id=1, it is otherwise set as 0, x here_idIndicate particle x_iD dimension Value；Wherein rand is a value of random initializtion；

x_i=x_i+v_i (3)

5) according to step 4), particle x is obtained_iEach dimension values of position are 0 or 1；Due to particle x_iEach dimension and characteristic set The correspondence of the feature of T, according to x_iDimension values be 1 position, character subset T ' is obtained, according to similar features set S and non-phase Like feature set D, T ' is divided into dissimilar character subset D ' and similar features subset S '；To in particle similar features and The quantity of dissimilar feature is controlled, and defined parameters α enables n_D′=α n, n_S′=(1- α) n, n_D′For dissimilar character subset The lower limit of D ' Characteristic Number, n_S′It is the upper limit of similar features subset S ' Characteristic Number；When the number of the middle feature of D ' is less than n_D′When, The feature in D is selected at random to D ', until the middle number of D ' reaches n_D′, and update x_i, i.e., by x_iIn corresponding dimension values be revised as 1；Similarly, when the number of the middle feature of S ' is greater than n_S′When, the random middle feature of S ' is rejected, until the middle number of S ' reaches n_S′, and update x_i, I.e. by x_iIn corresponding dimension values be revised as 0；

X by upper operation, after being updated_i, and the T ' after update；

6) x is utilized_iRepresented character subset set T ' indicates text training KNN classifier with vector space model, and calculates The accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N table Show that test text concentrates the total number of samples for including, n_accThe test text number correctly classified；

7) step 6) is utilized, according toWithCalculate accuracy rate f of the character subset represented by particle for classification_i ^bestAnd

8) rightWithIt is updated；If f_i> f_i ^best, thenIfThen

9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4), otherwise terminate, and defeated Out

10) basisSet composed by the feature of middle position obtains optimal feature subset.