CN108664562B - The text feature selection method of particle group optimizing - Google Patents
The text feature selection method of particle group optimizing Download PDFInfo
- Publication number
- CN108664562B CN108664562B CN201810315024.4A CN201810315024A CN108664562B CN 108664562 B CN108664562 B CN 108664562B CN 201810315024 A CN201810315024 A CN 201810315024A CN 108664562 B CN108664562 B CN 108664562B
- Authority
- CN
- China
- Prior art keywords
- feature
- particle
- text
- indicate
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 239000002245 particle Substances 0.000 title claims abstract description 59
- 238000010187 selection method Methods 0.000 title claims abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 14
- 238000012360 testing method Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims description 3
- 238000005284 basis set Methods 0.000 claims 1
- 229910002056 binary alloy Inorganic materials 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 abstract description 7
- 238000005457 optimization Methods 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 244000144992 flock Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2111—Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
Abstract
The text feature selection method based on particle group optimizing that the invention discloses a kind of, this method is to solve the problems, such as to indicate that Text eigenvector higher-dimension and sparse occurs in text using vector space model, local searching strategy is embedded into particle swarm optimization algorithm by the present invention selects uncorrelated and significant character subset, particle swarm algorithm is instructed to select different features in search process by considering the relevant information of population, to select the feature for being more conducive to classification accuracy from primitive character.The present invention can select the character subset most beneficial for text representation from huge text word set, lay a good foundation so as to classification, the text-processing for text.
Description
Technical field
The present invention relates to natural language processing field, specifically a kind of feature selecting based on particle swarm optimization algorithm
Method (PSO-FS) preferably indicates text applied to effective feature in the feature selecting of text, is selected.
Background technique
In big data era, data generation is increasingly huge, and useful information is obtained from a large amount of data becomes more multiple
It is miscellaneous.Data are handled then in big data era using artificial method, the data of generation are increasingly huge, from a large amount of data
The useful information of middle acquisition becomes more complicated.It is then difficult to be handled in the way of artificial data, so natural
For expect handling data using machine.
Text classification, which refers to, carries out feature selecting and analysis to text, by text feature attribute it is most like be classified as one kind
Process.Text classification includes several steps below: segmenting, stop words, feature selecting, vector space model is gone to indicate, instruct
Practice classifier and classifies.The most of contents of text are all that natural language indicates, different from machine language, it is therefore desirable to will be original
Text is converted.VSM is to be indicated text using vector space model, if resulting word will be segmented as feature
, then vector dimension is huge, this not only brings complexity to calculating, and segments in gained word and have a large amount of useless letter
Breath also brings interference to classification, therefore effectively the quantity of text feature item and controlling feature item is extremely critical for selection
One step.
Feature selecting refers to through certain feature calculation method, selects from total characteristic set and distinguishes with text
The feature of degree intensity is characterized item.Feature selecting has a variety of meanings to text-processing: (1) feature selecting can improve model
Estimated performance effectively improves accuracy rate.(2) time of the training time of model and prediction are all reduced, is improved whole
Efficiency.(3) the generation process of implication and data in data is disclosed.It is in simple terms exactly that feature selecting makes from number
Most effectively feature is selected according to concentration, preferably data are understood.The validity feature collection that feature selecting goes out is smaller, so that
Expression is dimension reduction, reduces the learning cost of model.Existing document frequency (DF), X2 (Chi) statistics, information gain
(IG), the common feature selection approach of the these types such as mutual information (MI) is described analysis and comparison, the results show that for not
Same classifier and data set, every kind of method have excellent lack.
Particle swarm algorithm (Particle Swarm Optimization, PSO) is derived to the simulation of flock of birds predation
Important swarm intelligence algorithm.PSO most starts random a group random examples, and the continuous iteration of behavior by simulating flock of birds searches out
Optimal solution.Each time during iteration, current optimal solution, and more new historical optimal solution can be all recorded, and change certainly
The position of body and the speed of movement.The algorithm has stronger ability of searching optimum, and is conducive to understand implementation also very
Simply.As a kind of optimization tool, it is effectively used to many fields.But there is also defects for own, are encountering part
When extreme value, the speed of particle is reduced rapidly until stagnating, and is difficult to jump out Local Extremum, precocious phenomenon occurs, and inertia is weighed
It is one important parameter of particle swarm algorithm again, to adjust the search capability of population.
Summary of the invention
The object of the present invention is to provide a kind of text feature selection method of particle group optimizing, this method is very strong using particle
Local search ability select significant to class discrimination ability and comprising the characteristic set more than text information amount, be effectively reduced
The dimension of text vector.
A kind of text feature selection method of particle group optimizing, this method comprising the following specific steps
1) text set is segmented with participle tool, the word after participle is formed into a word set, as text set
Primitive character indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T1, t2..., tn};
2) firstly, calculating feature t using formula (1)iWith the average relationship distance R between other featuresi,
Wherein p (ti, tj) indicate ti, tjThe probability of co-occurrence, i.e. ti, tjThe number in sentence is appeared in jointly than upper text set
The quantity of all words, p (ti) indicate feature tiThe probability of appearance, i.e. feature tiThe number of appearance is the same as all word numbers of text set
The ratio of amount;p(tj) indicate feature tjThe probability of appearance, i.e. feature tjRatio of the number of appearance with all word quantity of text set
Value;RiValue is higher to show that this feature is bigger with the relationship of other features, RiValue is lower to show this feature with the area other features Yue You
Not;After the R value for acquiring all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put
Enter dissimilar group D, later half is put into similar group S;
3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several two at random
The particle x (i.e. the position of particle) of system simultaneously initializes each initial velocity degree vi, use xiAnd viIt is expressed as the position of i-th of particle
It sets and speed, and is all the vector of m dimension, the value of each dimension is the random number between (0,1);
4) according to formula (2) come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a,
B), a and b is custom parameter (being set as a=-4, b=4), and specific practice is to use vidIndicate viD dimension value, if vid>
B then enables vid=b, if vid< a, then enable vid=a, other situations vidIt is constant;
Wherein, it usesIndicate the optimum position that each particle lives through itself,It is best to indicate that population particle lives through
Position, c1And c2It is Studying factors, usual situation c1=c2=2;r1And r2It is the random number between [0,1];
By formula (3) come the position of more new particle, and each dimension values of particle are changed, specific practice be first with
Formula (4) calculates s (vid), s (vid) indicate a function, wherein e is natural constant,Indicate-the v of eidPower, then s
(vid) compare with random number rand, if s (vid) > rand, then by xid=1, it is otherwise set as 0, x hereidIndicate particle xiD
The value of dimension;A wherein value of rand random initializtion;
xi=xi+vi (3)
5) according to step 4), particle x is obtainediEach dimension values of position are 0 or 1;Due to particle xiEach dimension and feature
The correspondence of the feature of set T, according to xiDimension values be 1 position, obtain character subset T ', according to similar features set S and
T ' is divided into dissimilar character subset D ' and similar features subset S ' by dissimilar feature set D;To the similar spy in particle
The quantity for dissimilar feature of seeking peace is controlled, and defined parameters α enables nD′=α n, nS′=(1- α) n, nD′For dissimilar feature
Subset D ' Characteristic Number lower limit, nS′It is the upper limit of similar features subset S ' Characteristic Number;When the number of the middle feature of D ' is less than nD′
When, the feature in D is selected at random to D ', until the middle number of D ' reaches nD′, and update xi, i.e., by xiIn the modification of corresponding dimension values
It is 1;Similarly, when the number of the middle feature of S ' is greater than nS′When, feature is rejected in random S, until number reaches n in SS′, and update xi,
I.e. by xiIn corresponding dimension values be revised as 0;
X by upper operation, after being updatedi, and the T ' after update;
6) x is utilizediRepresented character subset set T ' indicates text training KNN classifier with vector space model, and counts
Calculate the accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N
Indicate that test text concentrates the total number of samples for including, naccThe test text number correctly classified;
7) step 6) is utilized, according toWithCalculate accuracy rate of the character subset represented by particle for classificationWith
And
8) rightWithIt is updated;IfThenIfThen
9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4), otherwise terminate,
And it exports
10) basisIn obtain optimal feature subset.
Beneficial effects of the present invention: the present invention can select the spy most beneficial for text representation from huge text word set
Subset is levied, is laid a good foundation so as to classification, the text-processing for text.
Detailed description of the invention
Fig. 1 is that text of the present invention segments instance graph;
Fig. 2 is similarity feature and dissimilar feature set calculation flow chart;
Fig. 3 is similar features and dissimilar feature limitation figure in particle of the present invention.
Specific embodiment
The present invention is a kind of text feature selection method based on particle group optimizing, the feature choosing in this method text
It selects, can effectively select the feature to contain much information, to realize preferably text representation.
A kind of text feature selection method of particle group optimizing, this method comprising the following specific steps
1) text such as Fig. 1 is segmented with participle tool, the word after participle is formed into a word set, as text
Primitive character indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T1, t2..., tn}
2) firstly, feature t is calculated with formula 1iWith the average relationship distance R between other featuresi,
Wherein p (ti, tj) indicate ti, tjThe probability of co-occurrence, i.e. ti, tjThe number in sentence is appeared in jointly than upper text set
The quantity of all words, p (ti) indicate feature tiThe probability of appearance, i.e. feature tiThe number of appearance is the same as all word numbers of text set
The ratio of amount;p(tj) indicate feature tjThe probability of appearance, i.e. feature tjRatio of the number of appearance with all word quantity of text set
Value;RiValue is higher to show that this feature is bigger with the relationship of other features, RiValue is lower to show this feature with the area other features Yue You
Not;After the R value for acquiring all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put
Enter dissimilar group D, later half is put into similar group S.As shown in Figure 2.
3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several two at random
The particle x (i.e. the position of particle) of system and to initialize each initial velocity degree vi, use xiAnd viIt is expressed as i-th of particle
Position and speed, and be all the vector of m dimension, the value of each dimension is the random number between (0,1).
4) according to formula 2 come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a,
B), a and b is custom parameter (being set as a=-4, b=4 herein), and specific practice is to use vidIndicate viD dimension value, if
vid> b, then enable vid=b, if vid< a, then enable vid=a, other situations vidIt is constant.
Wherein, it usesIndicate the optimum position that each particle lives through itself,It is best to indicate that population particle lives through
Position, c1And c2It is Studying factors, usual situation c1=c2=2.r1And r2It is the random number between [0,1].
By formula 3 come the position of more new particle, and each dimension values of particle are changed, specific practice be first with
Formula 4 calculates s (vid), s (vid) indicate a function, wherein e is natural constant,Indicate-the v of eidPower, then s
(vid) compare with random number rand, if s (vid) > rand, then by xid=1, it is otherwise set as 0, x hereidIndicate particle xiD
The value of dimension.A wherein value of rand random initializtion.
xi=xi+vi (3)
5) rapid according to step 4, available particle xiEach dimension values of position are 0 or 1.Due to particle xiEach dimension
It is corresponding with the feature of characteristic set T, according to xiDimension values be 1 position, available character subset T ', according to similar spy
S and dissimilar feature set D are closed in collection, and T ' is divided into dissimilar character subset D ' and similar features subset S '.To in particle
Similar features and the quantity of dissimilar feature controlled, defined parameters α enables nD′=α n, nS′=(1- α) n, nD′For not
Similar features subset D ' Characteristic Number lower limit, nS′, it is the upper limit of similar features subset S Characteristic Number.When the number of feature in D
Less than nD′When, the feature in D is selected at random to D, until number reaches n in DD′, and update xi, i.e., by xiIn corresponding dimension values
It is revised as 1.Similarly, when the number of feature in S is greater than nS′, when, feature is rejected in random S, until number reaches n in SS′, and more
New xi, i.e., by xiIn corresponding dimension values be revised as 0.X by upper operation, after being updatedi, and the T after update.Tool
Body example is as shown in Figure 3.
6) x is utilizediRepresented character subset set T ' indicates text training KNN classifier with vector space model, and counts
Calculate the accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N
Indicate that test text concentrates the total number of samples for including, naccThe test text number correctly classified;
7) the rapid method of step 6 is utilized, according toWithCharacter subset represented by particle is calculated for the accurate of classification
RateAnd
8) rightWithIt is updated;IfThenIfThen
9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4, otherwise terminate,
And it exports
10) basisIn obtain optimal feature subset.Such asIndicate that position is 1 in characteristic set X,
Collection composed by 2,3,7,8 feature is combined into optimal characteristics set.
Claims (1)
1. a kind of text feature selection method of particle group optimizing, which is characterized in that this method comprising the following specific steps
1) text set is segmented with participle tool, the word after participle is formed into a word set, as the original of text set
Feature indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T1, t2..., tn};
2) firstly, calculating feature t using formula (1)iWith the average relationship distance R between other featuresi,
Wherein p (ti, tj) indicate ti, tjThe probability of co-occurrence, i.e. ti, tjThe number appeared in sentence jointly is more all than upper text set
The number of word, p (ti) indicate feature tiThe probability of appearance, i.e. feature tiThe number of appearance is the same as all word quantity of text set
Ratio;p(tj) indicate feature tjThe probability of appearance, i.e. feature tjRatio of the number of appearance with all word quantity of text set;Ri
Value is higher to show that this feature is bigger with the relationship of other features, RiValue is lower to show that this feature is more had any different with other features;It asks
After the R value for obtaining all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put into not
Similar group D, later half are put into similar group S;
3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several binary systems at random
Particle x, that is, particle position, and initialize each initial velocity degree vi, use xiAnd viIt is expressed as the position and speed of i-th of particle
Degree, and be all the vector of m dimension, the value of each dimension is the random number between (0,1);
4) according to formula (2) come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, b), a
It is custom parameter with b, specific practice is to use vidIndicate viD dimension value, if vid> b, then enable vid=b, if vid< a, then
Enable vid=a, other situations vidIt is constant;
Wherein, it usesIndicate the optimum position that each particle lives through itself,Indicate the optimum bit that population particle lives through
It sets, c1And c2It is Studying factors;r1And r2It is the random number between [0,1];
By formula (3) come the position of more new particle, and each dimension values of particle are changed, specific practice is first with formula
(4) s (v is calculatedid), s (vid) indicate a function, wherein e is natural constant,Indicate-the v of eidPower, then s (vid)
Compare with random number rand, if s (vid) > rand, then by xid=1, it is otherwise set as 0, x hereidIndicate particle xiD dimension
Value;Wherein rand is a value of random initializtion;
xi=xi+vi (3)
5) according to step 4), particle x is obtainediEach dimension values of position are 0 or 1;Due to particle xiEach dimension and characteristic set
The correspondence of the feature of T, according to xiDimension values be 1 position, character subset T ' is obtained, according to similar features set S and non-phase
Like feature set D, T ' is divided into dissimilar character subset D ' and similar features subset S ';To in particle similar features and
The quantity of dissimilar feature is controlled, and defined parameters α enables nD′=α n, nS′=(1- α) n, nD′For dissimilar character subset
The lower limit of D ' Characteristic Number, nS′It is the upper limit of similar features subset S ' Characteristic Number;When the number of the middle feature of D ' is less than nD′When,
The feature in D is selected at random to D ', until the middle number of D ' reaches nD′, and update xi, i.e., by xiIn corresponding dimension values be revised as
1;Similarly, when the number of the middle feature of S ' is greater than nS′When, the random middle feature of S ' is rejected, until the middle number of S ' reaches nS′, and update xi,
I.e. by xiIn corresponding dimension values be revised as 0;
X by upper operation, after being updatedi, and the T ' after update;
6) x is utilizediRepresented character subset set T ' indicates text training KNN classifier with vector space model, and calculates
The accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N table
Show that test text concentrates the total number of samples for including, naccThe test text number correctly classified;
7) step 6) is utilized, according toWithCalculate accuracy rate f of the character subset represented by particle for classificationi bestAnd
8) rightWithIt is updated;If fi> fi best, thenIfThen
9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4), otherwise terminate, and defeated
Out
10) basisSet composed by the feature of middle position obtains optimal feature subset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810315024.4A CN108664562B (en) | 2018-04-10 | 2018-04-10 | The text feature selection method of particle group optimizing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810315024.4A CN108664562B (en) | 2018-04-10 | 2018-04-10 | The text feature selection method of particle group optimizing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108664562A CN108664562A (en) | 2018-10-16 |
CN108664562B true CN108664562B (en) | 2019-10-01 |
Family
ID=63783195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810315024.4A Expired - Fee Related CN108664562B (en) | 2018-04-10 | 2018-04-10 | The text feature selection method of particle group optimizing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108664562B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110336637A (en) * | 2019-07-15 | 2019-10-15 | 北京航空航天大学 | A kind of unmanned plane interference signal feature selection approach |
CN112365117A (en) * | 2020-09-03 | 2021-02-12 | 中交西安筑路机械有限公司 | Pavement structure performance calculation method based on optimized support vector machine |
CN112613595A (en) * | 2020-12-25 | 2021-04-06 | 煤炭科学研究总院 | Ultra-wideband radar echo signal preprocessing method based on variational modal decomposition |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5070591B2 (en) * | 2007-05-25 | 2012-11-14 | 株式会社国際電気通信基礎技術研究所 | Noise suppression device, computer program, and speech recognition system |
CN105095494B (en) * | 2015-08-21 | 2019-03-26 | 中国地质大学(武汉) | The method that a kind of pair of categorized data set is tested |
CN107506821A (en) * | 2017-10-13 | 2017-12-22 | 集美大学 | A kind of improved particle group optimizing method |
-
2018
- 2018-04-10 CN CN201810315024.4A patent/CN108664562B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN108664562A (en) | 2018-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304316B (en) | Software defect prediction method based on collaborative migration | |
CN110633745A (en) | Image classification training method and device based on artificial intelligence and storage medium | |
CN108664562B (en) | The text feature selection method of particle group optimizing | |
CN111428733B (en) | Zero sample target detection method and system based on semantic feature space conversion | |
CN114841257B (en) | Small sample target detection method based on self-supervision comparison constraint | |
CN107992895A (en) | A kind of Boosting support vector machines learning method | |
CN104573669A (en) | Image object detection method | |
EP2936392B1 (en) | Image pattern recognition system and method | |
CN105389583A (en) | Image classifier generation method, and image classification method and device | |
Chen et al. | An Enhanced Region Proposal Network for object detection using deep learning method | |
CN101196905A (en) | Intelligent pattern searching method | |
CN110879881B (en) | Mouse track recognition method based on feature component hierarchy and semi-supervised random forest | |
CN109522544A (en) | Sentence vector calculation, file classification method and system based on Chi-square Test | |
CN105930792A (en) | Human action classification method based on video local feature dictionary | |
CN110991518A (en) | Two-stage feature selection method and system based on evolution multitask | |
CN113704522A (en) | Artificial intelligence-based target image rapid retrieval method and system | |
CN112183671A (en) | Target attack counterattack sample generation method for deep learning model | |
CN110097067B (en) | Weak supervision fine-grained image classification method based on layer-feed feature transformation | |
CN115331752A (en) | Method capable of adaptively predicting quartz forming environment | |
CN111797935A (en) | Semi-supervised deep network picture classification method based on group intelligence | |
US20230252282A1 (en) | Method, server, and system for deep metric learning per hierarchical steps of multi-labels and few-shot inference using the same | |
CN110046255A (en) | A kind of file classification method based on anti-noise traveling time potential energy cluster | |
Li et al. | Wheat cultivar classifications based on tabu search and fuzzy c-means clustering algorithm | |
Zhang et al. | Improved deep learning model text classification | |
CN113971442A (en) | Method and system for generating universal countermeasure disturbance based on self-walking learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191001 |