CN108664562B - The text feature selection method of particle group optimizing - Google Patents

The text feature selection method of particle group optimizing Download PDF

Info

Publication number
CN108664562B
CN108664562B CN201810315024.4A CN201810315024A CN108664562B CN 108664562 B CN108664562 B CN 108664562B CN 201810315024 A CN201810315024 A CN 201810315024A CN 108664562 B CN108664562 B CN 108664562B
Authority
CN
China
Prior art keywords
feature
particle
text
indicate
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810315024.4A
Other languages
Chinese (zh)
Other versions
CN108664562A (en
Inventor
琚小明
王锋华
钱仲文
毛大鹏
吴翔
邢雅菲
张全
于晓蝶
夏洪涛
成敬周
王政
孙晨
王仲锋
吕旭芬
张旭东
张建松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
East China Normal University
Zhejiang Huayun Information Technology Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
East China Normal University
Zhejiang Huayun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, East China Normal University, Zhejiang Huayun Information Technology Co Ltd filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN201810315024.4A priority Critical patent/CN108664562B/en
Publication of CN108664562A publication Critical patent/CN108664562A/en
Application granted granted Critical
Publication of CN108664562B publication Critical patent/CN108664562B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms

Abstract

The text feature selection method based on particle group optimizing that the invention discloses a kind of, this method is to solve the problems, such as to indicate that Text eigenvector higher-dimension and sparse occurs in text using vector space model, local searching strategy is embedded into particle swarm optimization algorithm by the present invention selects uncorrelated and significant character subset, particle swarm algorithm is instructed to select different features in search process by considering the relevant information of population, to select the feature for being more conducive to classification accuracy from primitive character.The present invention can select the character subset most beneficial for text representation from huge text word set, lay a good foundation so as to classification, the text-processing for text.

Description

The text feature selection method of particle group optimizing
Technical field
The present invention relates to natural language processing field, specifically a kind of feature selecting based on particle swarm optimization algorithm Method (PSO-FS) preferably indicates text applied to effective feature in the feature selecting of text, is selected.
Background technique
In big data era, data generation is increasingly huge, and useful information is obtained from a large amount of data becomes more multiple It is miscellaneous.Data are handled then in big data era using artificial method, the data of generation are increasingly huge, from a large amount of data The useful information of middle acquisition becomes more complicated.It is then difficult to be handled in the way of artificial data, so natural For expect handling data using machine.
Text classification, which refers to, carries out feature selecting and analysis to text, by text feature attribute it is most like be classified as one kind Process.Text classification includes several steps below: segmenting, stop words, feature selecting, vector space model is gone to indicate, instruct Practice classifier and classifies.The most of contents of text are all that natural language indicates, different from machine language, it is therefore desirable to will be original Text is converted.VSM is to be indicated text using vector space model, if resulting word will be segmented as feature , then vector dimension is huge, this not only brings complexity to calculating, and segments in gained word and have a large amount of useless letter Breath also brings interference to classification, therefore effectively the quantity of text feature item and controlling feature item is extremely critical for selection One step.
Feature selecting refers to through certain feature calculation method, selects from total characteristic set and distinguishes with text The feature of degree intensity is characterized item.Feature selecting has a variety of meanings to text-processing: (1) feature selecting can improve model Estimated performance effectively improves accuracy rate.(2) time of the training time of model and prediction are all reduced, is improved whole Efficiency.(3) the generation process of implication and data in data is disclosed.It is in simple terms exactly that feature selecting makes from number Most effectively feature is selected according to concentration, preferably data are understood.The validity feature collection that feature selecting goes out is smaller, so that Expression is dimension reduction, reduces the learning cost of model.Existing document frequency (DF), X2 (Chi) statistics, information gain (IG), the common feature selection approach of the these types such as mutual information (MI) is described analysis and comparison, the results show that for not Same classifier and data set, every kind of method have excellent lack.
Particle swarm algorithm (Particle Swarm Optimization, PSO) is derived to the simulation of flock of birds predation Important swarm intelligence algorithm.PSO most starts random a group random examples, and the continuous iteration of behavior by simulating flock of birds searches out Optimal solution.Each time during iteration, current optimal solution, and more new historical optimal solution can be all recorded, and change certainly The position of body and the speed of movement.The algorithm has stronger ability of searching optimum, and is conducive to understand implementation also very Simply.As a kind of optimization tool, it is effectively used to many fields.But there is also defects for own, are encountering part When extreme value, the speed of particle is reduced rapidly until stagnating, and is difficult to jump out Local Extremum, precocious phenomenon occurs, and inertia is weighed It is one important parameter of particle swarm algorithm again, to adjust the search capability of population.
Summary of the invention
The object of the present invention is to provide a kind of text feature selection method of particle group optimizing, this method is very strong using particle Local search ability select significant to class discrimination ability and comprising the characteristic set more than text information amount, be effectively reduced The dimension of text vector.
A kind of text feature selection method of particle group optimizing, this method comprising the following specific steps
1) text set is segmented with participle tool, the word after participle is formed into a word set, as text set Primitive character indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T1, t2..., tn};
2) firstly, calculating feature t using formula (1)iWith the average relationship distance R between other featuresi,
Wherein p (ti, tj) indicate ti, tjThe probability of co-occurrence, i.e. ti, tjThe number in sentence is appeared in jointly than upper text set The quantity of all words, p (ti) indicate feature tiThe probability of appearance, i.e. feature tiThe number of appearance is the same as all word numbers of text set The ratio of amount;p(tj) indicate feature tjThe probability of appearance, i.e. feature tjRatio of the number of appearance with all word quantity of text set Value;RiValue is higher to show that this feature is bigger with the relationship of other features, RiValue is lower to show this feature with the area other features Yue You Not;After the R value for acquiring all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put Enter dissimilar group D, later half is put into similar group S;
3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several two at random The particle x (i.e. the position of particle) of system simultaneously initializes each initial velocity degree vi, use xiAnd viIt is expressed as the position of i-th of particle It sets and speed, and is all the vector of m dimension, the value of each dimension is the random number between (0,1);
4) according to formula (2) come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, B), a and b is custom parameter (being set as a=-4, b=4), and specific practice is to use vidIndicate viD dimension value, if vid> B then enables vid=b, if vid< a, then enable vid=a, other situations vidIt is constant;
Wherein, it usesIndicate the optimum position that each particle lives through itself,It is best to indicate that population particle lives through Position, c1And c2It is Studying factors, usual situation c1=c2=2;r1And r2It is the random number between [0,1];
By formula (3) come the position of more new particle, and each dimension values of particle are changed, specific practice be first with Formula (4) calculates s (vid), s (vid) indicate a function, wherein e is natural constant,Indicate-the v of eidPower, then s (vid) compare with random number rand, if s (vid) > rand, then by xid=1, it is otherwise set as 0, x hereidIndicate particle xiD The value of dimension;A wherein value of rand random initializtion;
xi=xi+vi (3)
5) according to step 4), particle x is obtainediEach dimension values of position are 0 or 1;Due to particle xiEach dimension and feature The correspondence of the feature of set T, according to xiDimension values be 1 position, obtain character subset T ', according to similar features set S and T ' is divided into dissimilar character subset D ' and similar features subset S ' by dissimilar feature set D;To the similar spy in particle The quantity for dissimilar feature of seeking peace is controlled, and defined parameters α enables nD′=α n, nS′=(1- α) n, nD′For dissimilar feature Subset D ' Characteristic Number lower limit, nS′It is the upper limit of similar features subset S ' Characteristic Number;When the number of the middle feature of D ' is less than nD When, the feature in D is selected at random to D ', until the middle number of D ' reaches nD′, and update xi, i.e., by xiIn the modification of corresponding dimension values It is 1;Similarly, when the number of the middle feature of S ' is greater than nS′When, feature is rejected in random S, until number reaches n in SS′, and update xi, I.e. by xiIn corresponding dimension values be revised as 0;
X by upper operation, after being updatedi, and the T ' after update;
6) x is utilizediRepresented character subset set T ' indicates text training KNN classifier with vector space model, and counts Calculate the accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N Indicate that test text concentrates the total number of samples for including, naccThe test text number correctly classified;
7) step 6) is utilized, according toWithCalculate accuracy rate of the character subset represented by particle for classificationWith And
8) rightWithIt is updated;IfThenIfThen
9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4), otherwise terminate, And it exports
10) basisIn obtain optimal feature subset.
Beneficial effects of the present invention: the present invention can select the spy most beneficial for text representation from huge text word set Subset is levied, is laid a good foundation so as to classification, the text-processing for text.
Detailed description of the invention
Fig. 1 is that text of the present invention segments instance graph;
Fig. 2 is similarity feature and dissimilar feature set calculation flow chart;
Fig. 3 is similar features and dissimilar feature limitation figure in particle of the present invention.
Specific embodiment
The present invention is a kind of text feature selection method based on particle group optimizing, the feature choosing in this method text It selects, can effectively select the feature to contain much information, to realize preferably text representation.
A kind of text feature selection method of particle group optimizing, this method comprising the following specific steps
1) text such as Fig. 1 is segmented with participle tool, the word after participle is formed into a word set, as text Primitive character indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T1, t2..., tn}
2) firstly, feature t is calculated with formula 1iWith the average relationship distance R between other featuresi,
Wherein p (ti, tj) indicate ti, tjThe probability of co-occurrence, i.e. ti, tjThe number in sentence is appeared in jointly than upper text set The quantity of all words, p (ti) indicate feature tiThe probability of appearance, i.e. feature tiThe number of appearance is the same as all word numbers of text set The ratio of amount;p(tj) indicate feature tjThe probability of appearance, i.e. feature tjRatio of the number of appearance with all word quantity of text set Value;RiValue is higher to show that this feature is bigger with the relationship of other features, RiValue is lower to show this feature with the area other features Yue You Not;After the R value for acquiring all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put Enter dissimilar group D, later half is put into similar group S.As shown in Figure 2.
3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several two at random The particle x (i.e. the position of particle) of system and to initialize each initial velocity degree vi, use xiAnd viIt is expressed as i-th of particle Position and speed, and be all the vector of m dimension, the value of each dimension is the random number between (0,1).
4) according to formula 2 come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, B), a and b is custom parameter (being set as a=-4, b=4 herein), and specific practice is to use vidIndicate viD dimension value, if vid> b, then enable vid=b, if vid< a, then enable vid=a, other situations vidIt is constant.
Wherein, it usesIndicate the optimum position that each particle lives through itself,It is best to indicate that population particle lives through Position, c1And c2It is Studying factors, usual situation c1=c2=2.r1And r2It is the random number between [0,1].
By formula 3 come the position of more new particle, and each dimension values of particle are changed, specific practice be first with Formula 4 calculates s (vid), s (vid) indicate a function, wherein e is natural constant,Indicate-the v of eidPower, then s (vid) compare with random number rand, if s (vid) > rand, then by xid=1, it is otherwise set as 0, x hereidIndicate particle xiD The value of dimension.A wherein value of rand random initializtion.
xi=xi+vi (3)
5) rapid according to step 4, available particle xiEach dimension values of position are 0 or 1.Due to particle xiEach dimension It is corresponding with the feature of characteristic set T, according to xiDimension values be 1 position, available character subset T ', according to similar spy S and dissimilar feature set D are closed in collection, and T ' is divided into dissimilar character subset D ' and similar features subset S '.To in particle Similar features and the quantity of dissimilar feature controlled, defined parameters α enables nD′=α n, nS′=(1- α) n, nD′For not Similar features subset D ' Characteristic Number lower limit, nS′, it is the upper limit of similar features subset S Characteristic Number.When the number of feature in D Less than nD′When, the feature in D is selected at random to D, until number reaches n in DD′, and update xi, i.e., by xiIn corresponding dimension values It is revised as 1.Similarly, when the number of feature in S is greater than nS′, when, feature is rejected in random S, until number reaches n in SS′, and more New xi, i.e., by xiIn corresponding dimension values be revised as 0.X by upper operation, after being updatedi, and the T after update.Tool Body example is as shown in Figure 3.
6) x is utilizediRepresented character subset set T ' indicates text training KNN classifier with vector space model, and counts Calculate the accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N Indicate that test text concentrates the total number of samples for including, naccThe test text number correctly classified;
7) the rapid method of step 6 is utilized, according toWithCharacter subset represented by particle is calculated for the accurate of classification RateAnd
8) rightWithIt is updated;IfThenIfThen
9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4, otherwise terminate, And it exports
10) basisIn obtain optimal feature subset.Such asIndicate that position is 1 in characteristic set X, Collection composed by 2,3,7,8 feature is combined into optimal characteristics set.

Claims (1)

1. a kind of text feature selection method of particle group optimizing, which is characterized in that this method comprising the following specific steps
1) text set is segmented with participle tool, the word after participle is formed into a word set, as the original of text set Feature indicates characteristic set with T, and the number of feature is n, i.e. T={ t in set T1, t2..., tn};
2) firstly, calculating feature t using formula (1)iWith the average relationship distance R between other featuresi,
Wherein p (ti, tj) indicate ti, tjThe probability of co-occurrence, i.e. ti, tjThe number appeared in sentence jointly is more all than upper text set The number of word, p (ti) indicate feature tiThe probability of appearance, i.e. feature tiThe number of appearance is the same as all word quantity of text set Ratio;p(tj) indicate feature tjThe probability of appearance, i.e. feature tjRatio of the number of appearance with all word quantity of text set;Ri Value is higher to show that this feature is bigger with the relationship of other features, RiValue is lower to show that this feature is more had any different with other features;It asks After the R value for obtaining all features, all features are subjected to ascending sort according to R value size, feature the first half after sequence is put into not Similar group D, later half are put into similar group S;
3) total the number of iterations iterations is set, and records current the number of iterations with k, test generates several binary systems at random Particle x, that is, particle position, and initialize each initial velocity degree vi, use xiAnd viIt is expressed as the position and speed of i-th of particle Degree, and be all the vector of m dimension, the value of each dimension is the random number between (0,1);
4) according to formula (2) come the speed of more new particle, and each dimension values of particle rapidity after update are limited to (a, b), a It is custom parameter with b, specific practice is to use vidIndicate viD dimension value, if vid> b, then enable vid=b, if vid< a, then Enable vid=a, other situations vidIt is constant;
Wherein, it usesIndicate the optimum position that each particle lives through itself,Indicate the optimum bit that population particle lives through It sets, c1And c2It is Studying factors;r1And r2It is the random number between [0,1];
By formula (3) come the position of more new particle, and each dimension values of particle are changed, specific practice is first with formula (4) s (v is calculatedid), s (vid) indicate a function, wherein e is natural constant,Indicate-the v of eidPower, then s (vid) Compare with random number rand, if s (vid) > rand, then by xid=1, it is otherwise set as 0, x hereidIndicate particle xiD dimension Value;Wherein rand is a value of random initializtion;
xi=xi+vi (3)
5) according to step 4), particle x is obtainediEach dimension values of position are 0 or 1;Due to particle xiEach dimension and characteristic set The correspondence of the feature of T, according to xiDimension values be 1 position, character subset T ' is obtained, according to similar features set S and non-phase Like feature set D, T ' is divided into dissimilar character subset D ' and similar features subset S ';To in particle similar features and The quantity of dissimilar feature is controlled, and defined parameters α enables nD′=α n, nS′=(1- α) n, nD′For dissimilar character subset The lower limit of D ' Characteristic Number, nS′It is the upper limit of similar features subset S ' Characteristic Number;When the number of the middle feature of D ' is less than nD′When, The feature in D is selected at random to D ', until the middle number of D ' reaches nD′, and update xi, i.e., by xiIn corresponding dimension values be revised as 1;Similarly, when the number of the middle feature of S ' is greater than nS′When, the random middle feature of S ' is rejected, until the middle number of S ' reaches nS′, and update xi, I.e. by xiIn corresponding dimension values be revised as 0;
X by upper operation, after being updatedi, and the T ' after update;
6) x is utilizediRepresented character subset set T ' indicates text training KNN classifier with vector space model, and calculates The accuracy rate of classificationThe fitness function of feature selecting is defined as to the accuracy rate of text classification:Wherein, N table Show that test text concentrates the total number of samples for including, naccThe test text number correctly classified;
7) step 6) is utilized, according toWithCalculate accuracy rate f of the character subset represented by particle for classificationi bestAnd
8) rightWithIt is updated;If fi> fi best, thenIfThen
9) judge whether current the number of iterations k is less than iteration, if then jumping to step 4), otherwise terminate, and defeated Out
10) basisSet composed by the feature of middle position obtains optimal feature subset.
CN201810315024.4A 2018-04-10 2018-04-10 The text feature selection method of particle group optimizing Expired - Fee Related CN108664562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810315024.4A CN108664562B (en) 2018-04-10 2018-04-10 The text feature selection method of particle group optimizing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810315024.4A CN108664562B (en) 2018-04-10 2018-04-10 The text feature selection method of particle group optimizing

Publications (2)

Publication Number Publication Date
CN108664562A CN108664562A (en) 2018-10-16
CN108664562B true CN108664562B (en) 2019-10-01

Family

ID=63783195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810315024.4A Expired - Fee Related CN108664562B (en) 2018-04-10 2018-04-10 The text feature selection method of particle group optimizing

Country Status (1)

Country Link
CN (1) CN108664562B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110336637A (en) * 2019-07-15 2019-10-15 北京航空航天大学 A kind of unmanned plane interference signal feature selection approach
CN112365117A (en) * 2020-09-03 2021-02-12 中交西安筑路机械有限公司 Pavement structure performance calculation method based on optimized support vector machine
CN112613595A (en) * 2020-12-25 2021-04-06 煤炭科学研究总院 Ultra-wideband radar echo signal preprocessing method based on variational modal decomposition

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5070591B2 (en) * 2007-05-25 2012-11-14 株式会社国際電気通信基礎技術研究所 Noise suppression device, computer program, and speech recognition system
CN105095494B (en) * 2015-08-21 2019-03-26 中国地质大学(武汉) The method that a kind of pair of categorized data set is tested
CN107506821A (en) * 2017-10-13 2017-12-22 集美大学 A kind of improved particle group optimizing method

Also Published As

Publication number Publication date
CN108664562A (en) 2018-10-16

Similar Documents

Publication Publication Date Title
CN108304316B (en) Software defect prediction method based on collaborative migration
CN110633745A (en) Image classification training method and device based on artificial intelligence and storage medium
CN108664562B (en) The text feature selection method of particle group optimizing
CN111428733B (en) Zero sample target detection method and system based on semantic feature space conversion
CN114841257B (en) Small sample target detection method based on self-supervision comparison constraint
CN107992895A (en) A kind of Boosting support vector machines learning method
CN104573669A (en) Image object detection method
EP2936392B1 (en) Image pattern recognition system and method
CN105389583A (en) Image classifier generation method, and image classification method and device
Chen et al. An Enhanced Region Proposal Network for object detection using deep learning method
CN101196905A (en) Intelligent pattern searching method
CN110879881B (en) Mouse track recognition method based on feature component hierarchy and semi-supervised random forest
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN105930792A (en) Human action classification method based on video local feature dictionary
CN110991518A (en) Two-stage feature selection method and system based on evolution multitask
CN113704522A (en) Artificial intelligence-based target image rapid retrieval method and system
CN112183671A (en) Target attack counterattack sample generation method for deep learning model
CN110097067B (en) Weak supervision fine-grained image classification method based on layer-feed feature transformation
CN115331752A (en) Method capable of adaptively predicting quartz forming environment
CN111797935A (en) Semi-supervised deep network picture classification method based on group intelligence
US20230252282A1 (en) Method, server, and system for deep metric learning per hierarchical steps of multi-labels and few-shot inference using the same
CN110046255A (en) A kind of file classification method based on anti-noise traveling time potential energy cluster
Li et al. Wheat cultivar classifications based on tabu search and fuzzy c-means clustering algorithm
Zhang et al. Improved deep learning model text classification
CN113971442A (en) Method and system for generating universal countermeasure disturbance based on self-walking learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191001