CN109325511A

CN109325511A - A kind of algorithm improving feature selecting

Info

Publication number: CN109325511A
Application number: CN201810859899.0A
Authority: CN
Inventors: 汪海涛; 唐康
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-02-12
Anticipated expiration: 2038-08-01
Also published as: CN109325511B

Abstract

The invention discloses a kind of algorithms for improving feature selecting, belong to the high-dimensional feature selection technique field of feature space.The present invention uses RDC (opposite discrimination standard) measurement first to calculate the correlation of each feature, and the correlation between feature is then calculated using Pearson correlation coefficient.Optimal characteristics are gradually selected finally by the M value that the present invention defines is calculated.The present invention not only selects maximally related feature in feature space, and the redundancy between them is considered using relativity measurement, redundancy and incoherent feature can be filtered from feature space, select optimal feature subset in feature space, by feature space dimensionality reduction, to improve the performance of text classification.

Description

A kind of algorithm improving feature selecting

Technical field

The present invention relates to a kind of algorithms for improving feature selecting, belong to the high-dimensional feature selection technique neck of feature space Domain.

Background technique

Big data era founder is exactly internet, and the rapidly development of internet makes data volume that explosive increase be presented.? In face of so big data volume, the opportunity of a lifetime was not only brought but also had brought great challenge.Much have The information of value is flooded by a large amount of hash, makes people be difficult to obtain oneself needs and again valuable information, therefore The information that people's needs how are excavated from mass data becomes the emphasis direction of research.Text classification oneself become one it is important Research topic, be widely studied and applied in machine learning, information retrieval and Spam filtering.In these necks Domain applicating text sorting technique, has many advantages.Classification Management for digital library is contracted significantly relative to manual method It is asked when the short classified finishing of document.In information retrieval field, by Text Classification, text information is divided into related and not Related category filters out useless search result, can significantly improve the accuracy rate and speed of retrieval.The technology of current text classification With theoretical comparative maturity, and good achievement is achieved.But with the development of mobile internet, being permitted occurs in text data Mostly new feature.Such as the social networks based on microblogging, wechat, community and discussion bar is popular, short text data is gradually increasing. In addition, the new variation such as the classification a few days of text increases, category distribution is uneven, classification mark difficulty, also gives text classification band Giant's challenge is carried out.There are also considerable room for improvement for text classification, it is still necessary to studying it, improve the effect of text classification Fruit.During text classification, document is usually modeled as a vector space, wherein each word is considered as a spy Sign.In the vector model of document, the value of feature can be the frequency or term frequency-inverse document frequency (tf-idf) of its equivalent. In text classification sixty-four dollar question first is that processing feature space it is high-dimensional.The higher-dimension of feature space is especially comprising big Measuring in the text categorization task of word leads to increased calculating cost and reduced classification performance.Feature selecting and extraction are to reduce Two kinds of main methods of text feature Spatial Dimension.Feature selecting is paid close attention in recent years, it is intended to be utilized centainly from data Policy selection goes out an optimal subset of primitive character collection, to promote the study of subsequent other goal tasks.Feature selecting Target includes the meaning of three aspects: (1) improving the estimated performance of object module；(2) reduce object module training time and Predicted time improves efficiency；(3) the generation process of the implication and data in data is disclosed.It is briefly exactly feature choosing It selects so that data are more simplified effectively, while helping to more fully understand data.Feature selecting as data processing primary one Step, for big data, can reduce data scale, reduce the difficulty of the learning of object model, can be to Data Dimensionality Reduction for high dimensional data To overcome the problems, such as " dimension disaster ", model over-fitting is prevented.Especially in the study of high dimensional data, to data carry out analysis and The difficulty and cost relative data dimension of study exponentially increase, it is necessary to learn complex model, to improve the expression of model Ability, while needing the data volume of exponential growth also to support the study of complex model.Data volume is too small, then will lead to model The Generalization Capability of over-fitting, model is poor.Therefore, very necessary to data progress feature selecting, but will be in the Pang of primitive character collection Optimal characteristics collection is found as the expression to data in big subset space, and difficulty is very big.Feature extraction refers to by merging or becoming Initial form is changed to generate the process of a small group new feature, and in feature selecting, it is reduced by selecting most significant feature Spatial Dimension.Feature selection approach can be divided into four classes: filter, wrapper, embedded and hybrid method.Filter method Statistical analysis is executed to select the distinction subset of feature to feature space.Feature selection approach should be able to be identified and be removed to the greatest extent Uncorrelated and redundancy feature more than possible.Incoherent feature can be effectively removed in most of feature selection approach, but not Redundancy feature can be handled.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of improvement feature selecting algorithms, in order to overcome above-mentioned The deficiencies in the prior art, the algorithm can filter redundancy and incoherent feature from feature space, select optimal in feature space Character subset further increases the effect of text classification to achieve the purpose that dimensionality reduction.

The technical solution adopted by the present invention is that: a kind of algorithm improving feature extraction includes the following steps:

Step1: inputting the feature quantity k that final feature space includes, and creates new set a S, F as the institute of document D There is characteristic set；

Step2: each of traversal F feature f_s, calculate its relevance values RDC (f_s), i.e., it is calculated using following equations group RDC value:

RDC(W_i)=AUC (w_i,tc_m),

Wherein W_iIt is characterized word, df_pos(w_i) and df_neg(w_i) it is containing word W respectively_iNumber of documents and do not contain word Language W_iNumber of documents, tc_j(w_i) indicate word W_iQuantity in document j, AUC (w_i,tc_j) indicate Feature Words W_iWith word frequency tc_j ROC curve under area, tc_j-1Indicate quantity of the Feature Words in document j-1, tc_j+1Indicate number of the Feature Words in document j+1 Amount, tc_mIndicate quantity of the Feature Words in the last one document m；

Step3: it is ranked up according to the calculated RDC value of step2 institute；

Step4: the feature f of maximum RDC is selected_max；

Step5: addition f_maxTo set S；

Step6: f is removed from set F_max；

Step7: traversal set F enables sum (f to each characteristic value_i)=0；

Step8: traversal set F, to each characteristic value f_i, calculate itself and each feature f in S_sThe degree of correlation Correlation(f_i,f_s) and calculate

Step9: to each of set F characteristic value f_i, its M (f is calculated using following formula_i) value,

That is:

M(f_i)=RDC (f_i)-sum(f_i),

Wherein RDC (f_i) it is feature f_iCorrelation, and correlation (f_i, f_j) indicate by their similarity Two feature f of definition_iAnd f_jBetween correlation, calculate correlation with Pearson correlation coefficient:

Wherein f_i,dAnd f_j,dIt is the word frequency of the Feature Words i and j of d-th of document respectively,WithIt is f respectively_iAnd f_jIn document The average value of word frequency in set, Correlation (f_i,f_j) be 1 when indicate maximum positive correlation, Correlation (f_i,f_j) Maximum negative correlation is indicated when being -1, value is between -1 and 1；

Step10: the selection maximum feature f of M value_max；

Step11: increase f_maxTo set S；

Step12: f is removed from set F_max；

Step13: step8-Step12 is repeated until the quantity in set S is equal to K；

Step14: set S is the feature of final choice.

The beneficial effects of the present invention are:

1, the precision of ratio of precision conventional method RDC of the invention is high；

2, the present invention removes redundancy and incoherent feature from feature space, realizes further feature space dimensionality reduction.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

In the following with reference to the drawings and specific embodiments, the present invention is described in further detail.

Embodiment 1: as shown in Figure 1, a kind of algorithm for improving feature extraction, includes the following steps:

The present invention uses RDC (opposite discrimination standard) measurement first to calculate the correlation of each feature, then uses skin Your inferior related coefficient calculates the correlation between feature.Finally by calculating, the M value that the present invention defines is optimal gradually to select Feature.

It is specific as follows:

Step1: inputting the feature quantity k that final feature space includes, (value of k is set according to actual conditions, and does not make here It is specific to limit), creation one new set S, F are all characteristic sets of document D；

RDC(W_i)=AUC (w_i,tc_m),

Step4: the feature f of maximum RDC is selected_max；

Step5: addition f_maxTo set S；

Step6: f is removed from set F_max；

Step7: traversal set F enables sum (f to each characteristic value_i)=0；

That is:

M(f_i)=RDC (f_i)-sum(f_i),

Step10: the selection maximum feature f of M value_max；

Step11: increase f_maxTo set S；

Step12: f is removed from set F_max；

Step13: step8-Step12 is repeated until the quantity in set S is equal to K；

Step14: set S is the feature of final choice.

Below with reference to specific example, the present invention is described in detail:

1 one, table simple data sets (only two class kind classifications)

Document	Class	Content
			Document D 1	Front	Cat, fish
Document D 2	Front	Cat, mouse, fish
			Document D 3	Front	Mouse, fish
Document D 4	Front	Mouse, cat, fish, mouse, fish
			Document D 5	Front	Fish, cat, fish, cat
Document D 6	Front	Fish, mouse
			Document D 7	Negatively	Dog, mouse
Document D 8	Negatively	Dog, dog
			Document D 9	Negatively	Fish, fish, mouse
Document D 10	Negatively	Mouse
			Document D 11	Negatively	Cat, fish
Document D 12	Negatively	Dog, fish

A simple generated data collection is provided in table 1.The data set is made of 12 documents, includes 4 words, packet Include ' cat ', ' dog ', ' mouse ' and ' fish '.Each document in this data set belongs to front or negative classification.

The data set word frequency of table 2

Document	f₁(cat)	f₂(fish)	f₃(mouse)	f₄(dog)	f₅(fish)
						Document 1	1	1	0	0	1
Document 2	1	1	1	0	1
						Document 3	0	1	1	0	1
Document 4	1	2	2	0	2
						Document 5	2	2	0	0	2
Document 6	0	1	1	0	1
						Document 7	0	0	1	1	0
Document 8	0	0	0	2	0
						Document 9	0	2	1	0	2
Document 10	0	0	1	0	0
						Document 11	1	1	0	0	0
Document 12	0	1	0	1	1

Table 2 shows the matrix form (i.e. vector model) of the data set.Firstly, calculate each word in each document Word frequency.In order to protrude the validity of the invention, one of feature f is repeated again₂And as new feature f₅It is added to data set In.Therefore, feature f_iAnd f_jIt is perfectly correlated.The purpose of feature selecting is the highly relevant spy that selection has minimum correlation Sign.f₂And f₅Comprising identical information, one of them is extra.One of feature has a higher M value, and another redundancy The M value of feature is relatively low.F is calculated below₂And f₅The value of RDC be the same；And the value for calculating their M is then different.

RDC(f₁)=(2+5)/2+ (5+0)/2=6

RDC(f₂)=RDC (f₅)=(1+0.5)/2+ (0.5+0)/2=1

RDC(f₃)=(0+5)/2+ (5+0)/2=5

RDC(f₄)=(20+5)/2+ (5+0)/2=15

Formula based on the M proposed repeats these calculating, and corresponding result is as follows, f₂And f₅M value not It is identical.According to formula:

In, the value of the final M of feature is with correlation (first item on the right of equation) and redundancy (on the right of equation Section 2) it is related.As two feature f₂And f₅When closely similar, therefore the correlation between them will be close to 1., if f_jF is selected before_i, then f_iThe redundancy value of distribution will be less than f_j。

Table 3RDC and M value proposed by the invention

Method	f₁(cat)	f₂(fish)	f₃(mouse)	f₄(dog)	f₅(fish)
						RDC	6	1	5	15	1
M	5.902	-0.193	4.63	15	-0.84

Table 3 compares RDC the and M value of these features.It can be seen that f₂And f₅RDC value it is the same, and the two features are then There is different M values.In this example, f₂And f₅Feature is identical, but f₅M value ratio f₂M value it is low, and have identical RDC Value.Use M value, f₂And f₅Selection or give up depending on threshold value k (the predefined quantity k) of final character subset.If k=3, f₂And f₅It does not select, and selects k=4, then f2 is selected, and f5 is rejected, and if k=5, f₂And f₅All selected.

The overall flow of the example:

For document D 4, K=4:

Step 1: initialization set S, set F are all characteristic values of D4, i.e., { mouse, cat, fish, mouse, fish }；

Step 2: for each value in set F.Using formula calculate its degree of correlation RDC value f (cat)=6, f (fish)= 1, f (mouse)=5, f (dog)=15, f (fish)=1；

Step 3: be ranked up according to RDC f (dog)=15, f (cat)=6, f (mouse)=5, f (fish)=1, f (fish)= 1}

Step4: the selection maximum feature f of RDC value_max(dog)；

Step5: f (dog) is added in set S；

Step6: f (dog) is removed from set F；

Step7: traversal set F, to each feature sum (f_i)=0；

Step8: to each of set F feature, the f (dog) combined in S calculates Correlation (f_i, f (dog)) i.e.

Correlation (f (cat), f (dog))=sum (cat)+Correlation (f (cat), f (cat))

Sum(f_i)=sum (f (cat))+Correlation (f (cat), f (dog))

Correlation (f (fish), f (dog))=sum (fish)+Correlation (f (fish), f (cat))

Sum(f_i)=sum (f (fish))+Correlation (f (fish), f (dog))

Correlation (f (mouse), f (dog))=sum (mouse)+Correlation (f (mouse), f (cat))

Sum(f_i)=sum (f (mouse))+Correlation (f (mouse), f (dog))

Correlation (f (fish), f (dog))=sum (fish)+Correlation (f (fish), f (cat))

Sum(f_i)=sum (f (fish))+Correlation (f (fish), f (dog))；

Step9: its M value, respectively M (cat)=5.902, M (fish)=- 0.193, M are calculated to each value in set F (mouse)=4.63, M (fish)=- 0.84；

Step10: selection that maximum characteristic value of M value, i.e. M (cat)=5.902 are put into f (cat) in set S；

Step11: f (cat) is removed from set F；

Step12: the quantity repeated in step8-step11 to set S is equal to 4.

Final selected feature space is { dog, mouse, cat, fish }.

The present invention not only selects maximally related feature in feature space, but also is considered between them using relativity measurement Redundancy can filter redundancy and incoherent feature from feature space, select optimal feature subset in feature space, by feature sky Between dimensionality reduction, to improve the performance of text classification.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of algorithm for improving feature extraction, characterized by the following steps:

Step1: the feature quantity k that final feature space includes is inputted, creation one new set S, F are all spies of document D Collection is closed；

Step2: each of traversal F feature f_s, calculate its relevance values RDC (f_s), i.e., RDC is calculated using following equations group Value:

RDC(W_i)=AUC (w_i,tc_m),

Wherein W_iIt is characterized word, df_pos(w_i) and df_neg(w_i) it is containing word W respectively_iNumber of documents and do not contain word W_i's Number of documents, tc_j(w_i) indicate word W_iQuantity in document j, AUC (w_i,tc_j) indicate Feature Words W_iWith word frequency tc_jROC Area under the curve, tc_j-1Indicate quantity of the Feature Words in document j-1, tc_j+1Indicate quantity of the Feature Words in document j+1, tc_mIndicate quantity of the Feature Words in the last one document m；

Step4: the feature f of maximum RDC is selected_max；

Step5: addition f_maxTo set S；

Step6: f is removed from set F_max；

Step7: traversal set F enables sum (f to each characteristic value_i)=0；

That is:

M(f_i)=RDC (f_i)-sum(f_i),

Wherein RDC (f_i) it is feature f_iCorrelation, and correlation (f_i, f_j) indicate to be defined by their similarity Two feature f_iAnd f_jBetween correlation, calculate correlation with Pearson correlation coefficient:

Wherein f_i,dAnd f_j,dIt is the word frequency of the Feature Words i and j of d-th of document respectively,WithIt is f respectively_iAnd f_jIn collection of document The average value of middle word frequency, Correlation (f_i,f_j) be 1 when indicate maximum positive correlation, Correlation (f_i,f_j) it is -1 When indicate maximum negative correlation, value is between -1 and 1；

Step10: the selection maximum feature f of M value_max；

Step11: increase f_maxTo set S；

Step12: f is removed from set F_max；

Step13: step8-Step12 is repeated until the quantity in set S is equal to K；

Step14: set S is the feature of final choice.