CN103886108A

CN103886108A - Feature selection and weight calculation method of imbalance text set

Info

Publication number: CN103886108A
Application number: CN201410149441.8A
Authority: CN
Inventors: 刘磊
Original assignee: Beijing University of Technology
Current assignee: Goonie International Software (Beijing) Co.,Ltd.
Priority date: 2014-04-13
Filing date: 2014-04-13
Publication date: 2014-06-25
Anticipated expiration: 2034-04-13
Also published as: CN103886108B

Abstract

The invention provides a feature selection and weight calculation method of an imbalance text set, and belongs to the field of text information processing. In order to solve the classification problem of imbalance text data, a feature selection and weight calculation method and system are provided. The category discrimination degree and the average word frequency factor are combined, the chi-squared statistic method is improved so as to conduct feature selection, meanwhile, a commonly-used feature weight calculation method is improved, and on the basis of the improvement, the TF-IDF weight calculation method is provided. The effect of the method on solving the imbalance data set problem is superior to that of a traditional feature selection method, and the method is effective and feasible in effectively improving the classification accuracy.

Description

A kind of feature selecting of unbalanced text set and weighing computation method

Technical field

The invention belongs to text information processing field, specifically relate to feature selecting and the weighing computation method of unbalanced text set.

Background technology

Along with the develop rapidly of infotech and popularizing of internet, expanding has rapidly appearred in text message resource.These information resources are in horn of plenty people knowledge and providing convenience, but this wherein also contains a large amount of junk information.As one of major technique of information retrieval technique, Text Classification has very high using value at aspects such as improving information retrieval and filtering system performance.

Under normal circumstances, the source of text not only comprises webpage, mail, also comprises note, microblogging and forum's model etc.In text classification process, if text table is shown as to vector form, the feature in training set may be ten hundreds of.In a large amount of features, much uncorrelated and feature redundancy needs to remove, and the noise characteristic of classification of disturbance accuracy also needs to remove.Huge feature space dimension can reduce performance and the generalization ability of sorter, and process high dimension vector needs high time complexity simultaneously.Feature selecting, as the important step of Text Classification, improves efficiency and the precision of sorter by feature is carried out to dimension-reduction treatment.Because classification information is the important component part of text classification, there is the problems such as classification is related to complexity, skewness weighing apparatus and classification is uncertain in text classification, and these problems are that feature selecting has been researched and proposed a lot of challenges.

A lot of traditional machine learning methods are all based under data set equilibrium situation, but in real world applications, most according to being unbalanced, conventional machines learning method is conventionally poor to the treatment effect of unbalanced data set.How effectively unbalanced data set to be processed is a study hotspot of Data Mining.Processing for unbalanced data set has wide prospect and practical significance in the fields such as medical diagnosis, financial credit management and filtrating mail.There are being two aspects, the one, sampling aspect, the 2nd, algorithm aspect for the processing of unbalanced problem.The present invention has provided Feature Selection by the feature selecting aspect of concentrating based on unbalanced data set.

Inventor, by considering the Feature Selection Algorithms of unbalanced data set, provides a kind of feature selecting and weighing computation method of unbalanced text set, has overcome the limitation of traditional classification method in the face of unbalanced data set.

Summary of the invention

The object of the invention is to the classification problem for unbalanced text data, propose a kind of Feature Selection and weighing computation method and system.The present invention, in conjunction with class discrimination degree and average word frequency factor, carries out Feature Selection by improving chi metering method.Also conventional feature weight computing method are improved simultaneously, and the weighing computation method of TF-IDF proposed on its basis, experiment shows, the effect in the time processing unbalanced data set problem of improving one's methods is better than traditional feature selection approach, is effective and feasible for improving classification accuracy.

The present invention adopts following technological means to realize:

Step 1: text set is carried out to text pre-service, extract semantic information, method is as follows:

Step 1.1: utilize Chinese morphology process software, file set is carried out to participle and part-of-speech tagging processing.

Step 1.2: filter out the stop words after word segmentation processing, comprising: auxiliary words of mood, preposition, adverbial word.

Step 2: the feature selecting of carrying out text set is calculated, and method is as follows:

Every pretreated text data set is handled as follows

The CHI statistic of step 2.1: calculated characteristics t and classification c

Comprise feature t and belong to classification c _i, be designated as A.

Comprise feature t and do not belong to classification

, be designated as B.

Do not comprise feature

and belong to classification c _i, be designated as C.

Do not comprise feature and do not belong to classification , be designated as D.

The CHI normalized set formula of feature t and classification c is:

χ^{2} (t, c) = {\begin{matrix} \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}, AD - BC > 0 \\ 0, AD - BC \leq 0 \end{matrix} - - - (1)

Step 2.2: calculate reversing classification frequency ICF

Wherein M is the sum of text set classification, m _tit is the number that occurs the classification of feature t in document sets.

{ECF}_{t, C} = In \frac{M}{m_{t}} + 1

Wherein M>0,0≤m _t≤ M

Step 2.3: carry out improved chi amount and calculate

χ^{2} (t, c) = {\begin{matrix} \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)} \times {ICF}_{t, C} \times \frac{{TC}_{i}}{T \overset{&OverBar;}{C_{i}}}, AD - BC > 0 \\ 0, AD - BC \leq 0 \end{matrix} - - - (2)

The average word frequency TC that wherein feature t occurs in positive class _iwith its average word frequency occurring in negative class ratio weighed the degree of correlation of feature and classification, larger characterization t is larger with degree of correlation of positive class for its value.Here χ ²(t, c) span be [0 ,+∞) between.

Step 3: term weight function calculates, and method is as follows:

Feature Words in each text is carried out to weight calculation

Step 3.1: calculate lambda factor, method is as follows:

λ (t, c_{i}) = \frac{DF (t, c_{i})}{D (c_{i})} - - - (3)

Wherein, DF (t, c _i) expression c _iin class, comprise the textual data of characteristic item t, D (c _i) expression c _itext sum in class, λ is the textual data that comprises Feature Words t in a certain classification and accounts for the ratio of this class text sum, λ (t, c _i) span is between [0,1];

Step 3.2: calculate TF-IDF* λ IG numerical value

w (t_{i}, d_{j}) = \frac{{tf}_{ij} * \log (\frac{N}{n_{i}}) * λIG}{\sqrt{\underset{i &Element; d_{j}}{Σ} {[{tf}_{ij} * \log (\frac{N}{n_{i}}) * λIG]}^{2}}} - - - (4)

Step 3.3: calculate TF-IDF* λ CHI

w (t_{i}, d_{j}) = \frac{{tf}_{ij} * \log (\frac{N}{n_{i}} + L) * λCHI}{\sqrt{\underset{i &Element; d_{j}}{Σ} {[{tf}_{ij} * \log (\frac{N}{n_{i}} + L) * CHI]}^{2}}} - - - (5)

The parametric t representation feature item of formula in step 3.2 and step 3.3, wherein N is the sum of classification in text set, n _iit is the number that occurs the classification of feature t in text set.Tf _ijrepresent a Feature Words t _iat certain text d _jthe number of times of middle appearance.W (t _i, d _j) span is between [0,1].

Step 4: classification results output.

The present invention compared with prior art, has following obvious advantage and beneficial effect:

Inventive method has considered the distribution situation of feature in positive and negative classification, and selection representativeness and the more intense feature of distinctive that can be comprehensive, avoid the inadaptability of traditional characteristic system of selection on unbalanced data set.Weighing computation method based on feature binding pattern has better solved the extraction problem of dimension of a vector space height and linked character word, has improved the efficiency of sort program and the precision of classification.

Accompanying drawing explanation

Fig. 1 realizes the process flow diagram of unbalanced text data set Feature Selection and weighing computation method and system;

Fig. 2 is non-equilibrium than the F1 value broken line graph of lower positive class;

The experimental result of TF-IDF weight calculation after improvement under the selection of Fig. 3 chi measure feature;

The comparing result figure of TF-IDF weight calculation after improvement under Fig. 4 information gain feature selecting.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Shown in Fig. 1, the method that the present invention proposes realizes successively according to the following steps:

Step 1: unbalanced text set is carried out to text pre-service, extract the word that contains semantic information.

Experiment word segmentation processing adopts Chinese lexical analysis system ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System).

Step 1.2: filter out the stop words after word segmentation processing.As auxiliary words of mood, preposition, adverbial word etc.

If exist in a large number stop words to cause noise jamming to its effective information in text.Delete the effect that can reach thick dimensionality reduction after stop words, object is efficiency in order to improve sort program and the precision of classification.

Step 2: the feature selecting of carrying out text set is calculated

Every pretreated unbalanced text data set is handled as follows:

The CHI statistic of step 2.1: calculated characteristics t and classification c, here

(t, c _i): comprise feature t and belong to classification c _i, be designated as A.

comprise feature t and do not belong to classification

, be designated as B.

do not comprise feature

and belong to classification c _i, be designated as C.

do not comprise feature

and do not belong to classification

be designated as D.

A and D have showed feature t and classification c _ipositive dependence, B and D have showed feature t and classification c _inegative dependence.In the system of selection of CHI statistical nature, the CHI normalized set formula of feature t and classification c is:

χ^{2} (t, c) = {\begin{matrix} \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}, AD - BC > 0 \\ 0, AD - BC \leq 0 \end{matrix} - - - (1)

Step 2.2: the reversing classification frequency ICF that calculates unbalanced text collection;

Because different features exists difference to the discrimination of classification, the feature in obviously positive class has good class discrimination degree.Reversing classification frequency ICF(Inverse Category Frequency) computing formula is as follows:

{ICF}_{t, C} = In \frac{M}{m_{t}} + 1 - - - (2)

Wherein M is the sum of classification in text set C, m _iit is the number that occurs the classification of feature t in C.Adding 1 is to be 0 for fear of ICF,

Step 2.3: carry out improved chi amount and calculate

χ^{2} (t, c) = {\begin{matrix} \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)} \times {ICF}_{t, C} \times \frac{{TC}_{i}}{T \overset{&OverBar;}{C_{i}}}, AD - BC > 0 \\ 0, AD - BC \leq 0 \end{matrix} - - - (3)

The average word frequency TC that wherein feature t occurs in positive class _iwith its average word frequency occurring in negative class

ratio weighed the degree of correlation of feature and classification, larger characterization t is larger with degree of correlation of positive class for its value.

Step 3: carry out term weight function in unbalanced text set and calculate

The frequency that calculated characteristics word weight occurs in text by Feature Words and number are determined the weight of this Feature Words.The present invention use TF ?IDF function calculated characteristics weight.

Word frequency represents with TF, i.e. a number of times that Feature Words occurs in text.The TF value of a Feature Words shows that more greatly its classification represents that ability is stronger.Anti-text frequency represents with IDF, and its implication is: if the textual data that comprises certain Feature Words is fewer, this Feature Words represents that the ability of certain class text is stronger, and its weight is also larger.

TF ?IDF formula be that word frequency and anti-text frequency are multiplied each other, the TF after standardization ?IDF function formula be:

T F_{i} * ID F_{j} = \frac{{tf}_{i} * \log (\frac{N}{n_{j}} + L)}{\sqrt{\underset{t &Element; d_{k}}{Σ} {[{tf}_{j} * \log (\frac{N}{n_{j}} + L)]}^{2}}} - - - (4)

Wherein L is constant, determines according to experiment.N is total textual data, n _jfor there is Feature Words t _jtextual data.

Inventor improves the term weight function computing method in each text.Based on TF ?added Feature Words to differentiate text categories in the development of IDF consideration.The frequency that uses TF ?IDF performance characteristic item to occur in text, by the relation between feature selecting function performance characteristic item and text categories.

Step 3.1: calculate lambda factor

In the unbalanced situation of data, even if the textual data that " large class " comprises Feature Words is little, also may be greater than the textual data that comprises this Feature Words in " group ".Regulate by introducing lambda factor, represent as follows:

λ (t, c_{i}) = \frac{DF (t, c_{i})}{D (c_{i})} - - - (5)

Wherein, DF (t, c _i) expression c _iin class, comprise the textual data of characteristic item t, D (c _i) expression c _itext sum in class, λ is the textual data that comprises Feature Words t in a certain classification and accounts for the ratio of this class text sum;

Step 3.2: add information gain, calculate TF ?IDF* λ IG numerical value

Information gain (Information Gain) is weighed feature whether the quantity of information providing for classification is provided.For each feature t, gain difference is larger, and this feature is more important to classification effect.Feature t information gain is as follows:

IG (t) = - Σ_{i = 1}^{n} P (c_{i}) \log P (c_{i}) + P (t) Σ_{i = 1}^{n} P (c_{i} | t) \log P (c_{i} | t) \log P (c_{i} | t) + P (\overset{&OverBar;}{t}) Σ_{i = 1}^{n} P (c_{i} | \overset{&OverBar;}{t}) \log P (c_{i} | \overset{&OverBar;}{t}) - - - (6)

Wherein, P (c _i) belong to classification c for text _iprobability, P (t) appears at the probability in text set, P (c for feature t _i| while t) representing to comprise feature t, text belongs to c _iprobability,

represent not comprise in text set the probability of the text of feature t,

represent that text does not comprise feature t and belongs to c _iprobability, n is classification number.

First with TF ?IDF to select the frequency occurring in single text higher, but the less Feature Words of frequency occurring in other text.Do not find out and do not occur in sample by information gain again, but can express text implication, and have the word of very large contribution to differentiating text categories.Finally introduce lambda factor and carry out combination, improvement formula is:

w (t_{i}, d_{j}) = \frac{{tf}_{ij} * \log (\frac{N}{n_{i}}) * λIG}{\sqrt{\underset{i &Element; d_{j}}{Σ} {[{tf}_{ij} * \log (\frac{N}{n_{i}}) * λIG]}^{2}}} - - - (7)

Step 3.3: introduce improved chi amount, calculate TF-IDF* λ CHI

Relation between CHI performance characteristic word and classification, introduces lambda factor its and TF-IDF is carried out to combination, and improving rear algorithm, to bias toward the frequency of occurrences more and can contain the Feature Words of a large amount of classification information.After improving, formula is:

w (t_{i}, d_{j}) = \frac{{tf}_{ij} * \log (\frac{N}{n_{i}} + L) * λCHI}{\sqrt{\underset{i &Element; d_{j}}{Σ} {[{tf}_{ij} * \log (\frac{N}{n_{i}} + L) * CHI]}^{2}}} - - - (8)

Step 4: carry out classifying quality contrast test according to improved Feature Selection and weighing computation method.

In order to check method of the present invention to improve to some extent with respect to classic method, the present invention has carried out following experiment.

Step 4.1: the feature selecting experiment of unbalanced data set text classification

Experimental data derives from Fudan University's Chinese corpus obtaining on scientific research data sharing platform website, and adopts open method of testing.Fudan University's Chinese corpus comprises 20 classifications, is divided into training set and test set two parts, and two-part sample number is roughly equal and do not have overlappingly, and full text is txt form.The category distribution situation of training set and test set is as shown in table 1:

The category distribution situation of table 1 training set and test set

Corresponding item name is corresponding as follows:

C3 ?art, C4 ?literature, C5 ?education, C6 ?philosophy, C7 ?history, C11 ?space, C15 ?energy, C16 ?electronics, C17 ?communication, C19 ?computing machine, C23 ?mining, C29 ?transportation, C31 ?environment, C32 ?agricultural, C34 ?economy, C35 ?law, C36 ?medical science, C37 ?military affairs, C38 ?politics, C39 ?physical culture.

In text classification experiment, according to practical application, two parts merged and choose sample.Choose in Fudan University's Chinese corpus that sample size differs larger C5 and C34 tests as unbalanced data set herein, in positive class C5, choose at random 60 pieces of texts, negative class C34 chooses 6 groups at random according to special ratios.The experimental data of unbalanced data set is as shown in table 2:

The experimental data of the unbalanced data set of table 2

Here use the method (3 ?fold cross validation) of 3 times of cross validations, the sample set of choosing is above divided into 3 groups, and wherein 2 groups as training set, and 1 group as test set, and by this process in triplicate, finally get the mean value of these three experimental results.

Experiment word segmentation processing adopts Chinese lexical analysis system ICTCLAS, selection be characterized as 1000 dimensions.Sorting algorithm adopts support vector machine.Performance estimating method adopts the overall target F1 value of precision ratio and recall ratio, and its formula is:

F_{1} = \frac{2 precision * recall}{precision + recall} - - - (9)

The different non-equilibrium contrasts of the experimental results than CHI feature selection approach after lower CHI, IG and improvement below, the method for weighting of this experiment adopt TF ?IDF Feature Weighting Method, experimental result is as follows:

Table 3TF ?IDF Feature Weighting Method experimental result

Owing to more paying close attention to the classifying quality of the positive class of unbalanced data centralization, simultaneously for the ease of the comparative analysis of experimental data, unbalancedly represent with broken line graph than lower positive class F1 value different, as shown in Figure 4.Can find out by observing, along with the continuous increase of the non-equilibrium ratio of positive and negative two class, in three kinds of feature selection approachs, the F1 value of negative class all presents growth by a small margin, and in the rear CHI method of improvement, the F1 value of negative class is better than CHI and IG.

Can find from positive class F1 value change curve, under different characteristic system of selection, the variation of positive class F1 value differs larger.Along with the continuous increase of non-equilibrium ratio, after improving, in CHI method, positive class F1 value has obtained than the better effect of additive method, and after 1:10, reach more stable value, after improving, CHI method is not reducing negative class classifying quality simultaneously, make positive class sample obtain there is attention, obtained gratifying effect.

CHI method synthesis after improvement has been considered the distribution situation of feature in positive and negative classification, selection representativeness and the more intense feature of distinctive that can be comprehensive.From experimental data, can find out simultaneously, method after improvement is subject to the impact of the unbalanced degree of data set very little, under different unbalanced ratios, the CHI method after improvement is not in reducing negative class classification performance, and the classification performance of its positive class can remain on the state of a relative ideal.

In sum, the CHI method after improvement is well avoided the inadaptability of traditional characteristic system of selection on unbalanced data set, and in not reducing negative class classification performance, has promoted by a relatively large margin positive class classification performance.

Step 4.2: the weight calculation experiment of unbalanced data set text classification

Experimental data derives from Fudan University's Chinese corpus obtaining on scientific research data sharing platform website, and adopts open method of testing.Fudan University's Chinese corpus comprises 20 classifications, is divided into training set and test set two parts, and two-part sample number is roughly equal and do not have overlappingly, and full text is txt form.Therefrom choose 10 classifications, as shown in table 4 for the sample number distribution situation of training and testing.

The sample number of table 4 training and testing distributes

Choose KNN sorting algorithm and carry out model training, test in the time that feature selecting function is identical, weights computing formula select respectively TF ?IDF and TF ?classifying quality when IDF* λ feature selecting function.K value is 10.

(1) use information gain IG as feature selection approach, feature weight computing method be respectively TF ?IDF and TF ?IDF ?λ IG.Experimental result is in table 5, and overall comparing result as shown in Figure 3.

The experimental result of TF-IDF weight calculation after improvement under table 5 information gain feature selecting

Therefrom can find out, the TF after improvement ?IDF* λ IG method had more significantly and to have promoted at grand average recall rate, accuracy rate and micro-Average Accuracy three aspects:.From classification accuracy rate angle, the method after improvement has had larger lifting at C7 and two classifications of C11, and wherein C7 is a classification that sample number is relatively less, also have certain lifting, but amplitude is limited in all the other classifications.

(2) use chi amount CHI as feature selection approach, feature weight computing method be respectively TF ?IDF, TF ?IDF ?λ CHI.Experimental result is in table 6, and overall comparing result as shown in Figure 4.

The experimental result of TF-IDF weight calculation after improvement under the selection of table 6 chi measure feature

Therefrom can find out, although the TF after improving ?IDF* λ CHI method in grand average recall rate, have decline by a small margin, on grand average and micro-Average Accuracy, be significantly improved.The accuracy rate of most of classification has certain lifting, and it is obvious that C39 and C7 promote amplitude.

By above embodiment, test in the weight method improvement based on feature combination at use KNN disaggregated model, TF after improvement ?IDF classification effect to significantly be better than traditional TF ?IDF method, in indivedual classifications, in the less situation of sample, also shown good classifying quality.This weighing computation method based on feature binding pattern can better solve the problem of the extraction of dimension of a vector space height and linked character word.

Experimental result shows, the weight of utilizing feature combination that the present invention proposes is improved one's methods and had obvious improvement than classic method.

Finally it should be noted that: above example is only in order to illustrate the present invention and unrestricted technical scheme described in the invention; Therefore,, although this instructions has been described in detail the present invention with reference to above-mentioned example, those of ordinary skill in the art should be appreciated that still and can modify or be equal to replacement the present invention; And all do not depart from technical scheme and the improvement thereof of the spirit and scope of invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims

1. the feature selecting of unbalanced text set and weighing computation method and a system, realizes according to the following steps:

Step 1.1: utilize Chinese morphology process software, file set is carried out to participle and part-of-speech tagging processing;

Step 1.2: filter out the stop words after word segmentation processing, auxiliary words of mood, preposition, adverbial word;

Every pretreated text data set is handled as follows

Comprise feature t and belong to classification c _i, be designated as A;

Comprise feature t and do not belong to classification

, be designated as B;

Do not comprise feature

and belong to classification c _i, be designated as C;

Do not comprise feature

and do not belong to classification

, be designated as D;

The CHI normalized set formula of feature t and classification c is:

χ^{2} (t, c) = {\begin{matrix} \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}, AD - BC > 0 \\ 0, AD - BC \leq 0 \end{matrix}

Step 2.2: calculate reversing classification frequency ICF;

Wherein M is the sum of classification in text set C, m _tit is the number that occurs the classification of feature t in C;

{ECF}_{t, C} = In \frac{M}{m_{t}} + 1

Step 2.3: carry out improved chi amount and calculate, method is as follows:

χ^{2} (t, c) = {\begin{matrix} \frac{N \times {(AD - CB)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)} \times {ICF}_{t, C} \times \frac{{TC}_{i}}{T \overset{&OverBar;}{C_{i}}}, AD - BC > 0 \\ 0, AD - BC \leq 0 \end{matrix}

ratio weighed the degree of correlation of feature and classification, larger characterization t is larger with degree of correlation of positive class for its value;

Step 3: term weight function calculates

Feature Words in each text is carried out to weight calculation

Step 3.1: calculate lambda factor, method is as follows:

λ (t, c_{i}) = \frac{DF (t, c_{i})}{D (c_{i})}

Step 3.2: calculate TF-IDF* λ IG numerical value, method is as follows:

w (t_{i}, d_{j}) = \frac{{tf}_{ij} * \log (\frac{N}{n_{i}}) * λIG}{\sqrt{\underset{i &Element; d_{j}}{Σ} {[{tf}_{ij} * \log (\frac{N}{n_{i}}) * λIG]}^{2}}}

Step 3.3: calculate TF-IDF* λ CHI, method is as follows:

w (t_{i}, d_{j}) = \frac{{tf}_{ij} * \log (\frac{N}{n_{i}} + L) * λCHI}{\sqrt{\underset{i &Element; d_{j}}{Σ} {[{tf}_{ij} * \log (\frac{N}{n_{i}} + L) * CHI]}^{2}}}

Step 4: classification results output.