CN106844596A

CN106844596A - One kind is based on improved SVM Chinese Text Categorizations

Info

Publication number: CN106844596A
Application number: CN201710026144.8A
Authority: CN
Inventors: 邱志斌; 向靓; 涂高元; 郭永兴; 陆云燕; 陈雅贤
Original assignee: XIAMEN TIPRAY TECHNOLOGY Co Ltd
Current assignee: XIAMEN TIPRAY TECHNOLOGY Co Ltd
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2017-06-13

Abstract

The present invention discloses one kind based on improved SVM Chinese Text Categorizations, comprises the following steps：Step 1, to Chinese Text Pretreatment, obtains characteristic item set；Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified；Step 3, to simplifying after characteristic item set calculate weight；Step 4, builds text vector, and using each key words in text an as dimension in vector space, and the value in dimension is the weight of the key wordses；Step 5, grader is built using Weighted Support Vector；Step 6, is treated classifying text and is processed using step 14, obtains text vector, and the grader that text vector input step 5 is built obtains classification results.This kind of sorting technique can improve text classification precision.

Description

One kind is based on improved SVM Chinese Text Categorizations

Technical field

It is more particularly to a kind of to be based on improved SVM Chinese Text Categorizations the invention belongs to data mining technology field.

Background technology

File classification method is a kind of sorting technique for having a guidance, it with a text data set for having marked classification (i.e. Training set) grader is trained, the text of non-identified category is classified with the grader for training then, existing classification Method and defect are：

(1) traditional machine learning method such as bayes method and K nearest neighbor algorithms be all based on empirical risk minimization and Realize, promote performance not ideal enough；

(2) traditional SVMs (SVM, Support Vector Machine) method is based on Structural risk minization Change a kind of new mode identification method of principle, with small sample, good popularization performance, the features such as global optimum, but In real operation, the classification problem of generally existing sample imbalance, now, traditional support vector machine method can exist higher False Rate, has much room for improvement.

The content of the invention

The purpose of the present invention, is to provide a kind of based on improved SVM Chinese Text Categorizations, and it can improve text point Class precision.

In order to reach above-mentioned purpose, solution of the invention is：

One kind is based on improved SVM Chinese Text Categorizations, comprises the following steps：

Step 1, to Chinese Text Pretreatment, obtains characteristic item set；

Step 2, feature selecting is carried out to characteristic item set, the characteristic item set after being simplified；

Step 3, to simplifying after characteristic item set calculate weight；

Step 4, builds text vector, using each key words in text an as dimension in vector space, and Value in dimension is the weight of the key wordses；

Step 5, grader is built using Weighted Support Vector；

Step 6, treats classifying text using step 1-4 treatment, text vector is obtained, by text vector input step 5 graders for building, obtain classification results.

In above-mentioned steps 1, Chinese word segmentation is included to Chinese Text Pretreatment and two processes of stop words are gone.

The particular content of above-mentioned steps 2 is：All characteristic items during a valuation functions are constructed to characteristic item set are carried out Assessment, then according to assessed value descending sort, the requirement of threshold value or characteristic item number according to setting selects those spies above Item is levied, the characteristic item set after being simplified.

Above-mentioned valuation functions use evolution test function, it is assumed that characteristic item t and classification C_iBetween meet the x of the single order free degree² Distribution, its computing formula is as follows：

Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification C_iText number, B is comprising feature T and it is not belonging to classification C_iText number, C is not comprising characteristic item t and belongs to classification C_iText number, D is not comprising spy Levy a t and be not belonging to classification C_iText number；

Then, by the x of each characteristic item t²Statistical value arranges a sequence from big to small, and several are used as the spy after simplifying before choosing Levy item set.

In above-mentioned steps 3, weight calculation is carried out using inverse ratio document frequency, the computing formula of weight IDF is：

IDF=log (D_all/D_t)

Wherein, D_allIt is article sum, D_tFor the article quantity that the word occurs.

The detailed content of above-mentioned steps 5 is：

Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,y_i ∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, y_iIt is key words sorting；Based on Weighted Support Vector Textual classification model be expressed as follows：

Wherein, ζ_i>=0, i=1,2 ..., l, l represent number of samples,It is kernel function；S_i＞ 0 represents sample importance Weights, if 0 ＜ S_i＜ 1 represents sampleIt is inessential；S_i=1 representsIt is general important；If S_i＞ 1 is representedIt is critically important；Sample This classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights；

Computing formula construction Lagrangian to weight IDF values is as follows：

Wherein, α_i, β_iIt is Lagrange multiplier, i=1,2 ..., l；

Finally give optimum classifier：

Wherein,It is footpath To base kernel function.

After such scheme, the present invention on file classification method of the tradition based on vector machine by increased weighting step Suddenly, the situation of sample imbalance in current multiclass Chinese Text Categorization can be effectively improved, by the weighting supporting vector after improvement Machine file classification method is applied in the daily document classification of enterprises and institutions, improves nicety of grading, it is ensured that some important classes Other file (such as finance category file) is not leaked, and data safety has been ensured to a certain extent.

Brief description of the drawings

Fig. 1 is the flow chart of training stage of the invention；

Fig. 2 is the flow chart of sorting phase of the present invention.

Specific embodiment

Below with reference to accompanying drawing, technical scheme is described in detail.

The present invention provides a kind of based on improved SVM Chinese Text Categorizations, and text classification is by text document and rule The process that the classification set is matched, includes training and two stages of classification, wherein, flow chart such as Fig. 1 of training stage It is shown, the flow chart of sorting phase as shown in Fig. 2 the treatment in the two stages is differed except final step, other treatment Step is identical, last step, is the structure that sorting algorithm carries out grader using the data of input in the training stage, It is to carry out classification treatment using the grader for training in sorting phase；The sorting technique comprises the following steps：

(1) training stage

Step 1, Chinese text pretreatment, including Chinese word segmentation and go two processes of stop words.

Chinese word segmentation, refers to a sentence with Chinese expression, analyzes its significant word or phrase for including, most These words are extracted from Chinese sentence afterwards, so original Chinese sentence becomes single word one by one；

Stop words is removed, the frequency of occurrences is very high in generally referring to removal text, but practical significance little word again, as common " ", " ", " and ", " then " etc, be more also using excessively frequently word, as " I ", " if ", " " and " " Etc., and various punctuation marks, it is to avoid there is excessive interference after participle.

The step can using the Chinese Academy of Sciences ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) Words partition system, ICTCLAS Words partition systems take full advantage of dictionary matching, system Meter analyzes the advantage of both segmenting methods, the characteristics of can play dictionary matching method participle speed fast, efficiency high, and can utilize Statistical analysis method combination context identification neologisms, the advantage of disambiguation.

Step 2, feature selecting

Exist in the form of characteristic item set after Text Pretreatment, the characteristic item quantity in now characteristic item set is unusual More, it is necessary to carry out dimension-reduction treatment, i.e. feature selecting to characteristic item set.By constructing a valuation functions, (the present embodiment is used Evolution test function) to characteristic item set in all characteristic items be estimated, then according to assessed value descending sort, according to setting The requirement selection of fixed threshold value or characteristic item number those characteristic items above.

Evolution is checked：Assuming that characteristic item t and classification C_iBetween meet the x of the single order free degree²Distribution, characteristic item t is for classification C_iX²Statistical value is higher, characteristic item t and classification C_iCorrelation it is stronger, class discrimination degree is bigger, on the contrary class discrimination degree get over Small, its computing formula is as follows：

Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification C_iText number, B is comprising feature T and it is not belonging to classification C_iText number, C is not comprising characteristic item t and belongs to classification C_iText number, D is not comprising spy Levy a t and be not belonging to classification C_iText number.

Step 3, weight calculation

The present invention carries out weight calculation using inverse ratio document frequency (inverse document frequency, IDF), The IDF values of a certain specific word, are a measurements for word general importance, with total files divided by the article quantity comprising the word, The business that will be obtained again takes the logarithm (log).The computing formula of IDF values is：

IDF=log (D_all/D_t)

Step 4, text representation

For the ease of computer disposal text, text representation is conveniently processed into computer shape using vector space model Formula.In text vector space, each key words is a dimension in vector space, and the value in dimension is the key The weight of word, weight represents the significance level of the key wordses.

Step 5, builds grader

Using the support vector machine method after improving in the present invention --- Weighted Support Vector is used as structure grader Method, for solving the classification problem under sample imbalance situation.Except the great disparity of sample size of all categories, the important journey of classification Degree difference also results in the imbalance of sample.For example：For the classification problem of the text of company's unit, " financial sffairs paper " it is important Degree is obviously higher than " athletic meeting file " significance level.While nicety of grading is ensured, should try one's best and avoid to important class Erroneous judgement.

Step is as follows：

1. Weighted Support Vector categorises weights to training sample, embodies different classes of importance.By increasing Vital document class weight, can efficiently reduce the sample number divided by mistake in the category.

2. furthermore, it is contemplated that the significance level of each text is also not quite similar, i.e., they are also differed to the contribution classified, It is subject to sample weights by individual text, improves each text by possibility of correctly classifying, reduces important text and divided by mistake The possibility of class, so as to improve nicety of grading.

Specific algorithm and derivation are as follows：

Training sample set is provided with to be expressed asWherein, i=1,2 ..., m, y_i∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, y_iFor key words sorting, (classification is individual in the present embodiment Number is 10) such as y_i=1 i-th text of expression belongs to the 2nd classification.Textual classification model based on Weighted Support Vector It is expressed as follows：

Wherein, ζ_i>=0, i=1,2 ..., l, l represent number of samples,It is kernel function.S_i＞ 0 represents sample importance Weights, if 0 ＜ S_i＜ 1 represents sampleIt is inessential；S_i=1 representsIt is general important；If S_i＞ 1 is representedIt is critically important.Sample This classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights.Weighted Support Vector and standard branch Hold vector machine to compare, the punishment that most prominent advantage has been its obfuscation divides sample mistake, i.e., to the slack variable of each sample It is multiplied by the corresponding importance weight of sample and classification weights.

Computing formula construction Lagrangian to IDF values is as follows：

Wherein, α_i, β_iIt is Lagrange multiplier, i=1,2 ..., l.

Finally give optimum classifier：

Wherein,It is footpath To base kernel function.

(2) sorting phase

The text to be sorted to one, is processed text first with the step 1-4 in the training stage, obtains one Corresponding text vector X, in grader f () that then X is input to constructed by step 5, can just obtain the classification corresponding to X As a result f (X), so as to obtain the classification of text.

Above example is only explanation technological thought of the invention, it is impossible to limit protection scope of the present invention with this, every According to technological thought proposed by the present invention, any change done on the basis of technical scheme each falls within the scope of the present invention Within.

Claims

1. it is a kind of to be based on improved SVM Chinese Text Categorizations, it is characterised in that to comprise the following steps：

Step 1, to Chinese Text Pretreatment, obtains characteristic item set；

Step 3, to simplifying after characteristic item set calculate weight；

Step 4, builds text vector, using each key words in text an as dimension in vector space, and dimension On value be the key wordses weight；

Step 5, grader is built using Weighted Support Vector；

Step 6, treats classifying text using step 1-4 treatment, obtains text vector, by the structure of text vector input step 5 The grader built, obtains classification results.

2. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that：The step 1 In, Chinese word segmentation is included to Chinese Text Pretreatment and two processes of stop words are gone.

3. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that：The step 2 Particular content be：All characteristic items during a valuation functions are constructed to characteristic item set are estimated, then according to assessment Value descending sort, the requirement of threshold value or characteristic item number according to setting selects those characteristic items above, after being simplified Characteristic item set.

4. as claimed in claim 3 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that：The assessment Function uses evolution test function, it is assumed that characteristic item t and classification C_iBetween meet the x of the single order free degree²Distribution, its computing formula It is as follows：

x^{2} (t, C_{i}) = \frac{N {(A D - B C)}^{2}}{(A + C) (B + D) (A + B) (C + D)}

Wherein, N is all of textual data, and A is comprising characteristic item t and belongs to classification C_iText number, B be comprising characteristic item t and It is not belonging to classification C_iText number, C is not comprising characteristic item t and belongs to classification C_iText number, D is not comprising characteristic item t And it is not belonging to classification C_iText number；

Then, by the x of each characteristic item t²Statistical value arranges a sequence from big to small, and several are used as the characteristic item after simplifying before choosing Set.

5. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that：The step 3 In, weight calculation is carried out using inverse ratio document frequency, the computing formula of weight IDF is：

IDF=log (D_all/D_t)

6. as claimed in claim 1 a kind of based on improved SVM Chinese Text Categorizations, it is characterised in that：The step 5 Detailed content be：

Training sample set is provided with to be expressed asWherein, i=1,2 ..., m,y_i∈ { 0,1,2,3,4,5,6,7,8,9 },Represent i-th vector of text, y_iIt is key words sorting；Based on Weighted Support Vector Textual classification model is expressed as follows：

\begin{matrix} \min & \frac{1}{2} | | \overset{&OverBar;}{W} | |^{2} + C σ Σ_{i = 1}^{l} S_{i} ζ_{i} \end{matrix}

\begin{matrix} s . t . & y_{i} ({\overset{&OverBar;}{W}}^{T} Φ ({\overset{&OverBar;}{x}}_{i}) + b) &GreaterEqual; 1 - ζ_{i} \end{matrix}

Wherein, ζ_i>=0, i=1,2 ..., l, l represent number of samples,It is kernel function；S_i＞ 0 represents sample importance power Value, if 0 ＜ S_i＜ 1 represents sampleIt is inessential；S_i=1 representsIt is general important；If S_i＞ 1 is representedIt is critically important；Sample Classification weights are σ >=1, and belonging to the sample of identical category has identical classification weights；

Computing formula construction Lagrangian to weight IDF values is as follows：

Φ (\overset{&RightArrow;}{w}, b, a) = \frac{1}{2} | | \overset{&RightArrow;}{w} | |^{2} + C σ Σ_{i = 1}^{l} s_{i} ξ_{i} - Σ_{i = 1}^{l} α_{i} (y_{i} ({\overset{&RightArrow;}{w}}^{T} Φ ({\overset{&RightArrow;}{x}}_{i}) + b) - 1 + ξ_{i}) - Σ_{i = 1}^{l} β_{i} ξ_{i}

Wherein, α_i, β_iIt is Lagrange multiplier, i=1,2 ..., l；

Finally give optimum classifier：

f ({\overset{&RightArrow;}{x}}_{j}) = s g n (Σ_{i = 1}^{l} y_{i} a_{i}^{*} K ({\overset{&RightArrow;}{x}}_{i}, {\overset{&RightArrow;}{x}}_{j}) + b^{*})

Wherein,It is radial direction base Kernel function.