CN108376130A

CN108376130A - A kind of objectionable text information filtering feature selection approach

Info

Publication number: CN108376130A
Application number: CN201810196195.XA
Authority: CN
Inventors: 闫茂德; 赵文; 柯伟; 陈宇; 李超飞; 田野; 林海
Original assignee: Changan University
Current assignee: Changan University
Priority date: 2018-03-09
Filing date: 2018-03-09
Publication date: 2018-08-07

Abstract

The invention discloses a kind of objectionable text information filtering feature selection approach, and all characteristic items are first extracted from classification corpus, build initial characteristics item set；Then basis includes characteristic item t_jTo any classification C in bad classification_iχ²Statistic χ²(t_j,C_i), inverse document frequency IDF, inverse classification frequency ICF after improvement and inverse bad document frequency IHDF be calculated characters classification weight value CTW values and screened to characteristic item using characters classification weight value CTW values as the foundation of feature selecting；The characteristic item in initial characteristics item set that step S2 is screened is sorted from high to low according to the size of CTW values finally, chooses a final characteristic item set of characteristic item composition.The present invention solves χ²Statistic feature selection approach does not consider characteristic item in class the problem of distribution between class situation, while solving the problems, such as that data set of all categories tilts, and then improves the effect of objectionable text information filtering.

Description

A kind of objectionable text information filtering feature selection approach

Technical field

The invention belongs to natural language processing technique fields, especially text content filtering technical field, and in particular to one Kind objectionable text information filtering feature selection approach.

Background technology

During objectionable text information filtering, " dimension disaster " is the significant problem for having to solve.Pass through Chinese text The text message of this word segmentation processing has great characteristic item quantity, since corpus is huge, the dimension in training text set It is just more up to tens of thousands of dimensions to tie up to hundreds of thousands, such huge dimension can cause serious operation to bear to computer, undoubtedly carry High difficulty in computation directly results in the reduction of objectionable text information filtering effect due to calculating the increase of time, meanwhile, such as Information noise is certainly existed in the characteristic item set of this higher-dimension, that is, there is the characteristic item that counter productive is generated to classification and exist, by This, Feature Dimension Reduction becomes very important processing procedure, χ²Statistic method has become present most widely used feature selecting One of method.

Whether 2 statistic methods of χ are the independent premise of two variables in null hypothesis commonly used to examine two variables independent Under, the χ that is calculated²It is more big to count magnitude, illustrates reality and null hypothesis more deviates from, then null hypothesis establishment possibility is smaller, and two Variable association is stronger.In text classification field, null hypothesis H₀：Characteristic item is mutual indepedent with the category, onrelevant；Alternative hypothesis H₁：Characteristic item is relevant with the category.χ²Statistic is bigger, and i.e. deviation value is bigger, and characteristic item and the category degree of association are higher.If special Levy item and classification then χ independently of each other²Statistic is 0.

χ²Although statistic method is the feature selection approach that application effect is best in current text classification, but inevitable Existing defects, mainly have following two points：

(1) reducing part has the low-frequency word weight of clear category significance

Though certain low-frequency word document frequencies are low, often largely appear in the specific a small number of text documents of certain class, due to The number of files for frequently occurring such word is less, causes the word frequency of such word relatively low, but such word has good generation Table represents the classification situation of this few documents, very big to classification contribution.Due to passing through χ²Statistic formula result of calculation is smaller, It is easy to be filtered in screening stage so that there is very strong representative characteristic item accidentally to be deleted.

(2) part is improved in the other kinds high frequency words not frequently occurred but rarely occur in specified classification

Such high frequency words frequently occur in other classes in Training document set, but less, i.e. A occur in specified class Value is smaller, it is clear that such high frequency words indicate the specified class without representative well.Since in calculating process, BC will be much big In AD, χ is directly resulted in²The result of calculation of statistic is higher, is not easy to be filtered in screening process so that is not equipped with relatively strong Representative characteristic item accidentally stays.

Invention content

In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of objectionable text Information filtering feature selection approach, this feature selection method is directed to the particularity of objectionable text information filtering, to traditional χ 2 Statistic feature selection approach is improved.

The present invention uses following technical scheme：

A kind of objectionable text information filtering feature selection approach first extracts all characteristic items, structure from classification corpus Build initial characteristics item set；Then basis includes characteristic item t_jTo any classification C in bad classification_iχ²Statistic χ²(t_j,C_i)、 Characters classification weight value is calculated in inverse document frequency IDF, inverse classification frequency ICF and inverse bad document frequency IHDF after improvement CTW values screen characteristic item using characters classification weight value CTW values as the foundation of feature selecting；Finally by screening Characteristic item in initial characteristics item set sorts from high to low according to the size of CTW values, chooses a characteristic item and forms final feature Item set.

Specifically, using the inverse document frequency IDF value equilibrium characteristic items after improvement in comprising the class including whole classifications Distribution between class situation；The classification that Training document set is compensated for using inverse classification frequency ICF values is tilted；Utilize inverse bad document frequency Balanced distribution situation of the characteristic item between bad classification and normal category of rate IHDF values, then characters classification weight value CTW values It calculates as follows：

CTW=χ²(t_j,C_i)×IDF×ICF×IHDF。

Further, it is Training document sum, C to define N_iFor any classification in bad classification, t_jFor C_iClass initial characteristics item Any feature item in set, A are both comprising characteristic item t_jBelong to classification C again_iDocument frequency；Although B is including characteristic item t_j But do not belong to and classification C_iDocument frequency；C is classification C_iIn do not include characteristic item t_jDocument frequency；D be all documents in neither Including characteristic item t_jIt is not belonging to classification C again_iDocument frequency, then Training document sum N=A+B+C+D.

Further, χ²(t_j,C_i) calculate it is as follows：

Further, the inverse document frequency IDF after improvement is calculated as follows：

Wherein, n is to include this feature item t_jNumber of files；M is classification C_iIn include this feature item t_jNumber of files；K be except Classification C_iIt is outer other kinds comprising this feature item t_jNumber of files, and n=m+k.

Further, inverse classification frequency ICF is calculated as follows：

Wherein, p is whole categorical measures of Training document set；Q is to include characteristic item t_jCategorical measure, when including spy Levy item t_jClassification it is more when ICF values more level off to 0, i.e. this feature item t_jRepresentativeness it is poorer, weighted value is lower.

Further, inverse bad document frequency IHDF is calculated as follows：

Wherein, N is Training document sum；W is to include this feature item t_jNumber of files；V is in whole bad classifications This feature item t_jNumber of files；It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,_jNumber of files；And w= v+l。

Specifically, setting the total number of documents N=TP+FP+FN+TN of final test, calculate separately to obtain the correct of filter effect Rate P_iWith recall rate R_i, and then obtain the comprehensive evaluation index of final filtration effect for verifying, TP is to be retrieved and target class Not relevant number of documents；FP is to be retrieved but the number of documents unrelated with target category；FN is to be retrieved but and target The relevant number of documents of classification；TN is to be retrieved the number of documents unrelated with target category.

Further, the accuracy P of filter effect_iIt calculates as follows：

The recall rate R of filter effect_iIt calculates as follows：

Comprehensive evaluation index F when further, on the basis of classification in objectionable text information_0.5It calculates as follows：

Comprehensive evaluation index F when on the basis of classification in normal text information₂It calculates as follows：

Wherein, corresponding accuracy when P is corresponding benchmark, recall rate corresponding when being corresponding benchmark R.

Compared with prior art, the present invention at least has the advantages that：

The feature selection approach of objectionable text information filtering of the present invention is applied to the bad classification of objectionable text information filtering In assorting process, the quality of objectionable text information filtering result only considers the identification result to bad classification and normal category, right The classifying quality of specific category included in bad classification not consider filter effect in, therefore present invention adds improvement after Inverse document frequency IDF values, inverse classification frequency ICF values and inverse bad document frequency IHDF values etc. calculate the factor and are used for inadequate balance class The categorised demarcation line of specific category, reaches the classifying quality for improving bad classification and normal category, the present invention is main included in not The characteristic of the application environment is accounted for, is had more preferably in the feature selecting of objectionable text information filtering function than other methods Effect.

Further, setting characters classification weight value CTW values as characteristic item for the expression significance level of entire classification, CTW values are bigger, and expression this feature item is more important for place class, therefore is more able to represent the generic attribute as the category feature.If The CTW values set include multiple factors, and not only distribution situation of the comprehensive characteristics item in class between class, also combines characteristic item bad Distribution situation between text message and normal text information can weigh Feature item weighting more fully hereinafter.

Further, setting statistic χ²(t_j,C_i) first factor as characters classification weight CTW values, it is this feature The basic value of the used weighted value of selection method, as basic selection gist, three factors later are all to statistic χ²(t_j, C_i) supplement with it is perfect.Selection statistic χ²(t_j,C_i) by assuming that the basic thought examined, can make Feature selection result more It is accurate to add.

Further, second factors of the inverse document frequency IDF after setting improvement as characters classification weight CTW values, For making up statistic χ²(t_j,C_i) for the deficiency of distribution between class situation consideration in characteristic item class, the IDF values the big, indicates, should Characteristic item is high in the frequency that given document occurs and less in the appearance of other documents, thus enhances by selection characteristic item for specified The particularity of document.

Further, the inverse third factors of the classification frequency ICF as characters classification weight CTW values is set, be for into One step makes up statistic χ²(t_j,C_i) for the deficiency of distribution between class situation consideration in characteristic item class, unlike IDF values, The ICF values the big, indicates, this feature item is high in the frequency that given classification occurs and less in the appearance of other classifications, thus enhances quilt Select characteristic item for the particularity of specified classification.

Further, inverse four factors of the bad document frequency IHDF as characters classification weight CTW values is set, is used for Make up statistic χ²(t_j,C_i) for characteristic item, distribution situation considers not between objectionable text information and normal text information Foot, the IHDF values the big, indicates, this feature item is high in the frequency that objectionable text information occurs and occurs in normal text information It is less, thus enhance the particularity for objectionable text information by selection characteristic item, passes through classification in fuzzy objectionable text information Boundary so that the higher characteristic item of the frequency of occurrences has the possibility of bigger to be selected in whole objectionable text information.

In conclusion the present invention solves χ²Statistic feature selection approach does not consider characteristic item distribution between class in class The problem of situation, while solving the problems, such as that data set of all categories tilts, and then improve the effect of objectionable text information filtering.

Below by drawings and examples, technical scheme of the present invention will be described in further detail.

Description of the drawings

Fig. 1 is flow chart of the present invention；

Fig. 2 is that feature of present invention item characters classification weight value calculates detail flowchart.

Specific implementation mode

The present invention provides a kind of objectionable text information filtering feature selection approach, are applied to objectionable text information filtering Process is when to bad category classification, to extract feature selection approach used in bad category feature item.The method is to pass The χ of system²Based on statistic feature selection approach, using characters classification weight value CTW values as the foundation of feature selecting, this method The factor for calculating CTW values includes traditional χ²Including statistic, three factors, the inverse document after respectively improveing in addition are increased Frequency IDF values, inverse classification frequency ICF values and inverse bad document frequency IHDF values；Feature weight value (CTW) will be literary after the completion of calculating Then the feature weight value of characteristic item in this information is chosen best characteristic item quantity and is constituted according to sorting successively from big to small New characteristic item set, characteristic item set at this time are exactly the text message represented by garbled characteristic item.The invention solution Determined χ²Statistic feature selection approach does not consider characteristic item in class the problem of distribution between class situation, while solving all kinds of The problem of other data set tilts, and then enhance the effect of objectionable text information filtering.

Referring to Fig. 1, a kind of objectionable text information filtering feature selection approach of the present invention, includes the following steps：

S1, all characteristic items being extracted from classification corpus, building initial characteristics item set, initial characteristics item set is up to Ten thousand dimensions even hundreds of thousands dimension, next needs that initial characteristics item set is carried out dimensionality reduction using feature selection approach, that is, want Characteristic item is screened；

S2, using characters classification weight value CTW (Category Term Weight) values as the foundation of feature selecting, packet Containing 2 (t of χ_j, Ci) and it is characterized a t_jFor classification C_iχ²Statistic, IDF (Inverse Document Frequency) are to change Inverse document frequency after good, ICF (Inverse Category Frequency) are inverse classification frequency, IHDF (Inverse Harmful Document Frequency) it is inverse bad document frequency；

Referring to Fig. 2, for the weight calculation detail flowchart of the present invention, specifically comprise the steps of：

S201,2 (t of χ are calculated_j,C_i) value

C_iFor any classification in bad classification, t_jFor C_iAny feature item in class initial characteristics item set；

Table 1 is characterized item and class relations figure

As shown in table 1, A is both comprising characteristic item t_jBelong to classification C again_iDocument frequency；Although B is including characteristic item t_j But do not belong to and classification C_iDocument frequency；C is classification C_iIn do not include characteristic item t_jDocument frequency；D be all documents in neither Including characteristic item t_jIt is not belonging to classification C again_iDocument frequency；

It is Training document sum to define N, there is N=A+B+C+D；

In feature selection process, χ is utilized²Statistic size is ranked up characteristic item in classification, to select system Measure relatively large more representational characteristic item, therefore, χ²The concrete numerical value of statistic is not important, for each class For not, Training document sum N, belong to C_iClass number of files A+C and it is not belonging to C_iClass number of files B+D be it is identical, therefore, Can be to simplify by formula (1), as shown in formula (2)：

S202, IDF values are calculated

Shown in traditional IDF values formula such as formula (3)：

By IDF formula it is found that if including this feature item t_jNumber of files more at most IDF values more level off to 0, it is apparent that this Distribution situation of the characteristic item in class between class is not accounted for, therefore, IDF formula are improved to as shown in formula (4)：

Wherein, N is Training document sum；N is to include this feature item t_jNumber of files；M is classification C_iIn include this feature item t_jNumber of files；K is except classification C_iIt is outer other kinds comprising this feature item t_jNumber of files；There is n=m+k.

Formula (4) is analyzed, ifThen have：

If m₁>m₂, then have f (m₁)>f(m₂), it can thus be appreciated that f (m) and m is proportional relationship, it is inversely prroportional relationship with k, reaches The improvement considered in characteristic item class with distribution between class situation, i.e. this IDF values meet characteristic item t_jIn classification C_iIn frequently go out Now, and in other classifications occur obtaining higher value when less condition.

S203, ICF values are calculated

In Training document set, tends not to ensure that all categories number of documents is identical, cause number of documents about class Other distribution situation tilts, and when it is this it is unbalanced occur when, such as when certain category documents number is less, IDF can hardly Inhibiting effect is played, weight is caused to be biased to depend on χ²It is higher to eventually lead to CTW values for statistic；

Therefore inverse classification frequency ICF values are added and make up inhibition strength, as shown in formula (6)：

S204, IHDF values are calculated

In Training document set, it is contemplated that have more similar classification in bad classification, being easy in assorting process will Negative characteristics item disperses, and causes the text message between two badness classifications that cannot be accurately identified filtering；

Thus inverse bad document frequency IHDF values are added and make up inhibition strength, as shown in formula (7)：

Wherein, N is Training document sum；W is to include this feature item t_jNumber of files；V is in whole bad classifications This feature item t_jNumber of files；It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,_jNumber of files；There is w= v+l。

Above formula (7) is analyzed, ifThen have：

If v₁>v₂, then have f (v₁)>f(v₂), it can thus be appreciated that f (v) and v is proportional relationship, and it is inversely prroportional relationship with l, it should Item IHDF values meet characteristic item t_jIt is frequently occurred in bad classification and occurs obtaining when less condition in other normal categories Higher value is taken, achievees the purpose that fuzzy bad classification boundary, the recognition capability of bad classification and normal category can be improved.

S205, CTW values are calculated

Shown in CTW values formula such as formula (9)：

CTW=χ²(t_j,C_i)×IDF×ICF×IHDF (9)

CTW value calculation formula include χ²Including statistic, three factors are increased, the inverse document frequency after respectively improveing Rate IDF (Inverse Document Frequency) value, inverse classification frequency ICF (Inverse Category Frequency) Value and inverse bad document frequency IHDF (Inverse Harmful Document Frequency) value.

Wherein χ²(t_j,C_i) value be bad classification in any classification C_iCharacteristic item t in class_jχ²Statistic；IDF values are balanced Characteristic item distribution between class situation in comprising the class including whole classifications；The classification that ICF values compensate for Training document set tilts； Balanced distribution situation of the characteristic item between bad classification and normal category of IHDF values.

S3, the characteristic item in initial characteristics item set that step S2 is screened there is into high to Low sequence according to the size of CTW values, It chooses a characteristic item and forms final characteristic item set, characteristic item set at this time is exactly represented by garbled characteristic item Text message.

For multiple category classification processes, all characteristic items in classification are exactly calculated separately into CTW values, and according to from greatly to It is small be ranked sequentially after, it is the characteristic item finally determined to choose and arrange a forward characteristic item, and wherein a can be as the case may be Setting.

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.The present invention being described and shown in usually here in attached drawing is real Applying the component of example can be arranged and be designed by a variety of different configurations.Therefore, the present invention to providing in the accompanying drawings below The detailed description of embodiment be not intended to limit the range of claimed invention, but be merely representative of the selected of the present invention Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts The every other embodiment obtained, shall fall within the protection scope of the present invention.

By taking certain company's objectionable text information filtering system as an example, by analyzing the said firm's demand, text message point before filtering Class is as follows, and normal text information classification totally ten nine class specifically includes：Electronic technology, communication, computer software, education, sport, Culture, finance and economics, medical treatment, traffic, public security is military, the energy, automobile, tourist industry, the wine industry, agricultural, forestry, fishery, animal husbandry；No Good text message classification totally five class, specifically includes：Drugs, gambling is pornographic, and reaction is illegal to market.

The feature selection approach of the objectionable text information filtering is tested applied to above system, testing material set Including language material 1920：19 class of normal text information each 80, totally 1520；5 class of objectionable text information each 80, totally 400.

The evaluation index of this test mainly has accuracy, recall rate and comprehensive evaluation index.

Table 2 is that assessment illustrates table：

	It is related	It is unrelated
			It is retrieved	TP(True Positives)	FP(False Positives)
It is not retrieved	FN(False Negatives)	TN(True Negatives)

As shown in table 2, TP expressions are retrieved and the relevant number of documents of target category；FP expressions are retrieved but and mesh Mark the unrelated number of documents of classification；FN be expressed as being retrieved but with the relevant number of documents of target category；TN is expressed as being detected Rope is to the number of documents unrelated with target category.The total number of documents N=TP+FP+FN+TN that testing material set includes.

Accuracy P_iCalculation formula such as formula (10) shown in：

Recall rate R_iCalculation formula such as formula (11) shown in：

Comprehensive evaluation index F when on the basis of classification in objectionable text information_0.5As shown in formula (12) calculating：

Wherein, wherein accuracy corresponding when being corresponding benchmark P, recall rate corresponding when being corresponding benchmark R.

Comprehensive evaluation index F when on the basis of classification in normal text information₂As shown in formula (13) calculating：

Table 3 is test result table

	Objectionable text classification	Normal text classification
			It is judged as objectionable text classification	327	35
It is judged as normal text classification	73	1485

As can be seen from Table 3, participate in test 400 objectionable text information category test documents in have 73 it is misjudged For normal text, i.e. 73 objectionable texts are not screened out；In 1520 normal text information category test documents for participating in test In there are 35 to be mistaken for objectionable text, i.e. 35 normal texts are accidentally deleted.

Table 4 is erroneous judgement concrete outcome table

As shown in Table 4, " gambling " in objectionable text information category, " illegal marketing " the misjudged number of files of two classes is relatively More, wherein there is 24 " gambling " class documents not to be retrieved, 14 normal documents are mistaken for class of " gambling "；28 " illegal battalion Pin " class document is not retrieved, and 18 normal documents are mistaken for " illegal marketing " class.Analyze its reason, normal text information " gambling " in Partial Feature and objectionable text information category in " electronic technology ", " communication ", " computer software " classification in classification, Characteristic item is identical in " illegal marketing " classification, and the above classification confusion probabilities is caused to increase, for example, " illegal marketing " classification includes model It encloses extensively, wherein there is the advertisement of the competing type of the game electricity such as online game, hand trip, is chosen when this kind of advertisement document is as Training document The characteristic item selected out then may be with the feature of higher " electronics technology " class.Human intervention class center vector feature can be passed through The increase and decrease of item is adjusted, and can also be increased training corpus quantity, be improved the training process of Naive Bayes Classifier.

Table 5 is test evaluation index result table

As shown in Table 5, which substantially meets the high accuracy requirement retrieved to objectionable text, and recall rate also reaches 81.75%, receive in range in filtering, the accuracy and recall rate of normal text retrieval all achieve the effect that more satisfactory, needle To accuracy and recall rate stress it is different F0.5 and F2 values are calculated to objectionable text and normal text respectively, respectively reach 88.48% and 97.21%, it was demonstrated that application effect of the present invention is preferable.

The present invention has considered distribution situation of the characteristic item in class class when being filtered between characteristic item, makes up χ² The shortcomings that statistic method so that the filtered characteristic item of this feature selection method has stronger representativeness, by traditional characteristic Selection method is combined with the special circumstances that objectionable text filters, and completes the better feature selection approach of effect.

The above content is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every to press According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within claims of the present invention Protection domain within.

Claims

1. a kind of objectionable text information filtering feature selection approach, which is characterized in that first extracted from classification corpus all Characteristic item builds initial characteristics item set；Then basis includes characteristic item t_jTo any classification C in bad classification_iχ²Statistic χ²(t_j,C_i), inverse document frequency IDF, inverse classification frequency ICF after improvement and inverse bad document frequency IHDF classification is calculated Feature weight value CTW values screen characteristic item using characters classification weight value CTW values as the foundation of feature selecting；Most The characteristic item in the initial characteristics item set of screening is sorted from high to low according to the size of CTW values afterwards, chooses a characteristic item group At final characteristic item set.

2. a kind of objectionable text information filtering feature selection approach according to claim 1, which is characterized in that using changing Inverse document frequency IDF value equilibrium characteristic items after the good distribution between class situation in comprising the class including whole classifications；Utilize inverse class The classification that other frequency ICF values compensate for Training document set tilts；Utilize inverse bad document frequency IHDF values equilibrium characteristic item Distribution situation between bad classification and normal category, then the calculating of characters classification weight value CTW values is as follows：

CTW=χ²(t_j,C_i)×IDF×ICF×IHDF。

3. a kind of objectionable text information filtering feature selection approach according to claim 1 or 2, it is training text to define N Shelves sum, C_iFor any classification in bad classification, t_jFor C_iAny feature item in class initial characteristics item set, A are both comprising spy Levy item t_jBelong to classification C again_iDocument frequency；Although B is including characteristic item t_jBut do not belong to and classification C_iDocument frequency；C is class Other C_iIn do not include characteristic item t_jDocument frequency；D is both not include characteristic item t in all documents_jIt is not belonging to classification C again_iText Shelves frequency, then Training document sum N=A+B+C+D.

4. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that χ²(t_j, C_i) calculate it is as follows：

5. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that after improvement Inverse document frequency IDF specifically calculate it is as follows：

Wherein, n is to include this feature item t_jNumber of files；M is classification C_iIn include this feature item t_jNumber of files；K is except classification C_iIt is outer other kinds comprising this feature item t_jNumber of files, and n=m+k.

6. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that inverse classification Frequency ICF calculates as follows：

Wherein, p is whole categorical measures of Training document set；Q is to include characteristic item t_jCategorical measure.

7. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that inverse bad Document frequency IHDF calculates as follows：

Wherein, N is Training document sum；W is to include this feature item t_jNumber of files；It includes this feature that v, which is in whole bad classifications, Item t_jNumber of files；It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,_jNumber of files；And w=v+l.

8. a kind of objectionable text information filtering feature selection approach according to claim 1, which is characterized in that set final The total number of documents N=TP+FP+FN+TN of test calculates separately the accuracy P for obtaining filter effect_iWith recall rate R_i, and then obtain For the comprehensive evaluation index of final filtration effect for verifying, TP is to be retrieved and the relevant number of documents of target category；FP is It is retrieved but the number of documents unrelated with target category；FN be retrieved but with the relevant number of documents of target category；TN For the number of documents for being retrieved unrelated with target category.

9. a kind of objectionable text information filtering feature selection approach according to claim 8, which is characterized in that filtering effect The accuracy P of fruit_iIt calculates as follows：

The recall rate R of filter effect_iIt calculates as follows：

10. a kind of objectionable text information filtering feature selection approach according to claim 9, which is characterized in that with not Comprehensive evaluation index F when in good text message on the basis of classification_0.5It calculates as follows：