CN108376130A - A kind of objectionable text information filtering feature selection approach - Google Patents

A kind of objectionable text information filtering feature selection approach Download PDF

Info

Publication number
CN108376130A
CN108376130A CN201810196195.XA CN201810196195A CN108376130A CN 108376130 A CN108376130 A CN 108376130A CN 201810196195 A CN201810196195 A CN 201810196195A CN 108376130 A CN108376130 A CN 108376130A
Authority
CN
China
Prior art keywords
classification
item
text information
characteristic item
inverse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810196195.XA
Other languages
Chinese (zh)
Inventor
闫茂德
赵文
柯伟
陈宇
李超飞
田野
林海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changan University
Original Assignee
Changan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changan University filed Critical Changan University
Priority to CN201810196195.XA priority Critical patent/CN108376130A/en
Publication of CN108376130A publication Critical patent/CN108376130A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of objectionable text information filtering feature selection approach, and all characteristic items are first extracted from classification corpus, build initial characteristics item set;Then basis includes characteristic item tjTo any classification C in bad classificationiχ2Statistic χ2(tj,Ci), inverse document frequency IDF, inverse classification frequency ICF after improvement and inverse bad document frequency IHDF be calculated characters classification weight value CTW values and screened to characteristic item using characters classification weight value CTW values as the foundation of feature selecting;The characteristic item in initial characteristics item set that step S2 is screened is sorted from high to low according to the size of CTW values finally, chooses a final characteristic item set of characteristic item composition.The present invention solves χ2Statistic feature selection approach does not consider characteristic item in class the problem of distribution between class situation, while solving the problems, such as that data set of all categories tilts, and then improves the effect of objectionable text information filtering.

Description

A kind of objectionable text information filtering feature selection approach
Technical field
The invention belongs to natural language processing technique fields, especially text content filtering technical field, and in particular to one Kind objectionable text information filtering feature selection approach.
Background technology
During objectionable text information filtering, " dimension disaster " is the significant problem for having to solve.Pass through Chinese text The text message of this word segmentation processing has great characteristic item quantity, since corpus is huge, the dimension in training text set It is just more up to tens of thousands of dimensions to tie up to hundreds of thousands, such huge dimension can cause serious operation to bear to computer, undoubtedly carry High difficulty in computation directly results in the reduction of objectionable text information filtering effect due to calculating the increase of time, meanwhile, such as Information noise is certainly existed in the characteristic item set of this higher-dimension, that is, there is the characteristic item that counter productive is generated to classification and exist, by This, Feature Dimension Reduction becomes very important processing procedure, χ2Statistic method has become present most widely used feature selecting One of method.
Whether 2 statistic methods of χ are the independent premise of two variables in null hypothesis commonly used to examine two variables independent Under, the χ that is calculated2It is more big to count magnitude, illustrates reality and null hypothesis more deviates from, then null hypothesis establishment possibility is smaller, and two Variable association is stronger.In text classification field, null hypothesis H0:Characteristic item is mutual indepedent with the category, onrelevant;Alternative hypothesis H1:Characteristic item is relevant with the category.χ2Statistic is bigger, and i.e. deviation value is bigger, and characteristic item and the category degree of association are higher.If special Levy item and classification then χ independently of each other2Statistic is 0.
χ2Although statistic method is the feature selection approach that application effect is best in current text classification, but inevitable Existing defects, mainly have following two points:
(1) reducing part has the low-frequency word weight of clear category significance
Though certain low-frequency word document frequencies are low, often largely appear in the specific a small number of text documents of certain class, due to The number of files for frequently occurring such word is less, causes the word frequency of such word relatively low, but such word has good generation Table represents the classification situation of this few documents, very big to classification contribution.Due to passing through χ2Statistic formula result of calculation is smaller, It is easy to be filtered in screening stage so that there is very strong representative characteristic item accidentally to be deleted.
(2) part is improved in the other kinds high frequency words not frequently occurred but rarely occur in specified classification
Such high frequency words frequently occur in other classes in Training document set, but less, i.e. A occur in specified class Value is smaller, it is clear that such high frequency words indicate the specified class without representative well.Since in calculating process, BC will be much big In AD, χ is directly resulted in2The result of calculation of statistic is higher, is not easy to be filtered in screening process so that is not equipped with relatively strong Representative characteristic item accidentally stays.
Invention content
In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of objectionable text Information filtering feature selection approach, this feature selection method is directed to the particularity of objectionable text information filtering, to traditional χ 2 Statistic feature selection approach is improved.
The present invention uses following technical scheme:
A kind of objectionable text information filtering feature selection approach first extracts all characteristic items, structure from classification corpus Build initial characteristics item set;Then basis includes characteristic item tjTo any classification C in bad classificationiχ2Statistic χ2(tj,Ci)、 Characters classification weight value is calculated in inverse document frequency IDF, inverse classification frequency ICF and inverse bad document frequency IHDF after improvement CTW values screen characteristic item using characters classification weight value CTW values as the foundation of feature selecting;Finally by screening Characteristic item in initial characteristics item set sorts from high to low according to the size of CTW values, chooses a characteristic item and forms final feature Item set.
Specifically, using the inverse document frequency IDF value equilibrium characteristic items after improvement in comprising the class including whole classifications Distribution between class situation;The classification that Training document set is compensated for using inverse classification frequency ICF values is tilted;Utilize inverse bad document frequency Balanced distribution situation of the characteristic item between bad classification and normal category of rate IHDF values, then characters classification weight value CTW values It calculates as follows:
CTW=χ2(tj,Ci)×IDF×ICF×IHDF。
Further, it is Training document sum, C to define NiFor any classification in bad classification, tjFor CiClass initial characteristics item Any feature item in set, A are both comprising characteristic item tjBelong to classification C againiDocument frequency;Although B is including characteristic item tj But do not belong to and classification CiDocument frequency;C is classification CiIn do not include characteristic item tjDocument frequency;D be all documents in neither Including characteristic item tjIt is not belonging to classification C againiDocument frequency, then Training document sum N=A+B+C+D.
Further, χ2(tj,Ci) calculate it is as follows:
Further, the inverse document frequency IDF after improvement is calculated as follows:
Wherein, n is to include this feature item tjNumber of files;M is classification CiIn include this feature item tjNumber of files;K be except Classification CiIt is outer other kinds comprising this feature item tjNumber of files, and n=m+k.
Further, inverse classification frequency ICF is calculated as follows:
Wherein, p is whole categorical measures of Training document set;Q is to include characteristic item tjCategorical measure, when including spy Levy item tjClassification it is more when ICF values more level off to 0, i.e. this feature item tjRepresentativeness it is poorer, weighted value is lower.
Further, inverse bad document frequency IHDF is calculated as follows:
Wherein, N is Training document sum;W is to include this feature item tjNumber of files;V is in whole bad classifications This feature item tjNumber of files;It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,jNumber of files;And w= v+l。
Specifically, setting the total number of documents N=TP+FP+FN+TN of final test, calculate separately to obtain the correct of filter effect Rate PiWith recall rate Ri, and then obtain the comprehensive evaluation index of final filtration effect for verifying, TP is to be retrieved and target class Not relevant number of documents;FP is to be retrieved but the number of documents unrelated with target category;FN is to be retrieved but and target The relevant number of documents of classification;TN is to be retrieved the number of documents unrelated with target category.
Further, the accuracy P of filter effectiIt calculates as follows:
The recall rate R of filter effectiIt calculates as follows:
Comprehensive evaluation index F when further, on the basis of classification in objectionable text information0.5It calculates as follows:
Comprehensive evaluation index F when on the basis of classification in normal text information2It calculates as follows:
Wherein, corresponding accuracy when P is corresponding benchmark, recall rate corresponding when being corresponding benchmark R.
Compared with prior art, the present invention at least has the advantages that:
The feature selection approach of objectionable text information filtering of the present invention is applied to the bad classification of objectionable text information filtering In assorting process, the quality of objectionable text information filtering result only considers the identification result to bad classification and normal category, right The classifying quality of specific category included in bad classification not consider filter effect in, therefore present invention adds improvement after Inverse document frequency IDF values, inverse classification frequency ICF values and inverse bad document frequency IHDF values etc. calculate the factor and are used for inadequate balance class The categorised demarcation line of specific category, reaches the classifying quality for improving bad classification and normal category, the present invention is main included in not The characteristic of the application environment is accounted for, is had more preferably in the feature selecting of objectionable text information filtering function than other methods Effect.
Further, setting characters classification weight value CTW values as characteristic item for the expression significance level of entire classification, CTW values are bigger, and expression this feature item is more important for place class, therefore is more able to represent the generic attribute as the category feature.If The CTW values set include multiple factors, and not only distribution situation of the comprehensive characteristics item in class between class, also combines characteristic item bad Distribution situation between text message and normal text information can weigh Feature item weighting more fully hereinafter.
Further, setting statistic χ2(tj,Ci) first factor as characters classification weight CTW values, it is this feature The basic value of the used weighted value of selection method, as basic selection gist, three factors later are all to statistic χ2(tj, Ci) supplement with it is perfect.Selection statistic χ2(tj,Ci) by assuming that the basic thought examined, can make Feature selection result more It is accurate to add.
Further, second factors of the inverse document frequency IDF after setting improvement as characters classification weight CTW values, For making up statistic χ2(tj,Ci) for the deficiency of distribution between class situation consideration in characteristic item class, the IDF values the big, indicates, should Characteristic item is high in the frequency that given document occurs and less in the appearance of other documents, thus enhances by selection characteristic item for specified The particularity of document.
Further, the inverse third factors of the classification frequency ICF as characters classification weight CTW values is set, be for into One step makes up statistic χ2(tj,Ci) for the deficiency of distribution between class situation consideration in characteristic item class, unlike IDF values, The ICF values the big, indicates, this feature item is high in the frequency that given classification occurs and less in the appearance of other classifications, thus enhances quilt Select characteristic item for the particularity of specified classification.
Further, inverse four factors of the bad document frequency IHDF as characters classification weight CTW values is set, is used for Make up statistic χ2(tj,Ci) for characteristic item, distribution situation considers not between objectionable text information and normal text information Foot, the IHDF values the big, indicates, this feature item is high in the frequency that objectionable text information occurs and occurs in normal text information It is less, thus enhance the particularity for objectionable text information by selection characteristic item, passes through classification in fuzzy objectionable text information Boundary so that the higher characteristic item of the frequency of occurrences has the possibility of bigger to be selected in whole objectionable text information.
In conclusion the present invention solves χ2Statistic feature selection approach does not consider characteristic item distribution between class in class The problem of situation, while solving the problems, such as that data set of all categories tilts, and then improve the effect of objectionable text information filtering.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Description of the drawings
Fig. 1 is flow chart of the present invention;
Fig. 2 is that feature of present invention item characters classification weight value calculates detail flowchart.
Specific implementation mode
The present invention provides a kind of objectionable text information filtering feature selection approach, are applied to objectionable text information filtering Process is when to bad category classification, to extract feature selection approach used in bad category feature item.The method is to pass The χ of system2Based on statistic feature selection approach, using characters classification weight value CTW values as the foundation of feature selecting, this method The factor for calculating CTW values includes traditional χ2Including statistic, three factors, the inverse document after respectively improveing in addition are increased Frequency IDF values, inverse classification frequency ICF values and inverse bad document frequency IHDF values;Feature weight value (CTW) will be literary after the completion of calculating Then the feature weight value of characteristic item in this information is chosen best characteristic item quantity and is constituted according to sorting successively from big to small New characteristic item set, characteristic item set at this time are exactly the text message represented by garbled characteristic item.The invention solution Determined χ2Statistic feature selection approach does not consider characteristic item in class the problem of distribution between class situation, while solving all kinds of The problem of other data set tilts, and then enhance the effect of objectionable text information filtering.
Referring to Fig. 1, a kind of objectionable text information filtering feature selection approach of the present invention, includes the following steps:
S1, all characteristic items being extracted from classification corpus, building initial characteristics item set, initial characteristics item set is up to Ten thousand dimensions even hundreds of thousands dimension, next needs that initial characteristics item set is carried out dimensionality reduction using feature selection approach, that is, want Characteristic item is screened;
S2, using characters classification weight value CTW (Category Term Weight) values as the foundation of feature selecting, packet Containing 2 (t of χj, Ci) and it is characterized a tjFor classification Ciχ2Statistic, IDF (Inverse Document Frequency) are to change Inverse document frequency after good, ICF (Inverse Category Frequency) are inverse classification frequency, IHDF (Inverse Harmful Document Frequency) it is inverse bad document frequency;
Referring to Fig. 2, for the weight calculation detail flowchart of the present invention, specifically comprise the steps of:
S201,2 (t of χ are calculatedj,Ci) value
CiFor any classification in bad classification, tjFor CiAny feature item in class initial characteristics item set;
Table 1 is characterized item and class relations figure
As shown in table 1, A is both comprising characteristic item tjBelong to classification C againiDocument frequency;Although B is including characteristic item tj But do not belong to and classification CiDocument frequency;C is classification CiIn do not include characteristic item tjDocument frequency;D be all documents in neither Including characteristic item tjIt is not belonging to classification C againiDocument frequency;
It is Training document sum to define N, there is N=A+B+C+D;
In feature selection process, χ is utilized2Statistic size is ranked up characteristic item in classification, to select system Measure relatively large more representational characteristic item, therefore, χ2The concrete numerical value of statistic is not important, for each class For not, Training document sum N, belong to CiClass number of files A+C and it is not belonging to CiClass number of files B+D be it is identical, therefore, Can be to simplify by formula (1), as shown in formula (2):
S202, IDF values are calculated
Shown in traditional IDF values formula such as formula (3):
By IDF formula it is found that if including this feature item tjNumber of files more at most IDF values more level off to 0, it is apparent that this Distribution situation of the characteristic item in class between class is not accounted for, therefore, IDF formula are improved to as shown in formula (4):
Wherein, N is Training document sum;N is to include this feature item tjNumber of files;M is classification CiIn include this feature item tjNumber of files;K is except classification CiIt is outer other kinds comprising this feature item tjNumber of files;There is n=m+k.
Formula (4) is analyzed, ifThen have:
If m1>m2, then have f (m1)>f(m2), it can thus be appreciated that f (m) and m is proportional relationship, it is inversely prroportional relationship with k, reaches The improvement considered in characteristic item class with distribution between class situation, i.e. this IDF values meet characteristic item tjIn classification CiIn frequently go out Now, and in other classifications occur obtaining higher value when less condition.
S203, ICF values are calculated
In Training document set, tends not to ensure that all categories number of documents is identical, cause number of documents about class Other distribution situation tilts, and when it is this it is unbalanced occur when, such as when certain category documents number is less, IDF can hardly Inhibiting effect is played, weight is caused to be biased to depend on χ2It is higher to eventually lead to CTW values for statistic;
Therefore inverse classification frequency ICF values are added and make up inhibition strength, as shown in formula (6):
Wherein, p is whole categorical measures of Training document set;Q is to include characteristic item tjCategorical measure, when including spy Levy item tjClassification it is more when ICF values more level off to 0, i.e. this feature item tjRepresentativeness it is poorer, weighted value is lower.
S204, IHDF values are calculated
In Training document set, it is contemplated that have more similar classification in bad classification, being easy in assorting process will Negative characteristics item disperses, and causes the text message between two badness classifications that cannot be accurately identified filtering;
Thus inverse bad document frequency IHDF values are added and make up inhibition strength, as shown in formula (7):
Wherein, N is Training document sum;W is to include this feature item tjNumber of files;V is in whole bad classifications This feature item tjNumber of files;It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,jNumber of files;There is w= v+l。
Above formula (7) is analyzed, ifThen have:
If v1>v2, then have f (v1)>f(v2), it can thus be appreciated that f (v) and v is proportional relationship, and it is inversely prroportional relationship with l, it should Item IHDF values meet characteristic item tjIt is frequently occurred in bad classification and occurs obtaining when less condition in other normal categories Higher value is taken, achievees the purpose that fuzzy bad classification boundary, the recognition capability of bad classification and normal category can be improved.
S205, CTW values are calculated
Shown in CTW values formula such as formula (9):
CTW=χ2(tj,Ci)×IDF×ICF×IHDF (9)
CTW value calculation formula include χ2Including statistic, three factors are increased, the inverse document frequency after respectively improveing Rate IDF (Inverse Document Frequency) value, inverse classification frequency ICF (Inverse Category Frequency) Value and inverse bad document frequency IHDF (Inverse Harmful Document Frequency) value.
Wherein χ2(tj,Ci) value be bad classification in any classification CiCharacteristic item t in classjχ2Statistic;IDF values are balanced Characteristic item distribution between class situation in comprising the class including whole classifications;The classification that ICF values compensate for Training document set tilts; Balanced distribution situation of the characteristic item between bad classification and normal category of IHDF values.
S3, the characteristic item in initial characteristics item set that step S2 is screened there is into high to Low sequence according to the size of CTW values, It chooses a characteristic item and forms final characteristic item set, characteristic item set at this time is exactly represented by garbled characteristic item Text message.
For multiple category classification processes, all characteristic items in classification are exactly calculated separately into CTW values, and according to from greatly to It is small be ranked sequentially after, it is the characteristic item finally determined to choose and arrange a forward characteristic item, and wherein a can be as the case may be Setting.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.The present invention being described and shown in usually here in attached drawing is real Applying the component of example can be arranged and be designed by a variety of different configurations.Therefore, the present invention to providing in the accompanying drawings below The detailed description of embodiment be not intended to limit the range of claimed invention, but be merely representative of the selected of the present invention Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts The every other embodiment obtained, shall fall within the protection scope of the present invention.
By taking certain company's objectionable text information filtering system as an example, by analyzing the said firm's demand, text message point before filtering Class is as follows, and normal text information classification totally ten nine class specifically includes:Electronic technology, communication, computer software, education, sport, Culture, finance and economics, medical treatment, traffic, public security is military, the energy, automobile, tourist industry, the wine industry, agricultural, forestry, fishery, animal husbandry;No Good text message classification totally five class, specifically includes:Drugs, gambling is pornographic, and reaction is illegal to market.
The feature selection approach of the objectionable text information filtering is tested applied to above system, testing material set Including language material 1920:19 class of normal text information each 80, totally 1520;5 class of objectionable text information each 80, totally 400.
The evaluation index of this test mainly has accuracy, recall rate and comprehensive evaluation index.
Table 2 is that assessment illustrates table:
It is related It is unrelated
It is retrieved TP(True Positives) FP(False Positives)
It is not retrieved FN(False Negatives) TN(True Negatives)
As shown in table 2, TP expressions are retrieved and the relevant number of documents of target category;FP expressions are retrieved but and mesh Mark the unrelated number of documents of classification;FN be expressed as being retrieved but with the relevant number of documents of target category;TN is expressed as being detected Rope is to the number of documents unrelated with target category.The total number of documents N=TP+FP+FN+TN that testing material set includes.
Accuracy PiCalculation formula such as formula (10) shown in:
Recall rate RiCalculation formula such as formula (11) shown in:
Comprehensive evaluation index F when on the basis of classification in objectionable text information0.5As shown in formula (12) calculating:
Wherein, wherein accuracy corresponding when being corresponding benchmark P, recall rate corresponding when being corresponding benchmark R.
Comprehensive evaluation index F when on the basis of classification in normal text information2As shown in formula (13) calculating:
Table 3 is test result table
Objectionable text classification Normal text classification
It is judged as objectionable text classification 327 35
It is judged as normal text classification 73 1485
As can be seen from Table 3, participate in test 400 objectionable text information category test documents in have 73 it is misjudged For normal text, i.e. 73 objectionable texts are not screened out;In 1520 normal text information category test documents for participating in test In there are 35 to be mistaken for objectionable text, i.e. 35 normal texts are accidentally deleted.
Table 4 is erroneous judgement concrete outcome table
As shown in Table 4, " gambling " in objectionable text information category, " illegal marketing " the misjudged number of files of two classes is relatively More, wherein there is 24 " gambling " class documents not to be retrieved, 14 normal documents are mistaken for class of " gambling ";28 " illegal battalion Pin " class document is not retrieved, and 18 normal documents are mistaken for " illegal marketing " class.Analyze its reason, normal text information " gambling " in Partial Feature and objectionable text information category in " electronic technology ", " communication ", " computer software " classification in classification, Characteristic item is identical in " illegal marketing " classification, and the above classification confusion probabilities is caused to increase, for example, " illegal marketing " classification includes model It encloses extensively, wherein there is the advertisement of the competing type of the game electricity such as online game, hand trip, is chosen when this kind of advertisement document is as Training document The characteristic item selected out then may be with the feature of higher " electronics technology " class.Human intervention class center vector feature can be passed through The increase and decrease of item is adjusted, and can also be increased training corpus quantity, be improved the training process of Naive Bayes Classifier.
Table 5 is test evaluation index result table
As shown in Table 5, which substantially meets the high accuracy requirement retrieved to objectionable text, and recall rate also reaches 81.75%, receive in range in filtering, the accuracy and recall rate of normal text retrieval all achieve the effect that more satisfactory, needle To accuracy and recall rate stress it is different F0.5 and F2 values are calculated to objectionable text and normal text respectively, respectively reach 88.48% and 97.21%, it was demonstrated that application effect of the present invention is preferable.
The present invention has considered distribution situation of the characteristic item in class class when being filtered between characteristic item, makes up χ2 The shortcomings that statistic method so that the filtered characteristic item of this feature selection method has stronger representativeness, by traditional characteristic Selection method is combined with the special circumstances that objectionable text filters, and completes the better feature selection approach of effect.
The above content is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every to press According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within claims of the present invention Protection domain within.

Claims (10)

1. a kind of objectionable text information filtering feature selection approach, which is characterized in that first extracted from classification corpus all Characteristic item builds initial characteristics item set;Then basis includes characteristic item tjTo any classification C in bad classificationiχ2Statistic χ2(tj,Ci), inverse document frequency IDF, inverse classification frequency ICF after improvement and inverse bad document frequency IHDF classification is calculated Feature weight value CTW values screen characteristic item using characters classification weight value CTW values as the foundation of feature selecting;Most The characteristic item in the initial characteristics item set of screening is sorted from high to low according to the size of CTW values afterwards, chooses a characteristic item group At final characteristic item set.
2. a kind of objectionable text information filtering feature selection approach according to claim 1, which is characterized in that using changing Inverse document frequency IDF value equilibrium characteristic items after the good distribution between class situation in comprising the class including whole classifications;Utilize inverse class The classification that other frequency ICF values compensate for Training document set tilts;Utilize inverse bad document frequency IHDF values equilibrium characteristic item Distribution situation between bad classification and normal category, then the calculating of characters classification weight value CTW values is as follows:
CTW=χ2(tj,Ci)×IDF×ICF×IHDF。
3. a kind of objectionable text information filtering feature selection approach according to claim 1 or 2, it is training text to define N Shelves sum, CiFor any classification in bad classification, tjFor CiAny feature item in class initial characteristics item set, A are both comprising spy Levy item tjBelong to classification C againiDocument frequency;Although B is including characteristic item tjBut do not belong to and classification CiDocument frequency;C is class Other CiIn do not include characteristic item tjDocument frequency;D is both not include characteristic item t in all documentsjIt is not belonging to classification C againiText Shelves frequency, then Training document sum N=A+B+C+D.
4. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that χ2(tj, Ci) calculate it is as follows:
5. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that after improvement Inverse document frequency IDF specifically calculate it is as follows:
Wherein, n is to include this feature item tjNumber of files;M is classification CiIn include this feature item tjNumber of files;K is except classification CiIt is outer other kinds comprising this feature item tjNumber of files, and n=m+k.
6. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that inverse classification Frequency ICF calculates as follows:
Wherein, p is whole categorical measures of Training document set;Q is to include characteristic item tjCategorical measure.
7. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that inverse bad Document frequency IHDF calculates as follows:
Wherein, N is Training document sum;W is to include this feature item tjNumber of files;It includes this feature that v, which is in whole bad classifications, Item tjNumber of files;It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,jNumber of files;And w=v+l.
8. a kind of objectionable text information filtering feature selection approach according to claim 1, which is characterized in that set final The total number of documents N=TP+FP+FN+TN of test calculates separately the accuracy P for obtaining filter effectiWith recall rate Ri, and then obtain For the comprehensive evaluation index of final filtration effect for verifying, TP is to be retrieved and the relevant number of documents of target category;FP is It is retrieved but the number of documents unrelated with target category;FN be retrieved but with the relevant number of documents of target category;TN For the number of documents for being retrieved unrelated with target category.
9. a kind of objectionable text information filtering feature selection approach according to claim 8, which is characterized in that filtering effect The accuracy P of fruitiIt calculates as follows:
The recall rate R of filter effectiIt calculates as follows:
10. a kind of objectionable text information filtering feature selection approach according to claim 9, which is characterized in that with not Comprehensive evaluation index F when in good text message on the basis of classification0.5It calculates as follows:
Comprehensive evaluation index F when on the basis of classification in normal text information2It calculates as follows:
Wherein, corresponding accuracy when P is corresponding benchmark, recall rate corresponding when being corresponding benchmark R.
CN201810196195.XA 2018-03-09 2018-03-09 A kind of objectionable text information filtering feature selection approach Pending CN108376130A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810196195.XA CN108376130A (en) 2018-03-09 2018-03-09 A kind of objectionable text information filtering feature selection approach

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810196195.XA CN108376130A (en) 2018-03-09 2018-03-09 A kind of objectionable text information filtering feature selection approach

Publications (1)

Publication Number Publication Date
CN108376130A true CN108376130A (en) 2018-08-07

Family

ID=63018434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810196195.XA Pending CN108376130A (en) 2018-03-09 2018-03-09 A kind of objectionable text information filtering feature selection approach

Country Status (1)

Country Link
CN (1) CN108376130A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN102200981A (en) * 2010-03-25 2011-09-28 三星电子(中国)研发中心 Feature selection method and feature selection device for hierarchical text classification
CN102567308A (en) * 2011-12-20 2012-07-11 上海电机学院 Information processing feature extracting method
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
KR101574027B1 (en) * 2014-12-19 2015-12-03 (주) 이비즈네트웍스 System for blocking harmful program of smartphones
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN105893380A (en) * 2014-12-11 2016-08-24 成都网安科技发展有限公司 Improved text classification characteristic selection method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200981A (en) * 2010-03-25 2011-09-28 三星电子(中国)研发中心 Feature selection method and feature selection device for hierarchical text classification
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN102567308A (en) * 2011-12-20 2012-07-11 上海电机学院 Information processing feature extracting method
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN105893380A (en) * 2014-12-11 2016-08-24 成都网安科技发展有限公司 Improved text classification characteristic selection method
KR101574027B1 (en) * 2014-12-19 2015-12-03 (주) 이비즈네트웍스 System for blocking harmful program of smartphones
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张玉芳 等: "基于文本分类 TFIDF 方法的改进与应用", 《计算机工程》 *
李帅 等: "改进卡方统计量的 BPNN 短文本分类方法", 《贵州大学学报( 自然科学版)》 *
王美方 等: "基于TFDF的特征选择方法", 《计算机工程与设计》 *
裴英博 等: "文本分类中改进型 CHI 特征选择方法的研究", 《计算机工程与应用》 *

Similar Documents

Publication Publication Date Title
CN105302911B (en) A kind of data screening engine method for building up and data screening engine
CN108898479B (en) Credit evaluation model construction method and device
CN103106275B (en) The text classification Feature Selection method of feature based distributed intelligence
CN107563428A (en) Classification of Polarimetric SAR Image method based on generation confrontation network
CN112102073A (en) Credit risk control method and system, electronic device and readable storage medium
JP3888812B2 (en) Fact data integration method and apparatus
CN108062478A (en) The malicious code sorting technique that global characteristics visualization is combined with local feature
CN106709349B (en) A kind of malicious code classification method based on various dimensions behavioural characteristic
CN106709513A (en) Supervised machine learning-based security financing account identification method
CN108874927A (en) Intrusion detection method based on hypergraph and random forest
CN108022146A (en) Characteristic item processing method, device, the computer equipment of collage-credit data
CN105491444B (en) A kind of data identifying processing method and device
CN104809393B (en) A kind of support attack detecting algorithm based on popularity characteristic of division
CN111507385B (en) Extensible network attack behavior classification method
CN115759640B (en) Public service information processing system and method for smart city
CN110930218B (en) Method and device for identifying fraudulent clients and electronic equipment
CN108197474A (en) The classification of mobile terminal application and detection method
CN109635010A (en) A kind of user characteristics and characterization factor extract, querying method and system
CN106780446A (en) It is a kind of to mix distorted image quality evaluating method without reference
CN115174250B (en) Network asset security assessment method and device, electronic equipment and storage medium
CN113626700A (en) Lawyer recommendation method, system and equipment
CN109347719A (en) A kind of image junk mail filtering method based on machine learning
CN110611655B (en) Blacklist screening method and related product
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
CN113191407A (en) Student economic condition grade classification method based on cost sensitivity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180807

RJ01 Rejection of invention patent application after publication