CN108376130A - A kind of objectionable text information filtering feature selection approach - Google Patents
A kind of objectionable text information filtering feature selection approach Download PDFInfo
- Publication number
- CN108376130A CN108376130A CN201810196195.XA CN201810196195A CN108376130A CN 108376130 A CN108376130 A CN 108376130A CN 201810196195 A CN201810196195 A CN 201810196195A CN 108376130 A CN108376130 A CN 108376130A
- Authority
- CN
- China
- Prior art keywords
- classification
- item
- text information
- characteristic item
- inverse
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of objectionable text information filtering feature selection approach, and all characteristic items are first extracted from classification corpus, build initial characteristics item set;Then basis includes characteristic item tjTo any classification C in bad classificationiχ2Statistic χ2(tj,Ci), inverse document frequency IDF, inverse classification frequency ICF after improvement and inverse bad document frequency IHDF be calculated characters classification weight value CTW values and screened to characteristic item using characters classification weight value CTW values as the foundation of feature selecting;The characteristic item in initial characteristics item set that step S2 is screened is sorted from high to low according to the size of CTW values finally, chooses a final characteristic item set of characteristic item composition.The present invention solves χ2Statistic feature selection approach does not consider characteristic item in class the problem of distribution between class situation, while solving the problems, such as that data set of all categories tilts, and then improves the effect of objectionable text information filtering.
Description
Technical field
The invention belongs to natural language processing technique fields, especially text content filtering technical field, and in particular to one
Kind objectionable text information filtering feature selection approach.
Background technology
During objectionable text information filtering, " dimension disaster " is the significant problem for having to solve.Pass through Chinese text
The text message of this word segmentation processing has great characteristic item quantity, since corpus is huge, the dimension in training text set
It is just more up to tens of thousands of dimensions to tie up to hundreds of thousands, such huge dimension can cause serious operation to bear to computer, undoubtedly carry
High difficulty in computation directly results in the reduction of objectionable text information filtering effect due to calculating the increase of time, meanwhile, such as
Information noise is certainly existed in the characteristic item set of this higher-dimension, that is, there is the characteristic item that counter productive is generated to classification and exist, by
This, Feature Dimension Reduction becomes very important processing procedure, χ2Statistic method has become present most widely used feature selecting
One of method.
Whether 2 statistic methods of χ are the independent premise of two variables in null hypothesis commonly used to examine two variables independent
Under, the χ that is calculated2It is more big to count magnitude, illustrates reality and null hypothesis more deviates from, then null hypothesis establishment possibility is smaller, and two
Variable association is stronger.In text classification field, null hypothesis H0:Characteristic item is mutual indepedent with the category, onrelevant;Alternative hypothesis
H1:Characteristic item is relevant with the category.χ2Statistic is bigger, and i.e. deviation value is bigger, and characteristic item and the category degree of association are higher.If special
Levy item and classification then χ independently of each other2Statistic is 0.
χ2Although statistic method is the feature selection approach that application effect is best in current text classification, but inevitable
Existing defects, mainly have following two points:
(1) reducing part has the low-frequency word weight of clear category significance
Though certain low-frequency word document frequencies are low, often largely appear in the specific a small number of text documents of certain class, due to
The number of files for frequently occurring such word is less, causes the word frequency of such word relatively low, but such word has good generation
Table represents the classification situation of this few documents, very big to classification contribution.Due to passing through χ2Statistic formula result of calculation is smaller,
It is easy to be filtered in screening stage so that there is very strong representative characteristic item accidentally to be deleted.
(2) part is improved in the other kinds high frequency words not frequently occurred but rarely occur in specified classification
Such high frequency words frequently occur in other classes in Training document set, but less, i.e. A occur in specified class
Value is smaller, it is clear that such high frequency words indicate the specified class without representative well.Since in calculating process, BC will be much big
In AD, χ is directly resulted in2The result of calculation of statistic is higher, is not easy to be filtered in screening process so that is not equipped with relatively strong
Representative characteristic item accidentally stays.
Invention content
In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of objectionable text
Information filtering feature selection approach, this feature selection method is directed to the particularity of objectionable text information filtering, to traditional χ 2
Statistic feature selection approach is improved.
The present invention uses following technical scheme:
A kind of objectionable text information filtering feature selection approach first extracts all characteristic items, structure from classification corpus
Build initial characteristics item set;Then basis includes characteristic item tjTo any classification C in bad classificationiχ2Statistic χ2(tj,Ci)、
Characters classification weight value is calculated in inverse document frequency IDF, inverse classification frequency ICF and inverse bad document frequency IHDF after improvement
CTW values screen characteristic item using characters classification weight value CTW values as the foundation of feature selecting;Finally by screening
Characteristic item in initial characteristics item set sorts from high to low according to the size of CTW values, chooses a characteristic item and forms final feature
Item set.
Specifically, using the inverse document frequency IDF value equilibrium characteristic items after improvement in comprising the class including whole classifications
Distribution between class situation;The classification that Training document set is compensated for using inverse classification frequency ICF values is tilted;Utilize inverse bad document frequency
Balanced distribution situation of the characteristic item between bad classification and normal category of rate IHDF values, then characters classification weight value CTW values
It calculates as follows:
CTW=χ2(tj,Ci)×IDF×ICF×IHDF。
Further, it is Training document sum, C to define NiFor any classification in bad classification, tjFor CiClass initial characteristics item
Any feature item in set, A are both comprising characteristic item tjBelong to classification C againiDocument frequency;Although B is including characteristic item tj
But do not belong to and classification CiDocument frequency;C is classification CiIn do not include characteristic item tjDocument frequency;D be all documents in neither
Including characteristic item tjIt is not belonging to classification C againiDocument frequency, then Training document sum N=A+B+C+D.
Further, χ2(tj,Ci) calculate it is as follows:
Further, the inverse document frequency IDF after improvement is calculated as follows:
Wherein, n is to include this feature item tjNumber of files;M is classification CiIn include this feature item tjNumber of files;K be except
Classification CiIt is outer other kinds comprising this feature item tjNumber of files, and n=m+k.
Further, inverse classification frequency ICF is calculated as follows:
Wherein, p is whole categorical measures of Training document set;Q is to include characteristic item tjCategorical measure, when including spy
Levy item tjClassification it is more when ICF values more level off to 0, i.e. this feature item tjRepresentativeness it is poorer, weighted value is lower.
Further, inverse bad document frequency IHDF is calculated as follows:
Wherein, N is Training document sum;W is to include this feature item tjNumber of files;V is in whole bad classifications
This feature item tjNumber of files;It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,jNumber of files;And w=
v+l。
Specifically, setting the total number of documents N=TP+FP+FN+TN of final test, calculate separately to obtain the correct of filter effect
Rate PiWith recall rate Ri, and then obtain the comprehensive evaluation index of final filtration effect for verifying, TP is to be retrieved and target class
Not relevant number of documents;FP is to be retrieved but the number of documents unrelated with target category;FN is to be retrieved but and target
The relevant number of documents of classification;TN is to be retrieved the number of documents unrelated with target category.
Further, the accuracy P of filter effectiIt calculates as follows:
The recall rate R of filter effectiIt calculates as follows:
Comprehensive evaluation index F when further, on the basis of classification in objectionable text information0.5It calculates as follows:
Comprehensive evaluation index F when on the basis of classification in normal text information2It calculates as follows:
Wherein, corresponding accuracy when P is corresponding benchmark, recall rate corresponding when being corresponding benchmark R.
Compared with prior art, the present invention at least has the advantages that:
The feature selection approach of objectionable text information filtering of the present invention is applied to the bad classification of objectionable text information filtering
In assorting process, the quality of objectionable text information filtering result only considers the identification result to bad classification and normal category, right
The classifying quality of specific category included in bad classification not consider filter effect in, therefore present invention adds improvement after
Inverse document frequency IDF values, inverse classification frequency ICF values and inverse bad document frequency IHDF values etc. calculate the factor and are used for inadequate balance class
The categorised demarcation line of specific category, reaches the classifying quality for improving bad classification and normal category, the present invention is main included in not
The characteristic of the application environment is accounted for, is had more preferably in the feature selecting of objectionable text information filtering function than other methods
Effect.
Further, setting characters classification weight value CTW values as characteristic item for the expression significance level of entire classification,
CTW values are bigger, and expression this feature item is more important for place class, therefore is more able to represent the generic attribute as the category feature.If
The CTW values set include multiple factors, and not only distribution situation of the comprehensive characteristics item in class between class, also combines characteristic item bad
Distribution situation between text message and normal text information can weigh Feature item weighting more fully hereinafter.
Further, setting statistic χ2(tj,Ci) first factor as characters classification weight CTW values, it is this feature
The basic value of the used weighted value of selection method, as basic selection gist, three factors later are all to statistic χ2(tj,
Ci) supplement with it is perfect.Selection statistic χ2(tj,Ci) by assuming that the basic thought examined, can make Feature selection result more
It is accurate to add.
Further, second factors of the inverse document frequency IDF after setting improvement as characters classification weight CTW values,
For making up statistic χ2(tj,Ci) for the deficiency of distribution between class situation consideration in characteristic item class, the IDF values the big, indicates, should
Characteristic item is high in the frequency that given document occurs and less in the appearance of other documents, thus enhances by selection characteristic item for specified
The particularity of document.
Further, the inverse third factors of the classification frequency ICF as characters classification weight CTW values is set, be for into
One step makes up statistic χ2(tj,Ci) for the deficiency of distribution between class situation consideration in characteristic item class, unlike IDF values,
The ICF values the big, indicates, this feature item is high in the frequency that given classification occurs and less in the appearance of other classifications, thus enhances quilt
Select characteristic item for the particularity of specified classification.
Further, inverse four factors of the bad document frequency IHDF as characters classification weight CTW values is set, is used for
Make up statistic χ2(tj,Ci) for characteristic item, distribution situation considers not between objectionable text information and normal text information
Foot, the IHDF values the big, indicates, this feature item is high in the frequency that objectionable text information occurs and occurs in normal text information
It is less, thus enhance the particularity for objectionable text information by selection characteristic item, passes through classification in fuzzy objectionable text information
Boundary so that the higher characteristic item of the frequency of occurrences has the possibility of bigger to be selected in whole objectionable text information.
In conclusion the present invention solves χ2Statistic feature selection approach does not consider characteristic item distribution between class in class
The problem of situation, while solving the problems, such as that data set of all categories tilts, and then improve the effect of objectionable text information filtering.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Description of the drawings
Fig. 1 is flow chart of the present invention;
Fig. 2 is that feature of present invention item characters classification weight value calculates detail flowchart.
Specific implementation mode
The present invention provides a kind of objectionable text information filtering feature selection approach, are applied to objectionable text information filtering
Process is when to bad category classification, to extract feature selection approach used in bad category feature item.The method is to pass
The χ of system2Based on statistic feature selection approach, using characters classification weight value CTW values as the foundation of feature selecting, this method
The factor for calculating CTW values includes traditional χ2Including statistic, three factors, the inverse document after respectively improveing in addition are increased
Frequency IDF values, inverse classification frequency ICF values and inverse bad document frequency IHDF values;Feature weight value (CTW) will be literary after the completion of calculating
Then the feature weight value of characteristic item in this information is chosen best characteristic item quantity and is constituted according to sorting successively from big to small
New characteristic item set, characteristic item set at this time are exactly the text message represented by garbled characteristic item.The invention solution
Determined χ2Statistic feature selection approach does not consider characteristic item in class the problem of distribution between class situation, while solving all kinds of
The problem of other data set tilts, and then enhance the effect of objectionable text information filtering.
Referring to Fig. 1, a kind of objectionable text information filtering feature selection approach of the present invention, includes the following steps:
S1, all characteristic items being extracted from classification corpus, building initial characteristics item set, initial characteristics item set is up to
Ten thousand dimensions even hundreds of thousands dimension, next needs that initial characteristics item set is carried out dimensionality reduction using feature selection approach, that is, want
Characteristic item is screened;
S2, using characters classification weight value CTW (Category Term Weight) values as the foundation of feature selecting, packet
Containing 2 (t of χj, Ci) and it is characterized a tjFor classification Ciχ2Statistic, IDF (Inverse Document Frequency) are to change
Inverse document frequency after good, ICF (Inverse Category Frequency) are inverse classification frequency, IHDF (Inverse
Harmful Document Frequency) it is inverse bad document frequency;
Referring to Fig. 2, for the weight calculation detail flowchart of the present invention, specifically comprise the steps of:
S201,2 (t of χ are calculatedj,Ci) value
CiFor any classification in bad classification, tjFor CiAny feature item in class initial characteristics item set;
Table 1 is characterized item and class relations figure
As shown in table 1, A is both comprising characteristic item tjBelong to classification C againiDocument frequency;Although B is including characteristic item tj
But do not belong to and classification CiDocument frequency;C is classification CiIn do not include characteristic item tjDocument frequency;D be all documents in neither
Including characteristic item tjIt is not belonging to classification C againiDocument frequency;
It is Training document sum to define N, there is N=A+B+C+D;
In feature selection process, χ is utilized2Statistic size is ranked up characteristic item in classification, to select system
Measure relatively large more representational characteristic item, therefore, χ2The concrete numerical value of statistic is not important, for each class
For not, Training document sum N, belong to CiClass number of files A+C and it is not belonging to CiClass number of files B+D be it is identical, therefore,
Can be to simplify by formula (1), as shown in formula (2):
S202, IDF values are calculated
Shown in traditional IDF values formula such as formula (3):
By IDF formula it is found that if including this feature item tjNumber of files more at most IDF values more level off to 0, it is apparent that this
Distribution situation of the characteristic item in class between class is not accounted for, therefore, IDF formula are improved to as shown in formula (4):
Wherein, N is Training document sum;N is to include this feature item tjNumber of files;M is classification CiIn include this feature item
tjNumber of files;K is except classification CiIt is outer other kinds comprising this feature item tjNumber of files;There is n=m+k.
Formula (4) is analyzed, ifThen have:
If m1>m2, then have f (m1)>f(m2), it can thus be appreciated that f (m) and m is proportional relationship, it is inversely prroportional relationship with k, reaches
The improvement considered in characteristic item class with distribution between class situation, i.e. this IDF values meet characteristic item tjIn classification CiIn frequently go out
Now, and in other classifications occur obtaining higher value when less condition.
S203, ICF values are calculated
In Training document set, tends not to ensure that all categories number of documents is identical, cause number of documents about class
Other distribution situation tilts, and when it is this it is unbalanced occur when, such as when certain category documents number is less, IDF can hardly
Inhibiting effect is played, weight is caused to be biased to depend on χ2It is higher to eventually lead to CTW values for statistic;
Therefore inverse classification frequency ICF values are added and make up inhibition strength, as shown in formula (6):
Wherein, p is whole categorical measures of Training document set;Q is to include characteristic item tjCategorical measure, when including spy
Levy item tjClassification it is more when ICF values more level off to 0, i.e. this feature item tjRepresentativeness it is poorer, weighted value is lower.
S204, IHDF values are calculated
In Training document set, it is contemplated that have more similar classification in bad classification, being easy in assorting process will
Negative characteristics item disperses, and causes the text message between two badness classifications that cannot be accurately identified filtering;
Thus inverse bad document frequency IHDF values are added and make up inhibition strength, as shown in formula (7):
Wherein, N is Training document sum;W is to include this feature item tjNumber of files;V is in whole bad classifications
This feature item tjNumber of files;It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,jNumber of files;There is w=
v+l。
Above formula (7) is analyzed, ifThen have:
If v1>v2, then have f (v1)>f(v2), it can thus be appreciated that f (v) and v is proportional relationship, and it is inversely prroportional relationship with l, it should
Item IHDF values meet characteristic item tjIt is frequently occurred in bad classification and occurs obtaining when less condition in other normal categories
Higher value is taken, achievees the purpose that fuzzy bad classification boundary, the recognition capability of bad classification and normal category can be improved.
S205, CTW values are calculated
Shown in CTW values formula such as formula (9):
CTW=χ2(tj,Ci)×IDF×ICF×IHDF (9)
CTW value calculation formula include χ2Including statistic, three factors are increased, the inverse document frequency after respectively improveing
Rate IDF (Inverse Document Frequency) value, inverse classification frequency ICF (Inverse Category Frequency)
Value and inverse bad document frequency IHDF (Inverse Harmful Document Frequency) value.
Wherein χ2(tj,Ci) value be bad classification in any classification CiCharacteristic item t in classjχ2Statistic;IDF values are balanced
Characteristic item distribution between class situation in comprising the class including whole classifications;The classification that ICF values compensate for Training document set tilts;
Balanced distribution situation of the characteristic item between bad classification and normal category of IHDF values.
S3, the characteristic item in initial characteristics item set that step S2 is screened there is into high to Low sequence according to the size of CTW values,
It chooses a characteristic item and forms final characteristic item set, characteristic item set at this time is exactly represented by garbled characteristic item
Text message.
For multiple category classification processes, all characteristic items in classification are exactly calculated separately into CTW values, and according to from greatly to
It is small be ranked sequentially after, it is the characteristic item finally determined to choose and arrange a forward characteristic item, and wherein a can be as the case may be
Setting.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.The present invention being described and shown in usually here in attached drawing is real
Applying the component of example can be arranged and be designed by a variety of different configurations.Therefore, the present invention to providing in the accompanying drawings below
The detailed description of embodiment be not intended to limit the range of claimed invention, but be merely representative of the selected of the present invention
Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts
The every other embodiment obtained, shall fall within the protection scope of the present invention.
By taking certain company's objectionable text information filtering system as an example, by analyzing the said firm's demand, text message point before filtering
Class is as follows, and normal text information classification totally ten nine class specifically includes:Electronic technology, communication, computer software, education, sport,
Culture, finance and economics, medical treatment, traffic, public security is military, the energy, automobile, tourist industry, the wine industry, agricultural, forestry, fishery, animal husbandry;No
Good text message classification totally five class, specifically includes:Drugs, gambling is pornographic, and reaction is illegal to market.
The feature selection approach of the objectionable text information filtering is tested applied to above system, testing material set
Including language material 1920:19 class of normal text information each 80, totally 1520;5 class of objectionable text information each 80, totally 400.
The evaluation index of this test mainly has accuracy, recall rate and comprehensive evaluation index.
Table 2 is that assessment illustrates table:
It is related | It is unrelated | |
It is retrieved | TP(True Positives) | FP(False Positives) |
It is not retrieved | FN(False Negatives) | TN(True Negatives) |
As shown in table 2, TP expressions are retrieved and the relevant number of documents of target category;FP expressions are retrieved but and mesh
Mark the unrelated number of documents of classification;FN be expressed as being retrieved but with the relevant number of documents of target category;TN is expressed as being detected
Rope is to the number of documents unrelated with target category.The total number of documents N=TP+FP+FN+TN that testing material set includes.
Accuracy PiCalculation formula such as formula (10) shown in:
Recall rate RiCalculation formula such as formula (11) shown in:
Comprehensive evaluation index F when on the basis of classification in objectionable text information0.5As shown in formula (12) calculating:
Wherein, wherein accuracy corresponding when being corresponding benchmark P, recall rate corresponding when being corresponding benchmark R.
Comprehensive evaluation index F when on the basis of classification in normal text information2As shown in formula (13) calculating:
Table 3 is test result table
Objectionable text classification | Normal text classification | |
It is judged as objectionable text classification | 327 | 35 |
It is judged as normal text classification | 73 | 1485 |
As can be seen from Table 3, participate in test 400 objectionable text information category test documents in have 73 it is misjudged
For normal text, i.e. 73 objectionable texts are not screened out;In 1520 normal text information category test documents for participating in test
In there are 35 to be mistaken for objectionable text, i.e. 35 normal texts are accidentally deleted.
Table 4 is erroneous judgement concrete outcome table
As shown in Table 4, " gambling " in objectionable text information category, " illegal marketing " the misjudged number of files of two classes is relatively
More, wherein there is 24 " gambling " class documents not to be retrieved, 14 normal documents are mistaken for class of " gambling ";28 " illegal battalion
Pin " class document is not retrieved, and 18 normal documents are mistaken for " illegal marketing " class.Analyze its reason, normal text information
" gambling " in Partial Feature and objectionable text information category in " electronic technology ", " communication ", " computer software " classification in classification,
Characteristic item is identical in " illegal marketing " classification, and the above classification confusion probabilities is caused to increase, for example, " illegal marketing " classification includes model
It encloses extensively, wherein there is the advertisement of the competing type of the game electricity such as online game, hand trip, is chosen when this kind of advertisement document is as Training document
The characteristic item selected out then may be with the feature of higher " electronics technology " class.Human intervention class center vector feature can be passed through
The increase and decrease of item is adjusted, and can also be increased training corpus quantity, be improved the training process of Naive Bayes Classifier.
Table 5 is test evaluation index result table
As shown in Table 5, which substantially meets the high accuracy requirement retrieved to objectionable text, and recall rate also reaches
81.75%, receive in range in filtering, the accuracy and recall rate of normal text retrieval all achieve the effect that more satisfactory, needle
To accuracy and recall rate stress it is different F0.5 and F2 values are calculated to objectionable text and normal text respectively, respectively reach
88.48% and 97.21%, it was demonstrated that application effect of the present invention is preferable.
The present invention has considered distribution situation of the characteristic item in class class when being filtered between characteristic item, makes up χ2
The shortcomings that statistic method so that the filtered characteristic item of this feature selection method has stronger representativeness, by traditional characteristic
Selection method is combined with the special circumstances that objectionable text filters, and completes the better feature selection approach of effect.
The above content is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every to press
According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within claims of the present invention
Protection domain within.
Claims (10)
1. a kind of objectionable text information filtering feature selection approach, which is characterized in that first extracted from classification corpus all
Characteristic item builds initial characteristics item set;Then basis includes characteristic item tjTo any classification C in bad classificationiχ2Statistic
χ2(tj,Ci), inverse document frequency IDF, inverse classification frequency ICF after improvement and inverse bad document frequency IHDF classification is calculated
Feature weight value CTW values screen characteristic item using characters classification weight value CTW values as the foundation of feature selecting;Most
The characteristic item in the initial characteristics item set of screening is sorted from high to low according to the size of CTW values afterwards, chooses a characteristic item group
At final characteristic item set.
2. a kind of objectionable text information filtering feature selection approach according to claim 1, which is characterized in that using changing
Inverse document frequency IDF value equilibrium characteristic items after the good distribution between class situation in comprising the class including whole classifications;Utilize inverse class
The classification that other frequency ICF values compensate for Training document set tilts;Utilize inverse bad document frequency IHDF values equilibrium characteristic item
Distribution situation between bad classification and normal category, then the calculating of characters classification weight value CTW values is as follows:
CTW=χ2(tj,Ci)×IDF×ICF×IHDF。
3. a kind of objectionable text information filtering feature selection approach according to claim 1 or 2, it is training text to define N
Shelves sum, CiFor any classification in bad classification, tjFor CiAny feature item in class initial characteristics item set, A are both comprising spy
Levy item tjBelong to classification C againiDocument frequency;Although B is including characteristic item tjBut do not belong to and classification CiDocument frequency;C is class
Other CiIn do not include characteristic item tjDocument frequency;D is both not include characteristic item t in all documentsjIt is not belonging to classification C againiText
Shelves frequency, then Training document sum N=A+B+C+D.
4. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that χ2(tj,
Ci) calculate it is as follows:
5. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that after improvement
Inverse document frequency IDF specifically calculate it is as follows:
Wherein, n is to include this feature item tjNumber of files;M is classification CiIn include this feature item tjNumber of files;K is except classification
CiIt is outer other kinds comprising this feature item tjNumber of files, and n=m+k.
6. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that inverse classification
Frequency ICF calculates as follows:
Wherein, p is whole categorical measures of Training document set;Q is to include characteristic item tjCategorical measure.
7. a kind of objectionable text information filtering feature selection approach according to claim 3, which is characterized in that inverse bad
Document frequency IHDF calculates as follows:
Wherein, N is Training document sum;W is to include this feature item tjNumber of files;It includes this feature that v, which is in whole bad classifications,
Item tjNumber of files;It includes this feature item t that l, which is that other normal categories are all kinds of in addition to bad classification,jNumber of files;And w=v+l.
8. a kind of objectionable text information filtering feature selection approach according to claim 1, which is characterized in that set final
The total number of documents N=TP+FP+FN+TN of test calculates separately the accuracy P for obtaining filter effectiWith recall rate Ri, and then obtain
For the comprehensive evaluation index of final filtration effect for verifying, TP is to be retrieved and the relevant number of documents of target category;FP is
It is retrieved but the number of documents unrelated with target category;FN be retrieved but with the relevant number of documents of target category;TN
For the number of documents for being retrieved unrelated with target category.
9. a kind of objectionable text information filtering feature selection approach according to claim 8, which is characterized in that filtering effect
The accuracy P of fruitiIt calculates as follows:
The recall rate R of filter effectiIt calculates as follows:
10. a kind of objectionable text information filtering feature selection approach according to claim 9, which is characterized in that with not
Comprehensive evaluation index F when in good text message on the basis of classification0.5It calculates as follows:
Comprehensive evaluation index F when on the basis of classification in normal text information2It calculates as follows:
Wherein, corresponding accuracy when P is corresponding benchmark, recall rate corresponding when being corresponding benchmark R.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810196195.XA CN108376130A (en) | 2018-03-09 | 2018-03-09 | A kind of objectionable text information filtering feature selection approach |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810196195.XA CN108376130A (en) | 2018-03-09 | 2018-03-09 | A kind of objectionable text information filtering feature selection approach |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108376130A true CN108376130A (en) | 2018-08-07 |
Family
ID=63018434
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810196195.XA Pending CN108376130A (en) | 2018-03-09 | 2018-03-09 | A kind of objectionable text information filtering feature selection approach |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108376130A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
CN102200981A (en) * | 2010-03-25 | 2011-09-28 | 三星电子(中国)研发中心 | Feature selection method and feature selection device for hierarchical text classification |
CN102567308A (en) * | 2011-12-20 | 2012-07-11 | 上海电机学院 | Information processing feature extracting method |
CN102622373A (en) * | 2011-01-31 | 2012-08-01 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
KR101574027B1 (en) * | 2014-12-19 | 2015-12-03 | (주) 이비즈네트웍스 | System for blocking harmful program of smartphones |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN105893380A (en) * | 2014-12-11 | 2016-08-24 | 成都网安科技发展有限公司 | Improved text classification characteristic selection method |
-
2018
- 2018-03-09 CN CN201810196195.XA patent/CN108376130A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102200981A (en) * | 2010-03-25 | 2011-09-28 | 三星电子(中国)研发中心 | Feature selection method and feature selection device for hierarchical text classification |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
CN102622373A (en) * | 2011-01-31 | 2012-08-01 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
CN102567308A (en) * | 2011-12-20 | 2012-07-11 | 上海电机学院 | Information processing feature extracting method |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN105893380A (en) * | 2014-12-11 | 2016-08-24 | 成都网安科技发展有限公司 | Improved text classification characteristic selection method |
KR101574027B1 (en) * | 2014-12-19 | 2015-12-03 | (주) 이비즈네트웍스 | System for blocking harmful program of smartphones |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
Non-Patent Citations (4)
Title |
---|
张玉芳 等: "基于文本分类 TFIDF 方法的改进与应用", 《计算机工程》 * |
李帅 等: "改进卡方统计量的 BPNN 短文本分类方法", 《贵州大学学报( 自然科学版)》 * |
王美方 等: "基于TFDF的特征选择方法", 《计算机工程与设计》 * |
裴英博 等: "文本分类中改进型 CHI 特征选择方法的研究", 《计算机工程与应用》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105302911B (en) | A kind of data screening engine method for building up and data screening engine | |
CN108898479B (en) | Credit evaluation model construction method and device | |
CN103106275B (en) | The text classification Feature Selection method of feature based distributed intelligence | |
CN107563428A (en) | Classification of Polarimetric SAR Image method based on generation confrontation network | |
CN112102073A (en) | Credit risk control method and system, electronic device and readable storage medium | |
JP3888812B2 (en) | Fact data integration method and apparatus | |
CN108062478A (en) | The malicious code sorting technique that global characteristics visualization is combined with local feature | |
CN106709349B (en) | A kind of malicious code classification method based on various dimensions behavioural characteristic | |
CN106709513A (en) | Supervised machine learning-based security financing account identification method | |
CN108874927A (en) | Intrusion detection method based on hypergraph and random forest | |
CN108022146A (en) | Characteristic item processing method, device, the computer equipment of collage-credit data | |
CN105491444B (en) | A kind of data identifying processing method and device | |
CN104809393B (en) | A kind of support attack detecting algorithm based on popularity characteristic of division | |
CN111507385B (en) | Extensible network attack behavior classification method | |
CN115759640B (en) | Public service information processing system and method for smart city | |
CN110930218B (en) | Method and device for identifying fraudulent clients and electronic equipment | |
CN108197474A (en) | The classification of mobile terminal application and detection method | |
CN109635010A (en) | A kind of user characteristics and characterization factor extract, querying method and system | |
CN106780446A (en) | It is a kind of to mix distorted image quality evaluating method without reference | |
CN115174250B (en) | Network asset security assessment method and device, electronic equipment and storage medium | |
CN113626700A (en) | Lawyer recommendation method, system and equipment | |
CN109347719A (en) | A kind of image junk mail filtering method based on machine learning | |
CN110611655B (en) | Blacklist screening method and related product | |
CN111753299A (en) | Unbalanced malicious software detection method based on packet integration | |
CN113191407A (en) | Student economic condition grade classification method based on cost sensitivity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180807 |
|
RJ01 | Rejection of invention patent application after publication |