CN103514174B - A kind of file classification method and device - Google Patents
A kind of file classification method and device Download PDFInfo
- Publication number
- CN103514174B CN103514174B CN201210206020.5A CN201210206020A CN103514174B CN 103514174 B CN103514174 B CN 103514174B CN 201210206020 A CN201210206020 A CN 201210206020A CN 103514174 B CN103514174 B CN 103514174B
- Authority
- CN
- China
- Prior art keywords
- text
- processed
- index
- rubbish
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Abstract
The present invention provides a kind of file classification method and devices, this method comprises: each character in text to be processed in addition to text and number is replaced using preset fixed character string;The word length for including in replaced text total length and text is counted, the ratio of the word length and text total length is calculated;Using the ratio of the word length and text total length, the cheating characteristic index of the text to be processed is calculated;The text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.The present invention can effectively make up the deficiency of existing machine learning method, improve the accuracy rate of classification.
Description
[technical field]
The present invention relates to technical field of Internet information, in particular to a kind of file classification method and device.
[background technique]
With the continuous development of internet, more and more users carry out information interchange and resource-sharing using internet,
Network information also increasingly increases severely.However, the opening of internet also leads to the presence of many flames in a network, therefore,
The information of internet is monitored, is filtered and classification has become common requirements.
Comment (or being known as leaving a message, reply etc.) is a critical function of the Internet community class product, is to form product
Interact an important channel of atmosphere.Because its issue cost it is small, Shou Zhongguang, effect is lasting, since comment function generate just by
To the puzzlement of junk information, including various advertisement links, promotion message, the various informations such as yellow anti-information.For sending advertisement even at
For an industry, mode of posting also is become machine and is posted automatically from posting manually, and its technology is increasingly advanced, constantly prominent
Break various anti-cheating measures.
The main means of existing this junk information of reply include two major classes: one kind is the method in mechanism, including artificial
Audit, user gradation or groups of users system, stringent user's access system etc..Another kind of is technical method, can be divided into
Two ways, one is mechanical, including the control of identifying code, filtering sensitive words, frequency, blacklist, Similar Text strategy etc.;
Another kind is intellectual, mainly includes the method for various machine learning, such as naive Bayesian, Fei Sheer, support vector machines,
Neural network etc..
Wherein, the method in mechanism mainly increases the cost posted, but inhibits rubbish text producer
(spammer) while, general user is also allowed to be difficult to post, compares in the high community of degree of opening and is difficult to receive.It is mechanical
The method of formula is to be directed to have the junk information of fixed character to be just very easily by-passed once being understood by spammer.Intellectual
Method has certain identification capability, but because of the difference of study mechanism, training corpus etc., implements and acquire a certain degree of difficulty, master
The factor to be considered is the accuracy rate and recall rate that it recognizes junk information and normal information.
Existing these types mode is more effective to the judgement of plain text, however classify for following several texts
Effect is all undesirable.One, for being mingled with the texts of a large amount of punctuation marks and blank, tab or newline, False Rate is higher.One
Aspect causes since when carrying out word segmentation processing, punctuation mark, which generally can be all filtered, to be returned as word segmentation result
It can not judge that these are largely mingled with the rubbish text of punctuation mark etc.;On the other hand, punctuation mark and stop words do not reflect
Semanteme, the frequency of occurrences is close in normal text and rubbish text, can not effectively support posterior probability, to influence machine sort
Accuracy rate.It two, is website links for text main component, QQ number, the classifying qualities such as cell-phone number are also not so good, because of participle
Effective content of text can not be cut out, accuracy rate is not high.Three, effect is bad to be judged for meaningless answer, such as when user makees
When disadvantage mode is head portrait advertisement, " good experience " largely can be sent out, the comment of " effect is pretty good, praises very much " etc.When such
When text is largely appeared in the training corpus of rubbish text, the classifying quality normally commented on can also be had some impact on, be led
Accuracy rate is caused to reduce.
[summary of the invention]
In view of this, can be carried out to each class text effectively quasi- the present invention provides a kind of file classification method and device
It really recognizes, improves the accuracy rate of classification.
Specific technical solution is as follows:
A kind of file classification method, method includes the following steps:
S1, each character in text to be processed in addition to text and number is replaced using preset fixed character string;
The word length for including in S2, the replaced text total length of statistics and text, calculates the word length and text
The ratio of this total length;
S3, using the ratio of the word length and text total length, the cheating feature for calculating the text to be processed refers to
Mark;
S4, the text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.
According to one preferred embodiment of the present invention, before the step S1, further includes:
It is pre-processed for the character in the text to be processed in addition to text and number, removes common punctuate symbol
Number;
The step S1 is only replaced remaining character using preset fixed character string.
According to one preferred embodiment of the present invention, before the step S3, further includes:
The number for finding out the link, number and the mailbox that include in the text to be processed obtains the text to be processed
Link weight and number weight;
The step S3 utilizes obtained link weight and number weight, and the ratio of the word length and text total length
The subtraction function of example is weighted, and obtains the cheating characteristic index of the text to be processed, the link weight and number weight are got over
Greatly, the cheating characteristic index is bigger.
According to one preferred embodiment of the present invention, this method further include:
Determine the user name and IP address for submitting the text to be processed;
The user name or the corresponding submission situation of IP address are searched in the user name dictionary or IP dictionary constructed in advance
Cheating user's index is calculated in data, the quantity of the normal text and rubbish text submitted using the user;
The step S4 is weighted or is multiplied with the cheating characteristic index using the cheating user index, will calculate
As a result it is determined as rubbish text more than the text to be processed of preset threshold.
According to one preferred embodiment of the present invention, the method for building up of the user name dictionary and IP dictionary, specifically includes:
Obtain the sample corpus comprising normal text and rubbish text;
Record submits the user name and IP address of each text in the sample corpus;
Corresponding normal text and the rubbish text of being marked as in the text that each user name and IP address upload is counted respectively
Quantity generates user name dictionary and IP dictionary.
According to one preferred embodiment of the present invention, this method further include:
The text to be processed is segmented, using the Bayes's dictionary constructed in advance, each lexical item pair for searching
The normal probability and rubbish probability answered, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed
This Bayes's index;
The step S4 is multiplied or weights with the cheating characteristic index using Bayes's index, ties calculating
Fruit is more than that the text to be processed of preset threshold is determined as rubbish text.
According to one preferred embodiment of the present invention, this method further include:
The text to be processed is segmented, using the Fei Sheer dictionary constructed in advance, each lexical item pair for searching
The normal probability and rubbish probability answered, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed
This Fei Sheer index;
The step S4 is multiplied or is weighted with the cheating characteristic index using the Fei Sheer index, is tied calculating
Fruit is more than that the text to be processed of preset threshold is determined as rubbish text.
A kind of document sorting apparatus, the device include:
Character replacement module, it is preset solid for using each character in text to be processed in addition to text and number
Determine character string replacement;
Text content computing module, for counting by the replaced text total length of the character replacement module and text
In include word length, calculate the ratio of the word length and text total length;
It practises fraud characteristic index computing module, for the ratio using the word length and text total length, described in calculating
The cheating characteristic index of text to be processed;
Categorization module, for the text to be processed that the cheating characteristic index is more than preset threshold to be determined as rubbish text
This.
According to one preferred embodiment of the present invention, the configuration of the character replacement module includes:
For being pre-processed for the character in the text to be processed in addition to text and number, common mark is removed
Point symbol;
Only remaining character after preprocessing module processing is replaced.
According to one preferred embodiment of the present invention, the device further include:
Numerical chracter statistical module, for finding out the number of the link, number and the mailbox that include in the text to be processed,
Obtain the link weight and number weight of the text to be processed;
The cheating characteristic index computing module using obtained link weight and number weight, with the word length with
The subtraction function of the ratio of text total length is weighted, and obtains the cheating characteristic index of the text to be processed, the link power
Weight and number weight are bigger, and the cheating characteristic index is bigger.
According to one preferred embodiment of the present invention, the device further include:
User information extraction module, for determining the user name and IP address of submitting the text to be processed;
Cheating user's index computing module, for searching the user in the user name dictionary or IP dictionary constructed in advance
Name or the corresponding submission status data of IP address, the rubbish submitted using the user name or IP address history of the text to be processed
Cheating user's index is calculated in the ratio of text;
The categorization module is also used to that the cheating user index is weighted or is multiplied with cheating characteristic index, will
Calculated result is more than that the text to be processed of preset threshold is determined as rubbish text.
According to one preferred embodiment of the present invention, the user name dictionary and IP dictionary establish module, specifically include:
Corpus acquiring unit, for obtaining the sample corpus comprising normal text and rubbish text;
User information recording unit, for recording the user name and IP address of submitting each text in the sample corpus;
Statistic unit is marked as normal text for counting correspondence in the text that each user name and IP address upload respectively
With the quantity of rubbish text, user name dictionary and IP dictionary are generated.
According to one preferred embodiment of the present invention, the device further include:
Bayes's index computing module utilizes the Bayes constructed in advance for segmenting to the text to be processed
Dictionary, the corresponding normal probability of each lexical item searched and rubbish probability, and calculating the text to be processed is rubbish text
Probability, be supplied to the categorization module as Bayes's index of the text to be processed, and by Bayes's index;
The categorization module is also used to be multiplied using Bayes's index with the cheating characteristic index or be added
The text to be processed that calculated result is more than preset threshold is determined as rubbish text by power.
According to one preferred embodiment of the present invention, the device further include:
Fei Sheer index computing module utilizes the Fei Sheer constructed in advance for segmenting to the text to be processed
Dictionary, the corresponding normal probability of each lexical item searched and rubbish probability, and calculating the text to be processed is rubbish text
Probability, be supplied to the categorization module as the Fei Sheer index of the text to be processed, and by the Fei Sheer index;
The categorization module is also used to be multiplied or added with the cheating characteristic index using the Fei Sheer index
The text to be processed that calculated result is more than preset threshold is determined as rubbish text by power.
As can be seen from the above technical solutions, file classification method and device provided by the invention utilize character replacement
The cheating feature that mode is expanded carries out auxiliary verifying to the submission behavior of user, can efficiently identify and be mingled with a large amount of spies
The meaningless text that different symbol, escape character and the text of link and head portrait advertisement cheating user largely issue, particularly with
The short text of comment, reply, the message of community or forum etc. improves identification precision, and the method with machine learning
It combines, effectively makes up the deficiency of existing machine learning method, improve the accuracy rate of classification.
[Detailed description of the invention]
Fig. 1 is the file classification method flow chart that the embodiment of the present invention one provides;
Fig. 2 is file classification method flow chart provided by Embodiment 2 of the present invention;
Fig. 3 a is the schematic diagram of certain content of text and its user information;
Fig. 3 b is to obtain Bayes's dictionary schematic diagram using bayes classification method training;
Fig. 3 c is to obtain Fei Sheer dictionary schematic diagram using the training of Fei Sheer classification method;
Fig. 3 d is the user name dictionary schematic diagram that statistics obtains;
Fig. 3 e is the IP dictionary schematic diagram that statistics obtains;
Fig. 4 is the file classification method flow chart that the embodiment of the present invention three provides;
Fig. 5 is that the embodiment of the present invention three carries out processing result schematic diagram to the text of Fig. 3 a;
Fig. 6 is the document sorting apparatus schematic diagram that the embodiment of the present invention four provides;
Fig. 7 is the document sorting apparatus schematic diagram that the embodiment of the present invention five provides;
Fig. 8 is the document sorting apparatus schematic diagram that the embodiment of the present invention six provides.
[specific embodiment]
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments
The present invention is described in detail.
Embodiment one,
Fig. 1 is file classification method flow chart provided in this embodiment, as shown in Figure 1, this method comprises:
S101, each character in text to be processed in addition to text and number is replaced using preset fixed character string
It changes.
First by the special symbol in text to be processed, such as English symbol " < >-_ `~@# $ %^&* () +=| ", Chinese
Symbol " " " $ () ---? ", escape character " n t r n " and space etc. replaced with fixed character.
Fixed character string can be, but not limited to be superimposed as the character string that length is more than 1 using identical character repetition.For example,
Using the fixed character string " $ $ $ $ " etc. of four " $ " character addings.For in text to be processed in addition to text and number other than
Each character use the fixed character string " $ $ $ $ " to go to replace.For example, for " < ----money-making side
Method ---: " "? ": > > >/" this text to be processed, after going replacement using fixed character string " $ $ $ $ ", become " $ $ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ method to make money $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ ", it is replaced to be processed
Text total length is elongated.
Since punctuation mark, stop words etc. when carrying out word segmentation processing, can be filtered to processing, thus, this step benefit
These special symbols are first replaced with fixed character string, and the feature of these special symbols is expanded, increase specific symbol
The influence of number this part, then count effective text content.
Certainly, due in normal text, can comprising some common punctuation marks such as ", " "." etc., this part belongs to
It, can also be without replacement in content in the text can normally occur.It thus, can be with before this step is replaced
Character in the text to be processed in addition to text and number is pre-processed, common punctuation mark is first removed, it is only right
Remaining character is replaced using preset fixed character string, and efficiency and accuracy rate can be improved.
The word length for including in S102, the replaced text total length of statistics and text, calculate the word length with
The ratio of text total length.
By fixed character string replacement text to be processed can because comprising English symbol, Chinese symbol or escape character etc.
The quantity of special symbol is more and occurs, and counts replaced text total length L _ ORIG.
The text for including in the text to be processed is found out using regular expression, for example, Chinese character is extended in national standard
Coding range in code (GBK) is 0x8140-0xFEFE, the coding in Chinese Character Set Code for Informati (GB2312)
Range is 0xA1A1-0xFEFE, the coding range in unicode (Unicode) be u4E00- u9FA5,
UF900- uFA2D find out using the regular expression of above-mentioned coding range building Chinese character and fall in above-mentioned coding range section
In character, count the text number found out, calculate word length L_CHAR.
The ratio K=L_CHAR/L_ORIG of the word length and text total length is calculated, i.e., effective text contains
Amount.
S103, the ratio of the word length and text total length, the cheating feature of the calculating text to be processed are utilized
Index.
For being mingled with the rubbish text of a large amount of punctuation marks, blank, escape character etc., the text number generally comprised compared with
Few, the content of non-legible symbol is larger.That is, effective text content is lower in text, the cheating characteristic index of the text
It is more obvious, the text is that the probability of rubbish text is also bigger.
Thus, advance with the ratio K of word length and text total length, building cheating characteristic index function, to count
It can be regarded as disadvantage characteristic index.Specifically, using the subtraction function of word length and the ratio K of text total length as cheating characteristic index
Function.The subtraction function can be, but not limited to use:
(formula 1)
In above-mentioned formula 1, since word length and the value range of the ratio K of text total length are [0,1], (1+K's)
Value range is [1,2], and the value range of 1/ (1+K) is [0.5,1], thus, which divides between [1,2]
Cloth.The value of ratio K is lower, and Score score is higher, and the cheating characteristic index of the text is more obvious, and the text is rubbish text
Probability is also bigger.
S104, the text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.
Preset threshold is to observe the classification thresholds of pending data collection.According to the demand of practical application scene and previous warp
It tests, sets the preset threshold of a cheating characteristic index, the cheating characteristic index Score_ that judgment step S103 is calculated
Whether feature1 is more than the preset threshold, if it does, being then identified as rubbish text.
Thus, it can be effectively specific to largely punctuate, space, escape character etc. is mingled with using method provided in this embodiment
The rubbish text of character is effectively identified.
It is noted that for other remaining texts to be processed after identification, i.e. cheating characteristic index Score_
Feature1 is no more than the text of preset threshold, existing classification method can also be utilized, such as bayes classification method or support
The methods of vector machine carries out classification judgement to those texts again.
Embodiment two
Fig. 2 is file classification method flow chart provided in this embodiment, as shown in Fig. 2, this method comprises:
Step S201, it to the character in text to be processed in addition to text and number, is replaced using preset fixed character string
It changes.
This step is identical as step S101 in embodiment one, repeats no more in this.
Step S202, the word length for including in replaced text total length and text is counted, it is long using the text
The ratio of degree and text total length calculates text proportional roles.
The calculation method of the ratio K of word length and text total length is identical as step S102 in embodiment one, i.e. K=L_
CHAR/L_ORIG。
Text proportional roles Score_char is calculated using word length and the ratio of text total length, can be, but not limited to
Using following formula:
(formula 2)
Step S203, the number for finding out the link, number and the mailbox that include in the text to be processed obtains described wait locate
Manage the link weight and number weight of text.
The quantity of link, QQ number, cell-phone number and mailbox is found out using regular expression.For example, in python language,
It regular expression re.compile (" [0-9 ] { 5-9 } ") can be used to find out cell-phone number or QQ number, use re.compile
(" w+@w+ w+ ") find out email address, using regular expression re.compile (" [http: /] * w+ [w+ ]+
[comnedugvtn ] { 2,6 } ") find out the link of network address.
Link weight Score_link can be, but not limited to the sum of the quantity using the link and mailbox for including in text
Indicate, correspondingly, number weight Score_digit can be, but not limited to using include in text QQ number, cell-phone number equal sign
The sum of quantity of code indicates.
Step S204, obtained link weight and number weight, and the ratio of the word length and text total length are utilized
The subtraction function of example is weighted, and obtains the cheating characteristic index of the text to be processed.
Specific weighted formula can be, but not limited to use:
Score_feature2=Score_char+0.5Score_link+0.5Score_digit (formula 3)
As can be seen from the above equation, link weight Score_link and number weight Score_digit are bigger, and cheating feature refers to
It marks bigger;Alternatively, word length and the ratio of text total length are smaller, the text proportional roles Score_char being calculated is got over
Greatly, cheating characteristic index is bigger.
Step S205, the user name and IP address for submitting the text to be processed are determined.
For the text to be processed, user information is obtained, with determining the user name for submitting the text to be processed and IP
Location.
Fig. 3 a is the schematic diagram of certain content of text and its user information, as shown in Figure 3a, submits the user of the text entitled
Sx1816, User IP are as follows: 114.228.210.130, content of text are " one area of the East China-sun, the moon and the stars
282zzd8010101060000700067b4t0zmcb50e0”。
Step S206, the user name is searched in the user name dictionary or IP dictionary constructed in advance or IP address is corresponding
Status data is submitted, cheating user's index is calculated in the quantity of the normal text and rubbish text submitted using the user.
The user name dictionary or IP dictionary be advance with certain scale historical data carry out statistic of classification obtain,
The quantity of normal text and rubbish text in the text that each user submits is counted according to the user name and IP address of submission, respectively
Record generates user name dictionary and IP dictionary.
The ratio for the rubbish text that the user name or IP address history of the text to be processed are submitted is calculated cheating and uses
Family index.
The user name and IP address determined using step S205, is searched in the user name dictionary and IP dictionary constructed in advance
The corresponding submission status data of the user records the quantity that the user submits normal text and rubbish text, is denoted as h_num respectively
And s_num.
Cheating user's index S core_user can be, but not limited to be calculated using the following equation:
(formula 4)
Wherein, T is a reference value of rubbish text quantity, to observe the rubbish text quantity of normal users and junk user
Line of demarcation, can be according to practical situation value, such as T value is between 6~10.
It can be seen from above-mentioned formula 4 practise fraud user's index mainly by user's history submit rubbish text quantity and
Rubbish text accounts for the influence for submitting total ratio, and for junk user, often this two indexs are all very high, even if normal users
There are some comments to be marked as rubbish text, but rubbish text ratio is lower, the cheating user index finally obtained also can be lower.
Certainly, when calculating cheating user's index, it is also contemplated that the feature of user name.Cheating user often passes through
Machine registration, user name can have certain feature, for example, letter and number is formed by certain rule, comprising " add Q,
QQ makes friends, connection, and button button adds me, puts me " etc. wordings, can be into for the cheating user index of the user with such feature
One step carries out tune power processing.
Step S207, it is weighted or is multiplied with the cheating characteristic index using the cheating user index, will calculate
As a result it is determined as rubbish text more than the text to be processed of preset threshold.
The work that the cheating characteristic index and step S206 that the final score of text to be processed is obtained using step S204 obtain
Disadvantage user's index is weighted or is multiplied, and calculates final score by the way of being multiplied in the present embodiment.
Preset threshold equally be observe pending data collection classification thresholds, judge weight or be multiplied result whether be more than
The preset threshold, if it does, being then identified as rubbish text.
Embodiment three
In the present embodiment, it first constructs Bayes's dictionary in advance by the way of generating dictionary offline, Fei Sheer dictionary, use
Name in an account book dictionary and IP dictionary, specific method for building up include:
Step S301, the sample corpus comprising normal text and rubbish text is obtained.
The existing historical data of certain scale can be used in the sample corpus, utilizes the different user accumulated in network
Text, comment or the reply that name or IP address are submitted form sample corpus.
The classification of the normal text and rubbish text of acquisition can be classifies to obtain using existing classification method, alternatively,
It is also possible to obtain using the method for handmarking, distinguishes the person of being managed or other users in sample corpus and be labeled as rubbish text
Text, and not labeled normal text.
Step S302, word cutting processing is carried out to the text in the sample corpus, counting statistics, meter is carried out to each lexical item
The probability that each lexical item is normal text and rubbish text is calculated, classified dictionary is generated.
Machine learning method can use existing bayes classification method or Fei Sheer classification method etc., be respectively formed
Corresponding classified dictionary.Fig. 3 b is that Bayes's dictionary schematic diagram is obtained using bayes classification method training, and Fig. 3 c is to utilize expense
The training of She Er classification method obtains Fei Sheer dictionary schematic diagram, includes each lexical item and the word in dictionary as shown in figures 3 b and 3 c
The normal probability and rubbish probability of item.
Step S303, record submits the user name and IP address of each text in the sample corpus.
The user name and IP address of each text are extracted from sample corpus.Fig. 3 a is certain content of text and its user information
Schematic diagram submits the entitled sx1816 of the user of the text, User IP as shown in Figure 3a are as follows: 114.228.210.130, in text
Hold is " one area of East China-sun, the moon and the stars 282zzd8010101060000700067b4t0zmcb50e0 ".
Step S304, correspondence in the text that each user name and IP address upload is counted respectively is marked as normal text and rubbish
The quantity of rubbish text generates user name dictionary and IP dictionary.
Fig. 3 d is the obtained user name dictionary schematic diagram of statistics, includes as shown in Figure 3d, in dictionary each user name and right
The normal text and rubbish text quantity that should be submitted.
Fig. 3 e is the IP dictionary schematic diagram that statistics obtains, and as shown in Figure 3 e, in dictionary includes each IP and corresponding submission
Normal text and rubbish text quantity.
Off-line learning generate dictionary can the periodic operation within the seeervice cycle, learnt automatically, realize autonomous learning
Effect.Furthermore it is also possible to constantly back up several dictionaries, it is ensured that when server resets, data originally can be reloaded.
When there is newer dictionary to provide, service needs to merge dictionary with the dictionary loaded, to similarly index (key) into
Row value replacement, is added the index being originally not present.
Fig. 4 is file classification method flow chart provided in this embodiment, as shown in figure 4, this method comprises:
Step S401, the character in text to be processed in addition to text and number is replaced using preset fixed character string
Change, calculate include in the ratio and text of the word length and text total length link, the number of number and mailbox, calculate
The cheating characteristic index Score_feature of the text to be processed.
The treatment process of this step is identical as the treatment process of step S201 to step S204 in embodiment two, no longer in this
It repeats.
Step S402, using the user name of text to be processed described in the user name dictionary or IP dictionary lookup constructed in advance or
The ratio for the rubbish text that IP address history is submitted calculates the cheating characteristic index Score_user of the text to be processed.
The treatment process of this step is identical as the treatment process of step S205 to step S206 in embodiment two, no longer in this
It repeats.
Step S403, the text to be processed is segmented, using the Bayes's dictionary constructed in advance, is searched
The corresponding normal probability of each lexical item and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as described
Bayes's index of text to be processed.
The basis of Bayes's classification is Bayes' theorem and total probability formula.Bayes' theorem is substantially that " condition is general for calculating
Rate ", so-called conditional probability refer to the probability that event A occurs in the case where event B occurs, and calculation is P (A | B)=P (B
|A)P(A)/P(B).Total probability formula is the probability that complicated event is calculated by the probability of simple event, such as A and A ' are structures
At one of sample space division, then probability P (B)=P (B | A) P (A)+P (B | A ') P (A ') of event B.
Bayes' theorem, which is used for text classification, is based on to Bayesian following understanding: P (A) is referred to as that " priori is general
A deduction before rate ", i.e. B event occur, to the A probability of happening;P (A | B) be known as " posterior probability ", i.e. it occurs for B event
Afterwards, the A probability of happening is reappraised, and P (B | A)/P (B) is known as " plausibility function ", and it is a Dynamic gene, so that estimating
Closer to true probability, value is obtained probability by experiment method, and if more than 1, then " prior probability " is enhanced, is equal to 1, meaning
B event be helpless to a possibility that judging A event, less than 1 " prior probability " is weakened.
It is used for text classification using following process, if S indicates rubbish text (Spam), H indicates normal text
(Healthy), P (S)=P (H)=50%, W indicates word (Word) under normal conditions, and then program calculation occurs problem in W
In the case of text be S probability, be denoted as P (S | W), P (S | W)=P (W | S) P (S)/(P (W | S) P (S) can be obtained according to above-mentioned formula
+ P (W | H) P (H)), and P (W | S) and P (W | H) it is illustrated respectively in rubbish and normal text, the probability that W occurs can count
Out.Speculate that text belongs to the probability of rubbish text according to the frequency of word in text.It is pushed away by the multiple words for including in text
It surveys and joint probability formula can be used: note P (S | W1) it is P1, P (S | W2) it is P2, final probability is P=P1P2/(P1P2+(1-P1)(1-
P2))。
Using obtained final probability as Bayes's index S core_bayes of the text to be processed.
Step S404, the text to be processed is segmented, using the Fei Sheer dictionary constructed in advance, is searched
The corresponding normal probability of each lexical item and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as described
The Fei Sheer index of text to be processed.
Similar with bayes classification method, Fei Sheer classification is a kind of alternative solution of Bayes, can also tie the two
Close use, from Bayes using word frequencies calculate unlike, Fei Sheer method statistic be document probability.There is W
When, the document belongs to the probability of S and H, when obtaining result, needs specified classification.Its calculation formula is: it sets C and classifies (i.e. to be affiliated
S and H above), when P (C | W) occurs for W, text belongs to the probability of classification C, can be counted and be obtained by training text.For defeated
The text entered judges that it belongs to the new probability formula of rubbish text for P (S)=P (S | W)/(P (S | W)+P (H | W)), for multiple
Result can be multiplied by the joint probability of word, and P (S)=P (S1) P (S2) ... can similarly obtain P (H).In Fei Sheer method
In, the P (C) obtained can also be by following processing: by the incoming inversion chi square function of the result of -2*log (P (C)), return most terminates
Fruit.
Fei Sheer index score score_fisher=P (S)/(P (S)+P (H)), i.e. Fei Sheer judgement belong to garbage classification
Probability do normalized.
It is noted that classify the text that can not judge for Fei Sheer, i.e. the case where P (S)=P (H), Score_
Fisher=1 illustrates that Fei Sheer index fails.
Step S405, it is carried out using Bayes's index, Fei Sheer index, cheating user's index and cheating characteristic index
Weighting is multiplied, and the text to be processed that calculated result is more than preset threshold is determined as rubbish text.
The final score of the text to be processed can be subject to four indices and be multiplied.Specifically:
Score=Score_feature*Score_user*Score_bayes*Score_fisher (formula 5)
Preset threshold equally be observe pending data collection classification thresholds, judge weight or be multiplied result whether be more than
The preset threshold, if it does, being then identified as rubbish text.In the present embodiment example, can preferably it be distinguished just with 1.0 for boundary
Chang Wenben and rubbish text.
For the text shown in Fig. 3 a, the text is at " sun, the moon and the stars " followed by a large amount of spaces and line feed character, warp
After crossing this instance processes, result as shown in figure 5, it is 0.406481 that obtain Bayes's index, which be 0.034608, Fei Sheer index,
The two indexs all fail to identify the feature in space, symbol, and cheating characteristic index is 2.785714, illustrate the side of feature expansion
Formula can more effectively identify such text, and cheating user's index is 1.000000, and what the user did not submit before explanation goes through
History, final score 0.0391875824319, is classified as normal text.
This example demonstrates the weakness of Fei Sheer and Bayes, while showing that the method for feature expansion can more effectively be known
It is not mingled with the text of a large amount of punctuation marks and blank, tab or newline.When similar text largely occurs and the person of being managed
After deletion, by the study of a few wheel dictionaries, the index of Bayes and Fei Sheer will be accordingly accurate, simultaneously as in the presence of
Practise fraud review record, user practise fraud index can also increase accordingly, in this way, building classifier can preferably identify it is similar in this way
Text, while the cheating mode of head portrait advertisement also can be very good to identify.It is wrapped in addition, being considered in text in this method
The quantity of the link and number that contain is website links for main component, and QQ number, the text of cell-phone number etc. also has to be divided well
Class effect.The present invention proposes feature expansion and the method with the behavior of submission, combines with the method for existing machine learning, improves and divides
Class accuracy rate.
It is the detailed description carried out to method provided by the present invention above, text classification provided by the invention is filled below
It sets and is described in detail.
Example IV
Fig. 6 is document sorting apparatus schematic diagram provided in this embodiment, as shown in fig. 6, the device includes:
Character replacement module 601, it is default for using each character in text to be processed in addition to text and number
Fixed character string replacement.
Character replacement module 601 is first by the special symbol in text to be processed, such as English symbol " < >-_ `~@# $ %
^&* () +=| ", Chinese symbol " " " $ () ---? ", escape character " n t r n " and space etc. replaced with fixed character
It changes.
Fixed character string can be, but not limited to be superimposed as the character string that length is more than 1 using identical character repetition.For example,
Using the fixed character string " $ $ $ $ " etc. of four " $ " character addings.For in text to be processed in addition to text and number other than
Each character use the fixed character string " $ $ $ $ " to go to replace.For example, for " < ----money-making side
Method ---: " "? ": > > >/" this text to be processed, after going replacement using fixed character string " $ $ $ $ ", become " $ $ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ method to make money $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ ", it is replaced to be processed
Text total length is elongated.
Since punctuation mark, stop words etc. when carrying out word segmentation processing, can be filtered to processing, thus, this module benefit
These special symbols are first replaced with fixed character string, and the feature of these special symbols is expanded, increase specific symbol
The influence of number this part, then count effective text content.
Certainly, due in normal text, can comprising some common punctuation marks such as ", " "." etc., this part belongs to
It, can also be without replacement in content in the text can normally occur.Thus, character replacement module 601 can also be first right
Character in the text to be processed in addition to text and number is pre-processed, and common punctuation mark is removed, only to residue
Character be replaced using preset fixed character string, efficiency and accuracy rate can be improved.
Text content computing module 602 passes through the replaced text total length of character replacement module 601 and text for counting
The word length for including in this, calculates the ratio of the word length and text total length.
By fixed character string replacement text to be processed can because comprising English symbol, Chinese symbol or escape character etc.
The quantity of special symbol is more and occurs, and counts replaced text total length L _ ORIG.
The text for including in the text to be processed is found out using regular expression, for example, Chinese character is extended in national standard
Coding range in code (GBK) is 0x8140-0xFEFE, the coding in Chinese Character Set Code for Informati (GB2312)
Range is 0xA1A1-0xFEFE, the coding range in unicode (Unicode) be u4E00- u9FA5,
UF900- uFA2D find out using the regular expression of above-mentioned coding range building Chinese character and fall in above-mentioned coding range section
In character, count the text number found out, calculate word length L_CHAR.
The ratio K=L_CHAR/L_ of the word length and text total length is calculated in text content computing module 602
ORIG, i.e., effective text content.
Cheating characteristic index computing module 603 calculates institute for the ratio using the word length and text total length
State the cheating characteristic index of text to be processed.
For being mingled with the rubbish text of a large amount of punctuation marks, blank, escape character etc., the text number generally comprised compared with
Few, the content of non-legible symbol is larger.That is, effective text content is lower in text, the cheating characteristic index of the text
It is more obvious, the text is that the probability of rubbish text is also bigger.
Thus, advance with the ratio K of word length and text total length, building cheating characteristic index function, to count
It can be regarded as disadvantage characteristic index.Specifically, using the subtraction function of word length and the ratio K of text total length as cheating characteristic index
Function.The subtraction function can be, but not limited to calculate using formula 1.In equation 1, due to word length and text total length
The value range of ratio K be [0,1], the value range of (1+K) is [1,2], and the value range of 1/ (1+K) is [0.5,1], because
And the cheating characteristic index is distributed between [1,2].The value of ratio K is lower, and Score score is higher, the cheating feature of the text
Index is more obvious, and the text is that the probability of rubbish text is also bigger.
Categorization module 604, for the text to be processed that the cheating characteristic index is more than preset threshold to be determined as rubbish
Text.
Preset threshold is to observe the classification thresholds of pending data collection.According to the demand of practical application scene and previous warp
It tests, sets the preset threshold of a cheating characteristic index, the cheating for judging that cheating characteristic index computing module 603 is calculated is special
Levy whether index S core_feature1 is more than the preset threshold, if it does, being then identified as rubbish text.
Thus, it can be effectively specific to largely punctuate, space, escape character etc. is mingled with using device provided in this embodiment
The rubbish text of character is effectively identified.
It is noted that for other remaining texts to be processed after identification, i.e. cheating characteristic index Score_
Feature1 is no more than the text of preset threshold, existing sorter can also be utilized, such as Bayes's classification or supporting vector
The classifiers such as machine carry out classification judgement to those texts again.
Embodiment five
Fig. 7 is document sorting apparatus schematic diagram provided in this embodiment, as shown in fig. 7, the device includes:
Character replacement module 701, for in text to be processed except text and number in addition to character, using preset solid
Determine character string replacement.
This module is identical as module 601 in example IV, repeats no more in this.
Text content computing module 702, for counting in the replaced text total length of character replacement module 701 and text
The word length for including calculates text proportional roles using the word length and the ratio of text total length.
Module in the calculation method and example IV of the ratio K of the word length and text total length that are used in this module
602 is identical, i.e. K=L_CHAR/L_ORIG.
Text content computing module 702 calculates text proportional roles using word length and the ratio of text total length
Score_char can be, but not limited to be calculated using formula 2.
Numerical chracter statistical module 703, for finding out of the link, number and the mailbox that include in the text to be processed
Number, obtains the link weight and number weight of the text to be processed.
The quantity of link, QQ number, cell-phone number and mailbox is found out using regular expression.For example, in python language,
It regular expression re.compile (" [0-9 ] { 5-9 } ") can be used to find out cell-phone number or QQ number, use re.compile
(" w+@w+ w+ ") find out email address, using regular expression re.compile (" [http: /] * w+ [w+ ]+
[comnedugvtn ] { 2,6 } ") find out the link of network address.
Link weight Score_link can be, but not limited to the sum of the quantity using the link and mailbox for including in text
Indicate, correspondingly, number weight Score_digit can be, but not limited to using include in text QQ number, cell-phone number equal sign
The sum of quantity of code indicates.
Cheating characteristic index computing module 704, it is long with the text for utilizing obtained link weight and number weight
It spends and is weighted with the subtraction function of the ratio of text total length, obtain the cheating characteristic index of the text to be processed.
Specific weighted formula can be, but not limited to be calculated using formula 3 in cheating characteristic index computing module 704.By
For formula 3 as can be seen that link weight Score_link and number weight Score_digit is bigger, cheating characteristic index is bigger;Or
The ratio of person, word length and text total length is smaller, and the text proportional roles Score_char being calculated is bigger, and cheating is special
It is bigger to levy index.
User information extraction module 705, for determining the user name and IP address of submitting the text to be processed.
For the text to be processed, user information is obtained, with determining the user name for submitting the text to be processed and IP
Location.
Fig. 3 a is the schematic diagram of certain content of text and its user information, as shown in Figure 3a, submits the user of the text entitled
Sx1816, User IP are as follows: 114.228.210.130, content of text are " one area of the East China-sun, the moon and the stars
282zzd8010101060000700067b4t0zmcb50e0”。
Cheating user's index computing module 706, described in being searched in the user name dictionary or IP dictionary constructed in advance
The quantity of user name or the corresponding submission status data of IP address, the normal text and rubbish text submitted using the user is calculated
Obtain cheating user's index.
The user name dictionary or IP dictionary be advance with certain scale historical data carry out statistic of classification obtain,
The quantity of normal text and rubbish text in the text that each user submits is counted according to the user name and IP address of submission, respectively
Record generates user name dictionary and IP dictionary.
The ratio for the rubbish text that the user name or IP address history of the text to be processed are submitted is calculated cheating and uses
Family index.
The user name and IP address determined using user information extraction module 705, in the user name dictionary that constructs in advance and
The corresponding submission status data of the user is searched in IP dictionary, records the quantity that the user submits normal text and rubbish text,
It is denoted as h_num and s_num respectively.
Cheating user's index S core_user can be, but not limited to calculate using formula 4.
It can be seen from above-mentioned formula 4 practise fraud user's index mainly by user's history submit rubbish text quantity and
Rubbish text accounts for the influence for submitting total ratio, and for junk user, often this two indexs are all very high, even if normal users
There are some comments to be marked as rubbish text, but rubbish text ratio is lower, the cheating user index finally obtained also can be lower.
Certainly, when calculating cheating user's index, cheating user's index computing module 706 is also conceivable to the spy of user name
Sign.The user that practises fraud is registered by machine, and user name can have certain feature, for example, letter and number is pressed
Certain rule composition, comprising wordings such as " add Q, QQ, make friends, connection, button button adds me, puts me ", for the use with such feature
The cheating user index at family can be handled with further progress tune power.
Categorization module 707, for being weighted or being multiplied with the cheating characteristic index using the cheating user index,
The text to be processed that calculated result is more than preset threshold is determined as rubbish text.
The cheating characteristic index and make that the final score of text to be processed is obtained using cheating characteristic index computing module 704
Cheating user's index that disadvantage user's index computing module 706 obtains is weighted or is multiplied, in the present embodiment using multiplication
Mode calculates final score.
Preset threshold equally be observe pending data collection classification thresholds, judge weight or be multiplied result whether be more than
The preset threshold, if it does, being then identified as rubbish text.
Embodiment six
In the present embodiment, it first constructs Bayes's dictionary in advance by the way of generating dictionary offline, Fei Sheer dictionary, use
Name in an account book dictionary and IP dictionary, specifically establishing module includes:
Corpus acquiring unit, for obtaining the sample corpus comprising normal text and rubbish text.
The existing historical data of certain scale can be used in the sample corpus, utilizes the different user accumulated in network
Text, comment or the reply that name or IP address are submitted form sample corpus.
The classification of the normal text and rubbish text of acquisition can be classifies to obtain using existing classification method, alternatively,
It is also possible to obtain using the method for handmarking, distinguishes the person of being managed or other users in sample corpus and be labeled as rubbish text
Text, and not labeled normal text.
Machine sort unit counts each lexical item for carrying out word cutting processing to the text in the sample corpus
Number statistics calculates the probability that each lexical item is normal text and rubbish text, generates classified dictionary.
Machine learning method can use existing bayes classification method or Fei Sheer classification method etc., be respectively formed
Corresponding classified dictionary.Fig. 3 b is that Bayes's dictionary schematic diagram is obtained using bayes classification method training, and Fig. 3 c is to utilize expense
The training of She Er classification method obtains Fei Sheer dictionary schematic diagram, includes each lexical item and the word in dictionary as shown in figures 3 b and 3 c
The normal probability and rubbish probability of item.
User information recording unit, for recording the user name and IP address of submitting each text in the sample corpus.
The user name and IP address of each text are extracted from sample corpus.Fig. 3 a is certain content of text and its user information
Schematic diagram submits the entitled sx1816 of the user of the text, User IP as shown in Figure 3a are as follows: 114.228.210.130, in text
Hold is " one area of East China-sun, the moon and the stars 282zzd8010101060000700067b4t0zmcb50e0 ".
Statistic unit is marked as normal text for counting correspondence in the text that each user name and IP address upload respectively
With the quantity of rubbish text, user name dictionary and IP dictionary are generated.
Fig. 3 d is the obtained user name dictionary schematic diagram of statistics, includes as shown in Figure 3d, in dictionary each user name and right
The normal text and rubbish text quantity that should be submitted.
Fig. 3 e is the IP dictionary schematic diagram that statistics obtains, and as shown in Figure 3 e, in dictionary includes each IP and corresponding submission
Normal text and rubbish text quantity.
Off-line learning generate dictionary can the periodic operation within the seeervice cycle, learnt automatically, realize autonomous learning
Effect.Furthermore it is also possible to constantly back up several dictionaries, it is ensured that when server resets, data originally can be reloaded.
When there is newer dictionary to provide, service needs to merge dictionary with the dictionary loaded, to similarly index (key) into
Row value replacement, is added the index being originally not present.
Fig. 8 is document sorting apparatus schematic diagram provided in this embodiment, as shown in figure 8, the device includes:
Cheating characteristic index processing module 801, for being used to the character in text to be processed in addition to text and number
The replacement of preset fixed character string, calculate include in the ratio and text of the word length and text total length link, number
The number of code and mailbox calculates the cheating characteristic index Score_feature of the text to be processed.
The treatment process of this module is identical as the treatment process of module 701 to module 704 in embodiment five, no longer superfluous in this
It states.
Cheating user's index processing module 802, for using described in the user name dictionary or IP dictionary lookup constructed in advance
The ratio for the rubbish text that the user name or IP address history of text to be processed are submitted, the cheating for calculating the text to be processed are special
Levy index S core_user.
The treatment process of this module is identical as the treatment process of module 705 to module 706 in embodiment five, no longer superfluous in this
It states.
Bayes's index computing module 803 utilizes the pattra leaves constructed in advance for segmenting to the text to be processed
This dictionary searches the corresponding normal probability of each lexical item and rubbish probability that participle obtains, and calculating the text to be processed is rubbish
The probability of rubbish text, Bayes's index as the text to be processed.
The basis of Bayes's classification is Bayes' theorem and total probability formula.Bayes' theorem is substantially that " condition is general for calculating
Rate ", so-called conditional probability refer to the probability that event A occurs in the case where event B occurs, and calculation is P (A | B)=P (B
|A)P(A)/P(B).Total probability formula is the probability that complicated event is calculated by the probability of simple event, such as A and A ' are structures
At one of sample space division, then probability P (B)=P (B | A) P (A)+P (B | A ') P (A ') of event B.
Bayes' theorem, which is used for text classification, is based on to Bayesian following understanding: P (A) is referred to as that " priori is general
A deduction before rate ", i.e. B event occur, to the A probability of happening;P (A | B) be known as " posterior probability ", i.e. it occurs for B event
Afterwards, the A probability of happening is reappraised, and P (B | A)/P (B) is known as " plausibility function ", and it is a Dynamic gene, so that estimating
Closer to true probability, value is obtained probability by experiment method, and if more than 1, then " prior probability " is enhanced, is equal to 1, meaning
B event be helpless to a possibility that judging A event, less than 1 " prior probability " is weakened.
It is used for text classification using following process, if S indicates rubbish text (Spam), H indicates normal text
(Healthy), P (S)=P (H)=50%, W indicates word (Word) under normal conditions, and then program calculation occurs problem in W
In the case of text be S probability, be denoted as P (S | W), P (S | W)=P (W | S) P (S)/(P (W | S) P (S) can be obtained according to above-mentioned formula
+ P (W | H) P (H)), and P (W | S) and P (W | H) it is illustrated respectively in rubbish and normal text, the probability that W occurs can count
Out.Speculate that text belongs to the probability of rubbish text according to the frequency of word in text.It is pushed away by the multiple words for including in text
It surveys and joint probability formula can be used: note P (S | W1) it is P1, P (S | W2) it is P2, final probability is P=P1P2/(P1P2+(1-P1)(1-
P2))。
Bayes's index computing module 803 is using obtained final probability as Bayes's index of the text to be processed
Score_bayes。
Fei Sheer index computing module 804 is given up for segmenting to the text to be processed using the expense constructed in advance
That dictionary searches the corresponding normal probability of each lexical item and rubbish probability that participle obtains, and calculating the text to be processed is rubbish
The probability of rubbish text, the Fei Sheer index as the text to be processed.
Similar with Bayes's classification, Fei Sheer classification is a kind of alternative solution of Bayes, and can also combine the two makes
With, from Bayes using word frequencies calculate unlike, Fei Sheer method statistic be document probability.It, should when there is W
Document belongs to the probability of S and H, when obtaining result, needs specified classification.Its calculation formula is: it is (i.e. above as affiliated classification to set C
In S and H), when P (C | W) occurs for W, text belongs to the probability of classification C, can be counted and be obtained by training text.For input
Text judges that it belongs to the new probability formula of rubbish text for P (S)=P (S | W)/(P (S | W)+P (H | W)), for multiple words
Joint probability, result can be multiplied, P (S)=P (S1) P (S2) ... can similarly obtain P (H).In Fei Sheer method, obtain
P (C) out can also be by following processing: by the incoming inversion chi square function of the result of -2*log (P (C)), returning to final result.
Fei Sheer index score score_fisher=P (S)/(P (S)+P (H)), i.e. Fei Sheer judgement belong to garbage classification
Probability do normalized.
It is noted that classify the text that can not judge for Fei Sheer, i.e. the case where P (S)=P (H), Score_
Fisher=1 illustrates that Fei Sheer index fails.
Categorization module 805, for utilizing Bayes's index, Fei Sheer index, cheating user's index and cheating feature
Index is weighted or is multiplied, and the text to be processed that calculated result is more than preset threshold is determined as rubbish text.
The final score of the text to be processed can be subject to four indices and be multiplied, such as formula 5.
Preset threshold equally be observe pending data collection classification thresholds, judge weight or be multiplied result whether be more than
The preset threshold, if it does, being then identified as rubbish text.In the present embodiment example, can preferably it be distinguished just with 1.0 for boundary
Chang Wenben and rubbish text.
For the text shown in Fig. 3 a, the text is at " sun, the moon and the stars " followed by a large amount of spaces and line feed character, warp
After crossing this instance processes, result as shown in figure 5, it is 0.406481 that obtain Bayes's index, which be 0.034608, Fei Sheer index,
The two indexs all fail to identify the feature in space, symbol, and cheating characteristic index is 2.785714, illustrate the side of feature expansion
Formula can more effectively identify such text, and cheating user's index is 1.000000, and what the user did not submit before explanation goes through
History, final score 0.0391875824319, is classified as normal text.
This example demonstrates the weakness of Fei Sheer and Bayes, while showing that the method for feature expansion can more effectively be known
It is not mingled with the text of a large amount of punctuation marks and blank, tab or newline.When similar text largely occurs and the person of being managed
After deletion, by the study of a few wheel dictionaries, the index of Bayes and Fei Sheer will be accordingly accurate, simultaneously as in the presence of
Practise fraud review record, user practise fraud index can also increase accordingly, in this way, building classifier can preferably identify it is similar in this way
Text, while the cheating mode of head portrait advertisement also can be very good to identify.It is wrapped in addition, being considered in text in the present invention
The quantity of the link and number that contain is website links for main component, and QQ number, the text of cell-phone number etc. also has to be divided well
Class effect.
File classification method and device provided by the invention, using the feature and user behavior of text, in conjunction with machine learning
Method classify to text, Accurate classification effectively can be carried out to each class text, particularly with being mingled with a large amount of special symbols
Number, the meaningless text largely issued of the text of escape character and link and head portrait advertisement cheating user, effectively make up existing
The deficiency of machine learning method improves the accuracy rate of classification.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (14)
1. a kind of file classification method, which comprises the following steps:
S1, each character in text to be processed in addition to text and number is replaced using preset fixed character string;
The word length for including in S2, the replaced text total length of statistics and text, it is total with text to calculate the word length
The ratio of length;
S3, the ratio of the word length and text total length, the cheating characteristic index of the calculating text to be processed are utilized;
S4, the text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.
2. the method according to claim 1, wherein before the step S1, further includes:
It is pre-processed for the character in the text to be processed in addition to text and number, removes common punctuation mark;
The step S1 is only replaced remaining character using preset fixed character string.
3. the method according to claim 1, wherein before the step S3, further includes:
The number for finding out the link, number and the mailbox that include in the text to be processed obtains the link of the text to be processed
Weight and number weight;
The step S3 is using obtained link weight and number weight, with the word length and the ratio of text total length
Subtraction function is weighted, and obtains the cheating characteristic index of the text to be processed, the link weight and number weight are bigger, institute
It is bigger to state cheating characteristic index.
4. the method according to claim 1, wherein this method further include:
Determine the user name and IP address for submitting the text to be processed;
The user name or the corresponding submission status data of IP address are searched in the user name dictionary or IP dictionary constructed in advance,
Cheating user's index is calculated in the quantity of the normal text and rubbish text submitted using the user;
The step S4 is weighted or is multiplied with the cheating characteristic index using the cheating user index, by calculated result
Text to be processed more than preset threshold is determined as rubbish text.
5. according to the method described in claim 4, it is characterized in that, the method for building up of the user name dictionary and IP dictionary, tool
Body includes:
Obtain the sample corpus comprising normal text and rubbish text;
Record submits the user name and IP address of each text in the sample corpus;
The corresponding quantity for being marked as normal text and rubbish text in the text that each user name and IP address upload is counted respectively,
Generate user name dictionary and IP dictionary.
6. the method according to claim 1, wherein this method further include:
The text to be processed is segmented, using the Bayes's dictionary constructed in advance, each lexical item searched is corresponding
Normal probability and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed
Bayes's index;
The step S4 is multiplied or weights with the cheating characteristic index using Bayes's index, and calculated result is surpassed
The text to be processed for crossing preset threshold is determined as rubbish text.
7. the method according to claim 1, wherein this method further include:
The text to be processed is segmented, using the Fei Sheer dictionary constructed in advance, each lexical item searched is corresponding
Normal probability and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed
Fei Sheer index;
The step S4 is multiplied or is weighted with the cheating characteristic index using the Fei Sheer index, and calculated result is surpassed
The text to be processed for crossing preset threshold is determined as rubbish text.
8. a kind of document sorting apparatus characterized by comprising
Character replacement module, for each character in text to be processed in addition to text and number to be used preset fixed word
Symbol string replacement;
Text content computing module, for counting by being wrapped in the replaced text total length of the character replacement module and text
The word length contained calculates the ratio of the word length and text total length;
Characteristic index computing module of practising fraud calculates described wait locate for the ratio using the word length and text total length
Manage the cheating characteristic index of text;
Categorization module, for the text to be processed that the cheating characteristic index is more than preset threshold to be determined as rubbish text.
9. device according to claim 8, which is characterized in that the configuration of the character replacement module includes:
For being pre-processed for the character in the text to be processed in addition to text and number, common punctuate symbol is removed
Number;
Only remaining character after preprocessing module processing is replaced.
10. device according to claim 8, which is characterized in that the device further include:
Numerical chracter statistical module is obtained for finding out the number of the link, number and the mailbox that include in the text to be processed
The link weight and number weight of the text to be processed;
The cheating characteristic index computing module is using obtained link weight and number weight, with the word length and text
The subtraction function of the ratio of total length is weighted, and obtains the cheating characteristic index of the text to be processed, the link weight and
Number weight is bigger, and the cheating characteristic index is bigger.
11. device according to claim 8, which is characterized in that the device further include:
User information extraction module, for determining the user name and IP address of submitting the text to be processed;
Practise fraud user's index computing module, for searched in the user name dictionary or IP dictionary constructed in advance the user name or
The corresponding submission status data of IP address, the rubbish text submitted using the user name or IP address history of the text to be processed
Ratio be calculated cheating user's index;
The categorization module is also used to that the cheating user index is weighted or is multiplied with cheating characteristic index, will calculate
As a result it is determined as rubbish text more than the text to be processed of preset threshold.
12. device according to claim 11, which is characterized in that the user name dictionary and IP dictionary establish module,
It specifically includes:
Corpus acquiring unit, for obtaining the sample corpus comprising normal text and rubbish text;
User information recording unit, for recording the user name and IP address of submitting each text in the sample corpus;
Statistic unit is marked as normal text and rubbish for counting correspondence in the text that each user name and IP address upload respectively
The quantity of rubbish text generates user name dictionary and IP dictionary.
13. device according to claim 8, which is characterized in that the device further include:
Bayes's index computing module, for being segmented to the text to be processed, using the Bayes's dictionary constructed in advance,
The obtained corresponding normal probability of each lexical item and rubbish probability is searched, and calculating the text to be processed is the general of rubbish text
Rate is supplied to the categorization module as Bayes's index of the text to be processed, and by Bayes's index;
The categorization module is also used to be multiplied or weight with the cheating characteristic index using Bayes's index, will
Calculated result is more than that the text to be processed of preset threshold is determined as rubbish text.
14. device according to claim 8, which is characterized in that the device further include:
Fei Sheer index computing module, for being segmented to the text to be processed, using the Fei Sheer dictionary constructed in advance,
The obtained corresponding normal probability of each lexical item and rubbish probability is searched, and calculating the text to be processed is the general of rubbish text
Rate is supplied to the categorization module as the Fei Sheer index of the text to be processed, and by the Fei Sheer index;
The categorization module is also used to be multiplied or weighted with the cheating characteristic index using the Fei Sheer index, will
Calculated result is more than that the text to be processed of preset threshold is determined as rubbish text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210206020.5A CN103514174B (en) | 2012-06-18 | 2012-06-18 | A kind of file classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210206020.5A CN103514174B (en) | 2012-06-18 | 2012-06-18 | A kind of file classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103514174A CN103514174A (en) | 2014-01-15 |
CN103514174B true CN103514174B (en) | 2019-01-15 |
Family
ID=49896913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210206020.5A Active CN103514174B (en) | 2012-06-18 | 2012-06-18 | A kind of file classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103514174B (en) |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970832A (en) * | 2014-04-01 | 2014-08-06 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing spam |
CN104408087A (en) * | 2014-11-13 | 2015-03-11 | 百度在线网络技术(北京)有限公司 | Method and system for identifying cheating text |
CN104573411B (en) * | 2014-12-30 | 2018-04-17 | 深圳先进技术研究院 | A kind of biomarker correlation method for visualizing and device |
CN105991620B (en) * | 2015-03-05 | 2019-09-06 | 阿里巴巴集团控股有限公司 | The recognition methods of malice account and device |
CN106372052A (en) * | 2015-07-22 | 2017-02-01 | 北京国双科技有限公司 | Text filtering processing method and apparatus |
CN106528504A (en) * | 2015-09-11 | 2017-03-22 | 北京国双科技有限公司 | Data screening method and device for social application |
CN106528521A (en) * | 2015-09-11 | 2017-03-22 | 北京国双科技有限公司 | Method and device for screening social application data |
US11915796B2 (en) * | 2015-11-20 | 2024-02-27 | Seegene, Inc. | Method for calibrating a data set of a target analyte using a normalization coefficient |
CN106874291A (en) * | 2015-12-11 | 2017-06-20 | 北京国双科技有限公司 | The processing method and processing device of text classification |
CN105574156B (en) * | 2015-12-16 | 2019-03-26 | 华为技术有限公司 | Text Clustering Method, device and calculating equipment |
CN106126605B (en) * | 2016-06-21 | 2019-12-10 | 国家计算机网络与信息安全管理中心 | Short text classification method based on user portrait |
CN107870945B (en) * | 2016-09-28 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Content rating method and apparatus |
CN108287697B (en) * | 2016-12-30 | 2020-10-02 | 广州华多网络科技有限公司 | Html escape character replacing method, device and terminal |
CN107168951B (en) * | 2017-05-10 | 2019-07-05 | 山东大学 | A kind of rule-based prison inmates short message automatic auditing method with dictionary |
CN110019776B (en) * | 2017-09-05 | 2023-04-28 | 腾讯科技(北京)有限公司 | Article classification method and device and storage medium |
CN107832360A (en) * | 2017-10-24 | 2018-03-23 | 广东欧珀移动通信有限公司 | Comment processing method and relevant device |
CN107729520B (en) * | 2017-10-27 | 2020-12-01 | 北京锐安科技有限公司 | File classification method and device, computer equipment and computer readable medium |
CN108090193B (en) * | 2017-12-21 | 2022-04-22 | 创新先进技术有限公司 | Abnormal text recognition method and device |
CN109977729A (en) * | 2017-12-27 | 2019-07-05 | 中移(杭州)信息技术有限公司 | A kind of Method for text detection and device |
CN108763209B (en) * | 2018-05-22 | 2022-04-05 | 创新先进技术有限公司 | Method, device and equipment for feature extraction and risk identification |
CN109241523B (en) * | 2018-08-10 | 2020-12-11 | 北京百度网讯科技有限公司 | Method, device and equipment for identifying variant cheating fields |
CN109460508B (en) * | 2018-10-10 | 2021-10-15 | 浙江大学 | Efficient spam comment user group detection method |
CN109582833B (en) * | 2018-11-06 | 2023-09-22 | 创新先进技术有限公司 | Abnormal text detection method and device |
CN110457597A (en) * | 2019-08-08 | 2019-11-15 | 中科鼎富(北京)科技发展有限公司 | A kind of advertisement recognition method and device |
CN113111234A (en) * | 2020-02-13 | 2021-07-13 | 北京明亿科技有限公司 | Regular expression-based alarm condition category determination method and device |
CN111507350B (en) * | 2020-04-16 | 2024-01-05 | 腾讯科技(深圳)有限公司 | Text recognition method and device |
CN112163789B (en) * | 2020-10-22 | 2021-04-30 | 上海易教科技股份有限公司 | Teacher workload evaluation system and method for online education |
CN113449199B (en) * | 2021-09-01 | 2021-11-26 | 深圳市知酷信息技术有限公司 | Document monitoring and management system based on comprehensive security audit |
CN113806542A (en) * | 2021-09-18 | 2021-12-17 | 上海幻电信息科技有限公司 | Text analysis method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197793A (en) * | 2007-12-28 | 2008-06-11 | 腾讯科技(深圳)有限公司 | Garbage information detection method and device |
CN101304589A (en) * | 2008-04-14 | 2008-11-12 | 中国联合通信有限公司 | Method and system for monitoring and filtering garbage short message transmitted by short message gateway |
CN102004764A (en) * | 2010-11-04 | 2011-04-06 | 中国科学院计算机网络信息中心 | Internet bad information detection method and system |
CN102298638A (en) * | 2011-08-31 | 2011-12-28 | 北京中搜网络技术股份有限公司 | Method and system for extracting news webpage contents by clustering webpage labels |
-
2012
- 2012-06-18 CN CN201210206020.5A patent/CN103514174B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197793A (en) * | 2007-12-28 | 2008-06-11 | 腾讯科技(深圳)有限公司 | Garbage information detection method and device |
CN101304589A (en) * | 2008-04-14 | 2008-11-12 | 中国联合通信有限公司 | Method and system for monitoring and filtering garbage short message transmitted by short message gateway |
CN102004764A (en) * | 2010-11-04 | 2011-04-06 | 中国科学院计算机网络信息中心 | Internet bad information detection method and system |
CN102298638A (en) * | 2011-08-31 | 2011-12-28 | 北京中搜网络技术股份有限公司 | Method and system for extracting news webpage contents by clustering webpage labels |
Also Published As
Publication number | Publication date |
---|---|
CN103514174A (en) | 2014-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103514174B (en) | A kind of file classification method and device | |
Abozinadah et al. | Detection of abusive accounts with Arabic tweets | |
CN109145216B (en) | Network public opinion monitoring method, device and storage medium | |
Sharif et al. | Sentiment analysis of Bengali texts on online restaurant reviews using multinomial Naïve Bayes | |
CN109325165B (en) | Network public opinion analysis method, device and storage medium | |
Gharge et al. | An integrated approach for malicious tweets detection using NLP | |
US8489689B1 (en) | Apparatus and method for obfuscation detection within a spam filtering model | |
Aisopos et al. | Content vs. context for sentiment analysis: a comparative analysis over microblogs | |
US8112484B1 (en) | Apparatus and method for auxiliary classification for generating features for a spam filtering model | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN103984703A (en) | Mail classification method and device | |
US10216837B1 (en) | Selecting pattern matching segments for electronic communication clustering | |
Sangwan et al. | D-BullyRumbler: a safety rumble strip to resolve online denigration bullying using a hybrid filter-wrapper approach | |
JP6605022B2 (en) | Systems and processes for analyzing, selecting, and capturing sources of unstructured data by experience attributes | |
Ruskanda | Study on the effect of preprocessing methods for spam email detection | |
Chumwatana | Using sentiment analysis technique for analyzing Thai customer satisfaction from social media | |
Samsudin et al. | Mining opinion in online messages | |
CN110020430B (en) | Malicious information identification method, device, equipment and storage medium | |
US11010687B2 (en) | Detecting abusive language using character N-gram features | |
Iyengar et al. | Integrated spam detection for multilingual emails | |
JP2009157450A (en) | Mail sorting system, mail retrieving system, and mail destination sorting system | |
Ali et al. | Identification of profane words in cyberbullying incidents within social networks | |
CN110232124A (en) | A kind of sentiment analysis system | |
Prasad | Micro-blogging sentiment analysis using bayesian classification methods | |
Escalante et al. | A weighted profile intersection measure for profile-based authorship attribution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |