CN103514174B

CN103514174B - A kind of file classification method and device

Info

Publication number: CN103514174B
Application number: CN201210206020.5A
Authority: CN
Inventors: 程童
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-06-18
Filing date: 2012-06-18
Publication date: 2019-01-15
Anticipated expiration: 2032-06-18
Also published as: CN103514174A

Abstract

The present invention provides a kind of file classification method and devices, this method comprises: each character in text to be processed in addition to text and number is replaced using preset fixed character string；The word length for including in replaced text total length and text is counted, the ratio of the word length and text total length is calculated；Using the ratio of the word length and text total length, the cheating characteristic index of the text to be processed is calculated；The text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.The present invention can effectively make up the deficiency of existing machine learning method, improve the accuracy rate of classification.

Description

A kind of file classification method and device

[technical field]

The present invention relates to technical field of Internet information, in particular to a kind of file classification method and device.

[background technique]

With the continuous development of internet, more and more users carry out information interchange and resource-sharing using internet, Network information also increasingly increases severely.However, the opening of internet also leads to the presence of many flames in a network, therefore, The information of internet is monitored, is filtered and classification has become common requirements.

Comment (or being known as leaving a message, reply etc.) is a critical function of the Internet community class product, is to form product Interact an important channel of atmosphere.Because its issue cost it is small, Shou Zhongguang, effect is lasting, since comment function generate just by To the puzzlement of junk information, including various advertisement links, promotion message, the various informations such as yellow anti-information.For sending advertisement even at For an industry, mode of posting also is become machine and is posted automatically from posting manually, and its technology is increasingly advanced, constantly prominent Break various anti-cheating measures.

The main means of existing this junk information of reply include two major classes: one kind is the method in mechanism, including artificial Audit, user gradation or groups of users system, stringent user's access system etc..Another kind of is technical method, can be divided into Two ways, one is mechanical, including the control of identifying code, filtering sensitive words, frequency, blacklist, Similar Text strategy etc.； Another kind is intellectual, mainly includes the method for various machine learning, such as naive Bayesian, Fei Sheer, support vector machines, Neural network etc..

Wherein, the method in mechanism mainly increases the cost posted, but inhibits rubbish text producer (spammer) while, general user is also allowed to be difficult to post, compares in the high community of degree of opening and is difficult to receive.It is mechanical The method of formula is to be directed to have the junk information of fixed character to be just very easily by-passed once being understood by spammer.Intellectual Method has certain identification capability, but because of the difference of study mechanism, training corpus etc., implements and acquire a certain degree of difficulty, master The factor to be considered is the accuracy rate and recall rate that it recognizes junk information and normal information.

Existing these types mode is more effective to the judgement of plain text, however classify for following several texts Effect is all undesirable.One, for being mingled with the texts of a large amount of punctuation marks and blank, tab or newline, False Rate is higher.One Aspect causes since when carrying out word segmentation processing, punctuation mark, which generally can be all filtered, to be returned as word segmentation result It can not judge that these are largely mingled with the rubbish text of punctuation mark etc.；On the other hand, punctuation mark and stop words do not reflect Semanteme, the frequency of occurrences is close in normal text and rubbish text, can not effectively support posterior probability, to influence machine sort Accuracy rate.It two, is website links for text main component, QQ number, the classifying qualities such as cell-phone number are also not so good, because of participle Effective content of text can not be cut out, accuracy rate is not high.Three, effect is bad to be judged for meaningless answer, such as when user makees When disadvantage mode is head portrait advertisement, " good experience " largely can be sent out, the comment of " effect is pretty good, praises very much " etc.When such When text is largely appeared in the training corpus of rubbish text, the classifying quality normally commented on can also be had some impact on, be led Accuracy rate is caused to reduce.

[summary of the invention]

In view of this, can be carried out to each class text effectively quasi- the present invention provides a kind of file classification method and device It really recognizes, improves the accuracy rate of classification.

Specific technical solution is as follows:

A kind of file classification method, method includes the following steps:

S1, each character in text to be processed in addition to text and number is replaced using preset fixed character string；

The word length for including in S2, the replaced text total length of statistics and text, calculates the word length and text The ratio of this total length；

S3, using the ratio of the word length and text total length, the cheating feature for calculating the text to be processed refers to Mark；

S4, the text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.

According to one preferred embodiment of the present invention, before the step S1, further includes:

It is pre-processed for the character in the text to be processed in addition to text and number, removes common punctuate symbol Number；

The step S1 is only replaced remaining character using preset fixed character string.

According to one preferred embodiment of the present invention, before the step S3, further includes:

The number for finding out the link, number and the mailbox that include in the text to be processed obtains the text to be processed Link weight and number weight；

The step S3 utilizes obtained link weight and number weight, and the ratio of the word length and text total length The subtraction function of example is weighted, and obtains the cheating characteristic index of the text to be processed, the link weight and number weight are got over Greatly, the cheating characteristic index is bigger.

According to one preferred embodiment of the present invention, this method further include:

Determine the user name and IP address for submitting the text to be processed；

The user name or the corresponding submission situation of IP address are searched in the user name dictionary or IP dictionary constructed in advance Cheating user's index is calculated in data, the quantity of the normal text and rubbish text submitted using the user；

The step S4 is weighted or is multiplied with the cheating characteristic index using the cheating user index, will calculate As a result it is determined as rubbish text more than the text to be processed of preset threshold.

According to one preferred embodiment of the present invention, the method for building up of the user name dictionary and IP dictionary, specifically includes:

Obtain the sample corpus comprising normal text and rubbish text；

Record submits the user name and IP address of each text in the sample corpus；

Corresponding normal text and the rubbish text of being marked as in the text that each user name and IP address upload is counted respectively Quantity generates user name dictionary and IP dictionary.

The text to be processed is segmented, using the Bayes's dictionary constructed in advance, each lexical item pair for searching The normal probability and rubbish probability answered, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed This Bayes's index；

The step S4 is multiplied or weights with the cheating characteristic index using Bayes's index, ties calculating Fruit is more than that the text to be processed of preset threshold is determined as rubbish text.

The text to be processed is segmented, using the Fei Sheer dictionary constructed in advance, each lexical item pair for searching The normal probability and rubbish probability answered, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed This Fei Sheer index；

The step S4 is multiplied or is weighted with the cheating characteristic index using the Fei Sheer index, is tied calculating Fruit is more than that the text to be processed of preset threshold is determined as rubbish text.

A kind of document sorting apparatus, the device include:

Character replacement module, it is preset solid for using each character in text to be processed in addition to text and number Determine character string replacement；

Text content computing module, for counting by the replaced text total length of the character replacement module and text In include word length, calculate the ratio of the word length and text total length；

It practises fraud characteristic index computing module, for the ratio using the word length and text total length, described in calculating The cheating characteristic index of text to be processed；

Categorization module, for the text to be processed that the cheating characteristic index is more than preset threshold to be determined as rubbish text This.

According to one preferred embodiment of the present invention, the configuration of the character replacement module includes:

For being pre-processed for the character in the text to be processed in addition to text and number, common mark is removed Point symbol；

Only remaining character after preprocessing module processing is replaced.

According to one preferred embodiment of the present invention, the device further include:

Numerical chracter statistical module, for finding out the number of the link, number and the mailbox that include in the text to be processed, Obtain the link weight and number weight of the text to be processed；

The cheating characteristic index computing module using obtained link weight and number weight, with the word length with The subtraction function of the ratio of text total length is weighted, and obtains the cheating characteristic index of the text to be processed, the link power Weight and number weight are bigger, and the cheating characteristic index is bigger.

User information extraction module, for determining the user name and IP address of submitting the text to be processed；

Cheating user's index computing module, for searching the user in the user name dictionary or IP dictionary constructed in advance Name or the corresponding submission status data of IP address, the rubbish submitted using the user name or IP address history of the text to be processed Cheating user's index is calculated in the ratio of text；

The categorization module is also used to that the cheating user index is weighted or is multiplied with cheating characteristic index, will Calculated result is more than that the text to be processed of preset threshold is determined as rubbish text.

According to one preferred embodiment of the present invention, the user name dictionary and IP dictionary establish module, specifically include:

Corpus acquiring unit, for obtaining the sample corpus comprising normal text and rubbish text；

User information recording unit, for recording the user name and IP address of submitting each text in the sample corpus；

Statistic unit is marked as normal text for counting correspondence in the text that each user name and IP address upload respectively With the quantity of rubbish text, user name dictionary and IP dictionary are generated.

Bayes's index computing module utilizes the Bayes constructed in advance for segmenting to the text to be processed Dictionary, the corresponding normal probability of each lexical item searched and rubbish probability, and calculating the text to be processed is rubbish text Probability, be supplied to the categorization module as Bayes's index of the text to be processed, and by Bayes's index；

The categorization module is also used to be multiplied using Bayes's index with the cheating characteristic index or be added The text to be processed that calculated result is more than preset threshold is determined as rubbish text by power.

Fei Sheer index computing module utilizes the Fei Sheer constructed in advance for segmenting to the text to be processed Dictionary, the corresponding normal probability of each lexical item searched and rubbish probability, and calculating the text to be processed is rubbish text Probability, be supplied to the categorization module as the Fei Sheer index of the text to be processed, and by the Fei Sheer index；

The categorization module is also used to be multiplied or added with the cheating characteristic index using the Fei Sheer index The text to be processed that calculated result is more than preset threshold is determined as rubbish text by power.

As can be seen from the above technical solutions, file classification method and device provided by the invention utilize character replacement The cheating feature that mode is expanded carries out auxiliary verifying to the submission behavior of user, can efficiently identify and be mingled with a large amount of spies The meaningless text that different symbol, escape character and the text of link and head portrait advertisement cheating user largely issue, particularly with The short text of comment, reply, the message of community or forum etc. improves identification precision, and the method with machine learning It combines, effectively makes up the deficiency of existing machine learning method, improve the accuracy rate of classification.

[Detailed description of the invention]

Fig. 1 is the file classification method flow chart that the embodiment of the present invention one provides；

Fig. 2 is file classification method flow chart provided by Embodiment 2 of the present invention；

Fig. 3 a is the schematic diagram of certain content of text and its user information；

Fig. 3 b is to obtain Bayes's dictionary schematic diagram using bayes classification method training；

Fig. 3 c is to obtain Fei Sheer dictionary schematic diagram using the training of Fei Sheer classification method；

Fig. 3 d is the user name dictionary schematic diagram that statistics obtains；

Fig. 3 e is the IP dictionary schematic diagram that statistics obtains；

Fig. 4 is the file classification method flow chart that the embodiment of the present invention three provides；

Fig. 5 is that the embodiment of the present invention three carries out processing result schematic diagram to the text of Fig. 3 a；

Fig. 6 is the document sorting apparatus schematic diagram that the embodiment of the present invention four provides；

Fig. 7 is the document sorting apparatus schematic diagram that the embodiment of the present invention five provides；

Fig. 8 is the document sorting apparatus schematic diagram that the embodiment of the present invention six provides.

[specific embodiment]

To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.

Embodiment one,

Fig. 1 is file classification method flow chart provided in this embodiment, as shown in Figure 1, this method comprises:

S101, each character in text to be processed in addition to text and number is replaced using preset fixed character string It changes.

First by the special symbol in text to be processed, such as English symbol " < >-_ `~@# $ %^&* () +=| ", Chinese Symbol " " " $ () ---? ", escape character " n t r n " and space etc. replaced with fixed character.

Fixed character string can be, but not limited to be superimposed as the character string that length is more than 1 using identical character repetition.For example, Using the fixed character string " $ $ $ $ " etc. of four " $ " character addings.For in text to be processed in addition to text and number other than Each character use the fixed character string " $ $ $ $ " to go to replace.For example, for " < ----money-making side Method ---: " "? ": > > >/" this text to be processed, after going replacement using fixed character string " $ $ $ $ ", become " $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ method to make money $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ ", it is replaced to be processed Text total length is elongated.

Since punctuation mark, stop words etc. when carrying out word segmentation processing, can be filtered to processing, thus, this step benefit These special symbols are first replaced with fixed character string, and the feature of these special symbols is expanded, increase specific symbol The influence of number this part, then count effective text content.

Certainly, due in normal text, can comprising some common punctuation marks such as ", " "." etc., this part belongs to It, can also be without replacement in content in the text can normally occur.It thus, can be with before this step is replaced Character in the text to be processed in addition to text and number is pre-processed, common punctuation mark is first removed, it is only right Remaining character is replaced using preset fixed character string, and efficiency and accuracy rate can be improved.

The word length for including in S102, the replaced text total length of statistics and text, calculate the word length with The ratio of text total length.

By fixed character string replacement text to be processed can because comprising English symbol, Chinese symbol or escape character etc. The quantity of special symbol is more and occurs, and counts replaced text total length L _ ORIG.

The text for including in the text to be processed is found out using regular expression, for example, Chinese character is extended in national standard Coding range in code (GBK) is 0x8140-0xFEFE, the coding in Chinese Character Set Code for Informati (GB2312) Range is 0xA1A1-0xFEFE, the coding range in unicode (Unicode) be u4E00- u9FA5, UF900- uFA2D find out using the regular expression of above-mentioned coding range building Chinese character and fall in above-mentioned coding range section In character, count the text number found out, calculate word length L_CHAR.

The ratio K=L_CHAR/L_ORIG of the word length and text total length is calculated, i.e., effective text contains Amount.

S103, the ratio of the word length and text total length, the cheating feature of the calculating text to be processed are utilized Index.

For being mingled with the rubbish text of a large amount of punctuation marks, blank, escape character etc., the text number generally comprised compared with Few, the content of non-legible symbol is larger.That is, effective text content is lower in text, the cheating characteristic index of the text It is more obvious, the text is that the probability of rubbish text is also bigger.

Thus, advance with the ratio K of word length and text total length, building cheating characteristic index function, to count It can be regarded as disadvantage characteristic index.Specifically, using the subtraction function of word length and the ratio K of text total length as cheating characteristic index Function.The subtraction function can be, but not limited to use:

(formula 1)

In above-mentioned formula 1, since word length and the value range of the ratio K of text total length are [0,1], (1+K's) Value range is [1,2], and the value range of 1/ (1+K) is [0.5,1], thus, which divides between [1,2] Cloth.The value of ratio K is lower, and Score score is higher, and the cheating characteristic index of the text is more obvious, and the text is rubbish text Probability is also bigger.

S104, the text to be processed that the cheating characteristic index is more than preset threshold is determined as rubbish text.

Preset threshold is to observe the classification thresholds of pending data collection.According to the demand of practical application scene and previous warp It tests, sets the preset threshold of a cheating characteristic index, the cheating characteristic index Score_ that judgment step S103 is calculated Whether feature1 is more than the preset threshold, if it does, being then identified as rubbish text.

Thus, it can be effectively specific to largely punctuate, space, escape character etc. is mingled with using method provided in this embodiment The rubbish text of character is effectively identified.

It is noted that for other remaining texts to be processed after identification, i.e. cheating characteristic index Score_ Feature1 is no more than the text of preset threshold, existing classification method can also be utilized, such as bayes classification method or support The methods of vector machine carries out classification judgement to those texts again.

Embodiment two

Fig. 2 is file classification method flow chart provided in this embodiment, as shown in Fig. 2, this method comprises:

Step S201, it to the character in text to be processed in addition to text and number, is replaced using preset fixed character string It changes.

This step is identical as step S101 in embodiment one, repeats no more in this.

Step S202, the word length for including in replaced text total length and text is counted, it is long using the text The ratio of degree and text total length calculates text proportional roles.

The calculation method of the ratio K of word length and text total length is identical as step S102 in embodiment one, i.e. K=L_ CHAR/L_ORIG。

Text proportional roles Score_char is calculated using word length and the ratio of text total length, can be, but not limited to Using following formula:

(formula 2)

Step S203, the number for finding out the link, number and the mailbox that include in the text to be processed obtains described wait locate Manage the link weight and number weight of text.

The quantity of link, QQ number, cell-phone number and mailbox is found out using regular expression.For example, in python language, It regular expression re.compile (" [0-9 ] { 5-9 } ") can be used to find out cell-phone number or QQ number, use re.compile (" w+@w+ w+ ") find out email address, using regular expression re.compile (" [http: /] * w+ [w+ ]+ [comnedugvtn ] { 2,6 } ") find out the link of network address.

Link weight Score_link can be, but not limited to the sum of the quantity using the link and mailbox for including in text Indicate, correspondingly, number weight Score_digit can be, but not limited to using include in text QQ number, cell-phone number equal sign The sum of quantity of code indicates.

Step S204, obtained link weight and number weight, and the ratio of the word length and text total length are utilized The subtraction function of example is weighted, and obtains the cheating characteristic index of the text to be processed.

Specific weighted formula can be, but not limited to use:

Score_feature2=Score_char+0.5Score_link+0.5Score_digit (formula 3)

As can be seen from the above equation, link weight Score_link and number weight Score_digit are bigger, and cheating feature refers to It marks bigger；Alternatively, word length and the ratio of text total length are smaller, the text proportional roles Score_char being calculated is got over Greatly, cheating characteristic index is bigger.

Step S205, the user name and IP address for submitting the text to be processed are determined.

For the text to be processed, user information is obtained, with determining the user name for submitting the text to be processed and IP Location.

Fig. 3 a is the schematic diagram of certain content of text and its user information, as shown in Figure 3a, submits the user of the text entitled Sx1816, User IP are as follows: 114.228.210.130, content of text are " one area of the East China-sun, the moon and the stars 282zzd8010101060000700067b4t0zmcb50e0”。

Step S206, the user name is searched in the user name dictionary or IP dictionary constructed in advance or IP address is corresponding Status data is submitted, cheating user's index is calculated in the quantity of the normal text and rubbish text submitted using the user.

The user name dictionary or IP dictionary be advance with certain scale historical data carry out statistic of classification obtain, The quantity of normal text and rubbish text in the text that each user submits is counted according to the user name and IP address of submission, respectively Record generates user name dictionary and IP dictionary.

The ratio for the rubbish text that the user name or IP address history of the text to be processed are submitted is calculated cheating and uses Family index.

The user name and IP address determined using step S205, is searched in the user name dictionary and IP dictionary constructed in advance The corresponding submission status data of the user records the quantity that the user submits normal text and rubbish text, is denoted as h_num respectively And s_num.

Cheating user's index S core_user can be, but not limited to be calculated using the following equation:

(formula 4)

Wherein, T is a reference value of rubbish text quantity, to observe the rubbish text quantity of normal users and junk user Line of demarcation, can be according to practical situation value, such as T value is between 6~10.

It can be seen from above-mentioned formula 4 practise fraud user's index mainly by user's history submit rubbish text quantity and Rubbish text accounts for the influence for submitting total ratio, and for junk user, often this two indexs are all very high, even if normal users There are some comments to be marked as rubbish text, but rubbish text ratio is lower, the cheating user index finally obtained also can be lower.

Certainly, when calculating cheating user's index, it is also contemplated that the feature of user name.Cheating user often passes through Machine registration, user name can have certain feature, for example, letter and number is formed by certain rule, comprising " add Q, QQ makes friends, connection, and button button adds me, puts me " etc. wordings, can be into for the cheating user index of the user with such feature One step carries out tune power processing.

Step S207, it is weighted or is multiplied with the cheating characteristic index using the cheating user index, will calculate As a result it is determined as rubbish text more than the text to be processed of preset threshold.

The work that the cheating characteristic index and step S206 that the final score of text to be processed is obtained using step S204 obtain Disadvantage user's index is weighted or is multiplied, and calculates final score by the way of being multiplied in the present embodiment.

Preset threshold equally be observe pending data collection classification thresholds, judge weight or be multiplied result whether be more than The preset threshold, if it does, being then identified as rubbish text.

Embodiment three

In the present embodiment, it first constructs Bayes's dictionary in advance by the way of generating dictionary offline, Fei Sheer dictionary, use Name in an account book dictionary and IP dictionary, specific method for building up include:

Step S301, the sample corpus comprising normal text and rubbish text is obtained.

The existing historical data of certain scale can be used in the sample corpus, utilizes the different user accumulated in network Text, comment or the reply that name or IP address are submitted form sample corpus.

The classification of the normal text and rubbish text of acquisition can be classifies to obtain using existing classification method, alternatively, It is also possible to obtain using the method for handmarking, distinguishes the person of being managed or other users in sample corpus and be labeled as rubbish text Text, and not labeled normal text.

Step S302, word cutting processing is carried out to the text in the sample corpus, counting statistics, meter is carried out to each lexical item The probability that each lexical item is normal text and rubbish text is calculated, classified dictionary is generated.

Machine learning method can use existing bayes classification method or Fei Sheer classification method etc., be respectively formed Corresponding classified dictionary.Fig. 3 b is that Bayes's dictionary schematic diagram is obtained using bayes classification method training, and Fig. 3 c is to utilize expense The training of She Er classification method obtains Fei Sheer dictionary schematic diagram, includes each lexical item and the word in dictionary as shown in figures 3 b and 3 c The normal probability and rubbish probability of item.

Step S303, record submits the user name and IP address of each text in the sample corpus.

The user name and IP address of each text are extracted from sample corpus.Fig. 3 a is certain content of text and its user information Schematic diagram submits the entitled sx1816 of the user of the text, User IP as shown in Figure 3a are as follows: 114.228.210.130, in text Hold is " one area of East China-sun, the moon and the stars 282zzd8010101060000700067b4t0zmcb50e0 ".

Step S304, correspondence in the text that each user name and IP address upload is counted respectively is marked as normal text and rubbish The quantity of rubbish text generates user name dictionary and IP dictionary.

Fig. 3 d is the obtained user name dictionary schematic diagram of statistics, includes as shown in Figure 3d, in dictionary each user name and right The normal text and rubbish text quantity that should be submitted.

Fig. 3 e is the IP dictionary schematic diagram that statistics obtains, and as shown in Figure 3 e, in dictionary includes each IP and corresponding submission Normal text and rubbish text quantity.

Off-line learning generate dictionary can the periodic operation within the seeervice cycle, learnt automatically, realize autonomous learning Effect.Furthermore it is also possible to constantly back up several dictionaries, it is ensured that when server resets, data originally can be reloaded. When there is newer dictionary to provide, service needs to merge dictionary with the dictionary loaded, to similarly index (key) into Row value replacement, is added the index being originally not present.

Fig. 4 is file classification method flow chart provided in this embodiment, as shown in figure 4, this method comprises:

Step S401, the character in text to be processed in addition to text and number is replaced using preset fixed character string Change, calculate include in the ratio and text of the word length and text total length link, the number of number and mailbox, calculate The cheating characteristic index Score_feature of the text to be processed.

The treatment process of this step is identical as the treatment process of step S201 to step S204 in embodiment two, no longer in this It repeats.

Step S402, using the user name of text to be processed described in the user name dictionary or IP dictionary lookup constructed in advance or The ratio for the rubbish text that IP address history is submitted calculates the cheating characteristic index Score_user of the text to be processed.

The treatment process of this step is identical as the treatment process of step S205 to step S206 in embodiment two, no longer in this It repeats.

Step S403, the text to be processed is segmented, using the Bayes's dictionary constructed in advance, is searched The corresponding normal probability of each lexical item and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as described Bayes's index of text to be processed.

The basis of Bayes's classification is Bayes' theorem and total probability formula.Bayes' theorem is substantially that " condition is general for calculating Rate ", so-called conditional probability refer to the probability that event A occurs in the case where event B occurs, and calculation is P (A | B)=P (B |A)P(A)/P(B).Total probability formula is the probability that complicated event is calculated by the probability of simple event, such as A and A ' are structures At one of sample space division, then probability P (B)=P (B | A) P (A)+P (B | A ') P (A ') of event B.

Bayes' theorem, which is used for text classification, is based on to Bayesian following understanding: P (A) is referred to as that " priori is general A deduction before rate ", i.e. B event occur, to the A probability of happening；P (A | B) be known as " posterior probability ", i.e. it occurs for B event Afterwards, the A probability of happening is reappraised, and P (B | A)/P (B) is known as " plausibility function ", and it is a Dynamic gene, so that estimating Closer to true probability, value is obtained probability by experiment method, and if more than 1, then " prior probability " is enhanced, is equal to 1, meaning B event be helpless to a possibility that judging A event, less than 1 " prior probability " is weakened.

It is used for text classification using following process, if S indicates rubbish text (Spam), H indicates normal text (Healthy), P (S)=P (H)=50%, W indicates word (Word) under normal conditions, and then program calculation occurs problem in W In the case of text be S probability, be denoted as P (S | W), P (S | W)=P (W | S) P (S)/(P (W | S) P (S) can be obtained according to above-mentioned formula + P (W | H) P (H)), and P (W | S) and P (W | H) it is illustrated respectively in rubbish and normal text, the probability that W occurs can count Out.Speculate that text belongs to the probability of rubbish text according to the frequency of word in text.It is pushed away by the multiple words for including in text It surveys and joint probability formula can be used: note P (S | W₁) it is P₁, P (S | W₂) it is P₂, final probability is P=P₁P₂/(P₁P₂+(1-P₁)(1- P₂))。

Using obtained final probability as Bayes's index S core_bayes of the text to be processed.

Step S404, the text to be processed is segmented, using the Fei Sheer dictionary constructed in advance, is searched The corresponding normal probability of each lexical item and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as described The Fei Sheer index of text to be processed.

Similar with bayes classification method, Fei Sheer classification is a kind of alternative solution of Bayes, can also tie the two Close use, from Bayes using word frequencies calculate unlike, Fei Sheer method statistic be document probability.There is W When, the document belongs to the probability of S and H, when obtaining result, needs specified classification.Its calculation formula is: it sets C and classifies (i.e. to be affiliated S and H above), when P (C | W) occurs for W, text belongs to the probability of classification C, can be counted and be obtained by training text.For defeated The text entered judges that it belongs to the new probability formula of rubbish text for P (S)=P (S | W)/(P (S | W)+P (H | W)), for multiple Result can be multiplied by the joint probability of word, and P (S)=P (S1) P (S2) ... can similarly obtain P (H).In Fei Sheer method In, the P (C) obtained can also be by following processing: by the incoming inversion chi square function of the result of -2*log (P (C)), return most terminates Fruit.

Fei Sheer index score score_fisher=P (S)/(P (S)+P (H)), i.e. Fei Sheer judgement belong to garbage classification Probability do normalized.

It is noted that classify the text that can not judge for Fei Sheer, i.e. the case where P (S)=P (H), Score_ Fisher=1 illustrates that Fei Sheer index fails.

Step S405, it is carried out using Bayes's index, Fei Sheer index, cheating user's index and cheating characteristic index Weighting is multiplied, and the text to be processed that calculated result is more than preset threshold is determined as rubbish text.

The final score of the text to be processed can be subject to four indices and be multiplied.Specifically:

Score=Score_feature*Score_user*Score_bayes*Score_fisher (formula 5)

Preset threshold equally be observe pending data collection classification thresholds, judge weight or be multiplied result whether be more than The preset threshold, if it does, being then identified as rubbish text.In the present embodiment example, can preferably it be distinguished just with 1.0 for boundary Chang Wenben and rubbish text.

For the text shown in Fig. 3 a, the text is at " sun, the moon and the stars " followed by a large amount of spaces and line feed character, warp After crossing this instance processes, result as shown in figure 5, it is 0.406481 that obtain Bayes's index, which be 0.034608, Fei Sheer index, The two indexs all fail to identify the feature in space, symbol, and cheating characteristic index is 2.785714, illustrate the side of feature expansion Formula can more effectively identify such text, and cheating user's index is 1.000000, and what the user did not submit before explanation goes through History, final score 0.0391875824319, is classified as normal text.

This example demonstrates the weakness of Fei Sheer and Bayes, while showing that the method for feature expansion can more effectively be known It is not mingled with the text of a large amount of punctuation marks and blank, tab or newline.When similar text largely occurs and the person of being managed After deletion, by the study of a few wheel dictionaries, the index of Bayes and Fei Sheer will be accordingly accurate, simultaneously as in the presence of Practise fraud review record, user practise fraud index can also increase accordingly, in this way, building classifier can preferably identify it is similar in this way Text, while the cheating mode of head portrait advertisement also can be very good to identify.It is wrapped in addition, being considered in text in this method The quantity of the link and number that contain is website links for main component, and QQ number, the text of cell-phone number etc. also has to be divided well Class effect.The present invention proposes feature expansion and the method with the behavior of submission, combines with the method for existing machine learning, improves and divides Class accuracy rate.

It is the detailed description carried out to method provided by the present invention above, text classification provided by the invention is filled below It sets and is described in detail.

Example IV

Fig. 6 is document sorting apparatus schematic diagram provided in this embodiment, as shown in fig. 6, the device includes:

Character replacement module 601, it is default for using each character in text to be processed in addition to text and number Fixed character string replacement.

Character replacement module 601 is first by the special symbol in text to be processed, such as English symbol " < >-_ `~@# $ % ^&* () +=| ", Chinese symbol " " " $ () ---? ", escape character " n t r n " and space etc. replaced with fixed character It changes.

Since punctuation mark, stop words etc. when carrying out word segmentation processing, can be filtered to processing, thus, this module benefit These special symbols are first replaced with fixed character string, and the feature of these special symbols is expanded, increase specific symbol The influence of number this part, then count effective text content.

Certainly, due in normal text, can comprising some common punctuation marks such as ", " "." etc., this part belongs to It, can also be without replacement in content in the text can normally occur.Thus, character replacement module 601 can also be first right Character in the text to be processed in addition to text and number is pre-processed, and common punctuation mark is removed, only to residue Character be replaced using preset fixed character string, efficiency and accuracy rate can be improved.

Text content computing module 602 passes through the replaced text total length of character replacement module 601 and text for counting The word length for including in this, calculates the ratio of the word length and text total length.

The ratio K=L_CHAR/L_ of the word length and text total length is calculated in text content computing module 602 ORIG, i.e., effective text content.

Cheating characteristic index computing module 603 calculates institute for the ratio using the word length and text total length State the cheating characteristic index of text to be processed.

Thus, advance with the ratio K of word length and text total length, building cheating characteristic index function, to count It can be regarded as disadvantage characteristic index.Specifically, using the subtraction function of word length and the ratio K of text total length as cheating characteristic index Function.The subtraction function can be, but not limited to calculate using formula 1.In equation 1, due to word length and text total length The value range of ratio K be [0,1], the value range of (1+K) is [1,2], and the value range of 1/ (1+K) is [0.5,1], because And the cheating characteristic index is distributed between [1,2].The value of ratio K is lower, and Score score is higher, the cheating feature of the text Index is more obvious, and the text is that the probability of rubbish text is also bigger.

Categorization module 604, for the text to be processed that the cheating characteristic index is more than preset threshold to be determined as rubbish Text.

Preset threshold is to observe the classification thresholds of pending data collection.According to the demand of practical application scene and previous warp It tests, sets the preset threshold of a cheating characteristic index, the cheating for judging that cheating characteristic index computing module 603 is calculated is special Levy whether index S core_feature1 is more than the preset threshold, if it does, being then identified as rubbish text.

Thus, it can be effectively specific to largely punctuate, space, escape character etc. is mingled with using device provided in this embodiment The rubbish text of character is effectively identified.

It is noted that for other remaining texts to be processed after identification, i.e. cheating characteristic index Score_ Feature1 is no more than the text of preset threshold, existing sorter can also be utilized, such as Bayes's classification or supporting vector The classifiers such as machine carry out classification judgement to those texts again.

Embodiment five

Fig. 7 is document sorting apparatus schematic diagram provided in this embodiment, as shown in fig. 7, the device includes:

Character replacement module 701, for in text to be processed except text and number in addition to character, using preset solid Determine character string replacement.

This module is identical as module 601 in example IV, repeats no more in this.

Text content computing module 702, for counting in the replaced text total length of character replacement module 701 and text The word length for including calculates text proportional roles using the word length and the ratio of text total length.

Module in the calculation method and example IV of the ratio K of the word length and text total length that are used in this module 602 is identical, i.e. K=L_CHAR/L_ORIG.

Text content computing module 702 calculates text proportional roles using word length and the ratio of text total length Score_char can be, but not limited to be calculated using formula 2.

Numerical chracter statistical module 703, for finding out of the link, number and the mailbox that include in the text to be processed Number, obtains the link weight and number weight of the text to be processed.

Cheating characteristic index computing module 704, it is long with the text for utilizing obtained link weight and number weight It spends and is weighted with the subtraction function of the ratio of text total length, obtain the cheating characteristic index of the text to be processed.

Specific weighted formula can be, but not limited to be calculated using formula 3 in cheating characteristic index computing module 704.By For formula 3 as can be seen that link weight Score_link and number weight Score_digit is bigger, cheating characteristic index is bigger；Or The ratio of person, word length and text total length is smaller, and the text proportional roles Score_char being calculated is bigger, and cheating is special It is bigger to levy index.

User information extraction module 705, for determining the user name and IP address of submitting the text to be processed.

Cheating user's index computing module 706, described in being searched in the user name dictionary or IP dictionary constructed in advance The quantity of user name or the corresponding submission status data of IP address, the normal text and rubbish text submitted using the user is calculated Obtain cheating user's index.

The user name and IP address determined using user information extraction module 705, in the user name dictionary that constructs in advance and The corresponding submission status data of the user is searched in IP dictionary, records the quantity that the user submits normal text and rubbish text, It is denoted as h_num and s_num respectively.

Cheating user's index S core_user can be, but not limited to calculate using formula 4.

Certainly, when calculating cheating user's index, cheating user's index computing module 706 is also conceivable to the spy of user name Sign.The user that practises fraud is registered by machine, and user name can have certain feature, for example, letter and number is pressed Certain rule composition, comprising wordings such as " add Q, QQ, make friends, connection, button button adds me, puts me ", for the use with such feature The cheating user index at family can be handled with further progress tune power.

Categorization module 707, for being weighted or being multiplied with the cheating characteristic index using the cheating user index, The text to be processed that calculated result is more than preset threshold is determined as rubbish text.

The cheating characteristic index and make that the final score of text to be processed is obtained using cheating characteristic index computing module 704 Cheating user's index that disadvantage user's index computing module 706 obtains is weighted or is multiplied, in the present embodiment using multiplication Mode calculates final score.

Embodiment six

In the present embodiment, it first constructs Bayes's dictionary in advance by the way of generating dictionary offline, Fei Sheer dictionary, use Name in an account book dictionary and IP dictionary, specifically establishing module includes:

Corpus acquiring unit, for obtaining the sample corpus comprising normal text and rubbish text.

Machine sort unit counts each lexical item for carrying out word cutting processing to the text in the sample corpus Number statistics calculates the probability that each lexical item is normal text and rubbish text, generates classified dictionary.

User information recording unit, for recording the user name and IP address of submitting each text in the sample corpus.

Fig. 8 is document sorting apparatus schematic diagram provided in this embodiment, as shown in figure 8, the device includes:

Cheating characteristic index processing module 801, for being used to the character in text to be processed in addition to text and number The replacement of preset fixed character string, calculate include in the ratio and text of the word length and text total length link, number The number of code and mailbox calculates the cheating characteristic index Score_feature of the text to be processed.

The treatment process of this module is identical as the treatment process of module 701 to module 704 in embodiment five, no longer superfluous in this It states.

Cheating user's index processing module 802, for using described in the user name dictionary or IP dictionary lookup constructed in advance The ratio for the rubbish text that the user name or IP address history of text to be processed are submitted, the cheating for calculating the text to be processed are special Levy index S core_user.

The treatment process of this module is identical as the treatment process of module 705 to module 706 in embodiment five, no longer superfluous in this It states.

Bayes's index computing module 803 utilizes the pattra leaves constructed in advance for segmenting to the text to be processed This dictionary searches the corresponding normal probability of each lexical item and rubbish probability that participle obtains, and calculating the text to be processed is rubbish The probability of rubbish text, Bayes's index as the text to be processed.

Bayes's index computing module 803 is using obtained final probability as Bayes's index of the text to be processed Score_bayes。

Fei Sheer index computing module 804 is given up for segmenting to the text to be processed using the expense constructed in advance That dictionary searches the corresponding normal probability of each lexical item and rubbish probability that participle obtains, and calculating the text to be processed is rubbish The probability of rubbish text, the Fei Sheer index as the text to be processed.

Similar with Bayes's classification, Fei Sheer classification is a kind of alternative solution of Bayes, and can also combine the two makes With, from Bayes using word frequencies calculate unlike, Fei Sheer method statistic be document probability.It, should when there is W Document belongs to the probability of S and H, when obtaining result, needs specified classification.Its calculation formula is: it is (i.e. above as affiliated classification to set C In S and H), when P (C | W) occurs for W, text belongs to the probability of classification C, can be counted and be obtained by training text.For input Text judges that it belongs to the new probability formula of rubbish text for P (S)=P (S | W)/(P (S | W)+P (H | W)), for multiple words Joint probability, result can be multiplied, P (S)=P (S1) P (S2) ... can similarly obtain P (H).In Fei Sheer method, obtain P (C) out can also be by following processing: by the incoming inversion chi square function of the result of -2*log (P (C)), returning to final result.

Categorization module 805, for utilizing Bayes's index, Fei Sheer index, cheating user's index and cheating feature Index is weighted or is multiplied, and the text to be processed that calculated result is more than preset threshold is determined as rubbish text.

The final score of the text to be processed can be subject to four indices and be multiplied, such as formula 5.

This example demonstrates the weakness of Fei Sheer and Bayes, while showing that the method for feature expansion can more effectively be known It is not mingled with the text of a large amount of punctuation marks and blank, tab or newline.When similar text largely occurs and the person of being managed After deletion, by the study of a few wheel dictionaries, the index of Bayes and Fei Sheer will be accordingly accurate, simultaneously as in the presence of Practise fraud review record, user practise fraud index can also increase accordingly, in this way, building classifier can preferably identify it is similar in this way Text, while the cheating mode of head portrait advertisement also can be very good to identify.It is wrapped in addition, being considered in text in the present invention The quantity of the link and number that contain is website links for main component, and QQ number, the text of cell-phone number etc. also has to be divided well Class effect.

File classification method and device provided by the invention, using the feature and user behavior of text, in conjunction with machine learning Method classify to text, Accurate classification effectively can be carried out to each class text, particularly with being mingled with a large amount of special symbols Number, the meaningless text largely issued of the text of escape character and link and head portrait advertisement cheating user, effectively make up existing The deficiency of machine learning method improves the accuracy rate of classification.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of file classification method, which comprises the following steps:

The word length for including in S2, the replaced text total length of statistics and text, it is total with text to calculate the word length The ratio of length；

S3, the ratio of the word length and text total length, the cheating characteristic index of the calculating text to be processed are utilized；

2. the method according to claim 1, wherein before the step S1, further includes:

It is pre-processed for the character in the text to be processed in addition to text and number, removes common punctuation mark；

3. the method according to claim 1, wherein before the step S3, further includes:

The number for finding out the link, number and the mailbox that include in the text to be processed obtains the link of the text to be processed Weight and number weight；

The step S3 is using obtained link weight and number weight, with the word length and the ratio of text total length Subtraction function is weighted, and obtains the cheating characteristic index of the text to be processed, the link weight and number weight are bigger, institute It is bigger to state cheating characteristic index.

4. the method according to claim 1, wherein this method further include:

The user name or the corresponding submission status data of IP address are searched in the user name dictionary or IP dictionary constructed in advance, Cheating user's index is calculated in the quantity of the normal text and rubbish text submitted using the user；

The step S4 is weighted or is multiplied with the cheating characteristic index using the cheating user index, by calculated result Text to be processed more than preset threshold is determined as rubbish text.

5. according to the method described in claim 4, it is characterized in that, the method for building up of the user name dictionary and IP dictionary, tool Body includes:

Obtain the sample corpus comprising normal text and rubbish text；

The corresponding quantity for being marked as normal text and rubbish text in the text that each user name and IP address upload is counted respectively, Generate user name dictionary and IP dictionary.

6. the method according to claim 1, wherein this method further include:

The text to be processed is segmented, using the Bayes's dictionary constructed in advance, each lexical item searched is corresponding Normal probability and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed Bayes's index；

The step S4 is multiplied or weights with the cheating characteristic index using Bayes's index, and calculated result is surpassed The text to be processed for crossing preset threshold is determined as rubbish text.

7. the method according to claim 1, wherein this method further include:

The text to be processed is segmented, using the Fei Sheer dictionary constructed in advance, each lexical item searched is corresponding Normal probability and rubbish probability, and the probability that the text to be processed is rubbish text is calculated, as the text to be processed Fei Sheer index；

The step S4 is multiplied or is weighted with the cheating characteristic index using the Fei Sheer index, and calculated result is surpassed The text to be processed for crossing preset threshold is determined as rubbish text.

8. a kind of document sorting apparatus characterized by comprising

Character replacement module, for each character in text to be processed in addition to text and number to be used preset fixed word Symbol string replacement；

Text content computing module, for counting by being wrapped in the replaced text total length of the character replacement module and text The word length contained calculates the ratio of the word length and text total length；

Characteristic index computing module of practising fraud calculates described wait locate for the ratio using the word length and text total length Manage the cheating characteristic index of text；

Categorization module, for the text to be processed that the cheating characteristic index is more than preset threshold to be determined as rubbish text.

9. device according to claim 8, which is characterized in that the configuration of the character replacement module includes:

For being pre-processed for the character in the text to be processed in addition to text and number, common punctuate symbol is removed Number；

Only remaining character after preprocessing module processing is replaced.

10. device according to claim 8, which is characterized in that the device further include:

Numerical chracter statistical module is obtained for finding out the number of the link, number and the mailbox that include in the text to be processed The link weight and number weight of the text to be processed；

The cheating characteristic index computing module is using obtained link weight and number weight, with the word length and text The subtraction function of the ratio of total length is weighted, and obtains the cheating characteristic index of the text to be processed, the link weight and Number weight is bigger, and the cheating characteristic index is bigger.

11. device according to claim 8, which is characterized in that the device further include:

Practise fraud user's index computing module, for searched in the user name dictionary or IP dictionary constructed in advance the user name or The corresponding submission status data of IP address, the rubbish text submitted using the user name or IP address history of the text to be processed Ratio be calculated cheating user's index；

The categorization module is also used to that the cheating user index is weighted or is multiplied with cheating characteristic index, will calculate As a result it is determined as rubbish text more than the text to be processed of preset threshold.

12. device according to claim 11, which is characterized in that the user name dictionary and IP dictionary establish module, It specifically includes:

Statistic unit is marked as normal text and rubbish for counting correspondence in the text that each user name and IP address upload respectively The quantity of rubbish text generates user name dictionary and IP dictionary.

13. device according to claim 8, which is characterized in that the device further include:

Bayes's index computing module, for being segmented to the text to be processed, using the Bayes's dictionary constructed in advance, The obtained corresponding normal probability of each lexical item and rubbish probability is searched, and calculating the text to be processed is the general of rubbish text Rate is supplied to the categorization module as Bayes's index of the text to be processed, and by Bayes's index；

The categorization module is also used to be multiplied or weight with the cheating characteristic index using Bayes's index, will Calculated result is more than that the text to be processed of preset threshold is determined as rubbish text.

14. device according to claim 8, which is characterized in that the device further include:

Fei Sheer index computing module, for being segmented to the text to be processed, using the Fei Sheer dictionary constructed in advance, The obtained corresponding normal probability of each lexical item and rubbish probability is searched, and calculating the text to be processed is the general of rubbish text Rate is supplied to the categorization module as the Fei Sheer index of the text to be processed, and by the Fei Sheer index；

The categorization module is also used to be multiplied or weighted with the cheating characteristic index using the Fei Sheer index, will Calculated result is more than that the text to be processed of preset threshold is determined as rubbish text.