CN103514174A

CN103514174A - Text categorization method and device

Info

Publication number: CN103514174A
Application number: CN201210206020.5A
Authority: CN
Inventors: 程童
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-06-18
Filing date: 2012-06-18
Publication date: 2014-01-15
Anticipated expiration: 2032-06-18
Also published as: CN103514174B

Abstract

The invention provides a text categorization method and device. The method comprises the steps of replacing characters, except words and numbers, in texts to be processed with preset fixed strings, determining the total length of the texts after replacement and the length of the words contained in the texts and calculating the ratio of the length of the words to the total length of the texts, calculating cheating characteristic indexes of the texts to be processed according to the ratio of the length of the words to the total length of the texts, and determining that the texts to be processed with the cheating characteristic indexes exceeding a preset threshold are garbage texts. The text categorization method and device can effectively make up for the deficiency of existing machine learning methods and improve the accuracy of categorization.

Description

A kind of file classification method and device

[technical field]

The present invention relates to internet information technical field, particularly a kind of file classification method and device.

[background technology]

Along with the development of internet, increasing user utilizes internet to carry out information interchange and resource sharing, and network information also increases severely day by day.Yet the opening of internet also causes having a lot of flames in network, so the information of ，Dui internet is monitored, is filtered and classify and becomes common requirements.

Comment (or be called message, and reply etc.) be a critical function of the Internet community series products, be an important channel that forms the interactive atmosphere of product.Because its issue cost is little, audient is wide, and effect is lasting, from comment function, produces the puzzlement that starts to be just subject to junk information, comprises various advertisement link, promotion message, the various information such as yellow anti-information.For sending advertisement, even become an industry, its mode of posting also becomes machine from manually posting posts automatically, and its technology is more and more advanced, constantly breaks through various anti-cheating measures.

The Main Means of this junk information of existing reply comprises two large classes: a class is the method in mechanism, comprises manual examination and verification, user gradation or groups of users system, strict user's access system etc.Another kind of is technical method, can be divided into two kinds of modes, and a kind of is mechanical, comprises identifying code, filtering sensitive words, frequency control, blacklist, Similar Text strategy etc.; Another kind is intellectual, mainly comprises the method for various machine learning, naive Bayesian for example, Fei Sheer, support vector machine, neural network etc.

Wherein, the method in mechanism is mainly to have increased the cost of posting, but when having suppressed rubbish text fabricator (spammer), also allows general user be difficult to post, and is relatively difficult to accept in degree of opening Gao community.Mechanical method is for the junk information that has fixed character, once be understood by spammer, is just easy to be bypassed.Intellectual method possesses certain identification capability, but because of the difference of study mechanism, corpus etc., implements and acquire a certain degree of difficulty, and its factor of mainly considering is its accuracy rate and recall rate to junk information and normal information identification.

Existing these several modes are comparatively effective to the judgement of plain text, yet the effect of classifying for following several texts is all undesirable.One, for the text that is mingled with a large amount of punctuation marks and blank, tab or newline, False Rate is higher.On the one hand, due to when carrying out word segmentation processing, punctuation mark generally all can be filtered and can't return as word segmentation result, causes judging the rubbish text that these are mingled with punctuation mark etc. in a large number; On the other hand, punctuation mark and stop words do not reflect semanteme, and in normal text and rubbish text, the frequency of occurrences is close, cannot effectively support posterior probability, thereby affect the accuracy rate of machine sort.Two, for text principal ingredient, be website links, No. QQ, the classifying qualities such as cell-phone number are also not so good, because participle cannot cut out effective content of text, accuracy rate is not high.Three, bad for insignificant answer determine effect, for example, when user's cheating mode is head portrait advertisement, can send out in a large number " well experience ", the comment of " effect is pretty good, praises very much " and so on.When such text appears in the corpus of rubbish text in a large number, also can cause certain influence to the classifying quality of normal comment, cause accuracy rate to reduce.

[summary of the invention]

In view of this, the invention provides a kind of file classification method and device, can carry out identification effectively and accurately to each class text, improve the accuracy rate of classification.

Concrete technical scheme is as follows:

, the method comprises the following steps:

S1, each character except word and numeral in pending text is adopted to the replacement of default fixed character string;

The word length comprising in text total length after S2, statistics are replaced and text, calculates the ratio of described word length and text total length;

S3, utilize the ratio of described word length and text total length, calculate the cheating characteristic index of described pending text;

S4, the pending text that described cheating characteristic index is surpassed to predetermined threshold value are defined as rubbish text.

According to one preferred embodiment of the present invention, before described step S1, also comprise:

For the character except word and numeral in described pending text, carry out pre-service, remove common punctuation mark;

Described step S1 only adopts default fixed character string to replace to remaining character.

According to one preferred embodiment of the present invention, before described step S3, also comprise:

Find out the number of the link, number and the mailbox that comprise in described pending text, obtain link weight and the number weight of described pending text;

Described step S3 utilizes link weight and the number weight obtaining, be weighted with the subtraction function of the ratio of described word length and text total length, obtain the cheating characteristic index of described pending text, described link weight and number weight are larger, and described cheating characteristic index is larger.

According to one preferred embodiment of the present invention, the method also comprises:

Determine the user name HeIP address of submitting described pending text to;

In the user name dictionary building in advance or IP dictionary, search submission status data corresponding to described user name HuoIP address, utilize the normal text of this user's submission and the quantity of rubbish text to calculate cheating user index;

Described step S4 utilizes described cheating user's index and described cheating characteristic index are weighted or multiply each other, and the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

According to one preferred embodiment of the present invention, the method for building up of described user name dictionary and IP dictionary, specifically comprises:

Obtain the sample language material that comprises normal text and rubbish text;

The user name HeIP address of each text in described sample language material submitted in record;

Add up respectively the corresponding quantity that is marked as normal text and rubbish text in the text of uploading each user name HeIP address, generate user name dictionary and IP dictionary.

Described pending text is carried out to participle, utilize the Bayes's dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item obtaining, and calculate the probability that described pending text is rubbish text, as Bayes's index of described pending text;

Described step S4 utilizes described Bayes's index and described cheating characteristic index to multiply each other or weighting, and the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

Described pending text is carried out to participle, utilize the Fei Sheer dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item obtaining, and calculate the probability that described pending text is rubbish text, as the Fei Sheer index of described pending text;

Described step S4 utilizes described Fei Sheer index and described cheating characteristic index to multiply each other or weighting, and the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

, this device comprises:

Character replacement module, for by pending text, each character except word and numeral adopts default fixed character string to replace;

Word cubage module, the word length comprising for adding up text total length after described character replacement module is replaced and text, calculates the ratio of described word length and text total length;

Cheating characteristic index computing module, for utilizing the ratio of described word length and text total length, calculates the cheating characteristic index of described pending text;

Sort module, is defined as rubbish text for described cheating characteristic index being surpassed to the pending text of predetermined threshold value.

According to one preferred embodiment of the present invention, the configuration of described character replacement module comprises:

For the character except word and numeral carries out pre-service for described pending text, remove common punctuation mark;

After only described pretreatment module being processed, remaining character is replaced.

According to one preferred embodiment of the present invention, this device also comprises:

Numeric character statistical module, for finding out the number of link, number and mailbox that described pending text comprises, obtains link weight and the number weight of described pending text;

Link weight and number weight that the utilization of described cheating characteristic index computing module obtains, be weighted with the subtraction function of the ratio of described word length and text total length, obtain the cheating characteristic index of described pending text, described link weight and number weight are larger, and described cheating characteristic index is larger.

User profile extraction module, for determining the user name HeIP address of submitting described pending text to;

Cheating user index computing module, for searching submission status data corresponding to described user name HuoIP address at the user name dictionary or the IP dictionary that build in advance, utilize the user name of described pending text or the ratio of the rubbish text that IP address history is submitted to calculate cheating user index;

Described sort module, also, for described cheating user index and cheating characteristic index are weighted or are multiplied each other, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

According to one preferred embodiment of the present invention, described user name dictionary and IP dictionary set up module, specifically comprise:

Language material acquiring unit, for obtaining the sample language material that comprises normal text and rubbish text;

User profile record cell, for recording the user name HeIP address of submitting described each text of sample language material to;

Statistic unit, for adding up respectively the corresponding quantity that is marked as normal text and rubbish text of the text of uploading each user name HeIP address, generates user name dictionary and IP dictionary.

Bayes's index computing module, for described pending text is carried out to participle, utilize the Bayes's dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item obtaining, and calculate the probability that described pending text is rubbish text, as Bayes's index of described pending text, and described Bayes's index is offered to described sort module;

Described sort module, also for utilizing described Bayes's index and described cheating characteristic index to multiply each other or weighting, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

Fei Sheer index computing module, for described pending text is carried out to participle, utilize the Fei Sheer dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item obtaining, and calculate the probability that described pending text is rubbish text, as the Fei Sheer index of described pending text, and described Fei Sheer index is offered to described sort module;

Described sort module, also for utilizing described Fei Sheer index and described cheating characteristic index to multiply each other or weighting, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

As can be seen from the above technical solutions, file classification method provided by the invention and device, utilize the mode of character replacement to obtain the cheating feature expanding, aided verification is carried out in submission behavior to user, can effectively identify and be mingled with a large amount of special symbols, the text of ESC and link, and the meaningless text of a large amount of issues of head portrait advertisement cheating user, especially for the comment such as community or forum, reply, the short text of message etc., improved identification precision, and combine with the method for machine learning, effectively make up the deficiency of existing machine learning method, improve the accuracy rate of classification.

[accompanying drawing explanation]

The file classification method process flow diagram that Fig. 1 provides for the embodiment of the present invention one;

The file classification method process flow diagram that Fig. 2 provides for the embodiment of the present invention two;

Fig. 3 a is the schematic diagram of certain content of text and user profile thereof;

Fig. 3 b is for utilizing bayes classification method training to obtain Bayes's dictionary schematic diagram;

Fig. 3 c is for utilizing the training of Fei Sheer sorting technique to obtain Fei Sheer dictionary schematic diagram;

The user name dictionary schematic diagram that Fig. 3 d obtains for statistics;

The IP dictionary schematic diagram that Fig. 3 e obtains for statistics;

The file classification method process flow diagram that Fig. 4 provides for the embodiment of the present invention three;

Fig. 5 is that the text of three couples of Fig. 3 a of the embodiment of the present invention carries out result schematic diagram;

The document sorting apparatus schematic diagram that Fig. 6 provides for the embodiment of the present invention four;

The document sorting apparatus schematic diagram that Fig. 7 provides for the embodiment of the present invention five;

The document sorting apparatus schematic diagram that Fig. 8 provides for the embodiment of the present invention six.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

Embodiment mono-,

Fig. 1 is the file classification method process flow diagram that the present embodiment provides, and as shown in Figure 1, the method comprises:

S101, each character except word and numeral in pending text is adopted to the replacement of default fixed character string.

First by the special symbol in pending text, as English symbol " < >-_ `～@# $ %^& * () +=| ", Chinese symbol " " " $ ()---? ", ESC " n t r n " and space etc. by fixed character, replace.

Fixed character string can be, but not limited to adopt identical character to repeat to be superimposed as length more than 1 character string.For example, adopt the fixed character string " $ $ $ $ " etc. of four " $ " characters stack.For each character except word and numeral in pending text, adopt this fixed character string " $ $ $ $ " to go to replace.Give an example, for "<---method to make money---: " "? ":>>>/" this pending text, after adopting fixed character string " $ $ $ $ " to go to replace, become " $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ method to make money $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ ", the pending text total length after replacement is elongated.

Due to when carrying out word segmentation processing, punctuation mark, stop words etc. can be carried out to filtration treatment, thereby, this step utilizes fixed character string first to replace these special symbols, and the feature of these special symbols is expanded, increase this part impact of special symbol, then add up effective word content.

Certainly, due in normal text, can comprise some common punctuation marks for example ", " "." etc., this part belongs to can normally appear at the content in text, also can not replace.Thereby, before this step replaces it, can carry out pre-service to the character except word and numeral in described pending text, first remove common punctuation mark, only to remaining character, adopt default fixed character string to replace, can raise the efficiency and accuracy rate.

The word length comprising in text total length after S102, statistics are replaced and text, calculates the ratio of described word length and text total length.

The pending text of replacing through fixed character string can occur because of the quantity of the special symbols such as the English symbol comprising, Chinese symbol or ESC more, add up after replacement text total length L _ ORIG.

Utilize regular expression to find out the word comprising in described pending text, for example, the coding range of Chinese character in national standard extended code (GBK) is 0x8140-0xFEFE, coding range in Chinese Character Set Code for Informati (GB2312) is 0xA1A1-0xFEFE, coding range in unified character code standard (Unicode) be u4E00-u9FA5, uF900-uFA2D, utilize above-mentioned coding range to build the regular expression of Chinese character, find out the character dropping in above-mentioned coding range interval, the word number that statistics is found out, calculate word length L_CHAR.

Calculate the ratio K=L_CHAR/L_ORIG of described word length and text total length, i.e. effective word content.

S103, utilize the ratio of described word length and text total length, calculate the cheating characteristic index of described pending text.

For the rubbish text that is mingled with a large amount of punctuation marks, blank, ESC etc., the word number conventionally comprising is less, and the content of non-legible symbol is larger.That is to say, in text, effectively word content is lower, and the cheating characteristic index of the text is more obvious, and the probability that the text is rubbish text is also larger.

Thereby, utilize in advance the ratio K of word length and text total length, build cheating characteristic index function, in order to calculate cheating characteristic index.Particularly, the subtraction function of the ratio K of employing word length and text total length is as cheating characteristic index function.Described subtraction function can be, but not limited to adopt:

Score_feature 1 = \frac{2}{1 + K}

(formula 1)

In above-mentioned formula 1, because the span of the ratio K of word length and text total length is [0,1], span (1+K) is [1,2], and the span of 1/ (1+K) is [0.5,1], thereby this cheating characteristic index distributes between [1,2].The value of ratio K is lower, and Score score is higher, and the cheating characteristic index of the text is more obvious, and the probability that the text is rubbish text is also larger.

S104, the pending text that described cheating characteristic index is surpassed to predetermined threshold value are defined as rubbish text.

Predetermined threshold value is to observe the classification thresholds of pending data set.According to the demand of practical application scene and previous experiences, set the predetermined threshold value of a cheating characteristic index, whether the cheating characteristic index Score_feature1 that determining step S103 calculates surpasses this predetermined threshold value, if surpassed, is identified as rubbish text.

Thereby, utilize the method that the present embodiment provides effectively to being mingled with in a large number the rubbish text of the specific characters such as punctuate, space, ESC, effectively to identify.

It is worth mentioning that, for remaining other pending texts after identification, the characteristic index of practising fraud Score_feature1 is no more than the text of predetermined threshold value, can also utilize existing sorting technique, as methods such as bayes classification method or support vector machine, to the judgement of again classifying of those texts.

Embodiment bis-

Fig. 2 is the file classification method process flow diagram that the present embodiment provides, and as shown in Figure 2, the method comprises:

Step S201, to the character except word and numeral in pending text, adopt the replacement of default fixed character string.

This step is identical with step S101 in embodiment mono-, in this, repeats no more.

The word length comprising in text total length after step S202, statistics are replaced and text, utilizes the ratio of described word length and text total length to calculate word ratio weight.

In the computing method of word length and the ratio K of text total length and embodiment mono-, step S102 is identical, i.e. K=L_CHAR/L_ORIG.

Utilize the ratio of word length and text total length to calculate word ratio weight Score_char, can be, but not limited to adopt following formula:

Score_char = \frac{2}{1 + K}

(formula 2)

Step S203, find out the number of the link, number and the mailbox that comprise in described pending text, obtain link weight and the number weight of described pending text.

Utilize regular expression to find out the quantity of link, No. QQ, cell-phone number and mailbox.For example, in python language, can use regular expression re.compile (" [0-9 .] { 5-9} ") to find out cell-phone number or No. QQ, use re.compile (“ w+ w+. w+ ") to find out email address, use regular expression re.compile (" [http: /] * w+[w+ .]+[comnedugvtn .] { 2,6 } ") find out the link of network address.

Link weight Score_link can be, but not limited to adopt the link that comprises in text and the quantity sum of mailbox to represent, correspondingly, number weight Score_digit can be, but not limited to adopt the quantity sum of the numbers such as No. QQ of comprising in text, cell-phone number to represent.

Step S204, utilize link weight and the number weight obtain, be weighted with the subtraction function of the ratio of described word length and text total length, obtain the cheating characteristic index of described pending text.

Concrete weighting formula can be, but not limited to adopt:

Score_feature2=Score_char+0.5Score_link+0.5Score_digit (formula 3)

As can be seen from the above equation, link weight Score_link and number weight Score_digit are larger, and cheating characteristic index is larger; Or the ratio of word length and text total length is less, the word ratio weight Score_char calculating is larger, and cheating characteristic index is larger.

Step S205, definite user name HeIP address of submitting described pending text to.

For described pending text, obtain user profile, determine the user name HeIP address of submitting described pending text to.

Fig. 3 a is the schematic diagram of certain content of text and user profile thereof, as shown in Figure 3 a, the user sx1816 by name that submits the text to, User IP is: 114.228.210.130, content of text Wei“ East China one district-sun, the moon and the stars 282zzd8010101060000700067b4t0zmcb50e0 ".

Step S206, in the user name dictionary building in advance or IP dictionary, search submission status data corresponding to described user name HuoIP address, utilize normal text that this user submits to and the quantity of rubbish text to calculate cheating user index.

Described user name dictionary or IP dictionary are to utilize in advance the historical data of certain scale to carry out statistic of classification to obtain, the quantity that counts normal text and rubbish text in the text that each user submits to according to the user name HeIP address of submitting to, record generates user name dictionary and IP dictionary respectively.

The ratio of the rubbish text that the user name of described pending text or IP address history are submitted to calculates cheating user index.

Utilize the definite user name HeIP address of step S205, in the user name dictionary building in advance and IP dictionary, search submission status data corresponding to this user, record the quantity that this user submits normal text and rubbish text to, be denoted as respectively h_num and s_num.

Described cheating user index S core_user can be, but not limited to adopt following formula to calculate:

Score_user = 1 + \log_{T} s_num * \frac{s_num}{s_num + h_num}

(formula 4)

Wherein, T is the reference value of rubbish text quantity, in order to observe the separatrix of normal users and rubbish user's rubbish text quantity, can be according to practical situation value, and for example T value is between 6～10.

By above-mentioned formula 4, can be found out, cheating user index is mainly subject to the historical rubbish text quantity of submitting to of user and rubbish text to account for the impact of submitting total ratio to, for rubbish user, often these two indexs are all very high, even if normal users has some comments to be marked as rubbish text, but rubbish text ratio is lower, the cheating user index finally obtaining also can be lower.

Certainly, when calculating cheating user index, it is also conceivable that the feature of user name.Cheating user often registers by machine, can there is certain feature in its user name, for example, letter and number forms by certain rule, comprise wordings such as " add Q; QQ, make friends, contact; button button; add me, put me ", for the cheating user index with the user of such feature, can further adjust power to process.

Step S207, utilize described cheating user's index and described cheating characteristic index are weighted or multiply each other, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

The cheating user index that the final score of pending text utilizes cheating characteristic index that step S204 obtains and step S206 to obtain is weighted or multiplies each other, and adopts in the present embodiment the mode multiplying each other to calculate final score.

Predetermined threshold value is to observe the classification thresholds of pending data set equally, judge weighting or the result that multiplies each other whether over this predetermined threshold value, if surpassed, be identified as rubbish text.

Embodiment tri-

In the present embodiment, the mode that adopts off-line to generate dictionary first builds Bayes's dictionary, Fei Sheer dictionary, user name dictionary and IP dictionary in advance, and concrete method for building up comprises:

Step S301, obtain the sample language material that comprises normal text and rubbish text.

Described sample language material can adopt the existing historical data of certain scale, and the text, comment or the reply that utilize different user name HuoIP address that in network, accumulation obtains to submit to form sample language material.

The normal text obtaining and the classification of rubbish text can be to adopt existing sorting technique classification to obtain, or, also can be to adopt handmarking's method to obtain, in differentiation sample language material, the person of being managed or other users be labeled as the text of rubbish text, and the normal text not being labeled.

Step S302, the text in described sample language material is cut to word process, each lexical item is carried out to counting statistics, calculate the probability that each lexical item is normal text and rubbish text, generate classified dictionary.

Machine learning method can adopt existing bayes classification method or Fei Sheer sorting technique etc., forms respectively corresponding classified dictionary.Fig. 3 b utilizes bayes classification method training to obtain Bayes's dictionary schematic diagram, Fig. 3 c utilizes the training of Fei Sheer sorting technique to obtain Fei Sheer dictionary schematic diagram, as shown in Fig. 3 b and 3c, dictionary comprises normal probability and the rubbish probability of each lexical item and this lexical item.

The user name HeIP address that each text in described sample language material submitted in step S303, record.

From sample language material, extract the user name HeIP address of each text.Fig. 3 a is the schematic diagram of certain content of text and user profile thereof, as shown in Figure 3 a, the user sx1816 by name that submits the text to, User IP is: 114.228.210.130, content of text Wei“ East China one district-sun, the moon and the stars 282zzd8010101060000700067b4t0zmcb50e0 ".

Step S304, add up in the text of uploading each user name HeIP address the corresponding quantity that is marked as normal text and rubbish text respectively, generate user name dictionary and IP dictionary.

Fig. 3 d is the user name dictionary schematic diagram that statistics obtains, and as shown in Figure 3 d, dictionary comprises each user name and corresponding normal text and the rubbish text quantity of submitting to.

Fig. 3 e is the IP dictionary schematic diagram that statistics obtains, and as shown in Figure 3 e, dictionary comprises each IP and corresponding normal text and the rubbish text quantity of submitting to.

Off-line learning generates dictionary can periodic operation within the seeervice cycle, carries out automatic learning, realizes the effect of autonomous learning.In addition, constantly the several dictionaries of backup, guarantee that, when service is restarted, data originally can be reloaded.When having newer dictionary to provide, service need to merge dictionary and the dictionary having loaded, and same index (key) is worth to replacement, and original non-existent index is added.

The file classification method process flow diagram that Fig. 4 provides for the present embodiment, as shown in Figure 4, the method comprises:

Step S401, the character except word and numeral in pending text is adopted to the replacement of default fixed character string, calculate the number of linking of comprising in the ratio of described word length and text total length and text, number and mailbox, calculate the cheating characteristic index Score_feature of described pending text.

The processing procedure of this step is identical to the processing procedure of step S204 with step S201 in embodiment bis-, in this, repeats no more.

Step S402, utilize the user name of pending text or the ratio of the rubbish text that IP address history is submitted to described in the user name dictionary that builds in advance or IP dictionary lookup, calculate the cheating characteristic index Score_user of described pending text.

The processing procedure of this step is identical to the processing procedure of step S206 with step S205 in embodiment bis-, in this, repeats no more.

Step S403, described pending text is carried out to participle, utilize the Bayes's dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item obtaining, and calculate the probability that described pending text is rubbish text, as Bayes's index of described pending text.

The basis of Bayes's classification is Bayes' theorem and total probability formula.Bayes' theorem essence is to calculate " conditional probability ", and so-called conditional probability, refers to the probability that event A occurs in the situation that event B occurs, and account form is P (A|B)=P (B|A) P (A)/P (B).Total probability formula is by the probability of simple event, to carry out the probability of calculation of complex event, and for example A and A ' are divisions that forms sample space, so the probability P of event B (B)=P (B|A) P (A)+P (B|A ') P (A ').

By Bayes' theorem, for text classification, be based on to Bayesian following understanding: claim P (A) for " prior probability ", before B event occurs, to A probability of occurrence deduction; P (A|B) is called " posterior probability ", be that after B event occurs,, to reappraising of A probability of occurrence, P (B|A)/P (B) is called " plausibility function ", be one and adjust the factor, make to estimate probability and more approach true probability, it is worth mode by experiment and obtains, if be greater than 1, " prior probability " is enhanced, equal 1, mean that B event is helpless to judge the possibility of A event, be less than 1 " prior probability " weakened.

Use following process to use it for text classification, if S represents rubbish text (Spam), H represents normal text (Healthy), P (S)=P (H)=50% generally, W represents word (Word), problem is programmed and calculated text in the situation that W occurs is the probability of S, be denoted as P (S|W), according to above-mentioned formula, can obtain P (S|W)=P (W|S) P (S)/(P (W|S) P (S)+P (W|H) P (H)), and P (W|S) and P (W|H) are illustrated respectively in rubbish and normal text, the probability that W occurs can be added up and draw.According to the frequency of word in text, infer that text belongs to the probability of rubbish text.By a plurality of words that comprise in text, infer and can use joint probability formula: note P (S|W ₁) be P ₁, P (S|W ₂) be P ₂, final probability is P=P ₁p ₂/ (P ₁p ₂+ (1-P ₁) (1-P ₂)).

Bayes's index S core_bayes using the final probability obtaining as described pending text.

Step S404, described pending text is carried out to participle, utilize the Fei Sheer dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item obtaining, and calculate the probability that described pending text is rubbish text, as the Fei Sheer index of described pending text.

Similar with bayes classification method, Fei Sheer classification is Bayesian a kind of replacement scheme, also both can be combined with, and to use word frequencies to calculate different from Bayes, Fei Sheer method statistic be the probability of document.While there is W, the document belongs to the probability of S and H, while obtaining result, needs to specify classification.Its computing formula is: establish C and be affiliated classification (S and H) above, when P (C|W) occurs for W, text belongs to the probability of the C that classifies, can be drawn by training text statistics.Text for input, judge that its new probability formula that belongs to rubbish text is P (S)=P (S|W)/(P (S|W)+P (H|W)), joint probability for a plurality of words, result can be multiplied each other, P (S)=P (S1) P (S2) ..., in like manner can obtain P (H).In Fei Sheer method, the P drawing (C) also can process through following: will-result of 2*log (P (C)) imports inversion card side function into, returns to net result.

Fei Sheer index score score_fisher=P (S)/(P (S)+P (H)), i.e. Fei Sheer judgement belongs to the probability of refuse classification and does normalized.

It is worth mentioning that, the text that cannot judge for Fei Sheer classification, i.e. the situation of P (S)=P (H), Score_fisher=1, illustrate the inefficacy of Fei Sheer index.

Step S405, utilize described Bayes's index, Fei Sheer index, cheating user index are weighted or multiply each other with cheating characteristic index, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

The final score of described pending text can multiply each other and be as the criterion with four indices.Be specially:

Score=Score_feature*Score_user*Score_bayes*Score_fisher (formula 5)

Predetermined threshold value is to observe the classification thresholds of pending data set equally, judge weighting or the result that multiplies each other whether over this predetermined threshold value, if surpassed, be identified as rubbish text.At the present embodiment Li Zhong ，Yi1.0Wei circle, can distinguish preferably normal text and rubbish text.

Take the text shown in Fig. 3 a as example, the text is connected to a large amount of spaces and line feed character below at " sun, the moon and the stars ", through after this instance processes, its result as shown in Figure 5, obtaining Bayes's index is 0.034608, Fei Sheer index is 0.406481, these two indexs all fail to pick out space, the feature of symbol, cheating characteristic index is 2.785714, the such text of the mode more effective identification of energy that characterization expands, cheating user index is 1.000000, before illustrating, this user does not have the history of submitting to, final score is 0.0391875824319, be classified as normal text.

This example has been verified Fei Sheer and Bayesian weakness, shows that the more effective identification of method energy of feature expansion is mingled with the text of a large amount of punctuation marks and blank, tab or newline simultaneously.When similar text occurs and after the person of being managed deletes in a large number, through several study of taking turns dictionary, the index of Bayes and Fei Sheer will be corresponding accurate, simultaneously, owing to there being cheating review record, user's index of practising fraud also can correspondingly increase, like this, the sorter building just can be identified so similar text preferably, also can well identify the cheating mode of head portrait advertisement simultaneously.In addition, having considered the link that comprises in text and the quantity of number in the method, is website links for principal ingredient, and No. QQ, the text of cell-phone number etc. also has good classifying quality.The present invention propose that feature expands and with the method for the behavior of submission to, combine with the method for existing machine learning, improve classification accuracy.

Be more than the detailed description that method provided by the present invention is carried out, below document sorting apparatus provided by the invention be described in detail.

Embodiment tetra-

Fig. 6 is the document sorting apparatus schematic diagram that the present embodiment provides, and as shown in Figure 6, this device comprises:

Character replacement module 601, for by pending text, each character except word and numeral adopts default fixed character string to replace.

Character replacement module 601 is first by the special symbol in pending text, as English symbol " < >-_ `～@# $ %^& * () +=| ", Chinese symbol " " " $ ()---? ", ESC " n t r n " and space etc. by fixed character, replace.

Due to when carrying out word segmentation processing, punctuation mark, stop words etc. can be carried out to filtration treatment, thereby, this module utilizes fixed character string first to replace these special symbols, and the feature of these special symbols is expanded, increase this part impact of special symbol, then add up effective word content.

Certainly, due in normal text, can comprise some common punctuation marks for example ", " "." etc., this part belongs to can normally appear at the content in text, also can not replace.Thereby, character replacement module 601 can also first be carried out pre-service to the character except word and numeral in described pending text, remove common punctuation mark, only to remaining character, adopt default fixed character string to replace, can raise the efficiency and accuracy rate.

Word cubage module 602, the word length comprising for adding up text total length after character replacement module 601 is replaced and text, calculates the ratio of described word length and text total length.

Word cubage module 602 calculates the ratio K=L_CHAR/L_ORIG of described word length and text total length, i.e. effective word content.

Cheating characteristic index computing module 603, for utilizing the ratio of described word length and text total length, calculates the cheating characteristic index of described pending text.

Thereby, utilize in advance the ratio K of word length and text total length, build cheating characteristic index function, in order to calculate cheating characteristic index.Particularly, the subtraction function of the ratio K of employing word length and text total length is as cheating characteristic index function.Described subtraction function can be, but not limited to adopt formula 1 to calculate.In formula 1, because the span of the ratio K of word length and text total length is [0,1], span (1+K) is [1,2], and the span of 1/ (1+K) is [0.5,1], thereby this cheating characteristic index distributes between [1,2].The value of ratio K is lower, and Score score is higher, and the cheating characteristic index of the text is more obvious, and the probability that the text is rubbish text is also larger.

Sort module 604, is defined as rubbish text for described cheating characteristic index being surpassed to the pending text of predetermined threshold value.

Predetermined threshold value is to observe the classification thresholds of pending data set.According to the demand of practical application scene and previous experiences, set the predetermined threshold value of a cheating characteristic index, whether the cheating characteristic index Score_feature1 that judgement cheating characteristic index computing module 603 calculates surpasses this predetermined threshold value, if surpassed, is identified as rubbish text.

Thereby, utilize the device that the present embodiment provides effectively to being mingled with in a large number the rubbish text of the specific characters such as punctuate, space, ESC, effectively to identify.

It is worth mentioning that, for remaining other pending texts after identification, the characteristic index of practising fraud Score_feature1 is no more than the text of predetermined threshold value, can also utilize existing sorter, as sorters such as Bayes's classification or support vector machine, to the judgement of again classifying of those texts.

Embodiment five

Fig. 7 is the document sorting apparatus schematic diagram that the present embodiment provides, and as shown in Figure 7, this device comprises:

Character replacement module 701, for the character except word and numeral to pending text, adopts default fixed character string to replace.

This module is identical with module 601 in embodiment tetra-, in this, repeats no more.

Word cubage module 702, the word length comprising for adding up text total length after character replacement module 701 is replaced and text, utilizes the ratio calculating word ratio weight of described word length and text total length.

In the computing method of the word length adopting in this module and the ratio K of text total length and embodiment tetra-, module 602 is identical, i.e. K=L_CHAR/L_ORIG.

Word cubage module 702 utilizes the ratio of word length and text total length to calculate word ratio weight Score_char, can be, but not limited to adopt formula 2 to calculate.

Numeric character statistical module 703, for finding out the number of link, number and mailbox that described pending text comprises, obtains link weight and the number weight of described pending text.

Cheating characteristic index computing module 704, for utilizing link weight and the number weight obtaining, is weighted with the subtraction function of the ratio of described word length and text total length, obtains the cheating characteristic index of described pending text.

In cheating characteristic index computing module 704, concrete weighting formula can be, but not limited to adopt formula 3 to calculate.By formula 3, can be found out, link weight Score_link and number weight Score_digit are larger, and cheating characteristic index is larger; Or the ratio of word length and text total length is less, the word ratio weight Score_char calculating is larger, and cheating characteristic index is larger.

User profile extraction module 705, for determining the user name HeIP address of submitting described pending text to.

Cheating user index computing module 706, for searching submission status data corresponding to described user name HuoIP address at the user name dictionary or the IP dictionary that build in advance, utilize the normal text of this user's submission and the quantity of rubbish text to calculate cheating user index.

Utilize the definite user name HeIP address of user profile extraction module 705, in the user name dictionary building in advance and IP dictionary, search submission status data corresponding to this user, record the quantity that this user submits normal text and rubbish text to, be denoted as respectively h_num and s_num.

Described cheating user index S core_user can be, but not limited to adopt formula 4 to calculate.

Certainly, when calculating cheating user index, cheating user index computing module 706 it is also conceivable that the feature of user name.Cheating user often registers by machine, can there is certain feature in its user name, for example, letter and number forms by certain rule, comprise wordings such as " add Q; QQ, make friends, contact; button button; add me, put me ", for the cheating user index with the user of such feature, can further adjust power to process.

Sort module 707, for utilizing described cheating user's index and described cheating characteristic index are weighted or multiply each other, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

The cheating characteristic index that the final score utilization cheating characteristic index computing module 704 of pending text obtains is weighted or multiplies each other with the cheating user index that cheating user index computing module 706 obtains, and adopts in the present embodiment the mode multiplying each other to calculate final score.

Embodiment six

In the present embodiment, the mode that adopts off-line to generate dictionary first builds Bayes's dictionary, Fei Sheer dictionary, user name dictionary and IP dictionary in advance, specifically sets up module and comprises:

Language material acquiring unit, for obtaining the sample language material that comprises normal text and rubbish text.

Machine sort unit, processes for the text of described sample language material being cut to word, and each lexical item is carried out to counting statistics, calculates the probability that each lexical item is normal text and rubbish text, generates classified dictionary.

User profile record cell, for recording the user name HeIP address of submitting described each text of sample language material to.

The document sorting apparatus schematic diagram that Fig. 8 provides for the present embodiment, as shown in Figure 8, this device comprises:

Cheating characteristic index processing module 801, for to pending text, the character except word and numeral adopts default fixed character string to replace, calculate the number of linking of comprising in the ratio of described word length and text total length and text, number and mailbox, calculate the cheating characteristic index Score_feature of described pending text.

The processing procedure of this module is identical to the processing procedure of module 704 with module 701 in embodiment five, in this, repeats no more.

Cheating user index processing module 802, for utilizing in advance the user name of pending text or the ratio of the rubbish text that IP address history is submitted to described in the user name dictionary that builds or IP dictionary lookup, calculate the cheating characteristic index Score_user of described pending text.

The processing procedure of this module is identical to the processing procedure of module 706 with module 705 in embodiment five, in this, repeats no more.

Bayes's index computing module 803, for described pending text is carried out to participle, utilize the Bayes's dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item that participle obtains, and calculate the probability that described pending text is rubbish text, as Bayes's index of described pending text.

The Bayes index S core_bayes of Bayes's index computing module 803 using the final probability obtaining as described pending text.

Fei Sheer index computing module 804, for described pending text is carried out to participle, utilize the Fei Sheer dictionary building in advance, search corresponding normal probability and the rubbish probability of each lexical item that participle obtains, and calculate the probability that described pending text is rubbish text, as the Fei Sheer index of described pending text.

Similar with Bayes's classification, Fei Sheer classification is Bayesian a kind of replacement scheme, also both can be combined with, and to use word frequencies to calculate different from Bayes, Fei Sheer method statistic be the probability of document.While there is W, the document belongs to the probability of S and H, while obtaining result, needs to specify classification.Its computing formula is: establish C and be affiliated classification (S and H) above, when P (C|W) occurs for W, text belongs to the probability of the C that classifies, can be drawn by training text statistics.Text for input, judge that its new probability formula that belongs to rubbish text is P (S)=P (S|W)/(P (S|W)+P (H|W)), joint probability for a plurality of words, result can be multiplied each other, P (S)=P (S1) P (S2) ..., in like manner can obtain P (H).In Fei Sheer method, the P drawing (C) also can process through following: will-result of 2*log (P (C)) imports inversion card side function into, returns to net result.

Sort module 805, for utilizing described Bayes's index, Fei Sheer index, cheating user index are weighted or multiply each other with cheating characteristic index, the pending text that result of calculation is surpassed to predetermined threshold value is defined as rubbish text.

The final score of described pending text can multiply each other and be as the criterion with four indices, as formula 5.

This example has been verified Fei Sheer and Bayesian weakness, shows that the more effective identification of method energy of feature expansion is mingled with the text of a large amount of punctuation marks and blank, tab or newline simultaneously.When similar text occurs and after the person of being managed deletes in a large number, through several study of taking turns dictionary, the index of Bayes and Fei Sheer will be corresponding accurate, simultaneously, owing to there being cheating review record, user's index of practising fraud also can correspondingly increase, like this, the sorter building just can be identified so similar text preferably, also can well identify the cheating mode of head portrait advertisement simultaneously.In addition, having considered the link that comprises in text and the quantity of number in the present invention, is website links for principal ingredient, and No. QQ, the text of cell-phone number etc. also has good classifying quality.

File classification method provided by the invention and device, utilize feature and the user behavior of text, method in conjunction with machine learning is classified to text, can effectively to each class text, accurately classify, especially for the text that is mingled with a large amount of special symbols, ESC and link, and the meaningless texts of a large amount of issues of head portrait advertisement cheating user, effectively make up the deficiency of existing machine learning method, improve the accuracy rate of classification.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a file classification method, is characterized in that, comprises the following steps:

2. method according to claim 1, is characterized in that, before described step S1, also comprises:

3. method according to claim 1, is characterized in that, before described step S3, also comprises:

4. method according to claim 1, is characterized in that, the method also comprises:

Determine the user name HeIP address of submitting described pending text to;

5. method according to claim 4, is characterized in that, the method for building up of described user name dictionary and IP dictionary, specifically comprises:

6. method according to claim 1, is characterized in that, the method also comprises:

7. method according to claim 1, is characterized in that, the method also comprises:

8. a document sorting apparatus, is characterized in that, comprising:

9. device according to claim 8, is characterized in that, the configuration of described character replacement module comprises:

10. device according to claim 8, is characterized in that, this device also comprises:

11. devices according to claim 8, is characterized in that, this device also comprises:

12. devices according to claim 11, is characterized in that, described user name dictionary and IP dictionary set up module, specifically comprise:

13. devices according to claim 8, is characterized in that, this device also comprises:

14. devices according to claim 8, is characterized in that, this device also comprises: