CN103336766A

CN103336766A - Short text garbage identification and modeling method and device

Info

Publication number: CN103336766A
Application number: CN2013102780126A
Authority: CN
Inventors: 姜贵彬
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2013-07-04
Filing date: 2013-07-04
Publication date: 2013-10-02
Anticipated expiration: 2033-07-04
Also published as: CN103336766B

Abstract

The invention discloses a short text garbage identification and modeling method and device. The short text garbage identification and modeling method includes the steps that word segmentation is conducted on a short text to be determined, word sets are acquired, and garbage features of the short text to be determined are analyzed to acquire analytical information; the analytical information of the short text to be determined and each word in the word sets are compared with feature elements in predetermined feature element sets respectively, and word feature vectors of the short text to be determined are generated according to feature values of words or the analytical information matched with the feature elements in the feature element sets; whether the short text to be determined is a garbage text or not is determined according to the word feature vectors of the short text to be determined and classification models; the classification models are trained in advance, wherein the classification models combine the number of samples with centralized training and select a proper classification algorithm. Due to the fact that the word feature vectors of the feature values of the analytical information are expanded to conduct garbage identification, the identified accuracy rate for identifying the garbage texts is improved.

Description

The identification of short text rubbish and modeling method and device

Technical field

The present invention relates to internet arena, relate in particular to the identification of a kind of short text rubbish and modeling method and device.

Background technology

The Internet technology fast development, the network information explosive growth; Along with the quickening of life, work rhythm, people more and more tend to come communication exchange with brief literal.Push away the spy with twitter() and Sina's microblogging be the SNS(Social Network Service that the short text with less of representative is produced, organized and diffuses information, social network services) website, obtain online friend's favor.

At present, the main method of the short text content on the internet being carried out automatic rubbish identification is that employing is categorized as rubbish text for certain short text content with it based on the method for disaggregated model, or non-rubbish text; This method comprises: training stage and sorting phase.

In the training stage, carry out modeling according to short texts a large amount of in the training set: for having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out participle and obtain the set of words of each short text, calculate the word feature vector of each short text according to the set of words of each short text; Word feature vector based on each short text in the training set trains disaggregated model.For example, use SVM(Support Vector Machine, support vector machine) sorting algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm train disaggregated model according to the word feature vector of each short text in the described training set.

At sorting phase, for short text to be determined, carry out after participle obtains the set of words of this short text to be determined, calculate the word feature vector of this short text to be determined according to the set of words of this short text to be determined; According to the word feature vector and the disaggregated model that trains before of this short text to be determined, judge whether this short text to be determined is rubbish text.How the judgement of carrying out rubbish text according to the word feature vector sum disaggregated model of this short text to be determined has multiple algorithm, is well known to those skilled in the art, and repeats no more herein.

But, in actual applications, the present inventor finds, the SNS website is because its social attribute, the common content of short text on the SNS website is brief, word in the set of words of extracting based on so brief content seldom, the effective eigenwert in the word feature vector that obtains thus is very sparse, and 1,2 effective eigenwert may only be arranged in the word feature vector of the short text that obtains sometimes; The accuracy of carrying out the ownership judgement of rubbish text collection and non-rubbish text collection based on few eigenwert like this reduces greatly; That is the rubbish recognition methods recognition accuracy of the short text content of prior art is not high at present.

Summary of the invention

Defective at above-mentioned prior art exists the invention provides the identification of a kind of short text rubbish and modeling method and device, in order to improve the accuracy of the content of short text being carried out rubbish identification.

According to an aspect of the present invention, provide the recognition methods of a kind of short text rubbish, having comprised:

Short text to be determined is carried out participle obtain set of words, and described short text to be determined is carried out the characteristics of spam analysis obtain analytical information;

With each word in the analytical information of described short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;

According to the word feature vector of described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.

Preferably, described analytical information comprises following arbitrary information, or the combination in any of following information:

The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.

Preferably, the eigenwert of described analytical information specifically comprises:

For the described information that whether comprises the contact method feature, its eigenwert is 0 or 1 of two-value;

For transition probability or the accounting information of noun or accounting information or the accounting information of punctuation mark or accounting information or the accounting information of binary word or the quantitative proportion information of different part of speech vocabulary blend proportion or punctuation mark and noun of monobasic word of verb between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word, its eigenwert is the numerical value between 0～1.

Further, before the word feature vector of the described short text to be determined of described generation, also comprise:

To with the set of described characteristic element in the eigenwert of the analytical information that is complementary of characteristic element carry out normalization:

Be 0 or 100 of two-value with the characteristic value normalization that wherein whether comprises the information of contact method feature;

The eigenwert of the quantitative proportion information of the accounting information of the accounting information of the accounting information of the accounting information of the accounting information of transition probability or noun between the part of speech of the accounting information of the accounting information of the accounting information of interference symbol wherein or rarely used word or traditional character or the transition probability between word or front and back word or verb or punctuation mark or monobasic word or binary word or different part of speech vocabulary blend proportion or punctuation mark and noun be multiply by 100, obtain the normalization numerical value between 0～100.

Preferably, the eigenwert of described word obtains according to following method:

Calculate TF, the IDF value of this word, and calculate the eigenwert of this word according to following formula 1:

Log (TF+1.0) * IDF (formula 1)

Preferably, the training method of described disaggregated model, and definite method of described characteristic element set comprises:

For having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out obtaining behind the participle set of words of this short text, and this short text is carried out obtaining after the characteristics of spam analysis analytical information of this short text;

At each short text in the described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the described characteristic element set;

At each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;

Word feature vector according to each short text in the described training set trains described disaggregated model.

Preferably, described word feature vector according to each short text in the described training set trains described disaggregated model and is specially:

Use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train described disaggregated model according to the word feature vector of each short text in the described training set.

According to another aspect of the present invention, also provide a kind of modeling method, having comprised:

At each short text in the described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;

Word feature vector according to each short text in the described training set trains disaggregated model.

Preferably, after the eigenwert of the analytical information of described this short text of calculating, and the word that is complementary of described basis and characteristic element in the described characteristic element set or the eigenwert of analytical information, generate before the word feature vector of this short text, also comprise:

Eigenwert to the analytical information of this short text is carried out normalization:

Be 0 or 100 of two-value with the described characteristic value normalization that whether comprises the information of contact method feature;

The eigenwert of the quantitative proportion information of the accounting information of the accounting information of the accounting information of the accounting information of the accounting information of transition probability or noun between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word or verb or punctuation mark or monobasic word or binary word or different part of speech vocabulary blend proportion or punctuation mark and noun be multiply by 100, obtain the normalization numerical value between 0～100; And

The word that characteristic element during described basis is gathered with described characteristic element is complementary or the eigenwert of analytical information, the word feature vector that generates this short text is specially:

According to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.

According to another aspect of the present invention, also provide a kind of model building device, having comprised:

Characteristic extracting module is used for having divided into rubbish text for training set, or each short text of non-rubbish text, carries out obtaining behind the participle set of words of this short text, and this short text is carried out the analytical information that the characteristics of spam analysis obtains this short text;

Characteristic element set determination module, be used for each short text at described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;

The proper vector determination module, be used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;

Disaggregated model makes up module, for the word feature vector of described each short text of training set of determining according to described proper vector determination module, makes up disaggregated model.

Preferably, described proper vector determination module specifically is used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.

According to another aspect of the present invention, also provide a kind of short text rubbish recognition device, having comprised:

Characteristic extracting module is used for carrying out obtaining set of words behind the participle for short text to be determined, and described short text to be determined is carried out the characteristics of spam analysis obtains analytical information;

The proper vector determination module, be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;

The rubbish identification module, after obtaining the word feature vector of described short text to be determined from described proper vector determination module, according to the word feature vector of described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.

Preferably, described proper vector determination module specifically be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of described short text to be determined.

In the technical scheme of the present invention, the word feature vector of each short text in the training set of formation rubbish text lineoid and non-rubbish text lineoid, word feature vector with short text to be determined, the eigenwert that has all comprised the analytical information that expands, word feature vector according to the eigenwert that has comprised the analytical information that expands, short text to be determined is carried out rubbish identification, improved discrimination and the recognition accuracy of rubbish text recognition.

Description of drawings

Fig. 1 is structure rubbish text lineoid and the non-rubbish text lineoid process flow diagram of the embodiment of the invention;

Fig. 2 carries out the process flow diagram that rubbish is identified for the embodiment of the invention to short text to be determined;

Fig. 3 is the model building device of the embodiment of the invention and the inner structure block diagram of short text rubbish recognition device.

Embodiment

For making purpose of the present invention, technical scheme and advantage clearer, below with reference to accompanying drawing and enumerate preferred embodiment, the present invention is described in more detail.Yet, need to prove that many details of listing in the instructions only are in order to make the reader to one or more aspects of the present invention a thorough understanding be arranged, even if there are not these specific details also can realize these aspects of the present invention.

Terms such as " module " used in this application, " system " are intended to comprise the entity relevant with computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.For example, module can be, but be not limited in: the thread of the process of moving on the processor, processor, object, executable program, execution, program and/or computing machine.For instance, the application program of moving on the computing equipment and this computing equipment can be modules.One or more modules can be positioned at an executory process and/or thread, and module also can be on the computing machine and/or be distributed between two or more the computing machines.

The present inventor considers, can the word feature vector that obtain based on art methods be expanded: except the eigenwert that comprises word, also can comprise the eigenwert of short text being carried out the analytical information that obtains after the characteristics of spam analysis.For example, the analytical information that short text is carried out obtaining after the characteristics of spam analysis can comprise: whether comprise the contact method feature, the accounting of the accounting of interference symbol, the accounting of noun or verb etc.According to this word feature vector that has expanded, judge whether the short text to be determined under it is rubbish text, improved the accuracy rate of judging than the method for prior art, namely improved the recognition accuracy of rubbish short text.

Based on above-mentioned consideration, embodiments of the invention provide a kind of short text rubbish recognition methods based on disaggregated model; In the training stage of disaggregated model, carry out modeling earlier: in the modeling process, according to each short text in the training set, make up rubbish text lineoid and non-rubbish plane lineoid in the disaggregated model; At cognitive phase, then can utilize rubbish text lineoid and non-rubbish plane lineoid in the disaggregated model of structure, carry out the judgement of rubbish short text.

In the modeling process, according to the method that each short text in the training set carries out modeling, namely make up rubbish text lineoid in the disaggregated model and the method for non-rubbish plane lineoid, flow process as shown in Figure 1, concrete steps comprise:

S101: each short text in the training set is carried out participle, obtain the set of words of each short text.

Particularly, for having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out participle: word sequence continuous in this short text is divided into word one by one; In the word that marks off, get rid of the function word (as punctuate, group verb, modal particle, interjection, onomatopoeia etc.) that does not have practical significance; Remaining word constitutes the set of words of this short text.

S102: each short text in the training set is carried out the characteristics of spam analysis, obtain the analytical information of each short text.

Particularly, for having divided into rubbish inside in the training set, or each short text of non-rubbish text, carry out the characteristics of spam analysis, obtain the analytical information of this short text, specifically comprise following arbitrary information, or the combination in any of following information: the information that whether comprises the contact method feature, the accounting information of interference symbol, the accounting information of rarely used word, the accounting information of traditional character, transition probability between word, transition probability between the part of speech of front and back word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportions (such as the quantitative proportion information of noun and verb), with ratio data information of punctuation mark and noun etc.

Wherein, at each short text in the training set, the analytical information of this short text can be to extract in the preprocessing process before obtaining the set of words of this short text to obtain, and also can be to obtain after obtaining the set of words of this short text.

Above-mentioned contact method feature specifically can be a string numeral or character with contact meaning, for example, and telephone number, QQ number or URL(Uniform Resource Locator, URL(uniform resource locator)) etc.; Usually, the purpose of some rubbish text is in order to obtain private interests, and will stay contact method; Therefore, whether the mode of being related can be used as an important judgement feature that takes a decision as to whether rubbish text in the short text.

The interference symbol specifically can be the symbol that is of little use, for example, and " $ " etc.; The rubbish text that has is for fear of the filtration of keyword, and the symbol that adopts some to be of little use carries out the separation of keyword; Therefore, the ratio that the interference symbol occurs in the statistics short text can be used as a judgement feature that takes a decision as to whether rubbish text.

Transition probability between word refers to the collocation probability of the type of the collocation probability of adjacent two words and adjacent two words, for example, " rubbish " is normal collocation with " identification " in " rubbish identification " short text, a collocation of corresponding existence probability, " rubbish " is noun, " identification " is verb, and the probability of noun and verb collocation is bigger;

The monobasic word specifically can be single word;

The binary word specifically can be idiom, slang or the Chinese idiom that 2 words are formed.

S103: for each short text in the training set, determine the eigenwert of the analytical information of this short text, and the eigenwert of each word in the set of words of this short text.

In this step, according to the set of words of each short text in the training set that obtains, at each word in the set of words of each short text, calculate the TF(Term Frequency of this word in this short text, word frequency) value; Calculate the IDF(Inverse Document Frequency of this word in training set, reverse file frequency) value; And calculate the eigenwert of this word according to following formula 1:

Log (TF+1.0) * IDF (formula 1)

The eigenwert of the word that calculates is generally the numerical value between 0～100.

In this step, at each short text in this training set, according to the analytical information of this short text that obtains, judge whether this short text comprises the information of contact method feature, if the eigenwert of then setting the information that whether comprises the contact method feature is 1(or 0); Otherwise, be set at 0(or 1);

The interference symbol that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this interference symbol;

The rarely used word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this rarely used word;

The traditional character that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this traditional character;

With the eigenwert of the transition probability between the word that obtains as the transition probability between this word;

The noun that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this noun;

The verb that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this verb;

The punctuation mark that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this punctuation mark;

The monobasic word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this monobasic word;

The binary word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this binary word;

With the quantitative proportion of the noun in this short text that counts and the verb eigenwert as the quantitative proportion information of this noun and verb;

With the quantitative proportion of the punctuation mark in this short text that counts and the noun eigenwert as the quantitative proportion information of this symbol and noun;

Consider above-mentioned all types of accounting information, the eigenwert of quantitative proportion information and transition probability is the numerical value between 0～1 normally, in order to make convenience of calculation, as a kind of more excellent embodiment, also can carry out normalization to the eigenwert of the analytical information of each short text in this training set of determining, obtain the normalization numerical value of described eigenwert: at each short text in the training set, the eigenwert of the information that whether comprises the contact method feature that calculates be multiply by 100, whether comprised the normalization numerical value of eigenwert of the information of contact method feature: 0 or 100;

The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, the accounting information of traditional character, the transition probability between word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the eigenwert of the quantitative proportion information of accounting information, noun and the verb of binary word and the quantitative proportion information of punctuation mark and noun multiply by 100 respectively, obtains the normalization numerical value of eigenwert of all types of accountings, quantitative proportion and the transition probability of this short text respectively: the numerical value between 0～100.

S104: for each short text in the training set, to the eigenwert of each word in the set of words of this short text, and the eigenwert of the analytical information of this short text, ask for the class discrimination degree after, the characteristic element in the selected characteristic element set.

Particularly, for each short text in the training set, to the eigenwert of each word in the set of words of this short text, can adopt the AUC algorithm to ask for the class discrimination degree of each word; The class discrimination degree of word can reflect that this word carries out the percentage contribution of the class discrimination of rubbish text or non-rubbish text to short text.

For each short text in the training set, according to the eigenwert of the analytical information of this short text or the class discrimination degree that the eigenwert after the normalization is asked for analytical information; The class discrimination degree of analytical information can reflect that this analytical information carries out the percentage contribution of the class discrimination of rubbish text or non-rubbish text to short text.Be the analytical information of discrete values for eigenwert, can adopt the Chi-square Test algorithm to ask for the class discrimination degree; Being the analytical information of serial number for eigenwert, can adopting AUC(Area Under Curve, area under curve) algorithm asks for the class discrimination degree.

For each short text in the training set, class discrimination degree at the analytical information that calculates this short text, behind the class discrimination degree of each word in the set of words of this short text, with the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set.Above-mentioned setting threshold can rule of thumb be arranged by the technician, be the different situations of discrete values and serial number for eigenwert, the setting threshold that arranges can be different: for example, for eigenwert be the situation of discrete values setting threshold can be set is 10, be that the setting threshold that can arrange of the serial number of serial number is 0.7 for eigenwert.

S105: at each short text in the training set, generate the word feature vector of this short text.

In this step, at each short text in the training set, the word feature vector that generates this short text with each word in the analytical information of this short text and the set of words respectively with the characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text.

More preferably, also can be at each short text in the training set, the word feature vector that generates this short text with each word in the analytical information of this short text and the set of words respectively with the characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.

Particularly, at each short text in the training set, each dimensional vector element is respectively each characteristic element in the characteristic element set in the word feature vector of this short text, wherein, the vector element that has is the analytical information of this short text, then with the eigenwert of this analytical information or the eigenwert after the normalization value as this vector element; The vector element that has is the word in the set of words of this short text, then with the eigenwert of this word or the eigenwert after the normalization value as this vector element; Other vector element value then is empty, or 0.

S106: according to the word feature vector of each short text in the training set that obtains, make up disaggregated model.

In this step, can use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train disaggregated model according to the word feature vector of each short text in the training set.Particularly, can combined training concentrate the quantity (being sample size) of short text, select to use a suitable algorithm, train disaggregated model according to the word feature vector of each short text in the training set.

How the concrete grammar that trains disaggregated model according to the word feature vector of each short text in the training set is well known to those skilled in the art, and repeats no more herein.

In fact, there is not strict sequencing between above-mentioned steps S101 and the S102, can executed in parallel or first execution in step S102 execution in step S101 again.

After constructing disaggregated model in the training stage, can carry out rubbish identification to short text to be determined at cognitive phase according to the disaggregated model that constructs; The process flow diagram of the short text rubbish recognition methods that the embodiment of the invention provides as shown in Figure 2, concrete steps comprise:

S201: short text to be determined is carried out participle, obtain the set of words of this short text to be determined.

Particularly, carry out participle for short text to be determined: word sequence continuous in this short text is divided into word one by one; In the word that marks off, get rid of the function word (as punctuate, group verb, modal particle, interjection, onomatopoeia etc.) that does not have practical significance; Remaining word constitutes the set of words of this short text.

S202: this short text to be determined is carried out the characteristics of spam analysis, obtain the analytical information of this short text to be determined.

Particularly, for this short text to be determined, carry out the characteristics of spam analysis, obtain the analytical information of this short text, specifically comprise following arbitrary information, or the combination in any of following information: the information that whether comprises the contact method feature, the accounting information of interference symbol, the accounting information of rarely used word, the accounting information of traditional character, transition probability between word, transition probability between the part of speech of front and back word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportions, with ratio data information of punctuation mark and noun etc.

Wherein, the analytical information of this short text to be determined can be to extract in the preprocessing process before obtaining this set of words for the treatment of the interpretation short text to obtain, and also can be to obtain after obtaining this set of words for the treatment of the interpretation short text.

S203: the characteristic element of determining short text to be determined.

Particularly, with each word in the analytical information of short text to be determined and the set of words respectively with above-mentioned characteristic element set in characteristic element compare, will with described characteristic element set in the word that is complementary of characteristic element or analytical information as the characteristic element of this short text to be determined.

S204: according to the eigenwert of the characteristic element of this short text to be determined, generate the word feature vector of described short text to be determined.

In this step, for the word as characteristic element of this short text to be determined, computation of characteristic values: calculate the TF value of this word in this short text, calculate the IDF value of this word in training set; And calculate the eigenwert of this word according to above-mentioned formula 1.

In this step, for the analytical information as characteristic element of this short text to be determined, computation of characteristic values:

Judgement is as the information that whether comprises the contact method feature in the analytical information of characteristic element, if the eigenwert of then setting the information that whether comprises the contact method feature is 1(or 0); Otherwise, then be set at 0(or 1);

Judgement is as the accounting information that whether comprises the interference symbol in the analytical information of characteristic element; If then the interference symbol that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this interference symbol;

Judgement is as the accounting information that whether comprises rarely used word in the analytical information of characteristic element; If then the rarely used word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this rarely used word;

Judgement is as the accounting information that whether comprises traditional character in the analytical information of characteristic element; If then the traditional character that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this traditional character;

Judgement is as the transition probability that whether comprises in the analytical information of characteristic element between word; If, then with the eigenwert of the transition probability between the word that obtains as the transition probability between this word;

Judgement is as the accounting information that whether comprises noun in the analytical information of characteristic element; If then the noun that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this noun;

Judgement is as the accounting information that whether comprises verb in the analytical information of characteristic element; If then the verb that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this verb;

Judgement is as the accounting information that whether comprises punctuation mark in the analytical information of characteristic element; If then the punctuation mark that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this punctuation mark;

Judgement is as the accounting information that whether comprises the monobasic word in the analytical information of characteristic element; If then the monobasic word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this monobasic word;

Judgement is as the accounting information that whether comprises the binary word in the analytical information of characteristic element; If then the binary word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this binary word;

Judgement is as the quantitative proportion information that whether comprises noun and verb in the analytical information of characteristic element; If the noun in this short text that then will count and the quantitative proportion of verb are as the eigenwert of the quantitative proportion information of this noun and verb;

Judgement is as the quantitative proportion information that whether comprises symbol and noun in the analytical information of characteristic element; If the punctuation mark in this short text that then will count and the quantitative proportion of noun are as the eigenwert of the quantitative proportion information of this symbol and noun.

The eigenwert of considering all types of accounting information, quantitative proportion information and the transition probability of this short text to be determined is generally the numerical value between 0～1, in order to make convenience of calculation, as a kind of more excellent embodiment, also can carry out normalization to the eigenwert of the analytical information of this short text to be determined, obtain the normalization numerical value of described eigenwert: at this short text to be determined, the eigenwert of the information that whether comprises the contact method feature that calculates be multiply by 100, whether comprised the normalization numerical value of eigenwert of the information of contact method feature: 0 or 100;

The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, the accounting information of traditional character, transition probability between word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, the quantitative proportion information of noun and verb, multiply by 100 respectively with the eigenwert of the quantitative proportion information of punctuation mark and noun, obtain all types of accounting of this short text to be determined respectively, the normalization numerical value of the eigenwert of quantitative proportion and transition probability: the numerical value between 0～100.

In this step, according to the eigenwert of the characteristic element of this short text to be determined or the eigenwert after the normalization, generate the word feature vector of described short text to be determined; Each dimensional vector element is respectively each characteristic element in the characteristic element set in the word feature vector of short text to be determined, wherein, the vector element that has is the analytical information of this short text to be determined, then with the eigenwert of this analytical information or the eigenwert after the normalization value as this vector element; The vector element that has is the word in the set of words of this short text to be determined, then with the eigenwert of this word or the eigenwert after the normalization value as this vector element; Other vector element value then is empty, or 0.

S205: according to the word feature vector of this short text to be determined, and disaggregated model determines whether this short text to be determined is rubbish text.

How according to the word feature vector of short text to be determined, and disaggregated model determines that whether this short text to be determined is the technology that rubbish text is well known to those skilled in the art, and repeats no more herein.

In fact, there is not strict sequencing between above-mentioned steps S201 and the S202, can executed in parallel or first execution in step S202 execution in step S201 again.

Based on above-mentioned modeling method, embodiments of the invention provide a kind of model building device, the inner structure block diagram specifically comprises as shown in Figure 3: characteristic extracting module 301, proper vector determination module 302 and disaggregated model make up module 303, characteristic element set determination module 304.

Characteristic extracting module 301 is used for having divided into rubbish text for training set, or each short text of non-rubbish text, carries out obtaining behind the participle set of words of this short text, and this short text is carried out the analytical information that the characteristics of spam analysis obtains this short text; Wherein, the analytical information of short text can comprise following arbitrary information, or the combination in any of following information:

Each short text that characteristic element set determination module 304 is used at described training set, obtain set of words and the analytical information of this short text from characteristic extracting module 301, and calculate the eigenwert of each word in the set of words of this short text, after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;

Particularly, characteristic element set determination module 304 can be at each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.

Each short text that proper vector determination module 302 is used at described training set, with each word in the analytical information of this short text and the set of words respectively with the 304 resulting characteristic elements set of characteristic element set determination module in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;

Disaggregated model makes up module 303 for the word feature vector of described each short text of training set of determining according to described proper vector determination module 302, makes up disaggregated model.Particularly, disaggregated model makes up module 303 and uses svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithms, trains described disaggregated model according to the word feature vector of each short text in the described training set.

Based on above-mentioned short text rubbish recognition methods, the embodiment of the invention provides a kind of short text rubbish recognition device, and the inner structure block diagram specifically comprises as shown in Figure 3: characteristic extracting module 401, proper vector determination module 402 and rubbish identification module 403.

Wherein, characteristic extracting module 401 is used for carrying out obtaining set of words behind the participle for short text to be determined, and described short text to be determined is carried out the characteristics of spam analysis obtains analytical information.The aforementioned by the agency of of the particular content of the analytical information of short text repeats no more herein.

Proper vector determination module 402 is used for obtaining from characteristic extracting module 401 analytical information and the set of words of short text to be determined, with each word in the analytical information of short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;

Particularly, proper vector determination module 402 can with each word in the analytical information of described short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of described short text to be determined.

Rubbish identification module 403 is for after obtaining the word feature vector of described short text to be determined from proper vector determination module 402, word feature vector according to described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.

In the technical scheme of the present invention, the word feature vector of each short text in the training set, word feature vector with short text to be determined, the eigenwert that has all comprised the analytical information that expands, word feature vector according to the eigenwert that has comprised the analytical information that expands, short text to be determined is carried out rubbish identification, improved discrimination and the recognition accuracy of rubbish text recognition.

The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. short text rubbish recognition methods is characterized in that, comprising:

2. the method for claim 1 is characterized in that, described analytical information comprises following arbitrary information, or the combination in any of following information:

3. method as claimed in claim 2 is characterized in that, the eigenwert of described analytical information specifically comprises:

4. method as claimed in claim 3 is characterized in that, before the word feature vector of the described short text to be determined of described generation, also comprises:

5. as the arbitrary described method of claim 1-4, it is characterized in that the eigenwert of described word obtains according to following method:

Log (TF+1.0) * IDF (formula 1).

6. as the arbitrary described method of claim 1-4, it is characterized in that, the training method of described disaggregated model, and definite method of described characteristic element set comprises:

7. method as claimed in claim 6 is characterized in that, described word feature vector according to each short text in the described training set trains described disaggregated model and is specially:

8. a modeling method is characterized in that, comprising:

9. method as claimed in claim 8 is characterized in that, described analytical information comprises following arbitrary information, or the combination in any of following information:

10. method as claimed in claim 9 is characterized in that, the eigenwert of described analytical information specifically comprises:

11. method as claimed in claim 10, it is characterized in that, after the eigenwert of the analytical information of described this short text of calculating, and the word that is complementary of described basis and characteristic element in the set of described characteristic element or the eigenwert of analytical information, generate before the word feature vector of this short text, also comprise:

12., it is characterized in that described word feature vector according to each short text in the described training set trains described disaggregated model and is specially as the arbitrary described method of claim 8-11:

13. a model building device is characterized in that, comprising:

14. device as claimed in claim 13 is characterized in that,

Described proper vector determination module specifically is used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.

15., it is characterized in that described analytical information comprises following arbitrary information, or the combination in any of following information as claim 13 or 14 described devices:

16. a short text rubbish recognition device is characterized in that, comprising:

17. device as claimed in claim 16 is characterized in that,

Described proper vector determination module specifically be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of described short text to be determined.

18., it is characterized in that described analytical information comprises following arbitrary information, or the combination in any of following information as claim 16 or 17 described devices: