CN103336766B

CN103336766B - Short text garbage identification and modeling method and device

Info

Publication number: CN103336766B
Application number: CN201310278012.6A
Authority: CN
Inventors: 姜贵彬
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2013-07-04
Filing date: 2013-07-04
Publication date: 2016-12-28
Anticipated expiration: 2033-07-04
Also published as: CN103336766A

Abstract

The invention discloses a kind of short text garbage identification and modeling method and device, described method includes: short text to be determined carries out participle and obtains set of words, and described short text to be determined is carried out characteristics of spam analysis obtains analysis information；Each word in the analysis information of described short text to be determined and set of words is compared with the characteristic element in predetermined characteristic element set respectively, according to the word matched with the characteristic element in described characteristic element set or the eigenvalue of the information of analysis, generate the word feature vector of described short text to be determined；Word feature vector according to described short text to be determined, and disaggregated model, determine whether described short text to be determined is rubbish text；Wherein disaggregated model is the sample number that combined training is concentrated, and selects suitable sorting algorithm training in advance to go out.Owing to using the word feature vector of the eigenvalue having expanded the information of analysis to carry out rubbish identification, thus improve the recognition accuracy of rubbish text recognition.

Description

Short text garbage identification and modeling method and device

Technical field

The present invention relates to internet arena, particularly relate to a kind of short text garbage identification and modeling method and Device.

Background technology

Internet technology fast development, network information explosive growth；Along with life, the adding of work rhythm Hurry up, people increasingly tend to carry out communication exchange with brief word.Push away spy with twitter() and Sina micro- Win the SNS(Social Network producing, organize and propagating information with less short text for representative Service, social network services) website, it is thus achieved that the favor of online friend.

At present, the main method that the short text content on the Internet carries out automatic garbage identification is, uses Method based on disaggregated model, is classified as rubbish text, or non-junk for certain short text content Text；The method includes: training stage and sorting phase.

In the training stage, it is modeled according to short text substantial amounts of in training set: for training set Zhong Yi district It is divided into rubbish text, or each short text of non-junk text, carries out participle and obtain the word of each short text Language set, is calculated the word feature vector of each short text according to the set of words of each short text； Word feature vector based on short text each in training set trains disaggregated model.Such as, SVM is used (Support Vector Machine, support vector machine) sorting algorithm or Bayesian Classification Arithmetic or certainly Plan tree classification algorithm or maximum entropy sorting algorithm, according to the word feature of short text each in described training set Vector trains disaggregated model.

At sorting phase, for short text to be determined, carry out participle and obtain the word of this short text to be determined After set, calculate the word feature of this short text to be determined according to the set of words of this short text to be determined Vector；Word feature vector according to this short text to be determined and the disaggregated model trained before, it is determined that Whether this short text to be determined is rubbish text.The most vectorial according to the word feature of this short text to be determined It is determined with many algorithms with disaggregated model carries out rubbish text, is well known to those skilled in the art, this Place repeats no more.

But, in actual applications, it was found by the inventors of the present invention that SNS website is due to its social attribute, The usual content of short text on SNS website is brief, the set of words extracted based on the most brief content In word little, the effective eigenvalue in thus obtained word feature vector is the most sparse, sometimes The word feature vector of the short text obtained may only have 1,2 effective eigenvalues；Based on the fewest Eigenvalue carry out the accuracy that the ownership of rubbish text collection and non-junk text set judges and be substantially reduced；Also That is, the rubbish recognition methods recognition accuracy of the short text content of currently available technology is the highest.

Summary of the invention

The defect existed for above-mentioned prior art, the invention provides a kind of short text garbage identification and Modeling method and device, carry out the accuracy of rubbish identification in order to improve the content to short text.

According to an aspect of the invention, it is provided a kind of short text garbage recognition methods, including:

Short text to be determined is carried out participle and obtains set of words, and described short text to be determined is carried out rubbish Rubbish feature analysis obtains analysis information；

By each word in the analysis information of described short text to be determined and set of words respectively with the most true The fixed characteristic element in characteristic element set compares, according to the spy in described characteristic element set Levying word or the eigenvalue of the information of analysis that element matches, the word generating described short text to be determined is special Levy vector；

Word feature vector according to described short text to be determined, and the disaggregated model that training in advance goes out, Determine whether described short text to be determined is rubbish text.

It is preferred that described analysis information includes following any information, or the combination in any of following information:

Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word, The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun Example information.

It is preferred that the eigenvalue of described analysis information specifically includes:

For the described information whether comprising contact method feature, its eigenvalue is the 0 or 1 of two-value；

For the accounting information of described interference symbol or the accounting information of rarely used word or traditional character Between the part of speech of transition probability between accounting information or word or front and back word, transition probability or noun accounts for Than the accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word, Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun Percent information, its eigenvalue is the numerical value between 0～1.

Further, before the word feature vector of the described short text to be determined of described generation, also include:

The eigenvalue of the analysis information matched with the characteristic element in described characteristic element set is returned One changes:

To the most whether comprise characteristic value normalization is two-value 0 or the 100 of the information of contact method feature；

By the accounting information of wherein interference symbol or the accounting information of rarely used word or accounting for of traditional character Than transition probability or the accounting of noun between the part of speech of the transition probability between information or word or front and back word The accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word, Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun The eigenvalue of percent information is multiplied by 100, obtains the normalization numerical value between 0～100.

It is preferred that the eigenvalue of described word obtains according to following method:

Calculate TF, IDF value of this word, and calculate the eigenvalue of this word according to equation below 1:

Log (TF+1.0) × IDF (formula 1)

It is preferred that the training method of described disaggregated model, and the determination method of described characteristic element set Including:

For training set has been divided into rubbish text, or each short text of non-junk text, carry out point Obtain the set of words of this short text after word, and it is short to obtain this after this short text is carried out characteristics of spam analysis The analysis information of text；

For each short text in described training set, calculate each word in the set of words of this short text Eigenvalue, and after calculating the eigenvalue of analysis information of this short text, the eigenvalue calculated is asked for Class discrimination degree；Class discrimination degree is more than the word setting threshold value, and the information of analysis is as described spy Levy the characteristic element in element set；

For each short text in described training set, by analysis information and the set of words of this short text In each word compare with the characteristic element in described characteristic element set respectively, according to described spy Levy word or the eigenvalue of the information of analysis that the characteristic element in element set matches, generate this short text Word feature vector；

Word feature vector according to short text each in described training set trains described disaggregated model.

It is preferred that the described word feature vector according to short text each in described training set trains described point Class model particularly as follows:

Use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or maximum entropy Sorting algorithm, trains described disaggregated model according to the word feature vector of short text each in described training set.

According to another aspect of the present invention, additionally provide a kind of modeling method, including:

For each short text in described training set, calculate each word in the set of words of this short text Eigenvalue, and after calculating the eigenvalue of analysis information of this short text, the eigenvalue calculated is asked for Class discrimination degree；Class discrimination degree is more than the word setting threshold value, and the information of analysis is as characteristic element Characteristic element in element set；

Word feature vector according to short text each in described training set trains disaggregated model.

It is preferred that after the eigenvalue of the analysis information of described this short text of calculating, and described basis with Word that characteristic element in described characteristic element set matches or the eigenvalue of the information of analysis, generating should Before the word feature vector of short text, also include:

The eigenvalue that this short text is analyzed information is normalized:

By characteristic value normalization is two-value 0 or the 100 of the described information whether comprising contact method feature；

By the accounting information of described interference symbol or the accounting information of rarely used word or accounting for of traditional character Than transition probability or the accounting of noun between the part of speech of the transition probability between information or word or front and back word The accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word, Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun The eigenvalue of percent information is multiplied by 100, obtains the normalization numerical value between 0～100；And

Word that described basis matches with the characteristic element in described characteristic element set or the information of analysis Eigenvalue, generate this short text word feature vector particularly as follows:

According to the word matched with the characteristic element in described characteristic element set or the normalizing of the information of analysis Eigenvalue after change, generates the word feature vector of this short text.

According to another aspect of the present invention, additionally provide a kind of model building device, including:

Characteristic extracting module, is used for for having divided into rubbish text in training set, or non-junk text Each short text, obtains the set of words of this short text, and this short text is carried out rubbish after carrying out participle Feature analysis obtains the analysis information of this short text；

Characteristic element set determines module, and for for each short text in described training set, calculating should The eigenvalue of each word in the set of words of short text, and calculate the feature of the analysis information of this short text After value, the eigenvalue calculated is asked for class discrimination degree；By class discrimination degree more than the word setting threshold value Language, and the information of analysis is as the characteristic element in characteristic element set；

Characteristic vector determines module, for for each short text in described training set, by this short text Analysis information and set of words in each word respectively with the characteristic element in described characteristic element set Compare, according to the word matched with the characteristic element in described characteristic element set or analysis information Eigenvalue, generate this short text word feature vector；

Disaggregated model builds module, for determining, according to described characteristic vector, the described training that module is determined Concentrate the word feature vector of each short text, build disaggregated model.

It is preferred that described characteristic vector determines that module is specifically for for each short essay in described training set This, by each word in the analysis information of this short text and set of words respectively with described characteristic element collection Characteristic element in conjunction compares, and matches according to characteristic element in described characteristic element set Eigenvalue after the normalization of word or the information of analysis, generates the word feature vector of this short text.

According to another aspect of the present invention, additionally provide a kind of short text garbage identification device, including:

Characteristic extracting module, obtains set of words after carrying out participle for short text to be determined, and right Described short text to be determined carries out characteristics of spam analysis and obtains analysis information；

Characteristic vector determines module, for by the analysis information of described short text to be determined and set of words In each word compare with the characteristic element in predetermined characteristic element set respectively, according to Word that characteristic element in described characteristic element set matches or the eigenvalue of the information of analysis, generate institute State the word feature vector of short text to be determined；

From described characteristic vector, rubbish identification module, for determining that module obtains described short text to be determined After word feature vector, according to the word feature vector of described short text to be determined, and training in advance goes out Disaggregated model, determine whether described short text to be determined is rubbish text.

It is preferred that described characteristic vector determines that module is specifically for believing the analysis of described short text to be determined In breath and set of words, each word enters with the characteristic element in predetermined characteristic element set respectively Row compares, according to the word that matches with the characteristic element in described characteristic element set or the information of analysis Eigenvalue after normalization, generates the word feature vector of described short text to be determined.

In technical scheme, constitute rubbish text hyperplane and the training of non-junk text hyperplane The word feature vector of each short text concentrated, and the word feature vector of short text to be determined, all comprise The eigenvalue of the analysiss information expanded, according to the word spy of the eigenvalue of the analysis information that contains expansion Levy vector, short text to be determined is carried out rubbish identification, improve discrimination and the knowledge of rubbish text recognition Other accuracy rate.

Accompanying drawing explanation

Fig. 1 is structure rubbish text hyperplane and the non-junk text hyperplane flow chart of the embodiment of the present invention；

Fig. 2 is the flow chart that short text to be determined carries out rubbish identification of the embodiment of the present invention；

Fig. 3 is model building device and the internal structure block diagram of short text garbage identification device of the embodiment of the present invention.

Detailed description of the invention

For making the purpose of the present invention, technical scheme and advantage clearer, develop simultaneously referring to the drawings Going out preferred embodiment, the present invention is described in more detail.However, it is necessary to explanation, in description The many details listed be only used to make reader one or more aspects of the present invention are had one thorough Understand, the aspects of the invention can also be realized even without these specific details.

The term such as " module " used in this application, " system " is intended to include the entity relevant to computer, Such as but not limited to hardware, firmware, combination thereof, software or executory software.Such as, mould Block it may be that it is not limited to: on processor run process, processor, object, journey can be performed Sequence, the thread of execution, program and/or computer.For example, application program calculating equipment run Can be module with this calculating equipment.One or more modules may be located at an executory process and/ Or in thread, a module can also be positioned on a computer and/or be distributed in two or the calculating of more multiple stage Between machine.

The present inventor it is considered that can to obtain based on art methods word feature vector Expand: in addition to the eigenvalue including word, may also include and short text is carried out characteristics of spam analysis After the eigenvalue of analysis information that obtains.Such as, obtain after short text being carried out characteristics of spam analysis divides Analysis information may include that whether comprise contact method feature, the accounting of interference symbol, the accounting of noun, Or the accounting etc. of verb.According to this word feature expanded vector, it is determined that its affiliated short essay to be determined Whether this is rubbish text, improves the accuracy rate of judgement than the method for prior art, i.e. improves rubbish The recognition accuracy of short text.

Based on above-mentioned consideration, The embodiment provides a kind of short text garbage based on disaggregated model Recognition methods；In the training stage of disaggregated model, first it is modeled: in modeling process, according to training set In each short text, build the rubbish text hyperplane in disaggregated model and non-junk plane hyperplane；? Cognitive phase, then the rubbish text hyperplane in the available disaggregated model built and non-junk plane are super flat Face, carries out the judgement of rubbish short text.

In modeling process, the method being modeled according to each short text in training set, i.e. build classification mould Rubbish text hyperplane in type and the method for non-junk plane hyperplane, flow process is as it is shown in figure 1, concrete Step includes:

S101: each short text in training set is carried out participle, obtains the set of words of each short text.

Specifically, for training set has been divided into rubbish text, or each short text of non-junk text, Carry out participle: continuous print word sequence in this short text is divided into word one by one；At the word marked off In, get rid of and there is no the function word of practical significance (such as punctuate, group verb, modal particle, interjection, onomatopoeia Deng)；Remaining word constitutes the set of words of this short text.

S102: each short text in training set is carried out characteristics of spam analysis, obtains each short text Analysis information.

Specifically, for training set is divided into inside rubbish, or each short text of non-junk text, Carry out characteristics of spam analysis, obtain the analysis information of this short text, specifically include following any information, or The combination in any of following information: whether comprise the accounting letter of the information of contact method feature, interference symbol Breath, transition probability between the accounting information of rarely used word, the accounting information of traditional character, word, front and back word Part of speech between transition probability, the accounting information of noun, the accounting information of verb, punctuation mark accounting letter Breath, the accounting information of unitary word, the accounting information of binary word, different part of speech Lexical collocation ratio are (such as The quantitative proportion information of noun and verb) and the ratio data information etc. of punctuation mark and noun.

Wherein, for short text each in training set, the analysis information of this short text can be to be somebody's turn to do In preprocessing process before the set of words of short text, extraction obtains, it is also possible to be to obtain this short essay Obtain after this set of words.

Above-mentioned contact method feature can be specifically a string numeral with contact meaning or character, such as, Telephone number, QQ number or URL(Uniform Resource Locator, URL) Deng；Typically, the purpose of some rubbish text is to obtain private interests, and contact method to be stayed； Therefore, in short text, the mode of being related can be important as determine whether rubbish text one Judge feature.

Interference symbol can be specifically the symbol being of little use, such as, " $ " etc.；Some rubbish texts are Avoid the filtration of key word, use some symbols being of little use to carry out the separation of key word；Therefore, The ratio that in statistics short text, interference symbol occurs can be sentenced as one that determines whether rubbish text Determine feature.

Transition probability between word refers to the collocation probability of adjacent two words and adjacent two words The collocation probability of type, such as, in " rubbish identification " short text, " rubbish " is normally to take with " identification " Joining, one collocation probability of corresponding existence, " rubbish " is noun, and " identification " is verb, noun and verb The probability of collocation is bigger；

Unitary word can be specifically single word；

Binary word can be specifically idiom, slang or the Chinese idiom of 2 word compositions.

S103: for short text each in training set, determine the eigenvalue of the analysis information of this short text, And the eigenvalue of each word in the set of words of this short text.

In this step, according to the set of words of short text each in the training set obtained, for each short essay Each word in this set of words, calculates this word TF(Term Frequency in this short text, Word frequency) value；Calculate this word IDF(Inverse Document Frequency in training set, inversely Document-frequency) value；And the eigenvalue of this word is calculated according to equation below 1:

Log (TF+1.0) × IDF (formula 1)

The eigenvalue of calculated word is usually the numerical value between 0～100.

In this step, for short text each in this training set, believe according to the analysis of this short text obtained Breath, it is judged that whether this short text comprises the information of contact method feature, the most then set and whether comprise connection Be the eigenvalue of the information of mode feature be 1(or 0)；Otherwise, 0(or 1 it is set as)；

The interference symbol counted is accounted for the ratio accounting as this interference symbol of this short text character The eigenvalue of information；

The rarely used word counted is accounted for the ratio spy as the accounting information of this rarely used word of this short text character Value indicative；

The traditional character counted is accounted for the ratio accounting information as this traditional character of this short text character Eigenvalue；

Transition probability between the word that will obtain is as the eigenvalue of the transition probability between this word；

The noun counted is accounted for the ratio feature as the accounting information of this noun of this short text character Value；

The verb counted is accounted for the ratio feature as the accounting information of this verb of this short text character Value；

The punctuation mark counted is accounted for the ratio accounting information as this punctuation mark of this short text character Eigenvalue；

The unitary word counted is accounted for the ratio spy as the accounting information of this unitary word of this short text character Value indicative；

The binary word counted is accounted for the ratio spy as the accounting information of this binary word of this short text character Value indicative；

The quantitative proportion of the noun in this short text that will count and verb is as the number of this noun Yu verb The eigenvalue of amount percent information；

The quantitative proportion of the punctuation mark in this short text that will count and noun is as this symbol and noun The eigenvalue of quantitative proportion information；

Eigenvalue in view of above-mentioned all types of accounting information, quantitative proportion information and transition probability is usual It is the numerical value between 0～1, in order to make convenience of calculation, as a kind of more excellent embodiment, also can be to really In this training set made, the eigenvalue of the analysis information of each short text is normalized, and obtains described spy The normalization numerical value of value indicative: for each short text in training set, whether comprise connection by calculated It is that the eigenvalue of the information of mode feature is multiplied by 100, whether is comprised the spy of the information of contact method feature The normalization numerical value of value indicative: 0 or 100；

The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, traditional character Transition probability between accounting information, word, the accounting information of noun, the accounting information of verb, punctuate symbol Number accounting information, the accounting information of unitary word, the quantity of the accounting information of binary word, noun and verb The eigenvalue of the quantitative proportion information of percent information and punctuation mark and noun is multiplied by 100 respectively, obtains respectively The normalization numerical value of eigenvalue to all types of accountings, quantitative proportion and the transition probability of this short text: Numerical value between 0～100.

S104: for each short text in training set, to word each in the set of words of this short text Eigenvalue, and the eigenvalue of the analysis information of this short text, after asking for class discrimination degree, choose spy Levy the characteristic element in element set.

Specifically, for each short text in training set, to word each in the set of words of this short text The eigenvalue of language, can use AUC algorithm to ask for the class discrimination degree of each word；The classification district of word Indexing can reflect that this word carries out the tribute of the class discrimination of rubbish text or non-junk text to short text Offer degree.

For each short text in training set, according to eigenvalue or the normalizing of the analysis information of this short text Eigenvalue after change asks for the class discrimination degree of analysis information；The class discrimination degree of analysis information can reflect Go out the percentage contribution that this analysis information carries out the class discrimination of rubbish text or non-junk text to short text. For the analysis information that eigenvalue is discrete values, X 2 test algorithm can be used to ask for class discrimination degree； For the analysis information that eigenvalue is serial number, AUC(Area Under Curve, curve can be used Lower area) algorithm asks for class discrimination degree.

For each short text in training set, at the class discrimination of the analysis information calculating this short text Degree, and in the set of words of this short text after the class discrimination degree of each word, class discrimination degree is more than Set the word of threshold value, and the information of analysis is as the characteristic element in characteristic element set.Above-mentioned sets Determine threshold value rule of thumb to be arranged by technical staff, for eigenvalue be discrete values and serial number not Same situation, the setting threshold value of setting can be different: such as, can when being discrete values for eigenvalue Set threshold value as 10 to arrange, setting threshold can be set for serial number that eigenvalue is serial number Value is 0.7.

S105: for each short text in training set, generate the word feature vector of this short text.

In this step, for each short text in training set, generate the word feature vector of this short text By each word in the analysis information of this short text and set of words respectively with the spy in characteristic element set Levy element to compare, according to the word matched with the characteristic element in described characteristic element set or point The eigenvalue of analysis information, generates the word feature vector of this short text.

More preferably, it is also possible to be that the word generating this short text is special for each short text in training set Levy vector by each word in the analysis information of this short text and set of words respectively with characteristic element set In characteristic element compare, according to the word matched with the characteristic element in described characteristic element set Eigenvalue after the normalization of language or the information of analysis, generates the word feature vector of this short text.

Specifically, for each short text in training set, the word feature vector of this short text is respectively tieed up Vector element is respectively each characteristic element in characteristic element set, and wherein, some vector elements are that this is short The analysis information of text, then using the eigenvalue after the eigenvalue of this analysis information or normalization as this vector The value of element；Some vector elements are the word in the set of words of this short text, then by the spy of this word Eigenvalue after value indicative or normalization is as the value of this vector element；Other vector element value is then empty, Or 0.

S106: according to the word feature vector of short text each in the training set obtained, build disaggregated model.

In this step, svm classifier algorithm or Bayesian Classification Arithmetic or decision tree classification can be used Algorithm or maximum entropy sorting algorithm, train point according to the word feature vector of short text each in training set Class model.Specifically, the quantity (i.e. sample size) of short text can be concentrated with combined training, select to use One suitable algorithm, trains disaggregated model according to the word feature vector of short text each in training set.

The concrete grammar of disaggregated model how is trained according to the word feature vector of short text each in training set Being well known to those skilled in the art, here is omitted.

It is true that there is no strict sequencing between above-mentioned steps S101 and S102, can hold parallel Go or first carry out step S102 and perform step S101 again.

After the training stage constructs disaggregated model, can at cognitive phase according to the disaggregated model constructed, Short text to be determined is carried out rubbish identification；The short text garbage recognition methods that the embodiment of the present invention provides Flow chart as in figure 2 it is shown, concrete steps include:

S201: short text to be determined is carried out participle, obtains the set of words of this short text to be determined.

Specifically, participle is carried out for short text to be determined: continuous print word sequence in this short text divided For word one by one；In the word marked off, get rid of do not have the function word of practical significance (as punctuate, Group verb, modal particle, interjection, onomatopoeia etc.)；Remaining word constitutes the set of words of this short text.

S202: this short text to be determined is carried out characteristics of spam analysis, obtains dividing of this short text to be determined Analysis information.

Specifically, for this short text to be determined, carry out characteristics of spam analysis, obtain dividing of this short text Analysis information, specifically includes following any information, or the combination in any of following information: whether comprise correspondent party The information of formula feature, the accounting information of interference symbol, the accounting information of rarely used word, the accounting for of traditional character Than transition probability between the part of speech of the transition probability between information, word, front and back word, the accounting information of noun, The accounting information of verb, the accounting information of punctuation mark, the accounting information of unitary word, the accounting of binary word Information, different part of speech Lexical collocation ratio and punctuation mark and the ratio data information etc. of noun.

Wherein, the analysis information of this short text to be determined can be to obtain this word treating interpretation short text In preprocessing process before set, extraction obtains, it is also possible to be to obtain this word treating interpretation short text Obtain after language set.

S203: determine the characteristic element of short text to be determined.

Specifically, by each word in the analysis information of short text to be determined and set of words respectively with upper Characteristic element in the characteristic element set stated compares, by with the feature in described characteristic element set Word that element matches or analysis information are as the characteristic element of this short text to be determined.

S204: according to the eigenvalue of the characteristic element of this short text to be determined, generates described short essay to be determined This word feature vector.

In this step, for the word as characteristic element of this short text to be determined, calculate eigenvalue: Calculate this word TF value in this short text, calculate this word IDF value in training set；And according to Above-mentioned formula 1 calculates the eigenvalue of this word.

In this step, for the analysis information as characteristic element of this short text to be determined, calculate feature Value:

Judge as the information whether comprising contact method feature in the analysis information of characteristic element, if so, Then set the eigenvalue of the information whether comprising contact method feature as 1(or 0)；Otherwise, then 0 it is set as (or 1)；

Judge as the accounting information whether comprising interference symbol in the analysis information of characteristic element；If so, Then the interference symbol counted is accounted for the ratio accounting letter as this interference symbol of this short text character The eigenvalue of breath；

Judge as the accounting information whether comprising rarely used word in the analysis information of characteristic element；The most then The rarely used word counted is accounted for the ratio feature as the accounting information of this rarely used word of this short text character Value；

Judge as the accounting information whether comprising traditional character in the analysis information of characteristic element；If so, Then the traditional character counted is accounted for the ratio of this short text character accounting information as this traditional character Eigenvalue；

Judge as whether the analysis information of characteristic element comprises the transition probability between word；The most then Transition probability between the word that will obtain is as the eigenvalue of the transition probability between this word；

Judge as the accounting information whether comprising noun in the analysis information of characteristic element；The most then will The noun counted accounts for the ratio eigenvalue as the accounting information of this noun of this short text character；

Judge as the accounting information whether comprising verb in the analysis information of characteristic element；The most then will The verb counted accounts for the ratio eigenvalue as the accounting information of this verb of this short text character；

Judge as the accounting information whether comprising punctuation mark in the analysis information of characteristic element；If so, Then the punctuation mark counted is accounted for the ratio of this short text character accounting information as this punctuation mark Eigenvalue；

Judge as the accounting information whether comprising unitary word in the analysis information of characteristic element；The most then The unitary word counted is accounted for the ratio feature as the accounting information of this unitary word of this short text character Value；

Judge as the accounting information whether comprising binary word in the analysis information of characteristic element；The most then The binary word counted is accounted for the ratio feature as the accounting information of this binary word of this short text character Value；

Judge as the quantitative proportion information whether comprising noun and verb in the analysis information of characteristic element； The quantitative proportion of the noun in this short text that the most then will count and verb is as this noun and verb The eigenvalue of quantitative proportion information；

Judge as the quantitative proportion information whether comprising symbol and noun in the analysis information of characteristic element； The quantitative proportion of the punctuation mark in this short text that the most then will count and noun as this symbol with The eigenvalue of the quantitative proportion information of noun.

All types of accounting information, quantitative proportion information and transition probability in view of this short text to be determined Eigenvalue is usually the numerical value between 0～1, in order to make convenience of calculation, as a kind of more excellent embodiment, The eigenvalue that also this short text to be determined can be analyzed information is normalized, and obtains described eigenvalue Normalization numerical value: for this short text to be determined, by the calculated contact method feature that whether comprises The eigenvalue of information is multiplied by 100, whether is comprised the normalization of the eigenvalue of the information of contact method feature Numerical value: 0 or 100；

The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, traditional character Transition probability between accounting information, word, the accounting information of noun, the accounting information of verb, punctuate symbol Number accounting information, the accounting information of unitary word, the quantity of the accounting information of binary word, noun and verb The eigenvalue of the quantitative proportion information of percent information and punctuation mark and noun is multiplied by 100 respectively, obtains respectively The normalization of eigenvalue to all types of accountings, quantitative proportion and the transition probability of this short text to be determined Numerical value: the numerical value between 0～100.

In this step, according to the feature after the eigenvalue of the characteristic element of this short text to be determined or normalization Value, generates the word feature vector of described short text to be determined；The word feature vector of short text to be determined In each characteristic element of being respectively in characteristic element set of each dimensional vector element, wherein, some vector elements For the analysis information of this short text to be determined, then by the feature after the eigenvalue of this analysis information or normalization It is worth the value as this vector element；Some vector elements are the word in the set of words of this short text to be determined Language, then using the eigenvalue after the eigenvalue of this word or normalization as the value of this vector element；Other Vector element value is then empty, or 0.

S205: according to the word feature vector of this short text to be determined, and disaggregated model determines that this waits to sentence Determine whether short text is rubbish text.

How according to the word feature vector of short text to be determined, and disaggregated model determines that this is to be determined short Whether text is the technology that rubbish text is well known to those skilled in the art, and here is omitted.

It is true that there is no strict sequencing between above-mentioned steps S201 and S202, can hold parallel Go or first carry out step S202 and perform step S201 again.

Based on above-mentioned modeling method, The embodiment provides a kind of model building device, internal structure Block diagram is as it is shown on figure 3, specifically include: characteristic extracting module 301, characteristic vector determine module 302 and divide Class model builds module 303, characteristic element set determines module 304.

Characteristic extracting module 301 is for for having divided into rubbish text, or non-junk text in training set Each short text, obtain the set of words of this short text after carrying out participle, and this short text carried out rubbish Rubbish feature analysis obtains the analysis information of this short text；Wherein, the analysis information of short text can include as Lower any information, or the combination in any of following information:

Characteristic element set determine module 304 for for each short text in described training set, from spy Levy extraction module 301 and obtain set of words and the analysis information of this short text, and calculate the word of this short text The eigenvalue of each word in language set, after calculating the eigenvalue of analysis information of this short text, to calculating The eigenvalue gone out asks for class discrimination degree；Class discrimination degree is more than the word setting threshold value, and analyzes Information is as the characteristic element in characteristic element set；

Specifically, characteristic element set determines that module 304 can be for each short essay in described training set This, by each word in the analysis information of this short text and set of words respectively with described characteristic element collection Characteristic element in conjunction compares, and matches according to characteristic element in described characteristic element set Eigenvalue after the normalization of word or the information of analysis, generates the word feature vector of this short text.

Characteristic vector determine module 302 for for each short text in described training set, by this short essay In this analysis information and set of words, each word determines module 304 institute respectively with characteristic element set Characteristic element in the characteristic element set obtained compares, according to in described characteristic element set Word that characteristic element matches or the eigenvalue of the information of analysis, generate the word feature vector of this short text；

Disaggregated model builds module 303 for determining, according to described characteristic vector, the institute that module 302 is determined State the word feature vector of each short text in training set, build disaggregated model.Specifically, disaggregated model structure Modeling block 303 use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or Maximum entropy sorting algorithm, trains described point according to the word feature vector of short text each in described training set Class model.

Based on above-mentioned short text garbage recognition methods, embodiments provide a kind of short text garbage Identify device, internal structure block diagram is as it is shown on figure 3, specifically include: characteristic extracting module 401, feature to Amount determines module 402 and rubbish identification module 403.

Wherein, characteristic extracting module 401 obtains word collection after carrying out participle for short text to be determined Close, and described short text to be determined is carried out characteristics of spam analysis obtain analysis information.The analysis of short text The aforementioned by the agency of of particular content of information, here is omitted.

Characteristic vector determines that module 402 is for obtaining dividing of short text to be determined from characteristic extracting module 401 Analysis information and set of words, by each word in the analysis information of short text to be determined and set of words Compare with the characteristic element in predetermined characteristic element set respectively, according to described characteristic element The word that matches of characteristic element in element set or the eigenvalue of the information of analysis, generate described to be determined short The word feature vector of text；

Specifically, characteristic vector determine module 402 can by the analysis information of described short text to be determined with And each word compares with the characteristic element in predetermined characteristic element set respectively in set of words Relatively, according to the word matched with the characteristic element in described characteristic element set or the normalizing of the information of analysis Eigenvalue after change, generates the word feature vector of described short text to be determined.

From characteristic vector, rubbish identification module 403 is for determining that module 402 obtains described short text to be determined Word feature vector after, according to the word feature of described short text to be determined vector, and training in advance The disaggregated model gone out, determines whether described short text to be determined is rubbish text.

In technical scheme, the word feature vector of each short text in training set, and to be determined The word feature vector of short text, all contains the eigenvalue of the analysis information of expansion, according to containing expansion The word feature vector of the eigenvalue of the analysis information filled, carries out rubbish identification to short text to be determined, carries The discrimination of high rubbish text recognition and recognition accuracy.

The above is only the preferred embodiment of the present invention, it is noted that general for the art For logical technical staff, under the premise without departing from the principles of the invention, it is also possible to make some improvement and profit Decorations, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a short text garbage recognition methods, it is characterised in that including:

Word feature vector according to described short text to be determined, and the disaggregated model that training in advance goes out, Determine whether described short text to be determined is rubbish text；Wherein,

Described analysis information includes following any information, or the combination in any of following information:

Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word, The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun Example information；And

The eigenvalue of described analysis information specifically includes:

2. the method for claim 1, it is characterised in that at the described short essay to be determined of described generation Before this word feature vector, also include:

3. method as claimed in claim 1 or 2, it is characterised in that the eigenvalue of described word according to Following method obtains:

Log (TF+1.0) × IDF (formula 1).

4. method as claimed in claim 1 or 2, it is characterised in that the training side of described disaggregated model Method, and the determination method of described characteristic element set includes:

5. method as claimed in claim 4, it is characterised in that described according to each short in described training set The word feature vector of text train described disaggregated model particularly as follows:

6. a modeling method, it is characterised in that including:

Word feature vector according to short text each in described training set trains disaggregated model；Wherein,

The eigenvalue of described analysis information specifically includes:

7. method as claimed in claim 6, it is characterised in that in the analysis of described this short text of calculating After the eigenvalue of information, and described basis and the characteristic element in described characteristic element set match Word or the eigenvalue of the information of analysis, before generating the word feature vector of this short text, also include:

The eigenvalue that this short text is analyzed information is normalized:

Method the most as claimed in claims 6 or 7, it is characterised in that described according in described training set The word feature vector of each short text train described disaggregated model particularly as follows:

9. a model building device, it is characterised in that including:

Disaggregated model builds module, for determining, according to described characteristic vector, the described training that module is determined Concentrate the word feature vector of each short text, build disaggregated model；

Wherein, described analysis information includes following any information, or the combination in any of following information:

The eigenvalue of described analysis information specifically includes:

For whether comprising the information of contact method feature, its eigenvalue is the 0 or 1 of two-value；

For the accounting information of interference symbol or the accounting information of rarely used word or the accounting of traditional character The accounting letter of transition probability or noun between the part of speech of transition probability between information or word or front and back word Breath or the accounting information of verb or the accounting information of punctuation mark or the accounting information of unitary word or The quantity ratio of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun Example information, its eigenvalue is the numerical value between 0～1.

10. device as claimed in claim 9, it is characterised in that

Described characteristic vector determines that module, should specifically for for each short text in described training set In the analysis information of short text and set of words each word respectively with the spy in described characteristic element set Levy element to compare, according to the word matched with the characteristic element in described characteristic element set or point Eigenvalue after the normalization of analysis information, generates the word feature vector of this short text.

11. 1 kinds of short text garbage identification devices, it is characterised in that including:

From described characteristic vector, rubbish identification module, for determining that module obtains described short text to be determined After word feature vector, according to the word feature vector of described short text to be determined, and training in advance goes out Disaggregated model, determine whether described short text to be determined is rubbish text；

The eigenvalue of described analysis information specifically includes:

12. devices as claimed in claim 11, it is characterised in that

Described characteristic vector determines that module is specifically for by the analysis information of described short text to be determined and word In language set, each word compares with the characteristic element in predetermined characteristic element set respectively, After normalization according to the word matched with the characteristic element in described characteristic element set or the information of analysis Eigenvalue, generate described short text to be determined word feature vector.