CN103336766A - Short text garbage identification and modeling method and device - Google Patents

Short text garbage identification and modeling method and device Download PDF

Info

Publication number
CN103336766A
CN103336766A CN2013102780126A CN201310278012A CN103336766A CN 103336766 A CN103336766 A CN 103336766A CN 2013102780126 A CN2013102780126 A CN 2013102780126A CN 201310278012 A CN201310278012 A CN 201310278012A CN 103336766 A CN103336766 A CN 103336766A
Authority
CN
China
Prior art keywords
word
information
short text
accounting information
eigenwert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102780126A
Other languages
Chinese (zh)
Other versions
CN103336766B (en
Inventor
姜贵彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN201310278012.6A priority Critical patent/CN103336766B/en
Publication of CN103336766A publication Critical patent/CN103336766A/en
Application granted granted Critical
Publication of CN103336766B publication Critical patent/CN103336766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a short text garbage identification and modeling method and device. The short text garbage identification and modeling method includes the steps that word segmentation is conducted on a short text to be determined, word sets are acquired, and garbage features of the short text to be determined are analyzed to acquire analytical information; the analytical information of the short text to be determined and each word in the word sets are compared with feature elements in predetermined feature element sets respectively, and word feature vectors of the short text to be determined are generated according to feature values of words or the analytical information matched with the feature elements in the feature element sets; whether the short text to be determined is a garbage text or not is determined according to the word feature vectors of the short text to be determined and classification models; the classification models are trained in advance, wherein the classification models combine the number of samples with centralized training and select a proper classification algorithm. Due to the fact that the word feature vectors of the feature values of the analytical information are expanded to conduct garbage identification, the identified accuracy rate for identifying the garbage texts is improved.

Description

The identification of short text rubbish and modeling method and device
Technical field
The present invention relates to internet arena, relate in particular to the identification of a kind of short text rubbish and modeling method and device.
Background technology
The Internet technology fast development, the network information explosive growth; Along with the quickening of life, work rhythm, people more and more tend to come communication exchange with brief literal.Push away the spy with twitter() and Sina's microblogging be the SNS(Social Network Service that the short text with less of representative is produced, organized and diffuses information, social network services) website, obtain online friend's favor.
At present, the main method of the short text content on the internet being carried out automatic rubbish identification is that employing is categorized as rubbish text for certain short text content with it based on the method for disaggregated model, or non-rubbish text; This method comprises: training stage and sorting phase.
In the training stage, carry out modeling according to short texts a large amount of in the training set: for having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out participle and obtain the set of words of each short text, calculate the word feature vector of each short text according to the set of words of each short text; Word feature vector based on each short text in the training set trains disaggregated model.For example, use SVM(Support Vector Machine, support vector machine) sorting algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm train disaggregated model according to the word feature vector of each short text in the described training set.
At sorting phase, for short text to be determined, carry out after participle obtains the set of words of this short text to be determined, calculate the word feature vector of this short text to be determined according to the set of words of this short text to be determined; According to the word feature vector and the disaggregated model that trains before of this short text to be determined, judge whether this short text to be determined is rubbish text.How the judgement of carrying out rubbish text according to the word feature vector sum disaggregated model of this short text to be determined has multiple algorithm, is well known to those skilled in the art, and repeats no more herein.
But, in actual applications, the present inventor finds, the SNS website is because its social attribute, the common content of short text on the SNS website is brief, word in the set of words of extracting based on so brief content seldom, the effective eigenwert in the word feature vector that obtains thus is very sparse, and 1,2 effective eigenwert may only be arranged in the word feature vector of the short text that obtains sometimes; The accuracy of carrying out the ownership judgement of rubbish text collection and non-rubbish text collection based on few eigenwert like this reduces greatly; That is the rubbish recognition methods recognition accuracy of the short text content of prior art is not high at present.
Summary of the invention
Defective at above-mentioned prior art exists the invention provides the identification of a kind of short text rubbish and modeling method and device, in order to improve the accuracy of the content of short text being carried out rubbish identification.
According to an aspect of the present invention, provide the recognition methods of a kind of short text rubbish, having comprised:
Short text to be determined is carried out participle obtain set of words, and described short text to be determined is carried out the characteristics of spam analysis obtain analytical information;
With each word in the analytical information of described short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;
According to the word feature vector of described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.
Preferably, described analytical information comprises following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
Preferably, the eigenwert of described analytical information specifically comprises:
For the described information that whether comprises the contact method feature, its eigenwert is 0 or 1 of two-value;
For transition probability or the accounting information of noun or accounting information or the accounting information of punctuation mark or accounting information or the accounting information of binary word or the quantitative proportion information of different part of speech vocabulary blend proportion or punctuation mark and noun of monobasic word of verb between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word, its eigenwert is the numerical value between 0~1.
Further, before the word feature vector of the described short text to be determined of described generation, also comprise:
To with the set of described characteristic element in the eigenwert of the analytical information that is complementary of characteristic element carry out normalization:
Be 0 or 100 of two-value with the characteristic value normalization that wherein whether comprises the information of contact method feature;
The eigenwert of the quantitative proportion information of the accounting information of the accounting information of the accounting information of the accounting information of the accounting information of transition probability or noun between the part of speech of the accounting information of the accounting information of the accounting information of interference symbol wherein or rarely used word or traditional character or the transition probability between word or front and back word or verb or punctuation mark or monobasic word or binary word or different part of speech vocabulary blend proportion or punctuation mark and noun be multiply by 100, obtain the normalization numerical value between 0~100.
Preferably, the eigenwert of described word obtains according to following method:
Calculate TF, the IDF value of this word, and calculate the eigenwert of this word according to following formula 1:
Log (TF+1.0) * IDF (formula 1)
Preferably, the training method of described disaggregated model, and definite method of described characteristic element set comprises:
For having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out obtaining behind the participle set of words of this short text, and this short text is carried out obtaining after the characteristics of spam analysis analytical information of this short text;
At each short text in the described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the described characteristic element set;
At each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Word feature vector according to each short text in the described training set trains described disaggregated model.
Preferably, described word feature vector according to each short text in the described training set trains described disaggregated model and is specially:
Use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train described disaggregated model according to the word feature vector of each short text in the described training set.
According to another aspect of the present invention, also provide a kind of modeling method, having comprised:
For having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out obtaining behind the participle set of words of this short text, and this short text is carried out obtaining after the characteristics of spam analysis analytical information of this short text;
At each short text in the described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;
At each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Word feature vector according to each short text in the described training set trains disaggregated model.
Preferably, described analytical information comprises following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
Preferably, the eigenwert of described analytical information specifically comprises:
For the described information that whether comprises the contact method feature, its eigenwert is 0 or 1 of two-value;
For transition probability or the accounting information of noun or accounting information or the accounting information of punctuation mark or accounting information or the accounting information of binary word or the quantitative proportion information of different part of speech vocabulary blend proportion or punctuation mark and noun of monobasic word of verb between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word, its eigenwert is the numerical value between 0~1.
Preferably, after the eigenwert of the analytical information of described this short text of calculating, and the word that is complementary of described basis and characteristic element in the described characteristic element set or the eigenwert of analytical information, generate before the word feature vector of this short text, also comprise:
Eigenwert to the analytical information of this short text is carried out normalization:
Be 0 or 100 of two-value with the described characteristic value normalization that whether comprises the information of contact method feature;
The eigenwert of the quantitative proportion information of the accounting information of the accounting information of the accounting information of the accounting information of the accounting information of transition probability or noun between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word or verb or punctuation mark or monobasic word or binary word or different part of speech vocabulary blend proportion or punctuation mark and noun be multiply by 100, obtain the normalization numerical value between 0~100; And
The word that characteristic element during described basis is gathered with described characteristic element is complementary or the eigenwert of analytical information, the word feature vector that generates this short text is specially:
According to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.
Preferably, described word feature vector according to each short text in the described training set trains described disaggregated model and is specially:
Use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train described disaggregated model according to the word feature vector of each short text in the described training set.
According to another aspect of the present invention, also provide a kind of model building device, having comprised:
Characteristic extracting module is used for having divided into rubbish text for training set, or each short text of non-rubbish text, carries out obtaining behind the participle set of words of this short text, and this short text is carried out the analytical information that the characteristics of spam analysis obtains this short text;
Characteristic element set determination module, be used for each short text at described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;
The proper vector determination module, be used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Disaggregated model makes up module, for the word feature vector of described each short text of training set of determining according to described proper vector determination module, makes up disaggregated model.
Preferably, described proper vector determination module specifically is used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.
Preferably, described analytical information comprises following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
According to another aspect of the present invention, also provide a kind of short text rubbish recognition device, having comprised:
Characteristic extracting module is used for carrying out obtaining set of words behind the participle for short text to be determined, and described short text to be determined is carried out the characteristics of spam analysis obtains analytical information;
The proper vector determination module, be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;
The rubbish identification module, after obtaining the word feature vector of described short text to be determined from described proper vector determination module, according to the word feature vector of described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.
Preferably, described proper vector determination module specifically be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of described short text to be determined.
Preferably, described analytical information comprises following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
In the technical scheme of the present invention, the word feature vector of each short text in the training set of formation rubbish text lineoid and non-rubbish text lineoid, word feature vector with short text to be determined, the eigenwert that has all comprised the analytical information that expands, word feature vector according to the eigenwert that has comprised the analytical information that expands, short text to be determined is carried out rubbish identification, improved discrimination and the recognition accuracy of rubbish text recognition.
Description of drawings
Fig. 1 is structure rubbish text lineoid and the non-rubbish text lineoid process flow diagram of the embodiment of the invention;
Fig. 2 carries out the process flow diagram that rubbish is identified for the embodiment of the invention to short text to be determined;
Fig. 3 is the model building device of the embodiment of the invention and the inner structure block diagram of short text rubbish recognition device.
Embodiment
For making purpose of the present invention, technical scheme and advantage clearer, below with reference to accompanying drawing and enumerate preferred embodiment, the present invention is described in more detail.Yet, need to prove that many details of listing in the instructions only are in order to make the reader to one or more aspects of the present invention a thorough understanding be arranged, even if there are not these specific details also can realize these aspects of the present invention.
Terms such as " module " used in this application, " system " are intended to comprise the entity relevant with computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.For example, module can be, but be not limited in: the thread of the process of moving on the processor, processor, object, executable program, execution, program and/or computing machine.For instance, the application program of moving on the computing equipment and this computing equipment can be modules.One or more modules can be positioned at an executory process and/or thread, and module also can be on the computing machine and/or be distributed between two or more the computing machines.
The present inventor considers, can the word feature vector that obtain based on art methods be expanded: except the eigenwert that comprises word, also can comprise the eigenwert of short text being carried out the analytical information that obtains after the characteristics of spam analysis.For example, the analytical information that short text is carried out obtaining after the characteristics of spam analysis can comprise: whether comprise the contact method feature, the accounting of the accounting of interference symbol, the accounting of noun or verb etc.According to this word feature vector that has expanded, judge whether the short text to be determined under it is rubbish text, improved the accuracy rate of judging than the method for prior art, namely improved the recognition accuracy of rubbish short text.
Based on above-mentioned consideration, embodiments of the invention provide a kind of short text rubbish recognition methods based on disaggregated model; In the training stage of disaggregated model, carry out modeling earlier: in the modeling process, according to each short text in the training set, make up rubbish text lineoid and non-rubbish plane lineoid in the disaggregated model; At cognitive phase, then can utilize rubbish text lineoid and non-rubbish plane lineoid in the disaggregated model of structure, carry out the judgement of rubbish short text.
In the modeling process, according to the method that each short text in the training set carries out modeling, namely make up rubbish text lineoid in the disaggregated model and the method for non-rubbish plane lineoid, flow process as shown in Figure 1, concrete steps comprise:
S101: each short text in the training set is carried out participle, obtain the set of words of each short text.
Particularly, for having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out participle: word sequence continuous in this short text is divided into word one by one; In the word that marks off, get rid of the function word (as punctuate, group verb, modal particle, interjection, onomatopoeia etc.) that does not have practical significance; Remaining word constitutes the set of words of this short text.
S102: each short text in the training set is carried out the characteristics of spam analysis, obtain the analytical information of each short text.
Particularly, for having divided into rubbish inside in the training set, or each short text of non-rubbish text, carry out the characteristics of spam analysis, obtain the analytical information of this short text, specifically comprise following arbitrary information, or the combination in any of following information: the information that whether comprises the contact method feature, the accounting information of interference symbol, the accounting information of rarely used word, the accounting information of traditional character, transition probability between word, transition probability between the part of speech of front and back word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportions (such as the quantitative proportion information of noun and verb), with ratio data information of punctuation mark and noun etc.
Wherein, at each short text in the training set, the analytical information of this short text can be to extract in the preprocessing process before obtaining the set of words of this short text to obtain, and also can be to obtain after obtaining the set of words of this short text.
Above-mentioned contact method feature specifically can be a string numeral or character with contact meaning, for example, and telephone number, QQ number or URL(Uniform Resource Locator, URL(uniform resource locator)) etc.; Usually, the purpose of some rubbish text is in order to obtain private interests, and will stay contact method; Therefore, whether the mode of being related can be used as an important judgement feature that takes a decision as to whether rubbish text in the short text.
The interference symbol specifically can be the symbol that is of little use, for example, and " $ " etc.; The rubbish text that has is for fear of the filtration of keyword, and the symbol that adopts some to be of little use carries out the separation of keyword; Therefore, the ratio that the interference symbol occurs in the statistics short text can be used as a judgement feature that takes a decision as to whether rubbish text.
Transition probability between word refers to the collocation probability of the type of the collocation probability of adjacent two words and adjacent two words, for example, " rubbish " is normal collocation with " identification " in " rubbish identification " short text, a collocation of corresponding existence probability, " rubbish " is noun, " identification " is verb, and the probability of noun and verb collocation is bigger;
The monobasic word specifically can be single word;
The binary word specifically can be idiom, slang or the Chinese idiom that 2 words are formed.
S103: for each short text in the training set, determine the eigenwert of the analytical information of this short text, and the eigenwert of each word in the set of words of this short text.
In this step, according to the set of words of each short text in the training set that obtains, at each word in the set of words of each short text, calculate the TF(Term Frequency of this word in this short text, word frequency) value; Calculate the IDF(Inverse Document Frequency of this word in training set, reverse file frequency) value; And calculate the eigenwert of this word according to following formula 1:
Log (TF+1.0) * IDF (formula 1)
The eigenwert of the word that calculates is generally the numerical value between 0~100.
In this step, at each short text in this training set, according to the analytical information of this short text that obtains, judge whether this short text comprises the information of contact method feature, if the eigenwert of then setting the information that whether comprises the contact method feature is 1(or 0); Otherwise, be set at 0(or 1);
The interference symbol that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this interference symbol;
The rarely used word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this rarely used word;
The traditional character that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this traditional character;
With the eigenwert of the transition probability between the word that obtains as the transition probability between this word;
The noun that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this noun;
The verb that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this verb;
The punctuation mark that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this punctuation mark;
The monobasic word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this monobasic word;
The binary word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this binary word;
With the quantitative proportion of the noun in this short text that counts and the verb eigenwert as the quantitative proportion information of this noun and verb;
With the quantitative proportion of the punctuation mark in this short text that counts and the noun eigenwert as the quantitative proportion information of this symbol and noun;
Consider above-mentioned all types of accounting information, the eigenwert of quantitative proportion information and transition probability is the numerical value between 0~1 normally, in order to make convenience of calculation, as a kind of more excellent embodiment, also can carry out normalization to the eigenwert of the analytical information of each short text in this training set of determining, obtain the normalization numerical value of described eigenwert: at each short text in the training set, the eigenwert of the information that whether comprises the contact method feature that calculates be multiply by 100, whether comprised the normalization numerical value of eigenwert of the information of contact method feature: 0 or 100;
The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, the accounting information of traditional character, the transition probability between word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the eigenwert of the quantitative proportion information of accounting information, noun and the verb of binary word and the quantitative proportion information of punctuation mark and noun multiply by 100 respectively, obtains the normalization numerical value of eigenwert of all types of accountings, quantitative proportion and the transition probability of this short text respectively: the numerical value between 0~100.
S104: for each short text in the training set, to the eigenwert of each word in the set of words of this short text, and the eigenwert of the analytical information of this short text, ask for the class discrimination degree after, the characteristic element in the selected characteristic element set.
Particularly, for each short text in the training set, to the eigenwert of each word in the set of words of this short text, can adopt the AUC algorithm to ask for the class discrimination degree of each word; The class discrimination degree of word can reflect that this word carries out the percentage contribution of the class discrimination of rubbish text or non-rubbish text to short text.
For each short text in the training set, according to the eigenwert of the analytical information of this short text or the class discrimination degree that the eigenwert after the normalization is asked for analytical information; The class discrimination degree of analytical information can reflect that this analytical information carries out the percentage contribution of the class discrimination of rubbish text or non-rubbish text to short text.Be the analytical information of discrete values for eigenwert, can adopt the Chi-square Test algorithm to ask for the class discrimination degree; Being the analytical information of serial number for eigenwert, can adopting AUC(Area Under Curve, area under curve) algorithm asks for the class discrimination degree.
For each short text in the training set, class discrimination degree at the analytical information that calculates this short text, behind the class discrimination degree of each word in the set of words of this short text, with the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set.Above-mentioned setting threshold can rule of thumb be arranged by the technician, be the different situations of discrete values and serial number for eigenwert, the setting threshold that arranges can be different: for example, for eigenwert be the situation of discrete values setting threshold can be set is 10, be that the setting threshold that can arrange of the serial number of serial number is 0.7 for eigenwert.
S105: at each short text in the training set, generate the word feature vector of this short text.
In this step, at each short text in the training set, the word feature vector that generates this short text with each word in the analytical information of this short text and the set of words respectively with the characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text.
More preferably, also can be at each short text in the training set, the word feature vector that generates this short text with each word in the analytical information of this short text and the set of words respectively with the characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.
Particularly, at each short text in the training set, each dimensional vector element is respectively each characteristic element in the characteristic element set in the word feature vector of this short text, wherein, the vector element that has is the analytical information of this short text, then with the eigenwert of this analytical information or the eigenwert after the normalization value as this vector element; The vector element that has is the word in the set of words of this short text, then with the eigenwert of this word or the eigenwert after the normalization value as this vector element; Other vector element value then is empty, or 0.
S106: according to the word feature vector of each short text in the training set that obtains, make up disaggregated model.
In this step, can use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train disaggregated model according to the word feature vector of each short text in the training set.Particularly, can combined training concentrate the quantity (being sample size) of short text, select to use a suitable algorithm, train disaggregated model according to the word feature vector of each short text in the training set.
How the concrete grammar that trains disaggregated model according to the word feature vector of each short text in the training set is well known to those skilled in the art, and repeats no more herein.
In fact, there is not strict sequencing between above-mentioned steps S101 and the S102, can executed in parallel or first execution in step S102 execution in step S101 again.
After constructing disaggregated model in the training stage, can carry out rubbish identification to short text to be determined at cognitive phase according to the disaggregated model that constructs; The process flow diagram of the short text rubbish recognition methods that the embodiment of the invention provides as shown in Figure 2, concrete steps comprise:
S201: short text to be determined is carried out participle, obtain the set of words of this short text to be determined.
Particularly, carry out participle for short text to be determined: word sequence continuous in this short text is divided into word one by one; In the word that marks off, get rid of the function word (as punctuate, group verb, modal particle, interjection, onomatopoeia etc.) that does not have practical significance; Remaining word constitutes the set of words of this short text.
S202: this short text to be determined is carried out the characteristics of spam analysis, obtain the analytical information of this short text to be determined.
Particularly, for this short text to be determined, carry out the characteristics of spam analysis, obtain the analytical information of this short text, specifically comprise following arbitrary information, or the combination in any of following information: the information that whether comprises the contact method feature, the accounting information of interference symbol, the accounting information of rarely used word, the accounting information of traditional character, transition probability between word, transition probability between the part of speech of front and back word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportions, with ratio data information of punctuation mark and noun etc.
Wherein, the analytical information of this short text to be determined can be to extract in the preprocessing process before obtaining this set of words for the treatment of the interpretation short text to obtain, and also can be to obtain after obtaining this set of words for the treatment of the interpretation short text.
S203: the characteristic element of determining short text to be determined.
Particularly, with each word in the analytical information of short text to be determined and the set of words respectively with above-mentioned characteristic element set in characteristic element compare, will with described characteristic element set in the word that is complementary of characteristic element or analytical information as the characteristic element of this short text to be determined.
S204: according to the eigenwert of the characteristic element of this short text to be determined, generate the word feature vector of described short text to be determined.
In this step, for the word as characteristic element of this short text to be determined, computation of characteristic values: calculate the TF value of this word in this short text, calculate the IDF value of this word in training set; And calculate the eigenwert of this word according to above-mentioned formula 1.
In this step, for the analytical information as characteristic element of this short text to be determined, computation of characteristic values:
Judgement is as the information that whether comprises the contact method feature in the analytical information of characteristic element, if the eigenwert of then setting the information that whether comprises the contact method feature is 1(or 0); Otherwise, then be set at 0(or 1);
Judgement is as the accounting information that whether comprises the interference symbol in the analytical information of characteristic element; If then the interference symbol that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this interference symbol;
Judgement is as the accounting information that whether comprises rarely used word in the analytical information of characteristic element; If then the rarely used word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this rarely used word;
Judgement is as the accounting information that whether comprises traditional character in the analytical information of characteristic element; If then the traditional character that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this traditional character;
Judgement is as the transition probability that whether comprises in the analytical information of characteristic element between word; If, then with the eigenwert of the transition probability between the word that obtains as the transition probability between this word;
Judgement is as the accounting information that whether comprises noun in the analytical information of characteristic element; If then the noun that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this noun;
Judgement is as the accounting information that whether comprises verb in the analytical information of characteristic element; If then the verb that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this verb;
Judgement is as the accounting information that whether comprises punctuation mark in the analytical information of characteristic element; If then the punctuation mark that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this punctuation mark;
Judgement is as the accounting information that whether comprises the monobasic word in the analytical information of characteristic element; If then the monobasic word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this monobasic word;
Judgement is as the accounting information that whether comprises the binary word in the analytical information of characteristic element; If then the binary word that counts is accounted for the ratio of this short text character as the eigenwert of the accounting information of this binary word;
Judgement is as the quantitative proportion information that whether comprises noun and verb in the analytical information of characteristic element; If the noun in this short text that then will count and the quantitative proportion of verb are as the eigenwert of the quantitative proportion information of this noun and verb;
Judgement is as the quantitative proportion information that whether comprises symbol and noun in the analytical information of characteristic element; If the punctuation mark in this short text that then will count and the quantitative proportion of noun are as the eigenwert of the quantitative proportion information of this symbol and noun.
The eigenwert of considering all types of accounting information, quantitative proportion information and the transition probability of this short text to be determined is generally the numerical value between 0~1, in order to make convenience of calculation, as a kind of more excellent embodiment, also can carry out normalization to the eigenwert of the analytical information of this short text to be determined, obtain the normalization numerical value of described eigenwert: at this short text to be determined, the eigenwert of the information that whether comprises the contact method feature that calculates be multiply by 100, whether comprised the normalization numerical value of eigenwert of the information of contact method feature: 0 or 100;
The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, the accounting information of traditional character, transition probability between word, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, the quantitative proportion information of noun and verb, multiply by 100 respectively with the eigenwert of the quantitative proportion information of punctuation mark and noun, obtain all types of accounting of this short text to be determined respectively, the normalization numerical value of the eigenwert of quantitative proportion and transition probability: the numerical value between 0~100.
In this step, according to the eigenwert of the characteristic element of this short text to be determined or the eigenwert after the normalization, generate the word feature vector of described short text to be determined; Each dimensional vector element is respectively each characteristic element in the characteristic element set in the word feature vector of short text to be determined, wherein, the vector element that has is the analytical information of this short text to be determined, then with the eigenwert of this analytical information or the eigenwert after the normalization value as this vector element; The vector element that has is the word in the set of words of this short text to be determined, then with the eigenwert of this word or the eigenwert after the normalization value as this vector element; Other vector element value then is empty, or 0.
S205: according to the word feature vector of this short text to be determined, and disaggregated model determines whether this short text to be determined is rubbish text.
How according to the word feature vector of short text to be determined, and disaggregated model determines that whether this short text to be determined is the technology that rubbish text is well known to those skilled in the art, and repeats no more herein.
In fact, there is not strict sequencing between above-mentioned steps S201 and the S202, can executed in parallel or first execution in step S202 execution in step S201 again.
Based on above-mentioned modeling method, embodiments of the invention provide a kind of model building device, the inner structure block diagram specifically comprises as shown in Figure 3: characteristic extracting module 301, proper vector determination module 302 and disaggregated model make up module 303, characteristic element set determination module 304.
Characteristic extracting module 301 is used for having divided into rubbish text for training set, or each short text of non-rubbish text, carries out obtaining behind the participle set of words of this short text, and this short text is carried out the analytical information that the characteristics of spam analysis obtains this short text; Wherein, the analytical information of short text can comprise following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
Each short text that characteristic element set determination module 304 is used at described training set, obtain set of words and the analytical information of this short text from characteristic extracting module 301, and calculate the eigenwert of each word in the set of words of this short text, after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;
Particularly, characteristic element set determination module 304 can be at each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.
Each short text that proper vector determination module 302 is used at described training set, with each word in the analytical information of this short text and the set of words respectively with the 304 resulting characteristic elements set of characteristic element set determination module in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Disaggregated model makes up module 303 for the word feature vector of described each short text of training set of determining according to described proper vector determination module 302, makes up disaggregated model.Particularly, disaggregated model makes up module 303 and uses svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithms, trains described disaggregated model according to the word feature vector of each short text in the described training set.
Based on above-mentioned short text rubbish recognition methods, the embodiment of the invention provides a kind of short text rubbish recognition device, and the inner structure block diagram specifically comprises as shown in Figure 3: characteristic extracting module 401, proper vector determination module 402 and rubbish identification module 403.
Wherein, characteristic extracting module 401 is used for carrying out obtaining set of words behind the participle for short text to be determined, and described short text to be determined is carried out the characteristics of spam analysis obtains analytical information.The aforementioned by the agency of of the particular content of the analytical information of short text repeats no more herein.
Proper vector determination module 402 is used for obtaining from characteristic extracting module 401 analytical information and the set of words of short text to be determined, with each word in the analytical information of short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;
Particularly, proper vector determination module 402 can with each word in the analytical information of described short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of described short text to be determined.
Rubbish identification module 403 is for after obtaining the word feature vector of described short text to be determined from proper vector determination module 402, word feature vector according to described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.
In the technical scheme of the present invention, the word feature vector of each short text in the training set, word feature vector with short text to be determined, the eigenwert that has all comprised the analytical information that expands, word feature vector according to the eigenwert that has comprised the analytical information that expands, short text to be determined is carried out rubbish identification, improved discrimination and the recognition accuracy of rubbish text recognition.
The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (18)

1. short text rubbish recognition methods is characterized in that, comprising:
Short text to be determined is carried out participle obtain set of words, and described short text to be determined is carried out the characteristics of spam analysis obtain analytical information;
With each word in the analytical information of described short text to be determined and the set of words respectively with predetermined characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;
According to the word feature vector of described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.
2. the method for claim 1 is characterized in that, described analytical information comprises following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
3. method as claimed in claim 2 is characterized in that, the eigenwert of described analytical information specifically comprises:
For the described information that whether comprises the contact method feature, its eigenwert is 0 or 1 of two-value;
For transition probability or the accounting information of noun or accounting information or the accounting information of punctuation mark or accounting information or the accounting information of binary word or the quantitative proportion information of different part of speech vocabulary blend proportion or punctuation mark and noun of monobasic word of verb between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word, its eigenwert is the numerical value between 0~1.
4. method as claimed in claim 3 is characterized in that, before the word feature vector of the described short text to be determined of described generation, also comprises:
To with the set of described characteristic element in the eigenwert of the analytical information that is complementary of characteristic element carry out normalization:
Be 0 or 100 of two-value with the characteristic value normalization that wherein whether comprises the information of contact method feature;
The eigenwert of the quantitative proportion information of the accounting information of the accounting information of the accounting information of the accounting information of the accounting information of transition probability or noun between the part of speech of the accounting information of the accounting information of the accounting information of interference symbol wherein or rarely used word or traditional character or the transition probability between word or front and back word or verb or punctuation mark or monobasic word or binary word or different part of speech vocabulary blend proportion or punctuation mark and noun be multiply by 100, obtain the normalization numerical value between 0~100.
5. as the arbitrary described method of claim 1-4, it is characterized in that the eigenwert of described word obtains according to following method:
Calculate TF, the IDF value of this word, and calculate the eigenwert of this word according to following formula 1:
Log (TF+1.0) * IDF (formula 1).
6. as the arbitrary described method of claim 1-4, it is characterized in that, the training method of described disaggregated model, and definite method of described characteristic element set comprises:
For having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out obtaining behind the participle set of words of this short text, and this short text is carried out obtaining after the characteristics of spam analysis analytical information of this short text;
At each short text in the described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the described characteristic element set;
At each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Word feature vector according to each short text in the described training set trains described disaggregated model.
7. method as claimed in claim 6 is characterized in that, described word feature vector according to each short text in the described training set trains described disaggregated model and is specially:
Use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train described disaggregated model according to the word feature vector of each short text in the described training set.
8. a modeling method is characterized in that, comprising:
For having divided into rubbish text in the training set, or each short text of non-rubbish text, carry out obtaining behind the participle set of words of this short text, and this short text is carried out obtaining after the characteristics of spam analysis analytical information of this short text;
At each short text in the described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;
At each short text in the described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Word feature vector according to each short text in the described training set trains disaggregated model.
9. method as claimed in claim 8 is characterized in that, described analytical information comprises following arbitrary information, or the combination in any of following information:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
10. method as claimed in claim 9 is characterized in that, the eigenwert of described analytical information specifically comprises:
For the described information that whether comprises the contact method feature, its eigenwert is 0 or 1 of two-value;
For transition probability or the accounting information of noun or accounting information or the accounting information of punctuation mark or accounting information or the accounting information of binary word or the quantitative proportion information of different part of speech vocabulary blend proportion or punctuation mark and noun of monobasic word of verb between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word, its eigenwert is the numerical value between 0~1.
11. method as claimed in claim 10, it is characterized in that, after the eigenwert of the analytical information of described this short text of calculating, and the word that is complementary of described basis and characteristic element in the set of described characteristic element or the eigenwert of analytical information, generate before the word feature vector of this short text, also comprise:
Eigenwert to the analytical information of this short text is carried out normalization:
Be 0 or 100 of two-value with the described characteristic value normalization that whether comprises the information of contact method feature;
The eigenwert of the quantitative proportion information of the accounting information of the accounting information of the accounting information of the accounting information of the accounting information of transition probability or noun between the part of speech of the accounting information of the accounting information of the accounting information of described interference symbol or rarely used word or traditional character or the transition probability between word or front and back word or verb or punctuation mark or monobasic word or binary word or different part of speech vocabulary blend proportion or punctuation mark and noun be multiply by 100, obtain the normalization numerical value between 0~100; And
The word that characteristic element during described basis is gathered with described characteristic element is complementary or the eigenwert of analytical information, the word feature vector that generates this short text is specially:
According to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.
12., it is characterized in that described word feature vector according to each short text in the described training set trains described disaggregated model and is specially as the arbitrary described method of claim 8-11:
Use svm classifier algorithm or Bayes algorithm or decision tree classification algorithm or maximum entropy sorting algorithm, train described disaggregated model according to the word feature vector of each short text in the described training set.
13. a model building device is characterized in that, comprising:
Characteristic extracting module is used for having divided into rubbish text for training set, or each short text of non-rubbish text, carries out obtaining behind the participle set of words of this short text, and this short text is carried out the analytical information that the characteristics of spam analysis obtains this short text;
Characteristic element set determination module, be used for each short text at described training set, calculate the eigenwert of each word in the set of words of this short text, and after calculating the eigenwert of analytical information of this short text, the eigenwert that calculates is asked for the class discrimination degree; With the word of class discrimination degree greater than setting threshold, and analytical information is as the characteristic element in the characteristic element set;
The proper vector determination module, be used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of this short text;
Disaggregated model makes up module, for the word feature vector of described each short text of training set of determining according to described proper vector determination module, makes up disaggregated model.
14. device as claimed in claim 13 is characterized in that,
Described proper vector determination module specifically is used for each short text at described training set, with each word in the analytical information of this short text and the set of words respectively with described characteristic element set in characteristic element compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of this short text.
15., it is characterized in that described analytical information comprises following arbitrary information, or the combination in any of following information as claim 13 or 14 described devices:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
16. a short text rubbish recognition device is characterized in that, comprising:
Characteristic extracting module is used for carrying out obtaining set of words behind the participle for short text to be determined, and described short text to be determined is carried out the characteristics of spam analysis obtains analytical information;
The proper vector determination module, be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert of analytical information, generate the word feature vector of described short text to be determined;
The rubbish identification module, after obtaining the word feature vector of described short text to be determined from described proper vector determination module, according to the word feature vector of described short text to be determined, and the disaggregated model that goes out of training in advance, determine whether described short text to be determined is rubbish text.
17. device as claimed in claim 16 is characterized in that,
Described proper vector determination module specifically be used for will described short text to be determined analytical information and the characteristic element of each word of set of words in gathering with predetermined characteristic element respectively compare, according to the set of described characteristic element in the word that is complementary of characteristic element or the eigenwert after the normalization of analytical information, generate the word feature vector of described short text to be determined.
18., it is characterized in that described analytical information comprises following arbitrary information, or the combination in any of following information as claim 16 or 17 described devices:
The quantitative proportion information that whether comprises transition probability, the accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting information of monobasic word, the accounting information of binary word, different part of speech vocabulary blend proportion, punctuation mark and noun between the part of speech of accounting information, the transition probability between word, front and back word of accounting information, the traditional character of accounting information, the rarely used word of information, the interference symbol of contact method feature.
CN201310278012.6A 2013-07-04 2013-07-04 Short text garbage identification and modeling method and device Active CN103336766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310278012.6A CN103336766B (en) 2013-07-04 2013-07-04 Short text garbage identification and modeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310278012.6A CN103336766B (en) 2013-07-04 2013-07-04 Short text garbage identification and modeling method and device

Publications (2)

Publication Number Publication Date
CN103336766A true CN103336766A (en) 2013-10-02
CN103336766B CN103336766B (en) 2016-12-28

Family

ID=49244935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310278012.6A Active CN103336766B (en) 2013-07-04 2013-07-04 Short text garbage identification and modeling method and device

Country Status (1)

Country Link
CN (1) CN103336766B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN104615585A (en) * 2014-01-06 2015-05-13 腾讯科技(深圳)有限公司 Text information processing method and device
CN104722554A (en) * 2015-02-04 2015-06-24 无锡荣博能源环保科技有限公司 Garbage classification equipment and garbage classification method based on chemical element properties as well as application
CN104809236A (en) * 2015-05-11 2015-07-29 苏州大学 Microblog-based user age classification method and Microblog-based user age classification system
CN105045924A (en) * 2015-08-26 2015-11-11 苏州大学张家港工业技术研究院 Question classification method and system
CN105589941A (en) * 2015-12-15 2016-05-18 北京百分点信息科技有限公司 Emotional information detection method and apparatus for web text
CN105808602A (en) * 2014-12-31 2016-07-27 中国移动通信集团公司 Detection method and device of junk information
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
CN106446032A (en) * 2016-08-30 2017-02-22 江苏博智软件科技有限公司 Junk information processing method and apparatus
WO2017028416A1 (en) * 2015-08-19 2017-02-23 小米科技有限责任公司 Classifier training method, type recognition method, and apparatus
CN106649255A (en) * 2015-11-04 2017-05-10 江苏引跑网络科技有限公司 Method for automatically classifying and identifying subject terms of short texts
CN106708961A (en) * 2016-11-30 2017-05-24 北京粉笔蓝天科技有限公司 Junk text library establishing method and system and junk text filtering method
CN104199811B (en) * 2014-09-10 2017-06-16 上海携程商务有限公司 Short sentence analytic modell analytical model method for building up and system
CN107180022A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 object classification method and device
CN107562728A (en) * 2017-09-12 2018-01-09 电子科技大学 Social media short text filter method based on structure and text message
CN107943941A (en) * 2017-11-23 2018-04-20 珠海金山网络游戏科技有限公司 It is a kind of can iteration renewal rubbish text recognition methods and system
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN108847238A (en) * 2018-08-06 2018-11-20 东北大学 A kind of new services robot voice recognition methods
CN109726727A (en) * 2017-10-27 2019-05-07 中移(杭州)信息技术有限公司 A kind of data detection method and system
WO2019096032A1 (en) * 2017-11-20 2019-05-23 腾讯科技(深圳)有限公司 Text information processing method, computer device, and computer-readable storage medium
CN110019681A (en) * 2017-12-19 2019-07-16 优酷网络技术(北京)有限公司 A kind of comment content filtering method and system
CN110298041A (en) * 2019-06-24 2019-10-01 北京奇艺世纪科技有限公司 Rubbish text filter method, device, electronic equipment and storage medium
CN110442714A (en) * 2019-07-25 2019-11-12 北京百度网讯科技有限公司 POI name authority appraisal procedure, device, equipment and storage medium
CN111651598A (en) * 2020-05-28 2020-09-11 上海勃池信息技术有限公司 Spam text auditing device and method through center vector similarity matching

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN101996241A (en) * 2010-10-22 2011-03-30 东南大学 Bayesian algorithm-based content filtering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN101996241A (en) * 2010-10-22 2011-03-30 东南大学 Bayesian algorithm-based content filtering method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯旭东: "基于内容的短消息智能分析系统研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
龚垒: "基于支持向量机的垃圾短信过滤方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615585A (en) * 2014-01-06 2015-05-13 腾讯科技(深圳)有限公司 Text information processing method and device
US10387460B2 (en) 2014-01-06 2019-08-20 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing text information
US11151176B2 (en) 2014-01-06 2021-10-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing text information
CN104615585B (en) * 2014-01-06 2017-07-21 腾讯科技(深圳)有限公司 Handle the method and device of text message
CN104199811B (en) * 2014-09-10 2017-06-16 上海携程商务有限公司 Short sentence analytic modell analytical model method for building up and system
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN105808602B (en) * 2014-12-31 2020-04-21 中国移动通信集团公司 Method and device for detecting junk information
CN105808602A (en) * 2014-12-31 2016-07-27 中国移动通信集团公司 Detection method and device of junk information
CN104722554A (en) * 2015-02-04 2015-06-24 无锡荣博能源环保科技有限公司 Garbage classification equipment and garbage classification method based on chemical element properties as well as application
CN104809236A (en) * 2015-05-11 2015-07-29 苏州大学 Microblog-based user age classification method and Microblog-based user age classification system
CN104809236B (en) * 2015-05-11 2018-03-27 苏州大学 A kind of age of user sorting technique and system based on microblogging
RU2643500C2 (en) * 2015-08-19 2018-02-01 Сяоми Инк. Method and device for training classifier and recognizing type
WO2017028416A1 (en) * 2015-08-19 2017-02-23 小米科技有限责任公司 Classifier training method, type recognition method, and apparatus
CN105045924A (en) * 2015-08-26 2015-11-11 苏州大学张家港工业技术研究院 Question classification method and system
CN106649255A (en) * 2015-11-04 2017-05-10 江苏引跑网络科技有限公司 Method for automatically classifying and identifying subject terms of short texts
CN105589941A (en) * 2015-12-15 2016-05-18 北京百分点信息科技有限公司 Emotional information detection method and apparatus for web text
CN107180022A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 object classification method and device
CN105956472B (en) * 2016-05-12 2019-10-18 宝利九章(北京)数据技术有限公司 Identify webpage in whether include hostile content method and system
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
CN106446032A (en) * 2016-08-30 2017-02-22 江苏博智软件科技有限公司 Junk information processing method and apparatus
CN106708961A (en) * 2016-11-30 2017-05-24 北京粉笔蓝天科技有限公司 Junk text library establishing method and system and junk text filtering method
CN106708961B (en) * 2016-11-30 2020-11-06 北京粉笔蓝天科技有限公司 Method for establishing junk text library, method for filtering junk text library and system
CN107562728A (en) * 2017-09-12 2018-01-09 电子科技大学 Social media short text filter method based on structure and text message
CN109726727A (en) * 2017-10-27 2019-05-07 中移(杭州)信息技术有限公司 A kind of data detection method and system
WO2019096032A1 (en) * 2017-11-20 2019-05-23 腾讯科技(深圳)有限公司 Text information processing method, computer device, and computer-readable storage medium
CN107943941A (en) * 2017-11-23 2018-04-20 珠海金山网络游戏科技有限公司 It is a kind of can iteration renewal rubbish text recognition methods and system
CN107943941B (en) * 2017-11-23 2021-10-15 珠海金山网络游戏科技有限公司 Junk text recognition method and system capable of being updated iteratively
CN110019681A (en) * 2017-12-19 2019-07-16 优酷网络技术(北京)有限公司 A kind of comment content filtering method and system
CN108647309B (en) * 2018-05-09 2021-08-10 达而观信息科技(上海)有限公司 Chat content auditing method and system based on sensitive words
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN108847238A (en) * 2018-08-06 2018-11-20 东北大学 A kind of new services robot voice recognition methods
CN110298041A (en) * 2019-06-24 2019-10-01 北京奇艺世纪科技有限公司 Rubbish text filter method, device, electronic equipment and storage medium
CN110298041B (en) * 2019-06-24 2023-09-05 北京奇艺世纪科技有限公司 Junk text filtering method and device, electronic equipment and storage medium
CN110442714A (en) * 2019-07-25 2019-11-12 北京百度网讯科技有限公司 POI name authority appraisal procedure, device, equipment and storage medium
CN110442714B (en) * 2019-07-25 2022-05-27 北京百度网讯科技有限公司 POI name normative evaluation method, device, equipment and storage medium
CN111651598A (en) * 2020-05-28 2020-09-11 上海勃池信息技术有限公司 Spam text auditing device and method through center vector similarity matching

Also Published As

Publication number Publication date
CN103336766B (en) 2016-12-28

Similar Documents

Publication Publication Date Title
CN103336766A (en) Short text garbage identification and modeling method and device
CN110020422B (en) Feature word determining method and device and server
CN105808526B (en) Commodity short text core word extracting method and device
CN103729474B (en) Method and system for recognizing forum user vest account
CN106919661B (en) Emotion type identification method and related device
CN103324745B (en) Text garbage recognition methods and system based on Bayesian model
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN109446404A (en) A kind of the feeling polarities analysis method and device of network public-opinion
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN107102993B (en) User appeal analysis method and device
CN102096680A (en) Method and device for analyzing information validity
CN108305180B (en) Friend recommendation method and device
CN102227724A (en) Machine learning for transliteration
CN103795612A (en) Method for detecting junk and illegal messages in instant messaging
CN110134788B (en) Microblog release optimization method and system based on text mining
CN105183717A (en) OSN user emotion analysis method based on random forest and user relationship
CN108009297B (en) Text emotion analysis method and system based on natural language processing
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN109508373A (en) Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index
CN111079029A (en) Sensitive account detection method, storage medium and computer equipment
CN111061837A (en) Topic identification method, device, equipment and medium
CN109978020A (en) A kind of social networks account vest identity identification method based on multidimensional characteristic
CN110688540B (en) Cheating account screening method, device, equipment and medium
CN108462624A (en) A kind of recognition methods of spam, device and electronic equipment
CN107992473B (en) Fraud information feature word extraction method and system based on point-to-point mutual information technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant