CN103336766B - Short text garbage identification and modeling method and device - Google Patents
Short text garbage identification and modeling method and device Download PDFInfo
- Publication number
- CN103336766B CN103336766B CN201310278012.6A CN201310278012A CN103336766B CN 103336766 B CN103336766 B CN 103336766B CN 201310278012 A CN201310278012 A CN 201310278012A CN 103336766 B CN103336766 B CN 103336766B
- Authority
- CN
- China
- Prior art keywords
- information
- word
- short text
- accounting
- eigenvalue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of short text garbage identification and modeling method and device, described method includes: short text to be determined carries out participle and obtains set of words, and described short text to be determined is carried out characteristics of spam analysis obtains analysis information;Each word in the analysis information of described short text to be determined and set of words is compared with the characteristic element in predetermined characteristic element set respectively, according to the word matched with the characteristic element in described characteristic element set or the eigenvalue of the information of analysis, generate the word feature vector of described short text to be determined;Word feature vector according to described short text to be determined, and disaggregated model, determine whether described short text to be determined is rubbish text;Wherein disaggregated model is the sample number that combined training is concentrated, and selects suitable sorting algorithm training in advance to go out.Owing to using the word feature vector of the eigenvalue having expanded the information of analysis to carry out rubbish identification, thus improve the recognition accuracy of rubbish text recognition.
Description
Technical field
The present invention relates to internet arena, particularly relate to a kind of short text garbage identification and modeling method and
Device.
Background technology
Internet technology fast development, network information explosive growth;Along with life, the adding of work rhythm
Hurry up, people increasingly tend to carry out communication exchange with brief word.Push away spy with twitter() and Sina micro-
Win the SNS(Social Network producing, organize and propagating information with less short text for representative
Service, social network services) website, it is thus achieved that the favor of online friend.
At present, the main method that the short text content on the Internet carries out automatic garbage identification is, uses
Method based on disaggregated model, is classified as rubbish text, or non-junk for certain short text content
Text;The method includes: training stage and sorting phase.
In the training stage, it is modeled according to short text substantial amounts of in training set: for training set Zhong Yi district
It is divided into rubbish text, or each short text of non-junk text, carries out participle and obtain the word of each short text
Language set, is calculated the word feature vector of each short text according to the set of words of each short text;
Word feature vector based on short text each in training set trains disaggregated model.Such as, SVM is used
(Support Vector Machine, support vector machine) sorting algorithm or Bayesian Classification Arithmetic or certainly
Plan tree classification algorithm or maximum entropy sorting algorithm, according to the word feature of short text each in described training set
Vector trains disaggregated model.
At sorting phase, for short text to be determined, carry out participle and obtain the word of this short text to be determined
After set, calculate the word feature of this short text to be determined according to the set of words of this short text to be determined
Vector;Word feature vector according to this short text to be determined and the disaggregated model trained before, it is determined that
Whether this short text to be determined is rubbish text.The most vectorial according to the word feature of this short text to be determined
It is determined with many algorithms with disaggregated model carries out rubbish text, is well known to those skilled in the art, this
Place repeats no more.
But, in actual applications, it was found by the inventors of the present invention that SNS website is due to its social attribute,
The usual content of short text on SNS website is brief, the set of words extracted based on the most brief content
In word little, the effective eigenvalue in thus obtained word feature vector is the most sparse, sometimes
The word feature vector of the short text obtained may only have 1,2 effective eigenvalues;Based on the fewest
Eigenvalue carry out the accuracy that the ownership of rubbish text collection and non-junk text set judges and be substantially reduced;Also
That is, the rubbish recognition methods recognition accuracy of the short text content of currently available technology is the highest.
Summary of the invention
The defect existed for above-mentioned prior art, the invention provides a kind of short text garbage identification and
Modeling method and device, carry out the accuracy of rubbish identification in order to improve the content to short text.
According to an aspect of the invention, it is provided a kind of short text garbage recognition methods, including:
Short text to be determined is carried out participle and obtains set of words, and described short text to be determined is carried out rubbish
Rubbish feature analysis obtains analysis information;
By each word in the analysis information of described short text to be determined and set of words respectively with the most true
The fixed characteristic element in characteristic element set compares, according to the spy in described characteristic element set
Levying word or the eigenvalue of the information of analysis that element matches, the word generating described short text to be determined is special
Levy vector;
Word feature vector according to described short text to be determined, and the disaggregated model that training in advance goes out,
Determine whether described short text to be determined is rubbish text.
It is preferred that described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information.
It is preferred that the eigenvalue of described analysis information specifically includes:
For the described information whether comprising contact method feature, its eigenvalue is the 0 or 1 of two-value;
For the accounting information of described interference symbol or the accounting information of rarely used word or traditional character
Between the part of speech of transition probability between accounting information or word or front and back word, transition probability or noun accounts for
Than the accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
Percent information, its eigenvalue is the numerical value between 0~1.
Further, before the word feature vector of the described short text to be determined of described generation, also include:
The eigenvalue of the analysis information matched with the characteristic element in described characteristic element set is returned
One changes:
To the most whether comprise characteristic value normalization is two-value 0 or the 100 of the information of contact method feature;
By the accounting information of wherein interference symbol or the accounting information of rarely used word or accounting for of traditional character
Than transition probability or the accounting of noun between the part of speech of the transition probability between information or word or front and back word
The accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
The eigenvalue of percent information is multiplied by 100, obtains the normalization numerical value between 0~100.
It is preferred that the eigenvalue of described word obtains according to following method:
Calculate TF, IDF value of this word, and calculate the eigenvalue of this word according to equation below 1:
Log (TF+1.0) × IDF (formula 1)
It is preferred that the training method of described disaggregated model, and the determination method of described characteristic element set
Including:
For training set has been divided into rubbish text, or each short text of non-junk text, carry out point
Obtain the set of words of this short text after word, and it is short to obtain this after this short text is carried out characteristics of spam analysis
The analysis information of text;
For each short text in described training set, calculate each word in the set of words of this short text
Eigenvalue, and after calculating the eigenvalue of analysis information of this short text, the eigenvalue calculated is asked for
Class discrimination degree;Class discrimination degree is more than the word setting threshold value, and the information of analysis is as described spy
Levy the characteristic element in element set;
For each short text in described training set, by analysis information and the set of words of this short text
In each word compare with the characteristic element in described characteristic element set respectively, according to described spy
Levy word or the eigenvalue of the information of analysis that the characteristic element in element set matches, generate this short text
Word feature vector;
Word feature vector according to short text each in described training set trains described disaggregated model.
It is preferred that the described word feature vector according to short text each in described training set trains described point
Class model particularly as follows:
Use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or maximum entropy
Sorting algorithm, trains described disaggregated model according to the word feature vector of short text each in described training set.
According to another aspect of the present invention, additionally provide a kind of modeling method, including:
For training set has been divided into rubbish text, or each short text of non-junk text, carry out point
Obtain the set of words of this short text after word, and it is short to obtain this after this short text is carried out characteristics of spam analysis
The analysis information of text;
For each short text in described training set, calculate each word in the set of words of this short text
Eigenvalue, and after calculating the eigenvalue of analysis information of this short text, the eigenvalue calculated is asked for
Class discrimination degree;Class discrimination degree is more than the word setting threshold value, and the information of analysis is as characteristic element
Characteristic element in element set;
For each short text in described training set, by analysis information and the set of words of this short text
In each word compare with the characteristic element in described characteristic element set respectively, according to described spy
Levy word or the eigenvalue of the information of analysis that the characteristic element in element set matches, generate this short text
Word feature vector;
Word feature vector according to short text each in described training set trains disaggregated model.
It is preferred that described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information.
It is preferred that the eigenvalue of described analysis information specifically includes:
For the described information whether comprising contact method feature, its eigenvalue is the 0 or 1 of two-value;
For the accounting information of described interference symbol or the accounting information of rarely used word or traditional character
Between the part of speech of transition probability between accounting information or word or front and back word, transition probability or noun accounts for
Than the accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
Percent information, its eigenvalue is the numerical value between 0~1.
It is preferred that after the eigenvalue of the analysis information of described this short text of calculating, and described basis with
Word that characteristic element in described characteristic element set matches or the eigenvalue of the information of analysis, generating should
Before the word feature vector of short text, also include:
The eigenvalue that this short text is analyzed information is normalized:
By characteristic value normalization is two-value 0 or the 100 of the described information whether comprising contact method feature;
By the accounting information of described interference symbol or the accounting information of rarely used word or accounting for of traditional character
Than transition probability or the accounting of noun between the part of speech of the transition probability between information or word or front and back word
The accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
The eigenvalue of percent information is multiplied by 100, obtains the normalization numerical value between 0~100;And
Word that described basis matches with the characteristic element in described characteristic element set or the information of analysis
Eigenvalue, generate this short text word feature vector particularly as follows:
According to the word matched with the characteristic element in described characteristic element set or the normalizing of the information of analysis
Eigenvalue after change, generates the word feature vector of this short text.
It is preferred that the described word feature vector according to short text each in described training set trains described point
Class model particularly as follows:
Use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or maximum entropy
Sorting algorithm, trains described disaggregated model according to the word feature vector of short text each in described training set.
According to another aspect of the present invention, additionally provide a kind of model building device, including:
Characteristic extracting module, is used for for having divided into rubbish text in training set, or non-junk text
Each short text, obtains the set of words of this short text, and this short text is carried out rubbish after carrying out participle
Feature analysis obtains the analysis information of this short text;
Characteristic element set determines module, and for for each short text in described training set, calculating should
The eigenvalue of each word in the set of words of short text, and calculate the feature of the analysis information of this short text
After value, the eigenvalue calculated is asked for class discrimination degree;By class discrimination degree more than the word setting threshold value
Language, and the information of analysis is as the characteristic element in characteristic element set;
Characteristic vector determines module, for for each short text in described training set, by this short text
Analysis information and set of words in each word respectively with the characteristic element in described characteristic element set
Compare, according to the word matched with the characteristic element in described characteristic element set or analysis information
Eigenvalue, generate this short text word feature vector;
Disaggregated model builds module, for determining, according to described characteristic vector, the described training that module is determined
Concentrate the word feature vector of each short text, build disaggregated model.
It is preferred that described characteristic vector determines that module is specifically for for each short essay in described training set
This, by each word in the analysis information of this short text and set of words respectively with described characteristic element collection
Characteristic element in conjunction compares, and matches according to characteristic element in described characteristic element set
Eigenvalue after the normalization of word or the information of analysis, generates the word feature vector of this short text.
It is preferred that described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information.
According to another aspect of the present invention, additionally provide a kind of short text garbage identification device, including:
Characteristic extracting module, obtains set of words after carrying out participle for short text to be determined, and right
Described short text to be determined carries out characteristics of spam analysis and obtains analysis information;
Characteristic vector determines module, for by the analysis information of described short text to be determined and set of words
In each word compare with the characteristic element in predetermined characteristic element set respectively, according to
Word that characteristic element in described characteristic element set matches or the eigenvalue of the information of analysis, generate institute
State the word feature vector of short text to be determined;
From described characteristic vector, rubbish identification module, for determining that module obtains described short text to be determined
After word feature vector, according to the word feature vector of described short text to be determined, and training in advance goes out
Disaggregated model, determine whether described short text to be determined is rubbish text.
It is preferred that described characteristic vector determines that module is specifically for believing the analysis of described short text to be determined
In breath and set of words, each word enters with the characteristic element in predetermined characteristic element set respectively
Row compares, according to the word that matches with the characteristic element in described characteristic element set or the information of analysis
Eigenvalue after normalization, generates the word feature vector of described short text to be determined.
It is preferred that described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information.
In technical scheme, constitute rubbish text hyperplane and the training of non-junk text hyperplane
The word feature vector of each short text concentrated, and the word feature vector of short text to be determined, all comprise
The eigenvalue of the analysiss information expanded, according to the word spy of the eigenvalue of the analysis information that contains expansion
Levy vector, short text to be determined is carried out rubbish identification, improve discrimination and the knowledge of rubbish text recognition
Other accuracy rate.
Accompanying drawing explanation
Fig. 1 is structure rubbish text hyperplane and the non-junk text hyperplane flow chart of the embodiment of the present invention;
Fig. 2 is the flow chart that short text to be determined carries out rubbish identification of the embodiment of the present invention;
Fig. 3 is model building device and the internal structure block diagram of short text garbage identification device of the embodiment of the present invention.
Detailed description of the invention
For making the purpose of the present invention, technical scheme and advantage clearer, develop simultaneously referring to the drawings
Going out preferred embodiment, the present invention is described in more detail.However, it is necessary to explanation, in description
The many details listed be only used to make reader one or more aspects of the present invention are had one thorough
Understand, the aspects of the invention can also be realized even without these specific details.
The term such as " module " used in this application, " system " is intended to include the entity relevant to computer,
Such as but not limited to hardware, firmware, combination thereof, software or executory software.Such as, mould
Block it may be that it is not limited to: on processor run process, processor, object, journey can be performed
Sequence, the thread of execution, program and/or computer.For example, application program calculating equipment run
Can be module with this calculating equipment.One or more modules may be located at an executory process and/
Or in thread, a module can also be positioned on a computer and/or be distributed in two or the calculating of more multiple stage
Between machine.
The present inventor it is considered that can to obtain based on art methods word feature vector
Expand: in addition to the eigenvalue including word, may also include and short text is carried out characteristics of spam analysis
After the eigenvalue of analysis information that obtains.Such as, obtain after short text being carried out characteristics of spam analysis divides
Analysis information may include that whether comprise contact method feature, the accounting of interference symbol, the accounting of noun,
Or the accounting etc. of verb.According to this word feature expanded vector, it is determined that its affiliated short essay to be determined
Whether this is rubbish text, improves the accuracy rate of judgement than the method for prior art, i.e. improves rubbish
The recognition accuracy of short text.
Based on above-mentioned consideration, The embodiment provides a kind of short text garbage based on disaggregated model
Recognition methods;In the training stage of disaggregated model, first it is modeled: in modeling process, according to training set
In each short text, build the rubbish text hyperplane in disaggregated model and non-junk plane hyperplane;?
Cognitive phase, then the rubbish text hyperplane in the available disaggregated model built and non-junk plane are super flat
Face, carries out the judgement of rubbish short text.
In modeling process, the method being modeled according to each short text in training set, i.e. build classification mould
Rubbish text hyperplane in type and the method for non-junk plane hyperplane, flow process is as it is shown in figure 1, concrete
Step includes:
S101: each short text in training set is carried out participle, obtains the set of words of each short text.
Specifically, for training set has been divided into rubbish text, or each short text of non-junk text,
Carry out participle: continuous print word sequence in this short text is divided into word one by one;At the word marked off
In, get rid of and there is no the function word of practical significance (such as punctuate, group verb, modal particle, interjection, onomatopoeia
Deng);Remaining word constitutes the set of words of this short text.
S102: each short text in training set is carried out characteristics of spam analysis, obtains each short text
Analysis information.
Specifically, for training set is divided into inside rubbish, or each short text of non-junk text,
Carry out characteristics of spam analysis, obtain the analysis information of this short text, specifically include following any information, or
The combination in any of following information: whether comprise the accounting letter of the information of contact method feature, interference symbol
Breath, transition probability between the accounting information of rarely used word, the accounting information of traditional character, word, front and back word
Part of speech between transition probability, the accounting information of noun, the accounting information of verb, punctuation mark accounting letter
Breath, the accounting information of unitary word, the accounting information of binary word, different part of speech Lexical collocation ratio are (such as
The quantitative proportion information of noun and verb) and the ratio data information etc. of punctuation mark and noun.
Wherein, for short text each in training set, the analysis information of this short text can be to be somebody's turn to do
In preprocessing process before the set of words of short text, extraction obtains, it is also possible to be to obtain this short essay
Obtain after this set of words.
Above-mentioned contact method feature can be specifically a string numeral with contact meaning or character, such as,
Telephone number, QQ number or URL(Uniform Resource Locator, URL)
Deng;Typically, the purpose of some rubbish text is to obtain private interests, and contact method to be stayed;
Therefore, in short text, the mode of being related can be important as determine whether rubbish text one
Judge feature.
Interference symbol can be specifically the symbol being of little use, such as, " $ " etc.;Some rubbish texts are
Avoid the filtration of key word, use some symbols being of little use to carry out the separation of key word;Therefore,
The ratio that in statistics short text, interference symbol occurs can be sentenced as one that determines whether rubbish text
Determine feature.
Transition probability between word refers to the collocation probability of adjacent two words and adjacent two words
The collocation probability of type, such as, in " rubbish identification " short text, " rubbish " is normally to take with " identification "
Joining, one collocation probability of corresponding existence, " rubbish " is noun, and " identification " is verb, noun and verb
The probability of collocation is bigger;
Unitary word can be specifically single word;
Binary word can be specifically idiom, slang or the Chinese idiom of 2 word compositions.
S103: for short text each in training set, determine the eigenvalue of the analysis information of this short text,
And the eigenvalue of each word in the set of words of this short text.
In this step, according to the set of words of short text each in the training set obtained, for each short essay
Each word in this set of words, calculates this word TF(Term Frequency in this short text,
Word frequency) value;Calculate this word IDF(Inverse Document Frequency in training set, inversely
Document-frequency) value;And the eigenvalue of this word is calculated according to equation below 1:
Log (TF+1.0) × IDF (formula 1)
The eigenvalue of calculated word is usually the numerical value between 0~100.
In this step, for short text each in this training set, believe according to the analysis of this short text obtained
Breath, it is judged that whether this short text comprises the information of contact method feature, the most then set and whether comprise connection
Be the eigenvalue of the information of mode feature be 1(or 0);Otherwise, 0(or 1 it is set as);
The interference symbol counted is accounted for the ratio accounting as this interference symbol of this short text character
The eigenvalue of information;
The rarely used word counted is accounted for the ratio spy as the accounting information of this rarely used word of this short text character
Value indicative;
The traditional character counted is accounted for the ratio accounting information as this traditional character of this short text character
Eigenvalue;
Transition probability between the word that will obtain is as the eigenvalue of the transition probability between this word;
The noun counted is accounted for the ratio feature as the accounting information of this noun of this short text character
Value;
The verb counted is accounted for the ratio feature as the accounting information of this verb of this short text character
Value;
The punctuation mark counted is accounted for the ratio accounting information as this punctuation mark of this short text character
Eigenvalue;
The unitary word counted is accounted for the ratio spy as the accounting information of this unitary word of this short text character
Value indicative;
The binary word counted is accounted for the ratio spy as the accounting information of this binary word of this short text character
Value indicative;
The quantitative proportion of the noun in this short text that will count and verb is as the number of this noun Yu verb
The eigenvalue of amount percent information;
The quantitative proportion of the punctuation mark in this short text that will count and noun is as this symbol and noun
The eigenvalue of quantitative proportion information;
Eigenvalue in view of above-mentioned all types of accounting information, quantitative proportion information and transition probability is usual
It is the numerical value between 0~1, in order to make convenience of calculation, as a kind of more excellent embodiment, also can be to really
In this training set made, the eigenvalue of the analysis information of each short text is normalized, and obtains described spy
The normalization numerical value of value indicative: for each short text in training set, whether comprise connection by calculated
It is that the eigenvalue of the information of mode feature is multiplied by 100, whether is comprised the spy of the information of contact method feature
The normalization numerical value of value indicative: 0 or 100;
The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, traditional character
Transition probability between accounting information, word, the accounting information of noun, the accounting information of verb, punctuate symbol
Number accounting information, the accounting information of unitary word, the quantity of the accounting information of binary word, noun and verb
The eigenvalue of the quantitative proportion information of percent information and punctuation mark and noun is multiplied by 100 respectively, obtains respectively
The normalization numerical value of eigenvalue to all types of accountings, quantitative proportion and the transition probability of this short text:
Numerical value between 0~100.
S104: for each short text in training set, to word each in the set of words of this short text
Eigenvalue, and the eigenvalue of the analysis information of this short text, after asking for class discrimination degree, choose spy
Levy the characteristic element in element set.
Specifically, for each short text in training set, to word each in the set of words of this short text
The eigenvalue of language, can use AUC algorithm to ask for the class discrimination degree of each word;The classification district of word
Indexing can reflect that this word carries out the tribute of the class discrimination of rubbish text or non-junk text to short text
Offer degree.
For each short text in training set, according to eigenvalue or the normalizing of the analysis information of this short text
Eigenvalue after change asks for the class discrimination degree of analysis information;The class discrimination degree of analysis information can reflect
Go out the percentage contribution that this analysis information carries out the class discrimination of rubbish text or non-junk text to short text.
For the analysis information that eigenvalue is discrete values, X 2 test algorithm can be used to ask for class discrimination degree;
For the analysis information that eigenvalue is serial number, AUC(Area Under Curve, curve can be used
Lower area) algorithm asks for class discrimination degree.
For each short text in training set, at the class discrimination of the analysis information calculating this short text
Degree, and in the set of words of this short text after the class discrimination degree of each word, class discrimination degree is more than
Set the word of threshold value, and the information of analysis is as the characteristic element in characteristic element set.Above-mentioned sets
Determine threshold value rule of thumb to be arranged by technical staff, for eigenvalue be discrete values and serial number not
Same situation, the setting threshold value of setting can be different: such as, can when being discrete values for eigenvalue
Set threshold value as 10 to arrange, setting threshold can be set for serial number that eigenvalue is serial number
Value is 0.7.
S105: for each short text in training set, generate the word feature vector of this short text.
In this step, for each short text in training set, generate the word feature vector of this short text
By each word in the analysis information of this short text and set of words respectively with the spy in characteristic element set
Levy element to compare, according to the word matched with the characteristic element in described characteristic element set or point
The eigenvalue of analysis information, generates the word feature vector of this short text.
More preferably, it is also possible to be that the word generating this short text is special for each short text in training set
Levy vector by each word in the analysis information of this short text and set of words respectively with characteristic element set
In characteristic element compare, according to the word matched with the characteristic element in described characteristic element set
Eigenvalue after the normalization of language or the information of analysis, generates the word feature vector of this short text.
Specifically, for each short text in training set, the word feature vector of this short text is respectively tieed up
Vector element is respectively each characteristic element in characteristic element set, and wherein, some vector elements are that this is short
The analysis information of text, then using the eigenvalue after the eigenvalue of this analysis information or normalization as this vector
The value of element;Some vector elements are the word in the set of words of this short text, then by the spy of this word
Eigenvalue after value indicative or normalization is as the value of this vector element;Other vector element value is then empty,
Or 0.
S106: according to the word feature vector of short text each in the training set obtained, build disaggregated model.
In this step, svm classifier algorithm or Bayesian Classification Arithmetic or decision tree classification can be used
Algorithm or maximum entropy sorting algorithm, train point according to the word feature vector of short text each in training set
Class model.Specifically, the quantity (i.e. sample size) of short text can be concentrated with combined training, select to use
One suitable algorithm, trains disaggregated model according to the word feature vector of short text each in training set.
The concrete grammar of disaggregated model how is trained according to the word feature vector of short text each in training set
Being well known to those skilled in the art, here is omitted.
It is true that there is no strict sequencing between above-mentioned steps S101 and S102, can hold parallel
Go or first carry out step S102 and perform step S101 again.
After the training stage constructs disaggregated model, can at cognitive phase according to the disaggregated model constructed,
Short text to be determined is carried out rubbish identification;The short text garbage recognition methods that the embodiment of the present invention provides
Flow chart as in figure 2 it is shown, concrete steps include:
S201: short text to be determined is carried out participle, obtains the set of words of this short text to be determined.
Specifically, participle is carried out for short text to be determined: continuous print word sequence in this short text divided
For word one by one;In the word marked off, get rid of do not have the function word of practical significance (as punctuate,
Group verb, modal particle, interjection, onomatopoeia etc.);Remaining word constitutes the set of words of this short text.
S202: this short text to be determined is carried out characteristics of spam analysis, obtains dividing of this short text to be determined
Analysis information.
Specifically, for this short text to be determined, carry out characteristics of spam analysis, obtain dividing of this short text
Analysis information, specifically includes following any information, or the combination in any of following information: whether comprise correspondent party
The information of formula feature, the accounting information of interference symbol, the accounting information of rarely used word, the accounting for of traditional character
Than transition probability between the part of speech of the transition probability between information, word, front and back word, the accounting information of noun,
The accounting information of verb, the accounting information of punctuation mark, the accounting information of unitary word, the accounting of binary word
Information, different part of speech Lexical collocation ratio and punctuation mark and the ratio data information etc. of noun.
Wherein, the analysis information of this short text to be determined can be to obtain this word treating interpretation short text
In preprocessing process before set, extraction obtains, it is also possible to be to obtain this word treating interpretation short text
Obtain after language set.
S203: determine the characteristic element of short text to be determined.
Specifically, by each word in the analysis information of short text to be determined and set of words respectively with upper
Characteristic element in the characteristic element set stated compares, by with the feature in described characteristic element set
Word that element matches or analysis information are as the characteristic element of this short text to be determined.
S204: according to the eigenvalue of the characteristic element of this short text to be determined, generates described short essay to be determined
This word feature vector.
In this step, for the word as characteristic element of this short text to be determined, calculate eigenvalue:
Calculate this word TF value in this short text, calculate this word IDF value in training set;And according to
Above-mentioned formula 1 calculates the eigenvalue of this word.
In this step, for the analysis information as characteristic element of this short text to be determined, calculate feature
Value:
Judge as the information whether comprising contact method feature in the analysis information of characteristic element, if so,
Then set the eigenvalue of the information whether comprising contact method feature as 1(or 0);Otherwise, then 0 it is set as
(or 1);
Judge as the accounting information whether comprising interference symbol in the analysis information of characteristic element;If so,
Then the interference symbol counted is accounted for the ratio accounting letter as this interference symbol of this short text character
The eigenvalue of breath;
Judge as the accounting information whether comprising rarely used word in the analysis information of characteristic element;The most then
The rarely used word counted is accounted for the ratio feature as the accounting information of this rarely used word of this short text character
Value;
Judge as the accounting information whether comprising traditional character in the analysis information of characteristic element;If so,
Then the traditional character counted is accounted for the ratio of this short text character accounting information as this traditional character
Eigenvalue;
Judge as whether the analysis information of characteristic element comprises the transition probability between word;The most then
Transition probability between the word that will obtain is as the eigenvalue of the transition probability between this word;
Judge as the accounting information whether comprising noun in the analysis information of characteristic element;The most then will
The noun counted accounts for the ratio eigenvalue as the accounting information of this noun of this short text character;
Judge as the accounting information whether comprising verb in the analysis information of characteristic element;The most then will
The verb counted accounts for the ratio eigenvalue as the accounting information of this verb of this short text character;
Judge as the accounting information whether comprising punctuation mark in the analysis information of characteristic element;If so,
Then the punctuation mark counted is accounted for the ratio of this short text character accounting information as this punctuation mark
Eigenvalue;
Judge as the accounting information whether comprising unitary word in the analysis information of characteristic element;The most then
The unitary word counted is accounted for the ratio feature as the accounting information of this unitary word of this short text character
Value;
Judge as the accounting information whether comprising binary word in the analysis information of characteristic element;The most then
The binary word counted is accounted for the ratio feature as the accounting information of this binary word of this short text character
Value;
Judge as the quantitative proportion information whether comprising noun and verb in the analysis information of characteristic element;
The quantitative proportion of the noun in this short text that the most then will count and verb is as this noun and verb
The eigenvalue of quantitative proportion information;
Judge as the quantitative proportion information whether comprising symbol and noun in the analysis information of characteristic element;
The quantitative proportion of the punctuation mark in this short text that the most then will count and noun as this symbol with
The eigenvalue of the quantitative proportion information of noun.
All types of accounting information, quantitative proportion information and transition probability in view of this short text to be determined
Eigenvalue is usually the numerical value between 0~1, in order to make convenience of calculation, as a kind of more excellent embodiment,
The eigenvalue that also this short text to be determined can be analyzed information is normalized, and obtains described eigenvalue
Normalization numerical value: for this short text to be determined, by the calculated contact method feature that whether comprises
The eigenvalue of information is multiplied by 100, whether is comprised the normalization of the eigenvalue of the information of contact method feature
Numerical value: 0 or 100;
The accounting information of the interference symbol that statistics is obtained, the accounting information of rarely used word, traditional character
Transition probability between accounting information, word, the accounting information of noun, the accounting information of verb, punctuate symbol
Number accounting information, the accounting information of unitary word, the quantity of the accounting information of binary word, noun and verb
The eigenvalue of the quantitative proportion information of percent information and punctuation mark and noun is multiplied by 100 respectively, obtains respectively
The normalization of eigenvalue to all types of accountings, quantitative proportion and the transition probability of this short text to be determined
Numerical value: the numerical value between 0~100.
In this step, according to the feature after the eigenvalue of the characteristic element of this short text to be determined or normalization
Value, generates the word feature vector of described short text to be determined;The word feature vector of short text to be determined
In each characteristic element of being respectively in characteristic element set of each dimensional vector element, wherein, some vector elements
For the analysis information of this short text to be determined, then by the feature after the eigenvalue of this analysis information or normalization
It is worth the value as this vector element;Some vector elements are the word in the set of words of this short text to be determined
Language, then using the eigenvalue after the eigenvalue of this word or normalization as the value of this vector element;Other
Vector element value is then empty, or 0.
S205: according to the word feature vector of this short text to be determined, and disaggregated model determines that this waits to sentence
Determine whether short text is rubbish text.
How according to the word feature vector of short text to be determined, and disaggregated model determines that this is to be determined short
Whether text is the technology that rubbish text is well known to those skilled in the art, and here is omitted.
It is true that there is no strict sequencing between above-mentioned steps S201 and S202, can hold parallel
Go or first carry out step S202 and perform step S201 again.
Based on above-mentioned modeling method, The embodiment provides a kind of model building device, internal structure
Block diagram is as it is shown on figure 3, specifically include: characteristic extracting module 301, characteristic vector determine module 302 and divide
Class model builds module 303, characteristic element set determines module 304.
Characteristic extracting module 301 is for for having divided into rubbish text, or non-junk text in training set
Each short text, obtain the set of words of this short text after carrying out participle, and this short text carried out rubbish
Rubbish feature analysis obtains the analysis information of this short text;Wherein, the analysis information of short text can include as
Lower any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information.
Characteristic element set determine module 304 for for each short text in described training set, from spy
Levy extraction module 301 and obtain set of words and the analysis information of this short text, and calculate the word of this short text
The eigenvalue of each word in language set, after calculating the eigenvalue of analysis information of this short text, to calculating
The eigenvalue gone out asks for class discrimination degree;Class discrimination degree is more than the word setting threshold value, and analyzes
Information is as the characteristic element in characteristic element set;
Specifically, characteristic element set determines that module 304 can be for each short essay in described training set
This, by each word in the analysis information of this short text and set of words respectively with described characteristic element collection
Characteristic element in conjunction compares, and matches according to characteristic element in described characteristic element set
Eigenvalue after the normalization of word or the information of analysis, generates the word feature vector of this short text.
Characteristic vector determine module 302 for for each short text in described training set, by this short essay
In this analysis information and set of words, each word determines module 304 institute respectively with characteristic element set
Characteristic element in the characteristic element set obtained compares, according to in described characteristic element set
Word that characteristic element matches or the eigenvalue of the information of analysis, generate the word feature vector of this short text;
Disaggregated model builds module 303 for determining, according to described characteristic vector, the institute that module 302 is determined
State the word feature vector of each short text in training set, build disaggregated model.Specifically, disaggregated model structure
Modeling block 303 use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or
Maximum entropy sorting algorithm, trains described point according to the word feature vector of short text each in described training set
Class model.
Based on above-mentioned short text garbage recognition methods, embodiments provide a kind of short text garbage
Identify device, internal structure block diagram is as it is shown on figure 3, specifically include: characteristic extracting module 401, feature to
Amount determines module 402 and rubbish identification module 403.
Wherein, characteristic extracting module 401 obtains word collection after carrying out participle for short text to be determined
Close, and described short text to be determined is carried out characteristics of spam analysis obtain analysis information.The analysis of short text
The aforementioned by the agency of of particular content of information, here is omitted.
Characteristic vector determines that module 402 is for obtaining dividing of short text to be determined from characteristic extracting module 401
Analysis information and set of words, by each word in the analysis information of short text to be determined and set of words
Compare with the characteristic element in predetermined characteristic element set respectively, according to described characteristic element
The word that matches of characteristic element in element set or the eigenvalue of the information of analysis, generate described to be determined short
The word feature vector of text;
Specifically, characteristic vector determine module 402 can by the analysis information of described short text to be determined with
And each word compares with the characteristic element in predetermined characteristic element set respectively in set of words
Relatively, according to the word matched with the characteristic element in described characteristic element set or the normalizing of the information of analysis
Eigenvalue after change, generates the word feature vector of described short text to be determined.
From characteristic vector, rubbish identification module 403 is for determining that module 402 obtains described short text to be determined
Word feature vector after, according to the word feature of described short text to be determined vector, and training in advance
The disaggregated model gone out, determines whether described short text to be determined is rubbish text.
In technical scheme, the word feature vector of each short text in training set, and to be determined
The word feature vector of short text, all contains the eigenvalue of the analysis information of expansion, according to containing expansion
The word feature vector of the eigenvalue of the analysis information filled, carries out rubbish identification to short text to be determined, carries
The discrimination of high rubbish text recognition and recognition accuracy.
The above is only the preferred embodiment of the present invention, it is noted that general for the art
For logical technical staff, under the premise without departing from the principles of the invention, it is also possible to make some improvement and profit
Decorations, these improvements and modifications also should be regarded as protection scope of the present invention.
Claims (12)
1. a short text garbage recognition methods, it is characterised in that including:
Short text to be determined is carried out participle and obtains set of words, and described short text to be determined is carried out rubbish
Rubbish feature analysis obtains analysis information;
By each word in the analysis information of described short text to be determined and set of words respectively with the most true
The fixed characteristic element in characteristic element set compares, according to the spy in described characteristic element set
Levying word or the eigenvalue of the information of analysis that element matches, the word generating described short text to be determined is special
Levy vector;
Word feature vector according to described short text to be determined, and the disaggregated model that training in advance goes out,
Determine whether described short text to be determined is rubbish text;Wherein,
Described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information;And
The eigenvalue of described analysis information specifically includes:
For the described information whether comprising contact method feature, its eigenvalue is the 0 or 1 of two-value;
For the accounting information of described interference symbol or the accounting information of rarely used word or traditional character
Between the part of speech of transition probability between accounting information or word or front and back word, transition probability or noun accounts for
Than the accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
Percent information, its eigenvalue is the numerical value between 0~1.
2. the method for claim 1, it is characterised in that at the described short essay to be determined of described generation
Before this word feature vector, also include:
The eigenvalue of the analysis information matched with the characteristic element in described characteristic element set is returned
One changes:
To the most whether comprise characteristic value normalization is two-value 0 or the 100 of the information of contact method feature;
By the accounting information of wherein interference symbol or the accounting information of rarely used word or accounting for of traditional character
Than transition probability or the accounting of noun between the part of speech of the transition probability between information or word or front and back word
The accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
The eigenvalue of percent information is multiplied by 100, obtains the normalization numerical value between 0~100.
3. method as claimed in claim 1 or 2, it is characterised in that the eigenvalue of described word according to
Following method obtains:
Calculate TF, IDF value of this word, and calculate the eigenvalue of this word according to equation below 1:
Log (TF+1.0) × IDF (formula 1).
4. method as claimed in claim 1 or 2, it is characterised in that the training side of described disaggregated model
Method, and the determination method of described characteristic element set includes:
For training set has been divided into rubbish text, or each short text of non-junk text, carry out point
Obtain the set of words of this short text after word, and it is short to obtain this after this short text is carried out characteristics of spam analysis
The analysis information of text;
For each short text in described training set, calculate each word in the set of words of this short text
Eigenvalue, and after calculating the eigenvalue of analysis information of this short text, the eigenvalue calculated is asked for
Class discrimination degree;Class discrimination degree is more than the word setting threshold value, and the information of analysis is as described spy
Levy the characteristic element in element set;
For each short text in described training set, by analysis information and the set of words of this short text
In each word compare with the characteristic element in described characteristic element set respectively, according to described spy
Levy word or the eigenvalue of the information of analysis that the characteristic element in element set matches, generate this short text
Word feature vector;
Word feature vector according to short text each in described training set trains described disaggregated model.
5. method as claimed in claim 4, it is characterised in that described according to each short in described training set
The word feature vector of text train described disaggregated model particularly as follows:
Use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or maximum entropy
Sorting algorithm, trains described disaggregated model according to the word feature vector of short text each in described training set.
6. a modeling method, it is characterised in that including:
For training set has been divided into rubbish text, or each short text of non-junk text, carry out point
Obtain the set of words of this short text after word, and it is short to obtain this after this short text is carried out characteristics of spam analysis
The analysis information of text;
For each short text in described training set, calculate each word in the set of words of this short text
Eigenvalue, and after calculating the eigenvalue of analysis information of this short text, the eigenvalue calculated is asked for
Class discrimination degree;Class discrimination degree is more than the word setting threshold value, and the information of analysis is as characteristic element
Characteristic element in element set;
For each short text in described training set, by analysis information and the set of words of this short text
In each word compare with the characteristic element in described characteristic element set respectively, according to described spy
Levy word or the eigenvalue of the information of analysis that the characteristic element in element set matches, generate this short text
Word feature vector;
Word feature vector according to short text each in described training set trains disaggregated model;Wherein,
Described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information;And
The eigenvalue of described analysis information specifically includes:
For the described information whether comprising contact method feature, its eigenvalue is the 0 or 1 of two-value;
For the accounting information of described interference symbol or the accounting information of rarely used word or traditional character
Between the part of speech of transition probability between accounting information or word or front and back word, transition probability or noun accounts for
Than the accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
Percent information, its eigenvalue is the numerical value between 0~1.
7. method as claimed in claim 6, it is characterised in that in the analysis of described this short text of calculating
After the eigenvalue of information, and described basis and the characteristic element in described characteristic element set match
Word or the eigenvalue of the information of analysis, before generating the word feature vector of this short text, also include:
The eigenvalue that this short text is analyzed information is normalized:
By characteristic value normalization is two-value 0 or the 100 of the described information whether comprising contact method feature;
By the accounting information of described interference symbol or the accounting information of rarely used word or accounting for of traditional character
Than transition probability or the accounting of noun between the part of speech of the transition probability between information or word or front and back word
The accounting information of information or verb or the accounting information of punctuation mark or the accounting information of unitary word,
Or the quantity of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
The eigenvalue of percent information is multiplied by 100, obtains the normalization numerical value between 0~100;And
Word that described basis matches with the characteristic element in described characteristic element set or the information of analysis
Eigenvalue, generate this short text word feature vector particularly as follows:
According to the word matched with the characteristic element in described characteristic element set or the normalizing of the information of analysis
Eigenvalue after change, generates the word feature vector of this short text.
Method the most as claimed in claims 6 or 7, it is characterised in that described according in described training set
The word feature vector of each short text train described disaggregated model particularly as follows:
Use svm classifier algorithm or Bayesian Classification Arithmetic or Decision Tree Algorithm or maximum entropy
Sorting algorithm, trains described disaggregated model according to the word feature vector of short text each in described training set.
9. a model building device, it is characterised in that including:
Characteristic extracting module, is used for for having divided into rubbish text in training set, or non-junk text
Each short text, obtains the set of words of this short text, and this short text is carried out rubbish after carrying out participle
Feature analysis obtains the analysis information of this short text;
Characteristic element set determines module, and for for each short text in described training set, calculating should
The eigenvalue of each word in the set of words of short text, and calculate the feature of the analysis information of this short text
After value, the eigenvalue calculated is asked for class discrimination degree;By class discrimination degree more than the word setting threshold value
Language, and the information of analysis is as the characteristic element in characteristic element set;
Characteristic vector determines module, for for each short text in described training set, by this short text
Analysis information and set of words in each word respectively with the characteristic element in described characteristic element set
Compare, according to the word matched with the characteristic element in described characteristic element set or analysis information
Eigenvalue, generate this short text word feature vector;
Disaggregated model builds module, for determining, according to described characteristic vector, the described training that module is determined
Concentrate the word feature vector of each short text, build disaggregated model;
Wherein, described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information;And
The eigenvalue of described analysis information specifically includes:
For whether comprising the information of contact method feature, its eigenvalue is the 0 or 1 of two-value;
For the accounting information of interference symbol or the accounting information of rarely used word or the accounting of traditional character
The accounting letter of transition probability or noun between the part of speech of transition probability between information or word or front and back word
Breath or the accounting information of verb or the accounting information of punctuation mark or the accounting information of unitary word or
The quantity ratio of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
Example information, its eigenvalue is the numerical value between 0~1.
10. device as claimed in claim 9, it is characterised in that
Described characteristic vector determines that module, should specifically for for each short text in described training set
In the analysis information of short text and set of words each word respectively with the spy in described characteristic element set
Levy element to compare, according to the word matched with the characteristic element in described characteristic element set or point
Eigenvalue after the normalization of analysis information, generates the word feature vector of this short text.
11. 1 kinds of short text garbage identification devices, it is characterised in that including:
Characteristic extracting module, obtains set of words after carrying out participle for short text to be determined, and right
Described short text to be determined carries out characteristics of spam analysis and obtains analysis information;
Characteristic vector determines module, for by the analysis information of described short text to be determined and set of words
In each word compare with the characteristic element in predetermined characteristic element set respectively, according to
Word that characteristic element in described characteristic element set matches or the eigenvalue of the information of analysis, generate institute
State the word feature vector of short text to be determined;
From described characteristic vector, rubbish identification module, for determining that module obtains described short text to be determined
After word feature vector, according to the word feature vector of described short text to be determined, and training in advance goes out
Disaggregated model, determine whether described short text to be determined is rubbish text;
Wherein, described analysis information includes following any information, or the combination in any of following information:
Whether comprise the information of contact method feature, the accounting information of interference symbol, the accounting of rarely used word
Transition probability between the part of speech of transition probability between information, the accounting information of traditional character, word, front and back word,
The accounting information of noun, the accounting information of verb, the accounting information of punctuation mark, the accounting letter of unitary word
The quantity ratio of breath, the accounting information of binary word, different part of speech Lexical collocation ratio, punctuation mark and noun
Example information;And
The eigenvalue of described analysis information specifically includes:
For whether comprising the information of contact method feature, its eigenvalue is the 0 or 1 of two-value;
For the accounting information of interference symbol or the accounting information of rarely used word or the accounting of traditional character
The accounting letter of transition probability or noun between the part of speech of transition probability between information or word or front and back word
Breath or the accounting information of verb or the accounting information of punctuation mark or the accounting information of unitary word or
The quantity ratio of the accounting information of binary word or different part of speech Lexical collocation ratio or punctuation mark and noun
Example information, its eigenvalue is the numerical value between 0~1.
12. devices as claimed in claim 11, it is characterised in that
Described characteristic vector determines that module is specifically for by the analysis information of described short text to be determined and word
In language set, each word compares with the characteristic element in predetermined characteristic element set respectively,
After normalization according to the word matched with the characteristic element in described characteristic element set or the information of analysis
Eigenvalue, generate described short text to be determined word feature vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310278012.6A CN103336766B (en) | 2013-07-04 | 2013-07-04 | Short text garbage identification and modeling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310278012.6A CN103336766B (en) | 2013-07-04 | 2013-07-04 | Short text garbage identification and modeling method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103336766A CN103336766A (en) | 2013-10-02 |
CN103336766B true CN103336766B (en) | 2016-12-28 |
Family
ID=49244935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310278012.6A Active CN103336766B (en) | 2013-07-04 | 2013-07-04 | Short text garbage identification and modeling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103336766B (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104615585B (en) | 2014-01-06 | 2017-07-21 | 腾讯科技(深圳)有限公司 | Handle the method and device of text message |
CN104199811B (en) * | 2014-09-10 | 2017-06-16 | 上海携程商务有限公司 | Short sentence analytic modell analytical model method for building up and system |
CN104408087A (en) * | 2014-11-13 | 2015-03-11 | 百度在线网络技术(北京)有限公司 | Method and system for identifying cheating text |
CN105808602B (en) * | 2014-12-31 | 2020-04-21 | 中国移动通信集团公司 | Method and device for detecting junk information |
CN104722554B (en) * | 2015-02-04 | 2017-07-07 | 无锡荣博能源环保科技有限公司 | Refuse classification equipment, method and application based on chemical element characteristic |
CN104809236B (en) * | 2015-05-11 | 2018-03-27 | 苏州大学 | A kind of age of user sorting technique and system based on microblogging |
CN105117384A (en) * | 2015-08-19 | 2015-12-02 | 小米科技有限责任公司 | Classifier training method, and type identification method and apparatus |
CN105045924A (en) * | 2015-08-26 | 2015-11-11 | 苏州大学张家港工业技术研究院 | Question classification method and system |
CN106649255A (en) * | 2015-11-04 | 2017-05-10 | 江苏引跑网络科技有限公司 | Method for automatically classifying and identifying subject terms of short texts |
CN105589941A (en) * | 2015-12-15 | 2016-05-18 | 北京百分点信息科技有限公司 | Emotional information detection method and apparatus for web text |
CN107180022A (en) * | 2016-03-09 | 2017-09-19 | 阿里巴巴集团控股有限公司 | object classification method and device |
CN105956472B (en) * | 2016-05-12 | 2019-10-18 | 宝利九章(北京)数据技术有限公司 | Identify webpage in whether include hostile content method and system |
CN106446032A (en) * | 2016-08-30 | 2017-02-22 | 江苏博智软件科技有限公司 | Junk information processing method and apparatus |
CN106708961B (en) * | 2016-11-30 | 2020-11-06 | 北京粉笔蓝天科技有限公司 | Method for establishing junk text library, method for filtering junk text library and system |
CN107562728A (en) * | 2017-09-12 | 2018-01-09 | 电子科技大学 | Social media short text filter method based on structure and text message |
CN109726727A (en) * | 2017-10-27 | 2019-05-07 | 中移(杭州)信息技术有限公司 | A kind of data detection method and system |
CN108304442B (en) * | 2017-11-20 | 2021-08-31 | 腾讯科技(深圳)有限公司 | Text information processing method and device and storage medium |
CN107943941B (en) * | 2017-11-23 | 2021-10-15 | 珠海金山网络游戏科技有限公司 | Junk text recognition method and system capable of being updated iteratively |
CN110019681B (en) * | 2017-12-19 | 2022-05-17 | 阿里巴巴(中国)有限公司 | Comment content filtering method and system |
CN108647309B (en) * | 2018-05-09 | 2021-08-10 | 达而观信息科技(上海)有限公司 | Chat content auditing method and system based on sensitive words |
CN108847238B (en) * | 2018-08-06 | 2022-09-16 | 东北大学 | Service robot voice recognition method |
CN110298041B (en) * | 2019-06-24 | 2023-09-05 | 北京奇艺世纪科技有限公司 | Junk text filtering method and device, electronic equipment and storage medium |
CN110442714B (en) * | 2019-07-25 | 2022-05-27 | 北京百度网讯科技有限公司 | POI name normative evaluation method, device, equipment and storage medium |
CN111651598A (en) * | 2020-05-28 | 2020-09-11 | 上海勃池信息技术有限公司 | Spam text auditing device and method through center vector similarity matching |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101784022A (en) * | 2009-01-16 | 2010-07-21 | 北京炎黄新星网络科技有限公司 | Method and system for filtering and classifying short messages |
CN101996241A (en) * | 2010-10-22 | 2011-03-30 | 东南大学 | Bayesian algorithm-based content filtering method |
-
2013
- 2013-07-04 CN CN201310278012.6A patent/CN103336766B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101784022A (en) * | 2009-01-16 | 2010-07-21 | 北京炎黄新星网络科技有限公司 | Method and system for filtering and classifying short messages |
CN101996241A (en) * | 2010-10-22 | 2011-03-30 | 东南大学 | Bayesian algorithm-based content filtering method |
Non-Patent Citations (2)
Title |
---|
基于内容的短消息智能分析系统研究;侯旭东;《中国优秀硕士学位论文全文数据库信息科技辑》;20110515(第05期);I136-260 * |
基于支持向量机的垃圾短信过滤方法研究;龚垒;《中国优秀硕士学位论文全文数据库信息科技辑》;20110915(第09期);I136-964 * |
Also Published As
Publication number | Publication date |
---|---|
CN103336766A (en) | 2013-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN102054015B (en) | System and method of organizing community intelligent information by using organic matter data model | |
CN102227724B (en) | Machine learning for transliteration | |
CN103324745B (en) | Text garbage recognition methods and system based on Bayesian model | |
CN103729474B (en) | Method and system for recognizing forum user vest account | |
CN105787025B (en) | Network platform public account classification method and device | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN107102993B (en) | User appeal analysis method and device | |
CN110263248A (en) | A kind of information-pushing method, device, storage medium and server | |
CN102096703A (en) | Filtering method and equipment of short messages | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN112686022A (en) | Method and device for detecting illegal corpus, computer equipment and storage medium | |
CN103593431A (en) | Internet public opinion analyzing method and device | |
CN109978020A (en) | A kind of social networks account vest identity identification method based on multidimensional characteristic | |
CN113033198B (en) | Similar text pushing method and device, electronic equipment and computer storage medium | |
CN109582788A (en) | Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing | |
CN110096681A (en) | Contract terms analysis method, device, equipment and readable storage medium storing program for executing | |
CN101833579A (en) | Method and system for automatically detecting academic misconduct literature | |
CN110688540B (en) | Cheating account screening method, device, equipment and medium | |
CN114997288A (en) | Design resource association method | |
CN112632982A (en) | Dialogue text emotion analysis method capable of being used for supplier evaluation | |
CN104794209A (en) | Chinese microblog sentiment classification method and system based on Markov logic network | |
CN107688594B (en) | The identifying system and method for risk case based on social information | |
CN112579781A (en) | Text classification method and device, electronic equipment and medium | |
JP5477910B2 (en) | Text search program, device, server and method using search keyword dictionary and dependency keyword dictionary |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |