CN108268459A - A kind of community's speech filtration system based on naive Bayesian - Google Patents
A kind of community's speech filtration system based on naive Bayesian Download PDFInfo
- Publication number
- CN108268459A CN108268459A CN201611254036.8A CN201611254036A CN108268459A CN 108268459 A CN108268459 A CN 108268459A CN 201611254036 A CN201611254036 A CN 201611254036A CN 108268459 A CN108268459 A CN 108268459A
- Authority
- CN
- China
- Prior art keywords
- word
- speech
- cutting
- word string
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The present invention provides a kind of community's speech filtration system based on naive Bayesian, community's speech filtration system based on naive Bayesian includes a cutting word unit, a converting unit, a mnemon, an output unit;For the cutting word unit for speech document to be pre-processed, the cutting word unit includes a positive module, a reverse module, a t test modules;The converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector;The mnemon is used to term vector stamping class label, for training Naive Bayes Classifier;The output unit is used to export speech document.User can effectively be avoided accidentally to have input the word in sensitive dictionary, but what is shielded happens using community's speech filtration system based on naive Bayesian, avoid and bad usage experience is brought to user.
Description
Technical field
The present invention relates to a kind of filtration system field more particularly to a kind of speech filtering systems of community based on naive Bayesian
System.
Background technology
Today that internet is grown rapidly, the clothing, food, lodging and transportion -- basic necessities of life of people increasingly be unable to do without network.In this background, it is based on
The community of various hobbies also just comes into being, but during community development of today, we often can see respectively
The inappropriate speech of kind, such as the personal attack to other people, the too drastic speech to politics.If a community lets alone this phenomenon
No matter, then this is very unfavorable for the development of community.
Nowadays major community platform can all have the corresponding intrument and method for dealing with speech improperly, but most of communities are all
It is using passive this pattern of user's report, efficiency is very low, and present community's speech filtering is typically direct
Judge whether there is some word in the speech of user in sensitive dictionary, these words are directly then substituted for No. *, this judged
Journey does not even all use segmentation methods.Sometimes user only by chance accidentally has input the word in sensitive dictionary, possible linguistic context
On be not improper speech completely, but shielded, bad experience is caused to user.Therefore active demand one kind can
Easily, the system of the high progress speech filtering of accuracy rate.
In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.
Invention content
To solve the above problems, the technical solution adopted by the present invention is, a kind of community based on naive Bayesian is provided
Speech filtration system, community's speech filtration system based on naive Bayesian include a cutting word unit, a converting unit, one
Mnemon, an output unit;For the cutting word unit for speech document to be pre-processed, the cutting word unit is including one just
To module, a reverse module, a t test modules;The converting unit is used for after the cutting word unit completes cutting word, by speech
Document is converted to term vector;The mnemon is used to term vector stamping class label, for training naive Bayesian
Grader;The output unit is used to export speech document.
Preferably, the cutting word method of the cutting word unit uses bi-directional matching method, the bi-directional matching method includes forward direction most
Big matching method and reverse maximum matching method;
The Forward Maximum Method method includes the following steps:
A1:M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns
Back Word string S;
A2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1;Otherwise A3 is gone to;
A3:A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return
Word string K, and go to A1;Otherwise A4 is gone to;
A4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S
To step A2;
The reverse maximum matching method includes the following steps:
B1:M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns
Back Word string S;
B2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1;Otherwise B3 is gone to;
B3:Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return
Word string K, and go to B1;Otherwise B4 is gone to;
B4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S
To step B2.
If the preferably, cutting result of the Forward Maximum Method method and the reverse maximum matching method to speech document
It differs, difference method is tested come disambiguation using t, for orderly word string xyz, x is relative to the t test definitions of y and z:
Wherein ρ (z ∣ y), ρ (y ∣ x) represent probability of the z under y, probability of the y under x, σ respectively2(ρ (z ∣ y)), σ2(ρ(y∣
X) respective variance) is represented, in above formula, the computational methods of each data are as follows:
R (y, z), r (x, y) represent orderly word string yz, the frequency that xy occurs in dictionary, r (x), r (y) difference table respectively
Show the frequency that x, y occur in dictionary,
Therefore, t is obtainedx,z(y) calculation formula is:
It is for the t tests difference between orderly word string a wxyz, x, y:
Δ t (x, y)=tw,y(x)-tx,z(y)
Classification processing is carried out to result:
Situation one:tw,y(x)>0, tx,z(y)<0, Δ t (x, y)>0, then it represents that attract each other between x, y, then xy is formed
One word;
Situation two:tw,y(x)<0, tx,z(y)>0, Δ t (x, y)<0, then it represents that it is mutually exclusive between x, y, xy is separated;
Situation three:tw,y(x)>0, tx,z(y)>0, represent that z attracts y while y attracts x, as Δ t (x, y)>0, xy composition one
A word;As Δ t (x, y)<0, xy is separated;
Situation four:tw,y(x)<0, tx,z(y)<0, represent that w attracts x while x attracts y, as Δ t (x, y)>0, xy composition
One word;As Δ t (x, y)<0, xy is separated.
Preferably, the cutting word unit of community's speech filtration system based on naive Bayesian tentatively obtains cutting word result
Afterwards, it according to deactivated vocabulary, if word is appeared in deactivated vocabulary, is deleted, it is then further according to information gain (IG)
Reduce dimension;
Information gain (IG) formula is as follows:
Wherein P (Ci) represent CiThe probability that class text occurs in training sample, P (t) represent that word occurs in training sample
Probability,Represent the probability that word does not occur in training sample, (Ci∣ t) it represents in the case of occurring comprising word t and belongs to
Classification CiProbability,Represent that word t still belongs to classification C in the case of not occurringiProbability.
Preferably, the converting unit is used for after Feature Words are obtained, speech document is converted into term vector, institute's predicate to
Amount form is as follows:
Xi=(x(1),…,x(j),…,x(n)), n is characterized word number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;
Speech document is divided into two class of normal speech and improper speech, formula represents as follows:
Y={ 0,1 }
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Preferably, the mnemon is used to term vector stamping class label, for training naive Bayesian point
Class device;
Conditional probability and the Bayesian Estimation of prior probability difference are as follows:
λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,
Here it with prior probability is that Laplce is smooth that we, which take conditional probability,:
By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
Denominator is for all c in above formulakAll it is identical, is equivalent to:
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Preferably, community's speech filtration system based on naive Bayesian judges whether a speech document is just
Saying opinion is as follows:
D1:Speech document is pre-processed by the cutting word unit, the speech document text that obtains that treated, and led to
It crosses the cutting word unit and cutting word processing is carried out to it using bi-directional matching method, the speech document text is cut into one by one
Word;If bi-directional matching method is when to the speech document text cutting word, if maximum forward matching method is with maximum reverse matching method
As a result it is consistent, then directly use word segmentation result;If maximum forward matching method is consistent with maximum reverse matching method result, t is used
The poor disambiguation of test;
D2:After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, for residue
The word not appeared in deactivated vocabulary, carry out the information gains of Feature Words, then further contracted according to information gain (IG)
Subtract dimension, take N number of word of information gain value maximum as validity feature;
D3:After extracting validity feature, the speech document is converted into term vector:
Xi=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;
D4:Laplce's smooth value is added in conditional probability and prior probability:
D5:For document to be sorted, term vector is converted into using step D3:
X=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein x(j)For j-th of Feature Words;
Calculate P (Y=ck)Πj P(X(j)=x(j)| Y=ck)
D6:Determine the class of document to be sorted
If y=0, speech document to be sorted is normal speech, is otherwise improper speech;
D7:Speech document is exported by output unit.
Compared with the prior art the beneficial effects of the present invention are:1st, using community's speech based on naive Bayesian
Filtration system can effectively avoid user from accidentally having input the word in sensitive dictionary, but what is shielded happens,
It avoids and bad usage experience is brought to user;It 2nd, can using community's speech filtration system based on naive Bayesian
Easily, the high carry out speech filtering of accuracy rate.
Description of the drawings
Fig. 1 is a kind of structure function block diagram of community's speech filtration system based on naive Bayesian of the present invention.
Specific embodiment
Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.
The present invention provides a kind of community's speech filtration system based on naive Bayesian, as shown in Figure 1, described based on simplicity
Community's speech filtration system of Bayes includes a cutting word unit, a converting unit, a mnemon, an output unit.
Embodiment one
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in
In, the cutting word unit for speech document to be pre-processed, since speech document is made of a large amount of sentences, so needing
Cutting word processing is carried out to speech document, in order to subsequent processing.
The cutting word unit includes a positive module, a reverse module, a t test modules;
It is described forward direction module for from left to right by several continuation characters in speech document to be slit in dictionary into
Row matching, successful match are then syncopated as a word, are matched in dictionary again after otherwise a word of rightmost is rejected, directly
It is split and finishes to text.
The reverse module for from right to left by several continuation characters in speech document to be slit in dictionary into
Row matching, successful match are then syncopated as a word, are matched in dictionary again after otherwise a leftmost word is rejected, directly
It is split and finishes to text.
The t test modules are used to differ speech document cutting result in the positive module and the reverse module
When, for disambiguation.
The converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector.
Using the multinomial model of naive Bayesian, term vector form is as follows:
Xi=(x(1),…,x(j),…,x(n)), n is characterized word number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words.
We assume that the classification of document has two classes, respectively normal speech and improper speech.
Y={ 0,1 }
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
The mnemon is used to term vector stamping class label, for training Naive Bayes Classifier.Piao
Plain Bayesian Method has done conditional independence assumption to conditional probability distribution, i other words the condition that is determined in class of feature for classification
Under be all conditional sampling, this hypothesis makes naive Bayesian method become simple, but can sacrifice certain accuracy rate.
Above-mentioned term vector is manually marked, for training Naive Bayes Classifier.
It can thus be concluded that the Maximum-likelihood estimation of prior probability:
I is indicator function, i.e. yi=ckWhen be 1, be otherwise 0.
If j-th of feature x(j)The collection of possible value is combined intoConditional probability P (X(j)=ajl| Y=ck)
Maximum-likelihood estimation be:
In formula,It is j-th of feature of i-th of sample;ajlIt is l-th of value that j-th of feature may take;I is instruction
Function.
We are it will be clear that with Maximum-likelihood estimation it is possible that the probability value to be estimated is 0 from above formula
Situation, at this moment influence whether the result of calculation of posterior probability, classification made to generate deviation, so we use Bayes herein
The Bayesian Estimation difference of estimation, conditional probability and prior probability is as follows:
λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,
Here it with prior probability is that Laplce is smooth that we, which take conditional probability,:
By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
Denominator is for all c in above formulakAll it is identical, is equivalent to:
Y is corresponding classification output, thus we can determine whether the classification of the term vector of input, so that it is determined that the input
Speech document it is whether normal.
The output unit is used to export speech document.
Embodiment two
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in
In the cutting word method of the cutting word unit of community's speech filtration system based on naive Bayesian is using bi-directional matching method, institute
It states bi-directional matching method and includes Forward Maximum Method method and reverse maximum matching method, when using the Forward Maximum Method method and described
When the cutting word result that reverse maximum matching method generates is consistent, cutting word result is continued to use.When using the Forward Maximum Method method and institute
State cutting word result that reverse maximum matching method generates it is inconsistent when, then it is assumed that generate cutting word ambiguity, at this time with t test difference method come
Improve cutting word accuracy rate.
The Forward Maximum Method method includes the following steps:
A1:M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns
Back Word string S;
A2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1;Otherwise A3 is gone to;
A3:A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return
Word string K, and go to A1;Otherwise A4 is gone to;
A4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S
To step A2.
The reverse maximum matching method includes the following steps:
B1:M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns
Back Word string S;
B2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1;Otherwise B3 is gone to;
B3:Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return
Word string K, and go to B1;Otherwise B4 is gone to;
B4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S
To step B2.
If the Forward Maximum Method method and the reverse maximum matching method are to the cutting result of speech document
It differs, difference method is tested come disambiguation using t, for orderly word string xyz, x is relative to y
And the t test definitions of z are:
Wherein ρ (z ∣ y), ρ (y ∣ x) represent probability of the z under y, probability of the y under x respectively.σ2(ρ (z ∣ y)), σ2(ρ(y∣
X) respective variance) is represented.In above formula, the computational methods of each data are as follows:
R (y, z), r (x, y) represent orderly word string yz, the frequency that xy occurs in dictionary, r (x), r (y) difference table respectively
Show the frequency that x, y occur in dictionary.
Therefore, t is obtainedx,z(y) calculation formula is:
It is for the t tests difference between orderly word string a wxyz, x, y:
Δ t (x, y)=tw,y(x)-tx,z(y)
Classification processing is carried out to result:
Situation one:tw,y(x)>0, tx,z(y)<0, Δ t (x, y)>0.It then represents x, attracts each other between y, then xy is formed
One word.
Situation two:tw,y(x)<0, tx,z(y)>0, Δ t (x, y)<0.Then represent x, it is mutually exclusive between y, xy is separated.
Situation three:tw,y(x)>0, tx,z(y)>0.Represent that z attracts y while y attracts x.As Δ t (x, y)>0, xy composition one
A word;As Δ t (x, y)<0, xy is separated.
Situation four:tw,y(x)<0, tx,z(y)<0.Represent that w attracts x while x attracts y.As Δ t (x, y)>0, xy composition
One word;As Δ t (x, y)<0, xy is separated.
The cutting word unit for speech document cutting word method, when using the Forward Maximum Method method and described reverse
When maximum matching method differs the cutting result of speech document, difference method is tested, and carry out according to above-mentioned four kinds of situations using t
After judgement, cutting word result can be tentatively obtained.
Embodiment three
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in
After the cutting word unit of, community's speech filtration system based on naive Bayesian tentatively obtains cutting word result, according to deactivating
Vocabulary if word is appeared in deactivated vocabulary, is deleted, and then further reduces dimension according to information gain (IG).Letter
Breath gain (IG) refers to be removed division sample space with an attribute t and led to the degree for it is expected that entropy lowers, if IG (t) is bigger,
Illustrate that t is bigger to the effect entirely classified.The present embodiment is on the basis of embodiment two removes stop words to speech document, then leads to
IG (t) values for calculating each word are crossed, N number of word of IG values maximum are then taken, as the Feature Words finally chosen.
Information gain (IG) formula is as follows:
Wherein P (Ci) represent CiThe probability that class text occurs in training sample, P (t) represent that word occurs in training sample
Probability,Represent the probability that word does not occur in training sample, (Ci∣ t) it represents in the case of occurring comprising word t and belongs to
Classification CiProbability,Represent that word t still belongs to classification C in the case of not occurringiProbability.
Example IV
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in
In the converting unit of community's speech filtration system based on naive Bayesian is used for after Feature Words are obtained, by speech text
Shelves are converted to term vector.
Term vector form is as follows:
Xi=(x(1),…,x(j),…,x(n)), n is characterized word number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words.
The form of the term vector uses the multinomial model of naive Bayesian.
The classification of document there are into two classes, respectively normal speech and improper speech.Formula represents as follows:
Y={ 0,1 }
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Embodiment five
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in
In the mnemon of community's speech filtration system based on naive Bayesian is used to term vector stamping classification mark
Label, for training Naive Bayes Classifier.Naive Bayesian method has done conditional independence assumption to conditional probability distribution,
I other words the feature for classification is all conditional sampling under conditions of class determines, this hypothesis becomes naive Bayesian method
Simply, but certain accuracy rate can be sacrificed.
Above-mentioned term vector is manually marked, for training Naive Bayes Classifier.
It can thus be concluded that the Maximum-likelihood estimation of prior probability:
I is indicator function, i.e. yi=ckWhen be 1, be otherwise 0.
If j-th of feature x(j)The collection of possible value is combined intoConditional probability P (X(j)=ajl| Y=
ck) Maximum-likelihood estimation be:
In formula,It is j-th of feature of i-th of sample;ajlIt is l-th of value that j-th of feature may take;I is instruction
Function.
We are it will be clear that with Maximum-likelihood estimation it is possible that the probability value to be estimated is 0 from above formula
Situation, at this moment influence whether the result of calculation of posterior probability, classification made to generate deviation, for this kind is avoided to happen, so
, herein using Bayesian Estimation, the Bayesian Estimation difference of conditional probability and prior probability is as follows for we:
λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,
Here it with prior probability is that Laplce is smooth that we, which take conditional probability,:
By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
Denominator is for all c in above formulakAll it is identical, is equivalent to:
Y is corresponding classification output, thus we can determine whether the classification of the term vector of input, so that it is determined that the input
Speech document it is whether normal;The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Embodiment five
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in
In community's speech filtration system based on naive Bayesian judges whether a speech document is the specific of normal speech
Step is as follows:
D1:Speech document is pre-processed by the cutting word unit, the classification of speech document is divided into two classes, respectively
For normal speech and improper speech.
Y={ 0,1 }
Wherein normal speech Y's is labeled as 0, and improper speech Y's is labeled as 1.
The speech document text that obtains that treated, and cutting word is carried out to it using bi-directional matching method by the cutting word unit
The speech document text is cut into word one by one by processing.If bi-directional matching method is cut to the speech document text
During word, if maximum forward matching method is consistent with maximum reverse matching method result, word segmentation result is directly used;If maximum forward
It is consistent with maximum reverse matching method result with method, then test poor disambiguation using t.
D2:After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, for residue
The word not appeared in deactivated vocabulary, carry out the information gains of Feature Words, then further contracted according to information gain (IG)
Subtract dimension, take N number of word of information gain value maximum as validity feature.
D3:After extracting validity feature, the speech document is converted into term vector:
Xi=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words.
After handling all speech documents, training sample is obtained.
D4:Laplce's smooth value is added in conditional probability and prior probability:
D5:For document to be sorted, term vector is converted into using step D3:
X=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein x(j)For j-th of Feature Words.
Calculate P (Y=ck)Πj P(X(j)=x(j)| Y=ck)
D6:Determine the class of document to be sorted
If y=0, speech document to be sorted is normal speech, is otherwise improper speech.
D7:Speech document is exported by output unit.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
Member, under the premise of the method for the present invention is not departed from, can also make several improvement and supplement, these are improved and supplement also should be regarded as
Protection scope of the present invention.
Claims (7)
- A kind of 1. community's speech filtration system based on naive Bayesian, which is characterized in that the society based on naive Bayesian Speech filtration system in area's includes a cutting word unit, a converting unit, a mnemon, an output unit;The cutting word unit is used It is pre-processed in by speech document, the cutting word unit includes a positive module, a reverse module, a t test modules;It is described Converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector;The mnemon is used for Term vector is stamped into class label, for training Naive Bayes Classifier;The output unit is used for speech document It is exported.
- 2. a kind of community's speech filtration system based on naive Bayesian according to claim 1, which is characterized in that described The cutting word method of cutting word unit uses bi-directional matching method, and the bi-directional matching method includes Forward Maximum Method method and reverse maximum With method;The Forward Maximum Method method includes the following steps:A1:M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns to word String S;A2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1;Otherwise A3 is gone to;A3:A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return to word string K, and go to A1;Otherwise A4 is gone to;A4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and step is gone to using word string S-K as word string S Rapid A2;The reverse maximum matching method includes the following steps:B1:M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns to word String S;B2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1;Otherwise B3 is gone to;B3:Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return to word string K, and go to B1;Otherwise B4 is gone to;B4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and step is gone to using word string S-K as word string S Rapid B2.
- 3. a kind of community's speech filtration system based on naive Bayesian according to claim 2, which is characterized in that if The Forward Maximum Method method and the reverse maximum matching method differ the cutting result of speech document, and it is poor to be tested using t Method carrys out disambiguation, and for orderly word string xyz, x is relative to the t test definitions of y and z:Wherein ρ (z | y), ρ (y | x) probability of the z under y, probability of the y under x, σ are represented respectively2(ρ (z | y)), σ2(ρ (y | x)) table Show respective variance, in above formula, the computational methods of each data are as follows:R (y, z), r (x, y) represent orderly word string yz respectively, and the frequency that xy occurs in dictionary, r (x), r (y) represent x, y respectively The frequency occurred in dictionary,Therefore, t is obtainedX, z(y) calculation formula is:It is for the t tests difference between orderly word string a wxyz, x, y:Δ t (x, y)=tW, y(x)-tX, z(y)Classification processing is carried out to result:Situation one:tW, y(x) > 0, tX, z(y) < 0, Δ t (x, y) > 0, then it represents that attract each other between x, y, then xy compositions one A word;Situation two:tW, y(x) < 0, tX, z(y) > 0, Δ t (x, y) < 0, then it represents that it is mutually exclusive between x, y, xy is separated;Situation three:tW, y(x) > 0, tX, z(y) > 0 represents that z attracts y while y attracts x, when Δ t (x, y) > 0, xy compositions one A word;As Δ t (x, y) < 0, xy is separated;Situation four:tW, y(x) < 0, tW, z(y) < 0 represents that w attracts x while x attracts y, when Δ t (x, y) > 0, xy are formed One word;As Δ t (x, y) < 0, xy is separated.
- 4. a kind of community's speech filtration system based on naive Bayesian according to claim 3, which is characterized in that described After the cutting word unit of community's speech filtration system based on naive Bayesian tentatively obtains cutting word result, according to deactivated vocabulary, such as Fruit word is appeared in deactivated vocabulary, then is deleted, and then further reduces dimension according to information gain (IG);Information gain (IG) formula is as follows:Wherein P (Ci) represent CiThe probability that class text occurs in training sample, it is general that P (t) represents that word occurs in training sample Rate,Represent the probability that word does not occur in training sample, (Ci| it t) represents in the case of occurring comprising word t and belongs to classification CiProbability,Represent that word t still belongs to classification C in the case of not occurringiProbability.
- 5. a kind of community's speech filtration system based on naive Bayesian according to claim 4, which is characterized in that described Converting unit is used for after Feature Words are obtained, and speech document is converted to term vector, the term vector form is as follows:xi=(x(1)..., x(j)..., x(n)), n is characterized word numberWherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;Speech document is divided into two class of normal speech and improper speech, formula represents as follows:Y={ 0,1 }The value of wherein normal speech y is 0, and the value of improper speech y is 1.
- 6. a kind of community's speech filtration system based on naive Bayesian according to claim 5, which is characterized in that described Mnemon is used to term vector stamping class label, for training Naive Bayes Classifier;Conditional probability and the Bayesian Estimation of prior probability difference are as follows:λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce, here we take conditional probability with Prior probability is that Laplce is smooth:By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:Denominator is for all c in above formulakAll it is identical, is equivalent to:The value of wherein normal speech y is 0, and the value of improper speech y is 1.
- 7. a kind of community's speech filtration system based on naive Bayesian according to claim 1, which is characterized in that described Community's speech filtration system based on naive Bayesian judge a speech document whether be the specific steps of normal speech such as Under:D1:Speech document is pre-processed by the cutting word unit, the speech document text that obtains that treated, and pass through institute It states cutting word unit and cutting word processing is carried out to it using bi-directional matching method, the speech document text is cut into one by one Word;If bi-directional matching method is when to the speech document text cutting word, if maximum forward matching method is with maximum reverse matching method knot Fruit is consistent, then directly uses word segmentation result;If maximum forward matching method is consistent with maximum reverse matching method result, surveyed using t The poor disambiguation of examination;D2:After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, does not have for remaining The word in vocabulary is deactivated is occurred, carries out the information gain of Feature Words, then further reduction is tieed up according to information gain (IG) Degree, takes N number of word of information gain value maximum as validity feature;D3:After extracting validity feature, the speech document is converted into term vector:xi=(x(1)..., x(j)..., x(n)), n is characterized numberWherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;D4:Laplce's smooth value is added in conditional probability and prior probability:D5:For document to be sorted, term vector is converted into using step D3:X=(x(1)..., x(j)...), n is characterized numberWherein x(j)For j-th of Feature Words;Calculate P (Y=ck)ΠjP(X(j)=x(j)| Y=ck)D6:Determine the class of document to be sortedIf y=0, speech document to be sorted is normal speech, is otherwise improper speech;D7:Speech document is exported by output unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611254036.8A CN108268459A (en) | 2016-12-30 | 2016-12-30 | A kind of community's speech filtration system based on naive Bayesian |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611254036.8A CN108268459A (en) | 2016-12-30 | 2016-12-30 | A kind of community's speech filtration system based on naive Bayesian |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108268459A true CN108268459A (en) | 2018-07-10 |
Family
ID=62754338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611254036.8A Pending CN108268459A (en) | 2016-12-30 | 2016-12-30 | A kind of community's speech filtration system based on naive Bayesian |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268459A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807642A (en) * | 2021-06-25 | 2021-12-17 | 国网浙江省电力有限公司金华供电公司 | Power dispatching intelligent interaction method based on program-controlled telephone |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996241A (en) * | 2010-10-22 | 2011-03-30 | 东南大学 | Bayesian algorithm-based content filtering method |
CN103634473A (en) * | 2013-12-05 | 2014-03-12 | 南京理工大学连云港研究院 | Naive Bayesian classification based mobile phone spam short message filtering method and system |
CN105975454A (en) * | 2016-04-21 | 2016-09-28 | 广州精点计算机科技有限公司 | Chinese word segmentation method and device of webpage text |
-
2016
- 2016-12-30 CN CN201611254036.8A patent/CN108268459A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996241A (en) * | 2010-10-22 | 2011-03-30 | 东南大学 | Bayesian algorithm-based content filtering method |
CN103634473A (en) * | 2013-12-05 | 2014-03-12 | 南京理工大学连云港研究院 | Naive Bayesian classification based mobile phone spam short message filtering method and system |
CN105975454A (en) * | 2016-04-21 | 2016-09-28 | 广州精点计算机科技有限公司 | Chinese word segmentation method and device of webpage text |
Non-Patent Citations (6)
Title |
---|
JIHITE: "朴素贝叶斯法(二)——基本方法", 《HTTPS://WWW.CNBLOGS.COM/KAITUORENSHENG/P/3379478.HTML》 * |
单月光: "基于微博的网络舆情关键技术的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
徐英慧等: "基于内容的手机端垃圾短信过滤策略研究", 《北京信息科技大学学报》 * |
曹卫峰: "中文分词关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王思力: "面向大规模信息检索的中文分词技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
马刚: "《基于语义的Web数据挖掘》", 31 January 2014 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807642A (en) * | 2021-06-25 | 2021-12-17 | 国网浙江省电力有限公司金华供电公司 | Power dispatching intelligent interaction method based on program-controlled telephone |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020140372A1 (en) | Recognition model-based intention recognition method, recognition device, and medium | |
CN106021362B (en) | Generation, image searching method and the device that the picture feature of query formulation represents | |
CN105808526B (en) | Commodity short text core word extracting method and device | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
US9189748B2 (en) | Information extraction system, method, and program | |
CN111027324A (en) | Method for extracting open type relation based on syntax mode and machine learning | |
WO2017107566A1 (en) | Retrieval method and system based on word vector similarity | |
Sirsat et al. | Strength and accuracy analysis of affix removal stemming algorithms | |
Hussain et al. | Using linguistic knowledge to classify non-functional requirements in SRS documents | |
CN106503153B (en) | A kind of computer version classification system | |
CN105224520B (en) | A kind of Chinese patent document term automatic identifying method | |
CN106708940A (en) | Method and device used for processing pictures | |
Nguyen et al. | Joint distant and direct supervision for relation extraction | |
CN108363694B (en) | Keyword extraction method and device | |
Agarwal et al. | Frame semantic tree kernels for social network extraction from text | |
Takale et al. | Measuring semantic similarity between words using web documents | |
CN113626604B (en) | Web page text classification system based on maximum interval criterion | |
CN108462624A (en) | A kind of recognition methods of spam, device and electronic equipment | |
CN106294689B (en) | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature | |
CN108268459A (en) | A kind of community's speech filtration system based on naive Bayesian | |
Coenen et al. | Statistical identification of key phrases for text classification | |
CN112818693A (en) | Automatic extraction method and system for electronic component model words | |
CN108073567A (en) | A kind of Feature Words extraction process method, system and server | |
CN110472031A (en) | A kind of regular expression preparation method, device, electronic equipment and storage medium | |
CN107590163B (en) | The methods, devices and systems of text feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180710 |
|
RJ01 | Rejection of invention patent application after publication |