CN108268459A

CN108268459A - A kind of community's speech filtration system based on naive Bayesian

Info

Publication number: CN108268459A
Application number: CN201611254036.8A
Authority: CN
Inventors: 麻建; 吴剑文; 何伟潮; 单小红
Original assignee: Guangdong Fine Point Data Polytron Technologies Inc
Current assignee: Guangdong Fine Point Data Polytron Technologies Inc
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2018-07-10

Abstract

The present invention provides a kind of community's speech filtration system based on naive Bayesian, community's speech filtration system based on naive Bayesian includes a cutting word unit, a converting unit, a mnemon, an output unit；For the cutting word unit for speech document to be pre-processed, the cutting word unit includes a positive module, a reverse module, a t test modules；The converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector；The mnemon is used to term vector stamping class label, for training Naive Bayes Classifier；The output unit is used to export speech document.User can effectively be avoided accidentally to have input the word in sensitive dictionary, but what is shielded happens using community's speech filtration system based on naive Bayesian, avoid and bad usage experience is brought to user.

Description

A kind of community's speech filtration system based on naive Bayesian

Technical field

The present invention relates to a kind of filtration system field more particularly to a kind of speech filtering systems of community based on naive Bayesian System.

Background technology

Today that internet is grown rapidly, the clothing, food, lodging and transportion -- basic necessities of life of people increasingly be unable to do without network.In this background, it is based on The community of various hobbies also just comes into being, but during community development of today, we often can see respectively The inappropriate speech of kind, such as the personal attack to other people, the too drastic speech to politics.If a community lets alone this phenomenon No matter, then this is very unfavorable for the development of community.

Nowadays major community platform can all have the corresponding intrument and method for dealing with speech improperly, but most of communities are all It is using passive this pattern of user's report, efficiency is very low, and present community's speech filtering is typically direct Judge whether there is some word in the speech of user in sensitive dictionary, these words are directly then substituted for No. *, this judged Journey does not even all use segmentation methods.Sometimes user only by chance accidentally has input the word in sensitive dictionary, possible linguistic context On be not improper speech completely, but shielded, bad experience is caused to user.Therefore active demand one kind can Easily, the system of the high progress speech filtering of accuracy rate.

In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.

Invention content

To solve the above problems, the technical solution adopted by the present invention is, a kind of community based on naive Bayesian is provided Speech filtration system, community's speech filtration system based on naive Bayesian include a cutting word unit, a converting unit, one Mnemon, an output unit；For the cutting word unit for speech document to be pre-processed, the cutting word unit is including one just To module, a reverse module, a t test modules；The converting unit is used for after the cutting word unit completes cutting word, by speech Document is converted to term vector；The mnemon is used to term vector stamping class label, for training naive Bayesian Grader；The output unit is used to export speech document.

Preferably, the cutting word method of the cutting word unit uses bi-directional matching method, the bi-directional matching method includes forward direction most Big matching method and reverse maximum matching method；

The Forward Maximum Method method includes the following steps：

A1：M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns Back Word string S；

A2：Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1；Otherwise A3 is gone to；

A3：A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return Word string K, and go to A1；Otherwise A4 is gone to；

A4：Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step A2；

The reverse maximum matching method includes the following steps：

B1：M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns Back Word string S；

B2：Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1；Otherwise B3 is gone to；

B3：Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return Word string K, and go to B1；Otherwise B4 is gone to；

B4：Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step B2.

If the preferably, cutting result of the Forward Maximum Method method and the reverse maximum matching method to speech document It differs, difference method is tested come disambiguation using t, for orderly word string xyz, x is relative to the t test definitions of y and z：

Wherein ρ (z ∣ y), ρ (y ∣ x) represent probability of the z under y, probability of the y under x, σ respectively²(ρ (z ∣ y)), σ²(ρ(y∣ X) respective variance) is represented, in above formula, the computational methods of each data are as follows：

R (y, z), r (x, y) represent orderly word string yz, the frequency that xy occurs in dictionary, r (x), r (y) difference table respectively Show the frequency that x, y occur in dictionary,

Therefore, t is obtained_x,z(y) calculation formula is：

It is for the t tests difference between orderly word string a wxyz, x, y：

Δ t (x, y)=t_w,y(x)-t_x,z(y)

Classification processing is carried out to result：

Situation one：t_w,y(x)>0, t_x,z(y)<0, Δ t (x, y)>0, then it represents that attract each other between x, y, then xy is formed One word；

Situation two：t_w,y(x)<0, t_x,z(y)>0, Δ t (x, y)<0, then it represents that it is mutually exclusive between x, y, xy is separated；

Situation three：t_w,y(x)>0, t_x,z(y)>0, represent that z attracts y while y attracts x, as Δ t (x, y)>0, xy composition one A word；As Δ t (x, y)<0, xy is separated；

Situation four：t_w,y(x)<0, t_x,z(y)<0, represent that w attracts x while x attracts y, as Δ t (x, y)>0, xy composition One word；As Δ t (x, y)<0, xy is separated.

Preferably, the cutting word unit of community's speech filtration system based on naive Bayesian tentatively obtains cutting word result Afterwards, it according to deactivated vocabulary, if word is appeared in deactivated vocabulary, is deleted, it is then further according to information gain (IG) Reduce dimension；

Information gain (IG) formula is as follows：

Wherein P (C_i) represent C_iThe probability that class text occurs in training sample, P (t) represent that word occurs in training sample Probability,Represent the probability that word does not occur in training sample, (C_i∣ t) it represents in the case of occurring comprising word t and belongs to Classification C_iProbability,Represent that word t still belongs to classification C in the case of not occurring_iProbability.

Preferably, the converting unit is used for after Feature Words are obtained, speech document is converted into term vector, institute's predicate to Amount form is as follows：

X_i=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized word number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words；

Speech document is divided into two class of normal speech and improper speech, formula represents as follows：

Y={ 0,1 }

The value of wherein normal speech y is 0, and the value of improper speech y is 1.

Preferably, the mnemon is used to term vector stamping class label, for training naive Bayesian point Class device；

Conditional probability and the Bayesian Estimation of prior probability difference are as follows：

λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,

Here it with prior probability is that Laplce is smooth that we, which take conditional probability,：

By it is above-mentioned it is various we can be derived from Naive Bayes Classifier：

Denominator is for all c in above formula_kAll it is identical, is equivalent to：

Preferably, community's speech filtration system based on naive Bayesian judges whether a speech document is just Saying opinion is as follows：

D1：Speech document is pre-processed by the cutting word unit, the speech document text that obtains that treated, and led to It crosses the cutting word unit and cutting word processing is carried out to it using bi-directional matching method, the speech document text is cut into one by one Word；If bi-directional matching method is when to the speech document text cutting word, if maximum forward matching method is with maximum reverse matching method As a result it is consistent, then directly use word segmentation result；If maximum forward matching method is consistent with maximum reverse matching method result, t is used The poor disambiguation of test；

D2：After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, for residue The word not appeared in deactivated vocabulary, carry out the information gains of Feature Words, then further contracted according to information gain (IG) Subtract dimension, take N number of word of information gain value maximum as validity feature；

D3：After extracting validity feature, the speech document is converted into term vector：

X_i=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words；

D4：Laplce's smooth value is added in conditional probability and prior probability：

D5：For document to be sorted, term vector is converted into using step D3：

X=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized number

Wherein x^(j)For j-th of Feature Words；

Calculate P (Y=c_k)Π_j P(X^(j)=x^(j)| Y=c_k)

D6：Determine the class of document to be sorted

If y=0, speech document to be sorted is normal speech, is otherwise improper speech；

D7：Speech document is exported by output unit.

Compared with the prior art the beneficial effects of the present invention are：1st, using community's speech based on naive Bayesian Filtration system can effectively avoid user from accidentally having input the word in sensitive dictionary, but what is shielded happens, It avoids and bad usage experience is brought to user；It 2nd, can using community's speech filtration system based on naive Bayesian Easily, the high carry out speech filtering of accuracy rate.

Description of the drawings

Fig. 1 is a kind of structure function block diagram of community's speech filtration system based on naive Bayesian of the present invention.

Specific embodiment

Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.

The present invention provides a kind of community's speech filtration system based on naive Bayesian, as shown in Figure 1, described based on simplicity Community's speech filtration system of Bayes includes a cutting word unit, a converting unit, a mnemon, an output unit.

Embodiment one

Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In, the cutting word unit for speech document to be pre-processed, since speech document is made of a large amount of sentences, so needing Cutting word processing is carried out to speech document, in order to subsequent processing.

The cutting word unit includes a positive module, a reverse module, a t test modules；

It is described forward direction module for from left to right by several continuation characters in speech document to be slit in dictionary into Row matching, successful match are then syncopated as a word, are matched in dictionary again after otherwise a word of rightmost is rejected, directly It is split and finishes to text.

The reverse module for from right to left by several continuation characters in speech document to be slit in dictionary into Row matching, successful match are then syncopated as a word, are matched in dictionary again after otherwise a leftmost word is rejected, directly It is split and finishes to text.

The t test modules are used to differ speech document cutting result in the positive module and the reverse module When, for disambiguation.

The converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector.

Using the multinomial model of naive Bayesian, term vector form is as follows：

X_i=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized word number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words.

We assume that the classification of document has two classes, respectively normal speech and improper speech.

Y={ 0,1 }

The mnemon is used to term vector stamping class label, for training Naive Bayes Classifier.Piao Plain Bayesian Method has done conditional independence assumption to conditional probability distribution, i other words the condition that is determined in class of feature for classification Under be all conditional sampling, this hypothesis makes naive Bayesian method become simple, but can sacrifice certain accuracy rate.

Above-mentioned term vector is manually marked, for training Naive Bayes Classifier.

It can thus be concluded that the Maximum-likelihood estimation of prior probability：

I is indicator function, i.e. y_i=c_kWhen be 1, be otherwise 0.

If j-th of feature x^(j)The collection of possible value is combined intoConditional probability P (X^(j)=a_jl| Y=c_k) Maximum-likelihood estimation be：

In formula,It is j-th of feature of i-th of sample；a_jlIt is l-th of value that j-th of feature may take；I is instruction Function.

We are it will be clear that with Maximum-likelihood estimation it is possible that the probability value to be estimated is 0 from above formula Situation, at this moment influence whether the result of calculation of posterior probability, classification made to generate deviation, so we use Bayes herein The Bayesian Estimation difference of estimation, conditional probability and prior probability is as follows：

Y is corresponding classification output, thus we can determine whether the classification of the term vector of input, so that it is determined that the input Speech document it is whether normal.

The output unit is used to export speech document.

Embodiment two

Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In the cutting word method of the cutting word unit of community's speech filtration system based on naive Bayesian is using bi-directional matching method, institute It states bi-directional matching method and includes Forward Maximum Method method and reverse maximum matching method, when using the Forward Maximum Method method and described When the cutting word result that reverse maximum matching method generates is consistent, cutting word result is continued to use.When using the Forward Maximum Method method and institute State cutting word result that reverse maximum matching method generates it is inconsistent when, then it is assumed that generate cutting word ambiguity, at this time with t test difference method come Improve cutting word accuracy rate.

The Forward Maximum Method method includes the following steps：

A4：Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step A2.

The reverse maximum matching method includes the following steps：

If the Forward Maximum Method method and the reverse maximum matching method are to the cutting result of speech document

It differs, difference method is tested come disambiguation using t, for orderly word string xyz, x is relative to y

And the t test definitions of z are：

Wherein ρ (z ∣ y), ρ (y ∣ x) represent probability of the z under y, probability of the y under x respectively.σ²(ρ (z ∣ y)), σ²(ρ(y∣ X) respective variance) is represented.In above formula, the computational methods of each data are as follows：

R (y, z), r (x, y) represent orderly word string yz, the frequency that xy occurs in dictionary, r (x), r (y) difference table respectively Show the frequency that x, y occur in dictionary.

Therefore, t is obtained_x,z(y) calculation formula is：

It is for the t tests difference between orderly word string a wxyz, x, y：

Δ t (x, y)=t_w,y(x)-t_x,z(y)

Classification processing is carried out to result：

Situation one：t_w,y(x)>0, t_x,z(y)<0, Δ t (x, y)>0.It then represents x, attracts each other between y, then xy is formed One word.

Situation two：t_w,y(x)<0, t_x,z(y)>0, Δ t (x, y)<0.Then represent x, it is mutually exclusive between y, xy is separated.

Situation three：t_w,y(x)>0, t_x,z(y)>0.Represent that z attracts y while y attracts x.As Δ t (x, y)>0, xy composition one A word；As Δ t (x, y)<0, xy is separated.

Situation four：t_w,y(x)<0, t_x,z(y)<0.Represent that w attracts x while x attracts y.As Δ t (x, y)>0, xy composition One word；As Δ t (x, y)<0, xy is separated.

The cutting word unit for speech document cutting word method, when using the Forward Maximum Method method and described reverse When maximum matching method differs the cutting result of speech document, difference method is tested, and carry out according to above-mentioned four kinds of situations using t After judgement, cutting word result can be tentatively obtained.

Embodiment three

Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in After the cutting word unit of, community's speech filtration system based on naive Bayesian tentatively obtains cutting word result, according to deactivating Vocabulary if word is appeared in deactivated vocabulary, is deleted, and then further reduces dimension according to information gain (IG).Letter Breath gain (IG) refers to be removed division sample space with an attribute t and led to the degree for it is expected that entropy lowers, if IG (t) is bigger, Illustrate that t is bigger to the effect entirely classified.The present embodiment is on the basis of embodiment two removes stop words to speech document, then leads to IG (t) values for calculating each word are crossed, N number of word of IG values maximum are then taken, as the Feature Words finally chosen.

Information gain (IG) formula is as follows：

Example IV

Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In the converting unit of community's speech filtration system based on naive Bayesian is used for after Feature Words are obtained, by speech text Shelves are converted to term vector.

Term vector form is as follows：

X_i=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized word number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words.

The form of the term vector uses the multinomial model of naive Bayesian.

The classification of document there are into two classes, respectively normal speech and improper speech.Formula represents as follows：

Y={ 0,1 }

Embodiment five

Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In the mnemon of community's speech filtration system based on naive Bayesian is used to term vector stamping classification mark Label, for training Naive Bayes Classifier.Naive Bayesian method has done conditional independence assumption to conditional probability distribution, I other words the feature for classification is all conditional sampling under conditions of class determines, this hypothesis becomes naive Bayesian method Simply, but certain accuracy rate can be sacrificed.

I is indicator function, i.e. y_i=c_kWhen be 1, be otherwise 0.

If j-th of feature x^(j)The collection of possible value is combined intoConditional probability P (X^(j)=a_jl| Y= c_k) Maximum-likelihood estimation be：

We are it will be clear that with Maximum-likelihood estimation it is possible that the probability value to be estimated is 0 from above formula Situation, at this moment influence whether the result of calculation of posterior probability, classification made to generate deviation, for this kind is avoided to happen, so , herein using Bayesian Estimation, the Bayesian Estimation difference of conditional probability and prior probability is as follows for we：

Y is corresponding classification output, thus we can determine whether the classification of the term vector of input, so that it is determined that the input Speech document it is whether normal；The value of wherein normal speech y is 0, and the value of improper speech y is 1.

Embodiment five

Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In community's speech filtration system based on naive Bayesian judges whether a speech document is the specific of normal speech Step is as follows：

D1：Speech document is pre-processed by the cutting word unit, the classification of speech document is divided into two classes, respectively For normal speech and improper speech.

Y={ 0,1 }

Wherein normal speech Y's is labeled as 0, and improper speech Y's is labeled as 1.

The speech document text that obtains that treated, and cutting word is carried out to it using bi-directional matching method by the cutting word unit The speech document text is cut into word one by one by processing.If bi-directional matching method is cut to the speech document text During word, if maximum forward matching method is consistent with maximum reverse matching method result, word segmentation result is directly used；If maximum forward It is consistent with maximum reverse matching method result with method, then test poor disambiguation using t.

D2：After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, for residue The word not appeared in deactivated vocabulary, carry out the information gains of Feature Words, then further contracted according to information gain (IG) Subtract dimension, take N number of word of information gain value maximum as validity feature.

X_i=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words.

After handling all speech documents, training sample is obtained.

D5：For document to be sorted, term vector is converted into using step D3：

X=(x⁽¹⁾,…,x^(j),…,x⁽ⁿ⁾), n is characterized number

Wherein x^(j)For j-th of Feature Words.

Calculate P (Y=c_k)Π_j P(X^(j)=x^(j)| Y=c_k)

D6：Determine the class of document to be sorted

If y=0, speech document to be sorted is normal speech, is otherwise improper speech.

D7：Speech document is exported by output unit.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise of the method for the present invention is not departed from, can also make several improvement and supplement, these are improved and supplement also should be regarded as Protection scope of the present invention.

Claims

A kind of 1. community's speech filtration system based on naive Bayesian, which is characterized in that the society based on naive Bayesian Speech filtration system in area's includes a cutting word unit, a converting unit, a mnemon, an output unit；The cutting word unit is used It is pre-processed in by speech document, the cutting word unit includes a positive module, a reverse module, a t test modules；It is described Converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector；The mnemon is used for Term vector is stamped into class label, for training Naive Bayes Classifier；The output unit is used for speech document It is exported.
2. a kind of community's speech filtration system based on naive Bayesian according to claim 1, which is characterized in that described The cutting word method of cutting word unit uses bi-directional matching method, and the bi-directional matching method includes Forward Maximum Method method and reverse maximum With method；

The Forward Maximum Method method includes the following steps：

A1：M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns to word String S；

A2：Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1；Otherwise A3 is gone to；

A3：A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return to word string K, and go to A1；Otherwise A4 is gone to；

A4：Word string K is searched in dictionary, finds then successful match, returns to word string K, and step is gone to using word string S-K as word string S Rapid A2；

The reverse maximum matching method includes the following steps：

B1：M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns to word String S；

B2：Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1；Otherwise B3 is gone to；

B3：Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return to word string K, and go to B1；Otherwise B4 is gone to；

B4：Word string K is searched in dictionary, finds then successful match, returns to word string K, and step is gone to using word string S-K as word string S Rapid B2.
3. a kind of community's speech filtration system based on naive Bayesian according to claim 2, which is characterized in that if The Forward Maximum Method method and the reverse maximum matching method differ the cutting result of speech document, and it is poor to be tested using t Method carrys out disambiguation, and for orderly word string xyz, x is relative to the t test definitions of y and z：

Wherein ρ (z | y), ρ (y | x) probability of the z under y, probability of the y under x, σ are represented respectively²(ρ (z | y)), σ²(ρ (y | x)) table Show respective variance, in above formula, the computational methods of each data are as follows：

R (y, z), r (x, y) represent orderly word string yz respectively, and the frequency that xy occurs in dictionary, r (x), r (y) represent x, y respectively The frequency occurred in dictionary,

Therefore, t is obtained_{X, z}(y) calculation formula is：

It is for the t tests difference between orderly word string a wxyz, x, y：

Δ t (x, y)=t_{W, y}(x)-t_{X, z}(y)

Classification processing is carried out to result：

Situation one：t_{W, y}(x) ＞ 0, t_{X, z}(y) ＜ 0, Δ t (x, y) ＞ 0, then it represents that attract each other between x, y, then xy compositions one A word；

Situation two：t_{W, y}(x) ＜ 0, t_{X, z}(y) ＞ 0, Δ t (x, y) ＜ 0, then it represents that it is mutually exclusive between x, y, xy is separated；

Situation three：t_{W, y}(x) ＞ 0, t_{X, z}(y) ＞ 0 represents that z attracts y while y attracts x, when Δ t (x, y) ＞ 0, xy compositions one A word；As Δ t (x, y) ＜ 0, xy is separated；

Situation four：t_{W, y}(x) ＜ 0, t_{W, z}(y) ＜ 0 represents that w attracts x while x attracts y, when Δ t (x, y) ＞ 0, xy are formed One word；As Δ t (x, y) ＜ 0, xy is separated.
4. a kind of community's speech filtration system based on naive Bayesian according to claim 3, which is characterized in that described After the cutting word unit of community's speech filtration system based on naive Bayesian tentatively obtains cutting word result, according to deactivated vocabulary, such as Fruit word is appeared in deactivated vocabulary, then is deleted, and then further reduces dimension according to information gain (IG)；

Information gain (IG) formula is as follows：

Wherein P (C_i) represent C_iThe probability that class text occurs in training sample, it is general that P (t) represents that word occurs in training sample Rate,Represent the probability that word does not occur in training sample, (C_i| it t) represents in the case of occurring comprising word t and belongs to classification C_iProbability,Represent that word t still belongs to classification C in the case of not occurring_iProbability.
5. a kind of community's speech filtration system based on naive Bayesian according to claim 4, which is characterized in that described Converting unit is used for after Feature Words are obtained, and speech document is converted to term vector, the term vector form is as follows：

x_i=(x⁽¹⁾..., x^(j)..., x⁽ⁿ⁾), n is characterized word number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words；

Speech document is divided into two class of normal speech and improper speech, formula represents as follows：

Y={ 0,1 }

The value of wherein normal speech y is 0, and the value of improper speech y is 1.
6. a kind of community's speech filtration system based on naive Bayesian according to claim 5, which is characterized in that described Mnemon is used to term vector stamping class label, for training Naive Bayes Classifier；

Conditional probability and the Bayesian Estimation of prior probability difference are as follows：

λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce, here we take conditional probability with Prior probability is that Laplce is smooth：

By it is above-mentioned it is various we can be derived from Naive Bayes Classifier：

Denominator is for all c in above formula_kAll it is identical, is equivalent to：

The value of wherein normal speech y is 0, and the value of improper speech y is 1.
7. a kind of community's speech filtration system based on naive Bayesian according to claim 1, which is characterized in that described Community's speech filtration system based on naive Bayesian judge a speech document whether be the specific steps of normal speech such as Under：

D1：Speech document is pre-processed by the cutting word unit, the speech document text that obtains that treated, and pass through institute It states cutting word unit and cutting word processing is carried out to it using bi-directional matching method, the speech document text is cut into one by one Word；If bi-directional matching method is when to the speech document text cutting word, if maximum forward matching method is with maximum reverse matching method knot Fruit is consistent, then directly uses word segmentation result；If maximum forward matching method is consistent with maximum reverse matching method result, surveyed using t The poor disambiguation of examination；

D2：After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, does not have for remaining The word in vocabulary is deactivated is occurred, carries out the information gain of Feature Words, then further reduction is tieed up according to information gain (IG) Degree, takes N number of word of information gain value maximum as validity feature；

D3：After extracting validity feature, the speech document is converted into term vector：

x_i=(x⁽¹⁾..., x^(j)..., x⁽ⁿ⁾), n is characterized number

Wherein X_iFor i-th part of speech document, x^(j)For j-th of Feature Words；

D4：Laplce's smooth value is added in conditional probability and prior probability：

D5：For document to be sorted, term vector is converted into using step D3：

X=(x⁽¹⁾..., x^(j)...), n is characterized number

Wherein x^(j)For j-th of Feature Words；

Calculate P (Y=c_k)Π_jP(X^(j)=x^(j)| Y=c_k)

D6：Determine the class of document to be sorted

If y=0, speech document to be sorted is normal speech, is otherwise improper speech；

D7：Speech document is exported by output unit.