CN108268459A - A kind of community's speech filtration system based on naive Bayesian - Google Patents

A kind of community's speech filtration system based on naive Bayesian Download PDF

Info

Publication number
CN108268459A
CN108268459A CN201611254036.8A CN201611254036A CN108268459A CN 108268459 A CN108268459 A CN 108268459A CN 201611254036 A CN201611254036 A CN 201611254036A CN 108268459 A CN108268459 A CN 108268459A
Authority
CN
China
Prior art keywords
word
speech
cutting
word string
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611254036.8A
Other languages
Chinese (zh)
Inventor
麻建
吴剑文
何伟潮
单小红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Fine Point Data Polytron Technologies Inc
Original Assignee
Guangdong Fine Point Data Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Fine Point Data Polytron Technologies Inc filed Critical Guangdong Fine Point Data Polytron Technologies Inc
Priority to CN201611254036.8A priority Critical patent/CN108268459A/en
Publication of CN108268459A publication Critical patent/CN108268459A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The present invention provides a kind of community's speech filtration system based on naive Bayesian, community's speech filtration system based on naive Bayesian includes a cutting word unit, a converting unit, a mnemon, an output unit;For the cutting word unit for speech document to be pre-processed, the cutting word unit includes a positive module, a reverse module, a t test modules;The converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector;The mnemon is used to term vector stamping class label, for training Naive Bayes Classifier;The output unit is used to export speech document.User can effectively be avoided accidentally to have input the word in sensitive dictionary, but what is shielded happens using community's speech filtration system based on naive Bayesian, avoid and bad usage experience is brought to user.

Description

A kind of community's speech filtration system based on naive Bayesian
Technical field
The present invention relates to a kind of filtration system field more particularly to a kind of speech filtering systems of community based on naive Bayesian System.
Background technology
Today that internet is grown rapidly, the clothing, food, lodging and transportion -- basic necessities of life of people increasingly be unable to do without network.In this background, it is based on The community of various hobbies also just comes into being, but during community development of today, we often can see respectively The inappropriate speech of kind, such as the personal attack to other people, the too drastic speech to politics.If a community lets alone this phenomenon No matter, then this is very unfavorable for the development of community.
Nowadays major community platform can all have the corresponding intrument and method for dealing with speech improperly, but most of communities are all It is using passive this pattern of user's report, efficiency is very low, and present community's speech filtering is typically direct Judge whether there is some word in the speech of user in sensitive dictionary, these words are directly then substituted for No. *, this judged Journey does not even all use segmentation methods.Sometimes user only by chance accidentally has input the word in sensitive dictionary, possible linguistic context On be not improper speech completely, but shielded, bad experience is caused to user.Therefore active demand one kind can Easily, the system of the high progress speech filtering of accuracy rate.
In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.
Invention content
To solve the above problems, the technical solution adopted by the present invention is, a kind of community based on naive Bayesian is provided Speech filtration system, community's speech filtration system based on naive Bayesian include a cutting word unit, a converting unit, one Mnemon, an output unit;For the cutting word unit for speech document to be pre-processed, the cutting word unit is including one just To module, a reverse module, a t test modules;The converting unit is used for after the cutting word unit completes cutting word, by speech Document is converted to term vector;The mnemon is used to term vector stamping class label, for training naive Bayesian Grader;The output unit is used to export speech document.
Preferably, the cutting word method of the cutting word unit uses bi-directional matching method, the bi-directional matching method includes forward direction most Big matching method and reverse maximum matching method;
The Forward Maximum Method method includes the following steps:
A1:M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns Back Word string S;
A2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1;Otherwise A3 is gone to;
A3:A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return Word string K, and go to A1;Otherwise A4 is gone to;
A4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step A2;
The reverse maximum matching method includes the following steps:
B1:M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns Back Word string S;
B2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1;Otherwise B3 is gone to;
B3:Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return Word string K, and go to B1;Otherwise B4 is gone to;
B4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step B2.
If the preferably, cutting result of the Forward Maximum Method method and the reverse maximum matching method to speech document It differs, difference method is tested come disambiguation using t, for orderly word string xyz, x is relative to the t test definitions of y and z:
Wherein ρ (z ∣ y), ρ (y ∣ x) represent probability of the z under y, probability of the y under x, σ respectively2(ρ (z ∣ y)), σ2(ρ(y∣ X) respective variance) is represented, in above formula, the computational methods of each data are as follows:
R (y, z), r (x, y) represent orderly word string yz, the frequency that xy occurs in dictionary, r (x), r (y) difference table respectively Show the frequency that x, y occur in dictionary,
Therefore, t is obtainedx,z(y) calculation formula is:
It is for the t tests difference between orderly word string a wxyz, x, y:
Δ t (x, y)=tw,y(x)-tx,z(y)
Classification processing is carried out to result:
Situation one:tw,y(x)>0, tx,z(y)<0, Δ t (x, y)>0, then it represents that attract each other between x, y, then xy is formed One word;
Situation two:tw,y(x)<0, tx,z(y)>0, Δ t (x, y)<0, then it represents that it is mutually exclusive between x, y, xy is separated;
Situation three:tw,y(x)>0, tx,z(y)>0, represent that z attracts y while y attracts x, as Δ t (x, y)>0, xy composition one A word;As Δ t (x, y)<0, xy is separated;
Situation four:tw,y(x)<0, tx,z(y)<0, represent that w attracts x while x attracts y, as Δ t (x, y)>0, xy composition One word;As Δ t (x, y)<0, xy is separated.
Preferably, the cutting word unit of community's speech filtration system based on naive Bayesian tentatively obtains cutting word result Afterwards, it according to deactivated vocabulary, if word is appeared in deactivated vocabulary, is deleted, it is then further according to information gain (IG) Reduce dimension;
Information gain (IG) formula is as follows:
Wherein P (Ci) represent CiThe probability that class text occurs in training sample, P (t) represent that word occurs in training sample Probability,Represent the probability that word does not occur in training sample, (Ci∣ t) it represents in the case of occurring comprising word t and belongs to Classification CiProbability,Represent that word t still belongs to classification C in the case of not occurringiProbability.
Preferably, the converting unit is used for after Feature Words are obtained, speech document is converted into term vector, institute's predicate to Amount form is as follows:
Xi=(x(1),…,x(j),…,x(n)), n is characterized word number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;
Speech document is divided into two class of normal speech and improper speech, formula represents as follows:
Y={ 0,1 }
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Preferably, the mnemon is used to term vector stamping class label, for training naive Bayesian point Class device;
Conditional probability and the Bayesian Estimation of prior probability difference are as follows:
λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,
Here it with prior probability is that Laplce is smooth that we, which take conditional probability,:
By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
Denominator is for all c in above formulakAll it is identical, is equivalent to:
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Preferably, community's speech filtration system based on naive Bayesian judges whether a speech document is just Saying opinion is as follows:
D1:Speech document is pre-processed by the cutting word unit, the speech document text that obtains that treated, and led to It crosses the cutting word unit and cutting word processing is carried out to it using bi-directional matching method, the speech document text is cut into one by one Word;If bi-directional matching method is when to the speech document text cutting word, if maximum forward matching method is with maximum reverse matching method As a result it is consistent, then directly use word segmentation result;If maximum forward matching method is consistent with maximum reverse matching method result, t is used The poor disambiguation of test;
D2:After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, for residue The word not appeared in deactivated vocabulary, carry out the information gains of Feature Words, then further contracted according to information gain (IG) Subtract dimension, take N number of word of information gain value maximum as validity feature;
D3:After extracting validity feature, the speech document is converted into term vector:
Xi=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;
D4:Laplce's smooth value is added in conditional probability and prior probability:
D5:For document to be sorted, term vector is converted into using step D3:
X=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein x(j)For j-th of Feature Words;
Calculate P (Y=ckj P(X(j)=x(j)| Y=ck)
D6:Determine the class of document to be sorted
If y=0, speech document to be sorted is normal speech, is otherwise improper speech;
D7:Speech document is exported by output unit.
Compared with the prior art the beneficial effects of the present invention are:1st, using community's speech based on naive Bayesian Filtration system can effectively avoid user from accidentally having input the word in sensitive dictionary, but what is shielded happens, It avoids and bad usage experience is brought to user;It 2nd, can using community's speech filtration system based on naive Bayesian Easily, the high carry out speech filtering of accuracy rate.
Description of the drawings
Fig. 1 is a kind of structure function block diagram of community's speech filtration system based on naive Bayesian of the present invention.
Specific embodiment
Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.
The present invention provides a kind of community's speech filtration system based on naive Bayesian, as shown in Figure 1, described based on simplicity Community's speech filtration system of Bayes includes a cutting word unit, a converting unit, a mnemon, an output unit.
Embodiment one
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In, the cutting word unit for speech document to be pre-processed, since speech document is made of a large amount of sentences, so needing Cutting word processing is carried out to speech document, in order to subsequent processing.
The cutting word unit includes a positive module, a reverse module, a t test modules;
It is described forward direction module for from left to right by several continuation characters in speech document to be slit in dictionary into Row matching, successful match are then syncopated as a word, are matched in dictionary again after otherwise a word of rightmost is rejected, directly It is split and finishes to text.
The reverse module for from right to left by several continuation characters in speech document to be slit in dictionary into Row matching, successful match are then syncopated as a word, are matched in dictionary again after otherwise a leftmost word is rejected, directly It is split and finishes to text.
The t test modules are used to differ speech document cutting result in the positive module and the reverse module When, for disambiguation.
The converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector.
Using the multinomial model of naive Bayesian, term vector form is as follows:
Xi=(x(1),…,x(j),…,x(n)), n is characterized word number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words.
We assume that the classification of document has two classes, respectively normal speech and improper speech.
Y={ 0,1 }
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
The mnemon is used to term vector stamping class label, for training Naive Bayes Classifier.Piao Plain Bayesian Method has done conditional independence assumption to conditional probability distribution, i other words the condition that is determined in class of feature for classification Under be all conditional sampling, this hypothesis makes naive Bayesian method become simple, but can sacrifice certain accuracy rate.
Above-mentioned term vector is manually marked, for training Naive Bayes Classifier.
It can thus be concluded that the Maximum-likelihood estimation of prior probability:
I is indicator function, i.e. yi=ckWhen be 1, be otherwise 0.
If j-th of feature x(j)The collection of possible value is combined intoConditional probability P (X(j)=ajl| Y=ck) Maximum-likelihood estimation be:
In formula,It is j-th of feature of i-th of sample;ajlIt is l-th of value that j-th of feature may take;I is instruction Function.
We are it will be clear that with Maximum-likelihood estimation it is possible that the probability value to be estimated is 0 from above formula Situation, at this moment influence whether the result of calculation of posterior probability, classification made to generate deviation, so we use Bayes herein The Bayesian Estimation difference of estimation, conditional probability and prior probability is as follows:
λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,
Here it with prior probability is that Laplce is smooth that we, which take conditional probability,:
By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
Denominator is for all c in above formulakAll it is identical, is equivalent to:
Y is corresponding classification output, thus we can determine whether the classification of the term vector of input, so that it is determined that the input Speech document it is whether normal.
The output unit is used to export speech document.
Embodiment two
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In the cutting word method of the cutting word unit of community's speech filtration system based on naive Bayesian is using bi-directional matching method, institute It states bi-directional matching method and includes Forward Maximum Method method and reverse maximum matching method, when using the Forward Maximum Method method and described When the cutting word result that reverse maximum matching method generates is consistent, cutting word result is continued to use.When using the Forward Maximum Method method and institute State cutting word result that reverse maximum matching method generates it is inconsistent when, then it is assumed that generate cutting word ambiguity, at this time with t test difference method come Improve cutting word accuracy rate.
The Forward Maximum Method method includes the following steps:
A1:M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns Back Word string S;
A2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1;Otherwise A3 is gone to;
A3:A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return Word string K, and go to A1;Otherwise A4 is gone to;
A4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step A2.
The reverse maximum matching method includes the following steps:
B1:M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns Back Word string S;
B2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1;Otherwise B3 is gone to;
B3:Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return Word string K, and go to B1;Otherwise B4 is gone to;
B4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and turn word string S-K as word string S To step B2.
If the Forward Maximum Method method and the reverse maximum matching method are to the cutting result of speech document
It differs, difference method is tested come disambiguation using t, for orderly word string xyz, x is relative to y
And the t test definitions of z are:
Wherein ρ (z ∣ y), ρ (y ∣ x) represent probability of the z under y, probability of the y under x respectively.σ2(ρ (z ∣ y)), σ2(ρ(y∣ X) respective variance) is represented.In above formula, the computational methods of each data are as follows:
R (y, z), r (x, y) represent orderly word string yz, the frequency that xy occurs in dictionary, r (x), r (y) difference table respectively Show the frequency that x, y occur in dictionary.
Therefore, t is obtainedx,z(y) calculation formula is:
It is for the t tests difference between orderly word string a wxyz, x, y:
Δ t (x, y)=tw,y(x)-tx,z(y)
Classification processing is carried out to result:
Situation one:tw,y(x)>0, tx,z(y)<0, Δ t (x, y)>0.It then represents x, attracts each other between y, then xy is formed One word.
Situation two:tw,y(x)<0, tx,z(y)>0, Δ t (x, y)<0.Then represent x, it is mutually exclusive between y, xy is separated.
Situation three:tw,y(x)>0, tx,z(y)>0.Represent that z attracts y while y attracts x.As Δ t (x, y)>0, xy composition one A word;As Δ t (x, y)<0, xy is separated.
Situation four:tw,y(x)<0, tx,z(y)<0.Represent that w attracts x while x attracts y.As Δ t (x, y)>0, xy composition One word;As Δ t (x, y)<0, xy is separated.
The cutting word unit for speech document cutting word method, when using the Forward Maximum Method method and described reverse When maximum matching method differs the cutting result of speech document, difference method is tested, and carry out according to above-mentioned four kinds of situations using t After judgement, cutting word result can be tentatively obtained.
Embodiment three
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in After the cutting word unit of, community's speech filtration system based on naive Bayesian tentatively obtains cutting word result, according to deactivating Vocabulary if word is appeared in deactivated vocabulary, is deleted, and then further reduces dimension according to information gain (IG).Letter Breath gain (IG) refers to be removed division sample space with an attribute t and led to the degree for it is expected that entropy lowers, if IG (t) is bigger, Illustrate that t is bigger to the effect entirely classified.The present embodiment is on the basis of embodiment two removes stop words to speech document, then leads to IG (t) values for calculating each word are crossed, N number of word of IG values maximum are then taken, as the Feature Words finally chosen.
Information gain (IG) formula is as follows:
Wherein P (Ci) represent CiThe probability that class text occurs in training sample, P (t) represent that word occurs in training sample Probability,Represent the probability that word does not occur in training sample, (Ci∣ t) it represents in the case of occurring comprising word t and belongs to Classification CiProbability,Represent that word t still belongs to classification C in the case of not occurringiProbability.
Example IV
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In the converting unit of community's speech filtration system based on naive Bayesian is used for after Feature Words are obtained, by speech text Shelves are converted to term vector.
Term vector form is as follows:
Xi=(x(1),…,x(j),…,x(n)), n is characterized word number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words.
The form of the term vector uses the multinomial model of naive Bayesian.
The classification of document there are into two classes, respectively normal speech and improper speech.Formula represents as follows:
Y={ 0,1 }
The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Embodiment five
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In the mnemon of community's speech filtration system based on naive Bayesian is used to term vector stamping classification mark Label, for training Naive Bayes Classifier.Naive Bayesian method has done conditional independence assumption to conditional probability distribution, I other words the feature for classification is all conditional sampling under conditions of class determines, this hypothesis becomes naive Bayesian method Simply, but certain accuracy rate can be sacrificed.
Above-mentioned term vector is manually marked, for training Naive Bayes Classifier.
It can thus be concluded that the Maximum-likelihood estimation of prior probability:
I is indicator function, i.e. yi=ckWhen be 1, be otherwise 0.
If j-th of feature x(j)The collection of possible value is combined intoConditional probability P (X(j)=ajl| Y= ck) Maximum-likelihood estimation be:
In formula,It is j-th of feature of i-th of sample;ajlIt is l-th of value that j-th of feature may take;I is instruction Function.
We are it will be clear that with Maximum-likelihood estimation it is possible that the probability value to be estimated is 0 from above formula Situation, at this moment influence whether the result of calculation of posterior probability, classification made to generate deviation, for this kind is avoided to happen, so , herein using Bayesian Estimation, the Bayesian Estimation difference of conditional probability and prior probability is as follows for we:
λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce,
Here it with prior probability is that Laplce is smooth that we, which take conditional probability,:
By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
Denominator is for all c in above formulakAll it is identical, is equivalent to:
Y is corresponding classification output, thus we can determine whether the classification of the term vector of input, so that it is determined that the input Speech document it is whether normal;The value of wherein normal speech y is 0, and the value of improper speech y is 1.
Embodiment five
Community's speech filtration system based on naive Bayesian as described above, what the present embodiment was different from is in In community's speech filtration system based on naive Bayesian judges whether a speech document is the specific of normal speech Step is as follows:
D1:Speech document is pre-processed by the cutting word unit, the classification of speech document is divided into two classes, respectively For normal speech and improper speech.
Y={ 0,1 }
Wherein normal speech Y's is labeled as 0, and improper speech Y's is labeled as 1.
The speech document text that obtains that treated, and cutting word is carried out to it using bi-directional matching method by the cutting word unit The speech document text is cut into word one by one by processing.If bi-directional matching method is cut to the speech document text During word, if maximum forward matching method is consistent with maximum reverse matching method result, word segmentation result is directly used;If maximum forward It is consistent with maximum reverse matching method result with method, then test poor disambiguation using t.
D2:After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, for residue The word not appeared in deactivated vocabulary, carry out the information gains of Feature Words, then further contracted according to information gain (IG) Subtract dimension, take N number of word of information gain value maximum as validity feature.
D3:After extracting validity feature, the speech document is converted into term vector:
Xi=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words.
After handling all speech documents, training sample is obtained.
D4:Laplce's smooth value is added in conditional probability and prior probability:
D5:For document to be sorted, term vector is converted into using step D3:
X=(x(1),…,x(j),…,x(n)), n is characterized number
Wherein x(j)For j-th of Feature Words.
Calculate P (Y=ckj P(X(j)=x(j)| Y=ck)
D6:Determine the class of document to be sorted
If y=0, speech document to be sorted is normal speech, is otherwise improper speech.
D7:Speech document is exported by output unit.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise of the method for the present invention is not departed from, can also make several improvement and supplement, these are improved and supplement also should be regarded as Protection scope of the present invention.

Claims (7)

  1. A kind of 1. community's speech filtration system based on naive Bayesian, which is characterized in that the society based on naive Bayesian Speech filtration system in area's includes a cutting word unit, a converting unit, a mnemon, an output unit;The cutting word unit is used It is pre-processed in by speech document, the cutting word unit includes a positive module, a reverse module, a t test modules;It is described Converting unit is used for after the cutting word unit completes cutting word, and speech document is converted to term vector;The mnemon is used for Term vector is stamped into class label, for training Naive Bayes Classifier;The output unit is used for speech document It is exported.
  2. 2. a kind of community's speech filtration system based on naive Bayesian according to claim 1, which is characterized in that described The cutting word method of cutting word unit uses bi-directional matching method, and the bi-directional matching method includes Forward Maximum Method method and reverse maximum With method;
    The Forward Maximum Method method includes the following steps:
    A1:M word S of text is from left to right obtained, if the length of the word string S obtained is less than 2, cutting terminates, and returns to word String S;
    A2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to A1;Otherwise A3 is gone to;
    A3:A word for removing word string S rightmosts obtains word string K, if word string K length is less than 2, cutting terminates to return to word string K, and go to A1;Otherwise A4 is gone to;
    A4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and step is gone to using word string S-K as word string S Rapid A2;
    The reverse maximum matching method includes the following steps:
    B1:M word S of text is obtained from right to left, if the length of the word string S obtained is less than 2, cutting terminates, and returns to word String S;
    B2:Word string S is searched in dictionary, finds then successful match, return to word string S and goes to B1;Otherwise B3 is gone to;
    B3:Remove the leftmost word of word string S and obtain word string K, if word string K length is less than 2, cutting terminates to return to word string K, and go to B1;Otherwise B4 is gone to;
    B4:Word string K is searched in dictionary, finds then successful match, returns to word string K, and step is gone to using word string S-K as word string S Rapid B2.
  3. 3. a kind of community's speech filtration system based on naive Bayesian according to claim 2, which is characterized in that if The Forward Maximum Method method and the reverse maximum matching method differ the cutting result of speech document, and it is poor to be tested using t Method carrys out disambiguation, and for orderly word string xyz, x is relative to the t test definitions of y and z:
    Wherein ρ (z | y), ρ (y | x) probability of the z under y, probability of the y under x, σ are represented respectively2(ρ (z | y)), σ2(ρ (y | x)) table Show respective variance, in above formula, the computational methods of each data are as follows:
    R (y, z), r (x, y) represent orderly word string yz respectively, and the frequency that xy occurs in dictionary, r (x), r (y) represent x, y respectively The frequency occurred in dictionary,
    Therefore, t is obtainedX, z(y) calculation formula is:
    It is for the t tests difference between orderly word string a wxyz, x, y:
    Δ t (x, y)=tW, y(x)-tX, z(y)
    Classification processing is carried out to result:
    Situation one:tW, y(x) > 0, tX, z(y) < 0, Δ t (x, y) > 0, then it represents that attract each other between x, y, then xy compositions one A word;
    Situation two:tW, y(x) < 0, tX, z(y) > 0, Δ t (x, y) < 0, then it represents that it is mutually exclusive between x, y, xy is separated;
    Situation three:tW, y(x) > 0, tX, z(y) > 0 represents that z attracts y while y attracts x, when Δ t (x, y) > 0, xy compositions one A word;As Δ t (x, y) < 0, xy is separated;
    Situation four:tW, y(x) < 0, tW, z(y) < 0 represents that w attracts x while x attracts y, when Δ t (x, y) > 0, xy are formed One word;As Δ t (x, y) < 0, xy is separated.
  4. 4. a kind of community's speech filtration system based on naive Bayesian according to claim 3, which is characterized in that described After the cutting word unit of community's speech filtration system based on naive Bayesian tentatively obtains cutting word result, according to deactivated vocabulary, such as Fruit word is appeared in deactivated vocabulary, then is deleted, and then further reduces dimension according to information gain (IG);
    Information gain (IG) formula is as follows:
    Wherein P (Ci) represent CiThe probability that class text occurs in training sample, it is general that P (t) represents that word occurs in training sample Rate,Represent the probability that word does not occur in training sample, (Ci| it t) represents in the case of occurring comprising word t and belongs to classification CiProbability,Represent that word t still belongs to classification C in the case of not occurringiProbability.
  5. 5. a kind of community's speech filtration system based on naive Bayesian according to claim 4, which is characterized in that described Converting unit is used for after Feature Words are obtained, and speech document is converted to term vector, the term vector form is as follows:
    xi=(x(1)..., x(j)..., x(n)), n is characterized word number
    Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;
    Speech document is divided into two class of normal speech and improper speech, formula represents as follows:
    Y={ 0,1 }
    The value of wherein normal speech y is 0, and the value of improper speech y is 1.
  6. 6. a kind of community's speech filtration system based on naive Bayesian according to claim 5, which is characterized in that described Mnemon is used to term vector stamping class label, for training Naive Bayes Classifier;
    Conditional probability and the Bayesian Estimation of prior probability difference are as follows:
    λ >=0 in formula is Maximum-likelihood estimation as λ=0, and when λ=1 is smooth for Laplce, here we take conditional probability with Prior probability is that Laplce is smooth:
    By it is above-mentioned it is various we can be derived from Naive Bayes Classifier:
    Denominator is for all c in above formulakAll it is identical, is equivalent to:
    The value of wherein normal speech y is 0, and the value of improper speech y is 1.
  7. 7. a kind of community's speech filtration system based on naive Bayesian according to claim 1, which is characterized in that described Community's speech filtration system based on naive Bayesian judge a speech document whether be the specific steps of normal speech such as Under:
    D1:Speech document is pre-processed by the cutting word unit, the speech document text that obtains that treated, and pass through institute It states cutting word unit and cutting word processing is carried out to it using bi-directional matching method, the speech document text is cut into one by one Word;If bi-directional matching method is when to the speech document text cutting word, if maximum forward matching method is with maximum reverse matching method knot Fruit is consistent, then directly uses word segmentation result;If maximum forward matching method is consistent with maximum reverse matching method result, surveyed using t The poor disambiguation of examination;
    D2:After cutting word, according to deactivated vocabulary, if word is appeared in deactivated vocabulary, it is deleted, does not have for remaining The word in vocabulary is deactivated is occurred, carries out the information gain of Feature Words, then further reduction is tieed up according to information gain (IG) Degree, takes N number of word of information gain value maximum as validity feature;
    D3:After extracting validity feature, the speech document is converted into term vector:
    xi=(x(1)..., x(j)..., x(n)), n is characterized number
    Wherein XiFor i-th part of speech document, x(j)For j-th of Feature Words;
    D4:Laplce's smooth value is added in conditional probability and prior probability:
    D5:For document to be sorted, term vector is converted into using step D3:
    X=(x(1)..., x(j)...), n is characterized number
    Wherein x(j)For j-th of Feature Words;
    Calculate P (Y=ckjP(X(j)=x(j)| Y=ck)
    D6:Determine the class of document to be sorted
    If y=0, speech document to be sorted is normal speech, is otherwise improper speech;
    D7:Speech document is exported by output unit.
CN201611254036.8A 2016-12-30 2016-12-30 A kind of community's speech filtration system based on naive Bayesian Pending CN108268459A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611254036.8A CN108268459A (en) 2016-12-30 2016-12-30 A kind of community's speech filtration system based on naive Bayesian

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611254036.8A CN108268459A (en) 2016-12-30 2016-12-30 A kind of community's speech filtration system based on naive Bayesian

Publications (1)

Publication Number Publication Date
CN108268459A true CN108268459A (en) 2018-07-10

Family

ID=62754338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611254036.8A Pending CN108268459A (en) 2016-12-30 2016-12-30 A kind of community's speech filtration system based on naive Bayesian

Country Status (1)

Country Link
CN (1) CN108268459A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807642A (en) * 2021-06-25 2021-12-17 国网浙江省电力有限公司金华供电公司 Power dispatching intelligent interaction method based on program-controlled telephone

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996241A (en) * 2010-10-22 2011-03-30 东南大学 Bayesian algorithm-based content filtering method
CN103634473A (en) * 2013-12-05 2014-03-12 南京理工大学连云港研究院 Naive Bayesian classification based mobile phone spam short message filtering method and system
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996241A (en) * 2010-10-22 2011-03-30 东南大学 Bayesian algorithm-based content filtering method
CN103634473A (en) * 2013-12-05 2014-03-12 南京理工大学连云港研究院 Naive Bayesian classification based mobile phone spam short message filtering method and system
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JIHITE: "朴素贝叶斯法(二)——基本方法", 《HTTPS://WWW.CNBLOGS.COM/KAITUORENSHENG/P/3379478.HTML》 *
单月光: "基于微博的网络舆情关键技术的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
徐英慧等: "基于内容的手机端垃圾短信过滤策略研究", 《北京信息科技大学学报》 *
曹卫峰: "中文分词关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王思力: "面向大规模信息检索的中文分词技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
马刚: "《基于语义的Web数据挖掘》", 31 January 2014 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807642A (en) * 2021-06-25 2021-12-17 国网浙江省电力有限公司金华供电公司 Power dispatching intelligent interaction method based on program-controlled telephone

Similar Documents

Publication Publication Date Title
WO2020140372A1 (en) Recognition model-based intention recognition method, recognition device, and medium
CN106021362B (en) Generation, image searching method and the device that the picture feature of query formulation represents
CN105808526B (en) Commodity short text core word extracting method and device
CN106599054B (en) Method and system for classifying and pushing questions
US9189748B2 (en) Information extraction system, method, and program
CN111027324A (en) Method for extracting open type relation based on syntax mode and machine learning
WO2017107566A1 (en) Retrieval method and system based on word vector similarity
Sirsat et al. Strength and accuracy analysis of affix removal stemming algorithms
Hussain et al. Using linguistic knowledge to classify non-functional requirements in SRS documents
CN106503153B (en) A kind of computer version classification system
CN105224520B (en) A kind of Chinese patent document term automatic identifying method
CN106708940A (en) Method and device used for processing pictures
Nguyen et al. Joint distant and direct supervision for relation extraction
CN108363694B (en) Keyword extraction method and device
Agarwal et al. Frame semantic tree kernels for social network extraction from text
Takale et al. Measuring semantic similarity between words using web documents
CN113626604B (en) Web page text classification system based on maximum interval criterion
CN108462624A (en) A kind of recognition methods of spam, device and electronic equipment
CN106294689B (en) A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature
CN108268459A (en) A kind of community&#39;s speech filtration system based on naive Bayesian
Coenen et al. Statistical identification of key phrases for text classification
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN108073567A (en) A kind of Feature Words extraction process method, system and server
CN110472031A (en) A kind of regular expression preparation method, device, electronic equipment and storage medium
CN107590163B (en) The methods, devices and systems of text feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710

RJ01 Rejection of invention patent application after publication