The content of the invention
Because traditional Feature Words filter method faces internet mass text, lack effective language
Justice constraint, easily causes misjudgement, fails to judge, it is impossible to the accurate public sentiment for detecting to need to be paid close attention to
The problem of event, the present invention proposes a kind of public sentiment event detecting method and device.
In a first aspect, the present invention proposes a kind of public sentiment event detecting method, including:
The feature term vector of text to be detected is obtained, the element representation of the feature term vector is to be checked
Survey whether corresponding Feature Words in text occur;
The corresponding vector of all Feature Words is obtained from semantic knowledge-base, and is obtained from sensitive dictionary
Sensitive senses of a dictionary entry vector, the corresponding vectorial element of the Feature Words includes current signature word, current
Whether the current senses of a dictionary entry comprising the sensitive senses of a dictionary entry, current signature word is corresponding with current signature word for Feature Words
Feature term vector, the justice in the corresponding vector of the sensitive senses of a dictionary entry vector representation current signature word
Item is the current sensitive senses of a dictionary entry;
Calculate the corresponding feature term vector of all Feature Words of Feature Words vector sum of text to be detected
Similarity, wherein, the corresponding feature term vector of all Feature Words includes all sensitive justice
Item vector;
Corresponding first sensitive senses of a dictionary entry when obtaining similarity maximum, and obtain institute in text to be detected
The quantity of Feature Words in the quantity and text to be detected of the first sensitive senses of a dictionary entry is stated, it is default according to first
Weights and the second preset weights, calculate the quantity and the Feature Words of the described first sensitive senses of a dictionary entry
The weighted sum of quantity, the thing described in text to be detected is determined when the weighted sum is more than threshold value
Part is public sentiment event.
Preferably, include before the feature term vector for obtaining text to be detected:
The semantic knowledge-base is built according to web page contents.
Preferably, the web page contents are stored in xml formatted files.
Preferably, the web page contents are wikipedia.
Preferably, include after the semantic knowledge-base according to web page contents structure:
Sensitive dictionary is set up according to the sensitive senses of a dictionary entry of the semantic knowledge-base and default Feature Words.
Second aspect, the present invention also proposes a kind of public sentiment event detection device, including:
Feature term vector acquisition module, the feature term vector for obtaining text to be detected is described
Whether corresponding Feature Words occur in the element representation text to be detected of feature term vector;
Correspondence vector acquisition module, it is corresponding for obtaining all Feature Words from semantic knowledge-base
Vector, and sensitive senses of a dictionary entry vector is obtained from sensitive dictionary, the corresponding vectorial member of the Feature Words
Whether element works as including current signature word, current signature word comprising the sensitive senses of a dictionary entry, current signature word
The corresponding feature term vector of the preceding senses of a dictionary entry and current signature word, the sensitive senses of a dictionary entry vector representation is current
The senses of a dictionary entry in the corresponding vector of Feature Words is the current sensitive senses of a dictionary entry;
Similarity calculation module, all features of Feature Words vector sum for calculating text to be detected
The similarity of the corresponding feature term vector of word, wherein, the corresponding Feature Words of all Feature Words
Vector includes all sensitive senses of a dictionary entry vectors;
Event checking module, corresponding first sensitive senses of a dictionary entry during for obtaining similarity maximum, and
Obtain Feature Words in the quantity and text to be detected of the first sensitive senses of a dictionary entry described in text to be detected
Quantity;According to the first preset weights and the second preset weights, the described first sensitive senses of a dictionary entry is calculated
Quantity and the Feature Words quantity weighted sum, when the weighted sum be more than threshold value when determine
Event described in text to be detected is public sentiment event.
Preferably, in addition to:
Semantic knowledge-base builds module, for building the semantic knowledge-base according to web page contents.
Preferably, the web page contents are stored in xml formatted files.
Preferably, the web page contents are wikipedia.
Preferably, in addition to:
Sensitive dictionary sets up module, for according to the quick of the semantic knowledge-base and default Feature Words
Feel the senses of a dictionary entry and set up sensitive dictionary.
As shown from the above technical solution, the present invention is by text vector to be detected, Neng Gouda
To effective semantic constraint;While all spies of Feature Words vector sum by calculating text to be detected
The similarity of the corresponding feature term vector of word is levied, the carriage for needing to be paid close attention to can be accurately detected
The problem of facts part, substantially reduce misjudgement and the probability failed to judge.
Embodiment
Below in conjunction with the accompanying drawings, the embodiment to invention is further described.Implement below
Example is only used for clearly illustrating technical scheme, and can not limit this hair with this
Bright protection domain.
Fig. 1 shows a kind of stream for public sentiment event detecting method that one embodiment of the invention is provided
Journey schematic diagram, including:
S101, the feature term vector for obtaining text to be detected, the list of elements of the feature term vector
Show whether corresponding Feature Words occur in text to be detected;
S102, obtain the corresponding vector of all Feature Words from semantic knowledge-base, and from sensitive word
Storehouse obtains sensitive senses of a dictionary entry vector, and the corresponding vectorial element of the Feature Words includes current signature
Whether word, current signature word include the sensitive senses of a dictionary entry, the current senses of a dictionary entry of current signature word and current spy
Levy the corresponding feature term vector of word, the sensitive senses of a dictionary entry vector representation current signature word it is corresponding to
The senses of a dictionary entry in amount is the current sensitive senses of a dictionary entry;
S103, the corresponding Feature Words of all Feature Words of Feature Words vector sum for calculating text to be detected
The similarity of vector, wherein, the corresponding feature term vector of all Feature Words includes all quick
Feel senses of a dictionary entry vector;
Corresponding first sensitive senses of a dictionary entry when S104, acquisition similarity are maximum, and obtain text to be detected
The quantity of Feature Words in the quantity and text to be detected of the first sensitive senses of a dictionary entry described in this, according to the
One preset weights and the second preset weights, calculate the quantity of the described first sensitive senses of a dictionary entry and the spy
The weighted sum of the quantity of word is levied, determines to retouch in text to be detected when the weighted sum is more than threshold value
The event stated is public sentiment event.
Wherein, can be by when the corresponding Feature Words of element of the feature term vector are sensitive word
Corresponding element is set to 0.
The present embodiment is by that to text vector to be detected, can reach effective semantic constraint;
While the corresponding Feature Words of all Feature Words of Feature Words vector sum by calculating text to be detected
The similarity of vector, the problem of can accurately detecting the public sentiment event for needing to be paid close attention to, greatly
Big reduction misjudgement and the probability failed to judge.
As the alternative of the present embodiment, include before step S101:
S100, according to web page contents build the semantic knowledge-base.
By building semantic knowledge-base, ambiguity tagging is carried out to public sentiment sensitive word, for analysis detection
Public sentiment event provides semantic support, is that sensitive word in text to be detected finds correct implication and carried
For foundation.Because public sentiment Feature Words are often the direct embodiment to public sentiment, but public sentiment Feature Words
Different implications can be but represented in different linguistic context, therefore, it is special that such has ambiguous public sentiment
Levy word and often bring false positive issue to text filtering pretreatment.Therefore, by by the semanteme
Knowledge base accurately provides its description and may recognize that its expressed meaning in specific linguistic context.
Wherein, it is by dividing for the corresponding vector of the Feature Words stored in semantic knowledge-base
The pretreated text of word is trained what is obtained using deep learning instrument word2vec.It is right
Each participle (being the Feature Words in text to be detected), can use the vector of certain dimension
It is effectively represented.It is as shown in the table
Specifically, the web page contents are stored in xml formatted files.
For example, the web page contents are wikipedia.
Wikipedia (Wikipedia) is one of largest online network encyclopedia, is used
The Wiki mechanism of colony online cooperation editor, with quality is high, covering is wide, develop in real time and
Semi-structured the features such as, originated for building the high-quality language material of semantic knowledge-base.Particular for
Ambiguity word in wikipedia, the senses of a dictionary entry of artificial mark reflection public sentiment feature, is follow-up early warning point
Analysis provides support.Using the wikipedia language material of xml forms as input, retouching for word is therefrom extracted
Content is stated, analyses whether as ambiguity word and redirection word, whether need complicated and simple conversion, reservation is plucked
Introductory section is wanted, while being labeled to sensitive features word.
By the powerful semantic knowledge of wikipedia, public sentiment sensitive word can be increased automatically, expand carriage
The sign scope of facts part, so as to aid in user preferably to hold public sentiment trend, formulates related right
Plan is tackled.
Further, include after step S100:
S1001, according to the sensitive senses of a dictionary entry of the semantic knowledge-base and default Feature Words set up sensitive
Dictionary.
Wherein, can be using subordinate sentence as processing unit, to quick when handling text to be detected
Sense word is handled.During specific processing, by the spy in the feature term vector of text subordinate sentence to be detected
Levy word vector corresponding with Feature Words in semantic knowledge-base to match, by calculating different characteristic word
The senses of a dictionary entry between similarity and similarity with text to be detected, the higher explanation of similarity should
The senses of a dictionary entry more presses close to its real meaning in the text, then chooses the senses of a dictionary entry and match with sensitive word, profit
The accurate meaning of each ambiguity word in the text when obtaining object function maximum with optimal method.
Calculation formula is as follows:
maxf(wi)
f(wi)=f (wi+1)+Sim(wi,wi+1)+Sim(wi,doci)
s.t.
wi∈{v1,v2…,vm}
doci=(w1,w2,…,wn),wi=0
Wherein:wiRepresent the Feature Words in text to be detected, f (wi) represent word wiTo sentence knot
The semantic similarity value of tail word, dociThat text removes the vector representation after sensitive word, i.e., it is corresponding
The element of position is set to 0;v1, v2... it is the corresponding vector of Feature Words, if the word is non-discrimination
Adopted word, then have a vector representation, conversely, there is multiple vector representations;Sim(wi,wi+1) it is meter
Calculate the function of adjacent sensitive Word similarity, Sim (wi,doci) it is that calculating sensitive word is similar to text
The function of degree.Because word with text represents that Similarity Measure function can be used with term vector
Cosine similarity computational methods.
When for example, according to text detection public sentiment event to be detected, as shown in Fig. 2 can be first
Participle is carried out to text to be detected and goes stop words to operate, wherein, participle refers to text to be detected
Sentence in this is divided into multiple Feature Words, goes stop words to refer to leave out the deactivation in text to be detected
Word, such as " simultaneously ", " in addition ".
Then, text to be detected is obtained from semantic knowledge-base and sensitive dictionary using word2vec
Vector of sensitive senses of a dictionary entry, is easy to the adjacent word being subsequently directed in the sentence of text to be detected to enter in this
Row Similarity Measure;
Then, the sensitive senses of a dictionary entry vector vector corresponding with other Feature Words of each Feature Words is utilized
It is each quick when taking similarity maximum and the feature term vector of text to be detected carries out Similarity Measure
Feel the senses of a dictionary entry implication so that obtain with other words and text to be detected can be reasonably combined sensitivity
The senses of a dictionary entry, determines concrete meaning of this feature word in text to be detected;
Finally, weight summation is carried out to the name entity in text and the sensitive senses of a dictionary entry, more than certain
Threshold value is then judged to needing the public sentiment event of early warning.Wherein, name entity refers to text to be detected
The quantity of middle Feature Words.
The present embodiment utilizes all Feature Words in the not synonymity and text to be detected of Feature Words
Information labeling carries out the semantics recognition of supervised learning.It can avoid relying solely on Keywords matching
The drawbacks of error detection is carried out to public sentiment event, so that public sentiment event is accurately identified, it is pre- to needing
Alert public sentiment event carries out early warning.
Fig. 3 shows a kind of knot for public sentiment event detection device that one embodiment of the invention is provided
Structure schematic diagram, including:
Feature term vector acquisition module 31, the feature term vector for obtaining text to be detected,
Whether corresponding Feature Words occur in the element representation text to be detected of the feature term vector;
Correspondence vector acquisition module 32, for obtaining all Feature Words pair from semantic knowledge-base
The vector answered, and sensitive senses of a dictionary entry vector is obtained from sensitive dictionary, the corresponding vector of the Feature Words
Element whether include current signature word, current signature word comprising the sensitive senses of a dictionary entry, current signature word
The current senses of a dictionary entry and the corresponding feature term vector of current signature word, the sensitive senses of a dictionary entry vector representation
The senses of a dictionary entry in the corresponding vector of current signature word is the current sensitive senses of a dictionary entry;
Similarity calculation module 33, the Feature Words vector sum for calculating text to be detected owns
The similarity of the corresponding feature term vector of Feature Words, wherein, the corresponding spy of all Feature Words
Levying term vector includes all sensitive senses of a dictionary entry vectors;
Event checking module 34, corresponding first sensitive senses of a dictionary entry during for obtaining similarity maximum,
And obtain feature in the quantity and text to be detected of the first sensitive senses of a dictionary entry described in text to be detected
The quantity of word;According to the first preset weights and the second preset weights, the described first sensitive justice is calculated
The weighted sum of the quantity of item and the quantity of the Feature Words, it is true when the weighted sum is more than threshold value
Event described in fixed text to be detected is public sentiment event.
The present embodiment is by that to text vector to be detected, can reach effective semantic constraint;
While the corresponding Feature Words of all Feature Words of Feature Words vector sum by calculating text to be detected
The similarity of vector, the problem of can accurately detecting the public sentiment event for needing to be paid close attention to, greatly
Big reduction misjudgement and the probability failed to judge.
As the alternative of the present embodiment, in addition to:
Semantic knowledge-base builds module, for building the semantic knowledge-base according to web page contents.
Specifically, the web page contents are stored in xml formatted files.
For example, the web page contents are wikipedia.
Further, in addition to:
Sensitive dictionary sets up module, for according to the quick of the semantic knowledge-base and default Feature Words
Feel the senses of a dictionary entry and set up sensitive dictionary.