CN107193796B

CN107193796B - Public opinion event detection method and device

Info

Publication number: CN107193796B
Application number: CN201610197073.3A
Authority: CN
Inventors: 蔡慧慧; 刘克松; 张丹; 于晓明; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Beijing Founder Electronics Co Ltd
Priority date: 2016-03-14
Filing date: 2016-03-31
Publication date: 2021-12-24
Anticipated expiration: 2036-03-31
Also published as: CN107193796A

Abstract

The invention discloses a public sentiment event detection method and a device, wherein the method comprises the following steps: acquiring a feature word vector of a text to be detected; obtaining vectors corresponding to all the feature words and obtaining sensitive meaning item vectors; calculating the similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words; the method comprises the steps of obtaining corresponding first sensitive meaning items when the similarity is maximum, obtaining the number of the first sensitive meaning items in a text to be detected and the number of feature words in the text to be detected, calculating the weighted sum of the number of the first sensitive meaning items and the number of the feature words according to a first preset weight and a second preset weight, and determining that an event described in the text to be detected is a public sentiment event when the weighted sum is larger than a threshold value. The invention can achieve effective semantic constraint by vectorizing the text to be detected; meanwhile, the similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words is calculated, so that the problem of the public sentiment event needing to be concerned can be accurately detected.

Description

Public opinion event detection method and device

Technical Field

The invention relates to the technical field of computers, in particular to a public sentiment event detection method and device.

Background

With the rapid development of the internet, the network public sentiment is becoming the interest expression of common people, advocating fair and impartial society, and uninterruptedly conveying the thought and position of common public minds to all levels of governments in China. More and more people are willing to release the idea and the phenomenon to be seen on the network, and more people are led in through the network transmission, thereby having great influence on the emotion and social stability of netizens. Therefore, the method has very important significance in accurately detecting public sentiment events by using modern scientific technology.

At present, the detection and discovery about public sentiment events still stay in semantic matching by using some public sentiment sensitive words, and public sentiment is only embodied when the named entity words associated with the public sentiment events, such as human names, foreign language human name translation names and mechanism names, appear in the context of the associated events. For named entities with duplicate names, the meaning of the named entities needs to be analyzed in combination with the current public sentiment event context, and for the class of ambiguous feature words, the traditional static corpus may not contain the latest explanatory meaning item for the ambiguous feature words. The traditional filtering method based on public sentiment characteristic words (sensitive words, named entities and the like) is still an important preprocessing means due to simple implementation mechanism and high execution efficiency; however, in the face of mass texts on the internet, especially fragmented and irregular social media contents, due to the lack of effective semantic constraints, the preprocessing and filtering mechanism has certain false positives, which easily cause misjudgment and missed judgment, and cannot accurately identify public sentiment events needing attention. Considerable noise data input is brought to subsequent processing in the application environment of network public opinion early warning of big data, so a data preprocessing mechanism with semantic understanding capability is urgently needed.

Disclosure of Invention

The invention provides a public opinion event detection method and device, which are used for solving the problems that a traditional feature word filtering method is lack of effective semantic constraints in the presence of massive internet texts, so that misjudgment and missed judgment are easily caused, and a public opinion event needing to be focused on cannot be accurately detected.

In a first aspect, the present invention provides a method for detecting a public sentiment event, including:

acquiring a feature word vector of a text to be detected, wherein elements of the feature word vector represent whether corresponding feature words appear in the text to be detected;

obtaining vectors corresponding to all feature words from a semantic knowledge base, and obtaining sensitive semantic item vectors from a sensitive word base, wherein elements of the vectors corresponding to the feature words comprise current feature words, whether the current feature words contain sensitive semantic items, current semantic items of the current feature words and feature word vectors corresponding to the current feature words, and the sensitive semantic item vectors represent that semantic items in the vectors corresponding to the current feature words are current sensitive semantic items;

calculating similarity between a feature word vector of a text to be detected and feature word vectors corresponding to all feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive meaning item vectors;

the method comprises the steps of obtaining a corresponding first sensitive meaning item when the similarity is maximum, obtaining the number of the first sensitive meaning items in a text to be detected and the number of feature words in the text to be detected, calculating the weighted sum of the number of the first sensitive meaning items and the number of the feature words according to a first preset weight and a second preset weight, and determining that an event described in the text to be detected is a public sentiment event when the weighted sum is larger than a threshold value.

Preferably, the obtaining of the feature word vector of the text to be detected includes:

and constructing the semantic knowledge base according to the webpage content.

Preferably, the web page content is stored in an xml format file.

Preferably, the web page content is wikipedia.

Preferably, the building the semantic knowledge base according to the web page content comprises:

and establishing a sensitive word bank according to the semantic knowledge bank and the sensitive meaning items of the preset characteristic words.

In a second aspect, the present invention further provides a public sentiment event detecting apparatus, including:

the characteristic word vector acquisition module is used for acquiring a characteristic word vector of the text to be detected, wherein elements of the characteristic word vector indicate whether corresponding characteristic words appear in the text to be detected;

the corresponding vector acquisition module is used for acquiring vectors corresponding to all the feature words from the semantic knowledge base and acquiring sensitive semantic item vectors from the sensitive word base, wherein elements of the vectors corresponding to the feature words comprise current feature words, whether the current feature words contain sensitive semantic items, current semantic items of the current feature words and feature word vectors corresponding to the current feature words, and the sensitive semantic item vectors represent that semantic items in the vectors corresponding to the current feature words are current sensitive semantic items;

the similarity calculation module is used for calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive semantic item vectors;

the event detection module is used for acquiring a corresponding first sensitive semantic item when the similarity is maximum, and acquiring the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected; and calculating the weighted sum of the number of the first sensitive semantic items and the number of the characteristic words according to a first preset weight and a second preset weight, and determining that the event described in the text to be detected is a public sentiment event when the weighted sum is greater than a threshold value.

Preferably, the method further comprises the following steps:

and the semantic knowledge base construction module is used for constructing the semantic knowledge base according to the webpage content.

Preferably, the web page content is stored in an xml format file.

Preferably, the web page content is wikipedia.

Preferably, the method further comprises the following steps:

and the sensitive word bank establishing module is used for establishing a sensitive word bank according to the semantic knowledge bank and the sensitive meaning items of the preset characteristic words.

According to the technical scheme, effective semantic constraint can be achieved by vectorizing the text to be detected; meanwhile, by calculating the similarity of the feature word vector of the text to be detected and the feature word vectors corresponding to all the feature words, the problem of the public sentiment event needing to be concerned can be accurately detected, and the probability of erroneous judgment and missed judgment is greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart illustrating a public sentiment event detection method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a public sentiment event detection method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a public sentiment event detection device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the invention with reference to the drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Fig. 1 is a flowchart illustrating a public sentiment event detection method according to an embodiment of the present invention, including:

s101, obtaining a feature word vector of a text to be detected, wherein elements of the feature word vector indicate whether corresponding feature words appear in the text to be detected;

s102, obtaining vectors corresponding to all feature words from a semantic knowledge base, and obtaining sensitive semantic item vectors from a sensitive word base, wherein elements of the vectors corresponding to the feature words comprise current feature words, whether the current feature words contain sensitive semantic items, current semantic items of the current feature words and feature word vectors corresponding to the current feature words, and the sensitive semantic item vectors indicate that semantic items in the vectors corresponding to the current feature words are the current sensitive semantic items;

s103, calculating similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive meaning item vectors;

s104, acquiring corresponding first sensitive semantic items when the similarity is maximum, acquiring the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected, calculating the weighted sum of the number of the first sensitive semantic items and the number of the feature words according to a first preset weight and a second preset weight, and determining that the public sentiment of the event described in the text to be detected is the public sentiment event when the weighted sum is greater than a threshold value.

When the feature word corresponding to the element of the feature word vector is a sensitive word, the corresponding element may be set to 0.

The embodiment can achieve effective semantic constraint by vectorizing the text to be detected; meanwhile, by calculating the similarity of the feature word vector of the text to be detected and the feature word vectors corresponding to all the feature words, the problem of the public sentiment event needing to be concerned can be accurately detected, and the probability of erroneous judgment and missed judgment is greatly reduced.

As an alternative to this embodiment, step S101 includes, before:

s100, building the semantic knowledge base according to the webpage content.

The semantic knowledge base is built, the ambiguity marking is carried out on the public sentiment sensitive words, semantic support is provided for analyzing and detecting the public sentiment events, and basis is provided for finding correct meanings of the sensitive words in the text to be detected. The public sentiment characteristic words are often direct embodiments of the public sentiment, but the public sentiment characteristic words can represent different meanings in different contexts, so the ambiguous public sentiment characteristic words often bring false positive problems to text filtering preprocessing. Thus, by giving its description exactly with the help of this semantic knowledge base, the meaning it has expressed in a specific context can be recognized.

The vector corresponding to the feature word stored in the semantic knowledge base is obtained by training the text after word segmentation preprocessing by using a deep learning tool word2 vec. For each word segmentation (namely, the feature words in the text to be detected), the word segmentation can be effectively represented by a vector with a certain dimension. As shown in the following table

Specifically, the web page content is stored in an xml format file.

For example, the web page content is wikipedia.

Wikipedia (Wikipedia) is one of the largest-scale online network encyclopedias, adopts a Wiki mechanism of group online cooperation editing, has the characteristics of high quality, wide coverage, real-time evolution, semi-structuring and the like, and is a high-quality corpus source for constructing a semantic knowledge base. Particularly, for ambiguous words in Wikipedia, meaning items reflecting public sentiment characteristics are manually marked, and support is provided for subsequent early warning analysis. The method comprises the steps of taking Wikipedia linguistic data in an xml format as input, extracting description contents of words from the input, analyzing whether the words are ambiguous words and reorientation words and whether complex and simple conversion is needed, keeping an abstract introduction part, and labeling sensitive characteristic words.

By means of strong semantic knowledge of Wikipedia, public sentiment sensitive words can be automatically added, and the representation range of public sentiment events is expanded, so that users are assisted to better grasp public sentiment trends, and relevant countermeasures are made to deal with the public sentiment events.

Further, step S100 is followed by:

s1001, establishing a sensitive word bank according to the semantic knowledge bank and sensitive meaning items of preset feature words.

When the text to be detected is processed, sensitive words can be processed by taking the clauses as processing units. During specific processing, matching the characteristic words in the characteristic word vector of the text clause to be detected with the vectors corresponding to the characteristic words in the semantic knowledge base, and selecting the meaning item to be matched with the sensitive word if the similarity is higher by calculating the similarity between the meaning items of different characteristic words and the similarity with the text to be detected, so that the meaning item is closer to the real meaning of the meaning item in the text, and the accurate meaning of each ambiguous word in the text when the maximum value of the target function is obtained by using an optimization method. The calculation formula is as follows:

maxf(w_i)

f(w_i)＝f(w_i+1)+Sim(w_i,w_i+1)+Sim(w_i,doc_i)

s.t.

w_i∈{v₁,v₂…,v_m}

doc_i＝(w₁,w₂,…,w_n),w_i＝0

wherein: w is a_iRepresenting a feature word in the text to be detected, f (w)_i) The expression w_iSemantic similarity value to sentence end, doc_iThe vector representation of the text after removing the sensitive words is carried out, namely the element of the corresponding position is set as 0; v. of₁，v₂… … is the vector corresponding to the characteristic word, if the word is a non-ambiguous word, there is one vector representation, otherwise, there are multiple vector representations; sim (w)_i,w_i+1) Is a function of calculating the similarity of adjacent sensitive words, Sim (w)_i,doc_i) Is a function that calculates the similarity of the sensitive word to the text. Because words and texts are represented by word vectors, similarity calculation functionA cosine similarity calculation method may be employed.

For example, when detecting a public sentiment event according to a text to be detected, as shown in fig. 2, a word segmentation and a stop word removal operation may be performed on the text to be detected, where the word segmentation means to segment a sentence in the text to be detected into a plurality of feature words, and the stop word removal means to delete the stop word in the text to be detected, such as "simultaneously", "additionally", and the like.

Then, obtaining a vector of a sensitive semantic item in the text to be detected from a semantic knowledge base and a sensitive word base by using word2vec, so that similarity calculation can be conveniently carried out on adjacent words in sentences of the text to be detected subsequently;

then, performing similarity calculation by using the sensitive semantic item vector of each feature word, vectors corresponding to other feature words and the feature word vector of the text to be detected, and taking the meaning of each sensitive semantic item when the similarity is maximum, so as to obtain sensitive semantic items which can be reasonably matched with other words and the text to be detected, and determine the specific meaning of the feature word in the text to be detected;

and finally, carrying out weight summation on the named entities and the sensitive meaning items in the text, and judging the public sentiment event needing early warning if the sum is greater than a certain threshold value. The named entity refers to the number of the feature words in the text to be detected.

In the embodiment, semantic recognition with supervised learning is performed by using different semantic items of the feature words and information labels of all the feature words in the text to be detected. The method can avoid the defect that the public sentiment event is subjected to error detection only by means of keyword matching, thereby accurately identifying the public sentiment event and carrying out early warning prompt on the public sentiment event needing early warning.

Fig. 3 is a schematic structural diagram of a public sentiment event detecting device according to an embodiment of the present invention, including:

the feature word vector obtaining module 31 is configured to obtain a feature word vector of the text to be detected, where an element of the feature word vector indicates whether a corresponding feature word in the text to be detected appears;

a corresponding vector obtaining module 32, configured to obtain vectors corresponding to all feature words from a semantic knowledge base, and obtain a sensitive semantic item vector from a sensitive word base, where elements of the vector corresponding to the feature words include a current feature word, whether the current feature word includes a sensitive semantic item, a current semantic item of the current feature word, and a feature word vector corresponding to the current feature word, and the sensitive semantic item vector indicates that a semantic item in the vector corresponding to the current feature word is a current sensitive semantic item;

the similarity calculation module 33 is configured to calculate similarities between feature word vectors of the text to be detected and feature word vectors corresponding to all feature words, where the feature word vectors corresponding to all feature words include all sensitive semantic item vectors;

the event detection module 34 is configured to obtain a corresponding first sensitive semantic item when the similarity is maximum, and obtain the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected; and calculating the weighted sum of the number of the first sensitive semantic items and the number of the characteristic words according to a first preset weight and a second preset weight, and determining that the event described in the text to be detected is a public sentiment event when the weighted sum is greater than a threshold value.

As an alternative of this embodiment, the method further includes:

Specifically, the web page content is stored in an xml format file.

For example, the web page content is wikipedia.

Further, still include:

In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Claims

1. A public opinion event detection method is characterized by comprising the following steps:

obtaining vectors corresponding to all the feature words from a semantic knowledge base, wherein elements of the vectors corresponding to the feature words comprise the current feature words, whether the current feature words contain sensitive semantic items, the current semantic items of the current feature words and feature word vectors corresponding to the current feature words; acquiring a sensitive sense item vector from a sensitive word bank, wherein the sensitive sense item vector indicates that a sense item in a vector corresponding to the current feature word is a preset feature word sensitive sense item; the sensitive word bank is established based on the semantic knowledge bank and sensitive meaning items of preset characteristic words; the vectors corresponding to the sensitive characteristic words in all the characteristic words of the semantic knowledge base have ambiguous labels;

calculating the overall similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive meaning item vectors obtained from a sensitive word bank; and the calculating the overall similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words specifically comprises: calculating the similarity among a plurality of feature word vectors and calculating the sum of the similarity of each feature word and a background text vector to serve as the overall similarity;

wherein the overall similarity function f (w)_i) The correlation calculation formula of (2) is as follows:

max f(w_i)

f(w_i)＝f(w_i+1)+Sim(w_i,w_i+1)+Sim(w_i,doc_i)

s.t.

w_i∈{v₁,v₂…，v_m}

doc_i＝(w₁，w₂，…，w_n)，w_i＝0

wherein, w_iRepresenting feature word vectors, v, in the text to be examined₁，v₂… … is a characteristic word vector corresponding to each word in the text to be detected, if the word is a non-ambiguous word, there is one vector representation, otherwise, there are multiple vector representations; f (w)_i) Representing the total similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words; sim (w)_i,w_i+1) Representing the similarity between every two adjacent feature word vectors; sim (w)_i,doc_i) Representing the similarity between each feature word and a background text vector; therein, doc_iThe method is a background text vector, and is used for vector representation of the background text of the text to be detected after sensitive characteristic words are removed, namely the vector representation of the background text when an element containing a corresponding position of a sensitive meaning item is set to be 0;

the method for acquiring the first sensitive meaning item corresponding to the feature word vector of the text to be detected when the overall similarity function of the text to be detected takes the maximum value comprises the following steps: acquiring a first sensitive meaning item corresponding to a feature word vector of a text to be detected when a total similarity function takes a maximum value through an optimization method based on ambiguity marks in a semantic knowledge base so as to determine the accurate meaning of an ambiguous word in the text;

repeatedly executing the steps to obtain corresponding first sensitive meaning items respectively for a plurality of sensitive characteristic words in the text to be detected; according to the obtained first sensitive semantic items, the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected are obtained, according to a first preset weight and a second preset weight, the weighted sum of the number of the first sensitive semantic items and the number of the feature words is calculated, and when the weighted sum is larger than a threshold value, the event described in the text to be detected is determined to be a public sentiment event.

2. The method according to claim 1, wherein the obtaining the feature word vector of the text to be detected comprises:

and constructing the semantic knowledge base according to the webpage content.

3. The method of claim 2, wherein the web page content is stored in an xml-format file.

4. The method of claim 3, wherein the web page content is wikipedia.

5. The method of claim 4, wherein the building the semantic knowledge base according to the web page content comprises:

and establishing a sensitive word bank according to the semantic knowledge bank and preset feature word sensitive semantic items.

6. A public opinion event detection device, comprising:

a corresponding vector acquisition module, configured to acquire vectors corresponding to all feature words from a semantic knowledge base, where elements of the vectors corresponding to the feature words include a current feature word, whether the current feature word includes a sensitive semantic item, a current semantic item of the current feature word, and a feature word vector corresponding to the current feature word; acquiring a sensitive sense item vector from a sensitive word bank, wherein the sensitive sense item vector indicates that a sense item in a vector corresponding to the current feature word is a preset feature word sensitive sense item; the sensitive word bank is established based on the semantic knowledge bank and sensitive meaning items of preset characteristic words; the vectors corresponding to the sensitive characteristic words in all the characteristic words of the semantic knowledge base have ambiguous labels;

the similarity calculation module is used for calculating the total similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive semantic item vectors acquired from a sensitive word bank; and the calculating the overall similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words specifically comprises: calculating the similarity among a plurality of feature word vectors and calculating the sum of the similarity of each feature word and a background text vector to serve as the overall similarity;

maxf(w_i)

f(w_i)＝f(w_i+1)+Sim(w_i,w_i+1)+Sim(w_i,doc_i)

s.t.

w_i∈{v₁,v₂…,v_m}

doc_i＝(w₁,w₂,…,w_n),w_i＝0

wherein, w_iRepresenting feature word vectors, v, in the text to be examined₁，v₂… … is a characteristic word vector corresponding to each word in the text to be detected, if the word is a non-ambiguous word, there is one vector representation, otherwise, there are multiple vector representations; f (w)_i) Representing the total similarity of the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words; sim (w)_i,w_i+1) Representing the similarity between every two adjacent feature word vectors; sim (w)_i,doc_i) Representing the similarity between each feature word and a background text vector; therein, doc_iIs a background text vector which is the vector representation of the background text of the text to be detected without the sensitive characteristic words, namely whether the background text contains the corresponding positions of the sensitive meaning items or notThe vector representation of the background text when the element of (a) is set to 0;

the event detection module is used for acquiring a first sensitive meaning item corresponding to the feature word vector when the overall similarity function of the text feature body to be detected takes the maximum value, and comprises the following steps: acquiring a first sensitive meaning item corresponding to a feature word vector of a text to be detected when a total similarity function takes a maximum value through an optimization method based on ambiguity marks in a semantic knowledge base so as to determine the accurate meaning of an ambiguous word in the text;

the event detection module is also used for repeatedly executing the steps to respectively obtain the corresponding first sensitive meaning items of a plurality of sensitive characteristic words in the text to be detected; according to the obtained first sensitive semantic item, acquiring the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected; and calculating the weighted sum of the number of the first sensitive semantic items and the number of the characteristic words according to a first preset weight and a second preset weight, and determining that the event described in the text to be detected is a public sentiment event when the weighted sum is greater than a threshold value.

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 7, wherein the web page content is stored in an xml format file.

9. The apparatus of claim 8, wherein the web page content is wikipedia.

10. The apparatus of claim 9, further comprising:

and the sensitive word bank establishing module is used for establishing a sensitive word bank according to the semantic knowledge bank and the preset characteristic word sensitive semantic item.