Disclosure of Invention
The invention provides a public opinion event detection method and device, which are used for solving the problems that a traditional feature word filtering method is lack of effective semantic constraints in the presence of massive internet texts, so that misjudgment and missed judgment are easily caused, and a public opinion event needing to be focused on cannot be accurately detected.
In a first aspect, the present invention provides a method for detecting a public sentiment event, including:
acquiring a feature word vector of a text to be detected, wherein elements of the feature word vector represent whether corresponding feature words appear in the text to be detected;
obtaining vectors corresponding to all feature words from a semantic knowledge base, and obtaining sensitive semantic item vectors from a sensitive word base, wherein elements of the vectors corresponding to the feature words comprise current feature words, whether the current feature words contain sensitive semantic items, current semantic items of the current feature words and feature word vectors corresponding to the current feature words, and the sensitive semantic item vectors represent that semantic items in the vectors corresponding to the current feature words are current sensitive semantic items;
calculating similarity between a feature word vector of a text to be detected and feature word vectors corresponding to all feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive meaning item vectors;
the method comprises the steps of obtaining a corresponding first sensitive meaning item when the similarity is maximum, obtaining the number of the first sensitive meaning items in a text to be detected and the number of feature words in the text to be detected, calculating the weighted sum of the number of the first sensitive meaning items and the number of the feature words according to a first preset weight and a second preset weight, and determining that an event described in the text to be detected is a public sentiment event when the weighted sum is larger than a threshold value.
Preferably, the obtaining of the feature word vector of the text to be detected includes:
and constructing the semantic knowledge base according to the webpage content.
Preferably, the web page content is stored in an xml format file.
Preferably, the web page content is wikipedia.
Preferably, the building the semantic knowledge base according to the web page content comprises:
and establishing a sensitive word bank according to the semantic knowledge bank and the sensitive meaning items of the preset characteristic words.
In a second aspect, the present invention further provides a public sentiment event detecting apparatus, including:
the characteristic word vector acquisition module is used for acquiring a characteristic word vector of the text to be detected, wherein elements of the characteristic word vector indicate whether corresponding characteristic words appear in the text to be detected;
the corresponding vector acquisition module is used for acquiring vectors corresponding to all the feature words from the semantic knowledge base and acquiring sensitive semantic item vectors from the sensitive word base, wherein elements of the vectors corresponding to the feature words comprise current feature words, whether the current feature words contain sensitive semantic items, current semantic items of the current feature words and feature word vectors corresponding to the current feature words, and the sensitive semantic item vectors represent that semantic items in the vectors corresponding to the current feature words are current sensitive semantic items;
the similarity calculation module is used for calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive semantic item vectors;
the event detection module is used for acquiring a corresponding first sensitive semantic item when the similarity is maximum, and acquiring the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected; and calculating the weighted sum of the number of the first sensitive semantic items and the number of the characteristic words according to a first preset weight and a second preset weight, and determining that the event described in the text to be detected is a public sentiment event when the weighted sum is greater than a threshold value.
Preferably, the method further comprises the following steps:
and the semantic knowledge base construction module is used for constructing the semantic knowledge base according to the webpage content.
Preferably, the web page content is stored in an xml format file.
Preferably, the web page content is wikipedia.
Preferably, the method further comprises the following steps:
and the sensitive word bank establishing module is used for establishing a sensitive word bank according to the semantic knowledge bank and the sensitive meaning items of the preset characteristic words.
According to the technical scheme, effective semantic constraint can be achieved by vectorizing the text to be detected; meanwhile, by calculating the similarity of the feature word vector of the text to be detected and the feature word vectors corresponding to all the feature words, the problem of the public sentiment event needing to be concerned can be accurately detected, and the probability of erroneous judgment and missed judgment is greatly reduced.
Detailed Description
The following further describes embodiments of the invention with reference to the drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Fig. 1 is a flowchart illustrating a public sentiment event detection method according to an embodiment of the present invention, including:
s101, obtaining a feature word vector of a text to be detected, wherein elements of the feature word vector indicate whether corresponding feature words appear in the text to be detected;
s102, obtaining vectors corresponding to all feature words from a semantic knowledge base, and obtaining sensitive semantic item vectors from a sensitive word base, wherein elements of the vectors corresponding to the feature words comprise current feature words, whether the current feature words contain sensitive semantic items, current semantic items of the current feature words and feature word vectors corresponding to the current feature words, and the sensitive semantic item vectors indicate that semantic items in the vectors corresponding to the current feature words are the current sensitive semantic items;
s103, calculating similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words comprise all sensitive meaning item vectors;
s104, acquiring corresponding first sensitive semantic items when the similarity is maximum, acquiring the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected, calculating the weighted sum of the number of the first sensitive semantic items and the number of the feature words according to a first preset weight and a second preset weight, and determining that the public sentiment of the event described in the text to be detected is the public sentiment event when the weighted sum is greater than a threshold value.
When the feature word corresponding to the element of the feature word vector is a sensitive word, the corresponding element may be set to 0.
The embodiment can achieve effective semantic constraint by vectorizing the text to be detected; meanwhile, by calculating the similarity of the feature word vector of the text to be detected and the feature word vectors corresponding to all the feature words, the problem of the public sentiment event needing to be concerned can be accurately detected, and the probability of erroneous judgment and missed judgment is greatly reduced.
As an alternative to this embodiment, step S101 includes, before:
s100, building the semantic knowledge base according to the webpage content.
The semantic knowledge base is built, the ambiguity marking is carried out on the public sentiment sensitive words, semantic support is provided for analyzing and detecting the public sentiment events, and basis is provided for finding correct meanings of the sensitive words in the text to be detected. The public sentiment characteristic words are often direct embodiments of the public sentiment, but the public sentiment characteristic words can represent different meanings in different contexts, so the ambiguous public sentiment characteristic words often bring false positive problems to text filtering preprocessing. Thus, by giving its description exactly with the help of this semantic knowledge base, the meaning it has expressed in a specific context can be recognized.
The vector corresponding to the feature word stored in the semantic knowledge base is obtained by training the text after word segmentation preprocessing by using a deep learning tool word2 vec. For each word segmentation (namely, the feature words in the text to be detected), the word segmentation can be effectively represented by a vector with a certain dimension. As shown in the following table
Specifically, the web page content is stored in an xml format file.
For example, the web page content is wikipedia.
Wikipedia (Wikipedia) is one of the largest-scale online network encyclopedias, adopts a Wiki mechanism of group online cooperation editing, has the characteristics of high quality, wide coverage, real-time evolution, semi-structuring and the like, and is a high-quality corpus source for constructing a semantic knowledge base. Particularly, for ambiguous words in Wikipedia, meaning items reflecting public sentiment characteristics are manually marked, and support is provided for subsequent early warning analysis. The method comprises the steps of taking Wikipedia linguistic data in an xml format as input, extracting description contents of words from the input, analyzing whether the words are ambiguous words and reorientation words and whether complex and simple conversion is needed, keeping an abstract introduction part, and labeling sensitive characteristic words.
By means of strong semantic knowledge of Wikipedia, public sentiment sensitive words can be automatically added, and the representation range of public sentiment events is expanded, so that users are assisted to better grasp public sentiment trends, and relevant countermeasures are made to deal with the public sentiment events.
Further, step S100 is followed by:
s1001, establishing a sensitive word bank according to the semantic knowledge bank and sensitive meaning items of preset feature words.
When the text to be detected is processed, sensitive words can be processed by taking the clauses as processing units. During specific processing, matching the characteristic words in the characteristic word vector of the text clause to be detected with the vectors corresponding to the characteristic words in the semantic knowledge base, and selecting the meaning item to be matched with the sensitive word if the similarity is higher by calculating the similarity between the meaning items of different characteristic words and the similarity with the text to be detected, so that the meaning item is closer to the real meaning of the meaning item in the text, and the accurate meaning of each ambiguous word in the text when the maximum value of the target function is obtained by using an optimization method. The calculation formula is as follows:
maxf(wi)
f(wi)=f(wi+1)+Sim(wi,wi+1)+Sim(wi,doci)
s.t.
wi∈{v1,v2…,vm}
doci=(w1,w2,…,wn),wi=0
wherein: w is aiRepresenting a feature word in the text to be detected, f (w)i) The expression wiSemantic similarity value to sentence end, dociThe vector representation of the text after removing the sensitive words is carried out, namely the element of the corresponding position is set as 0; v. of1,v2… … is the vector corresponding to the characteristic word, if the word is a non-ambiguous word, there is one vector representation, otherwise, there are multiple vector representations; sim (w)i,wi+1) Is a function of calculating the similarity of adjacent sensitive words, Sim (w)i,doci) Is a function that calculates the similarity of the sensitive word to the text. Because words and texts are represented by word vectors, similarity calculation functionA cosine similarity calculation method may be employed.
For example, when detecting a public sentiment event according to a text to be detected, as shown in fig. 2, a word segmentation and a stop word removal operation may be performed on the text to be detected, where the word segmentation means to segment a sentence in the text to be detected into a plurality of feature words, and the stop word removal means to delete the stop word in the text to be detected, such as "simultaneously", "additionally", and the like.
Then, obtaining a vector of a sensitive semantic item in the text to be detected from a semantic knowledge base and a sensitive word base by using word2vec, so that similarity calculation can be conveniently carried out on adjacent words in sentences of the text to be detected subsequently;
then, performing similarity calculation by using the sensitive semantic item vector of each feature word, vectors corresponding to other feature words and the feature word vector of the text to be detected, and taking the meaning of each sensitive semantic item when the similarity is maximum, so as to obtain sensitive semantic items which can be reasonably matched with other words and the text to be detected, and determine the specific meaning of the feature word in the text to be detected;
and finally, carrying out weight summation on the named entities and the sensitive meaning items in the text, and judging the public sentiment event needing early warning if the sum is greater than a certain threshold value. The named entity refers to the number of the feature words in the text to be detected.
In the embodiment, semantic recognition with supervised learning is performed by using different semantic items of the feature words and information labels of all the feature words in the text to be detected. The method can avoid the defect that the public sentiment event is subjected to error detection only by means of keyword matching, thereby accurately identifying the public sentiment event and carrying out early warning prompt on the public sentiment event needing early warning.
Fig. 3 is a schematic structural diagram of a public sentiment event detecting device according to an embodiment of the present invention, including:
the feature word vector obtaining module 31 is configured to obtain a feature word vector of the text to be detected, where an element of the feature word vector indicates whether a corresponding feature word in the text to be detected appears;
a corresponding vector obtaining module 32, configured to obtain vectors corresponding to all feature words from a semantic knowledge base, and obtain a sensitive semantic item vector from a sensitive word base, where elements of the vector corresponding to the feature words include a current feature word, whether the current feature word includes a sensitive semantic item, a current semantic item of the current feature word, and a feature word vector corresponding to the current feature word, and the sensitive semantic item vector indicates that a semantic item in the vector corresponding to the current feature word is a current sensitive semantic item;
the similarity calculation module 33 is configured to calculate similarities between feature word vectors of the text to be detected and feature word vectors corresponding to all feature words, where the feature word vectors corresponding to all feature words include all sensitive semantic item vectors;
the event detection module 34 is configured to obtain a corresponding first sensitive semantic item when the similarity is maximum, and obtain the number of the first sensitive semantic items in the text to be detected and the number of the feature words in the text to be detected; and calculating the weighted sum of the number of the first sensitive semantic items and the number of the characteristic words according to a first preset weight and a second preset weight, and determining that the event described in the text to be detected is a public sentiment event when the weighted sum is greater than a threshold value.
The embodiment can achieve effective semantic constraint by vectorizing the text to be detected; meanwhile, by calculating the similarity of the feature word vector of the text to be detected and the feature word vectors corresponding to all the feature words, the problem of the public sentiment event needing to be concerned can be accurately detected, and the probability of erroneous judgment and missed judgment is greatly reduced.
As an alternative of this embodiment, the method further includes:
and the semantic knowledge base construction module is used for constructing the semantic knowledge base according to the webpage content.
Specifically, the web page content is stored in an xml format file.
For example, the web page content is wikipedia.
Further, still include:
and the sensitive word bank establishing module is used for establishing a sensitive word bank according to the semantic knowledge bank and the sensitive meaning items of the preset characteristic words.
In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.