CN106547875B

CN106547875B - Microblog online emergency detection method based on emotion analysis and label

Info

Publication number: CN106547875B
Application number: CN201610945406.6A
Authority: CN
Inventors: 邹晓梅; 杨静; 张健沛
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2016-11-02
Filing date: 2016-11-02
Publication date: 2020-05-15
Anticipated expiration: 2036-11-02
Also published as: CN106547875A

Abstract

The invention belongs to the field of network detection, and particularly relates to a microblog online emergency detection method based on emotion analysis and labels. The invention comprises the following steps: constructing an emotion analysis model, namely an emotion concurrence graph, by using an emotion classification model emotion wheel; performing sentiment classification on the microblogs in the microblog flow by using the sentiment analysis model constructed in the step (1), and detecting the burst period of the microblog flow by adopting a kleinberg algorithm; extracting microblog labels in a burst period, filtering out junk labels, and performing word segmentation processing on the rest labels; forming an initial keyword of an event; and (4) extracting words related to the keywords in the microblog by using the keywords generated in the step (3) to form the final description of the event. The emotion detection method based on the emotion wheel constructs the emotion concurrence graph based on the emotion wheel, emotion classification is more detailed, emotion is easier to understand and explain, and compared with the event detection accuracy based on the emotion symbol, the emotion detection method based on the emotion wheel is higher in accuracy.

Description

Microblog online emergency detection method based on emotion analysis and label

Technical Field

The invention belongs to the field of network detection, and particularly relates to a microblog online emergency detection method based on emotion analysis and labels.

Background

With the vigorous development of Web2.0 technology in recent years, a series of social networks emerge. These social networks, such as the newsbook, twitter, etc., attract a large number of users. Users are active on social networks and post a large number of microblog messages containing opinions or views about certain events. By mining the microblog messages, a large amount of deeper information such as user emotion can be obtained. The information can be used for providing services for governments or enterprises, for example, the governments can use the information to judge whether people support legal laws and what opinions are held on a certain social event, so as to carry out public opinion control and guidance; the enterprise can learn the behavior habits and preferences of the user by mining microblog messages of the user, so that commodities which are most likely to be interested or bought by the user are recommended to the enterprise.

For incident detection, there are two conventional approaches, namely document-based incident detection and feature-based incident detection. The idea of detecting the emergent events based on the documents is to represent the documents into word vectors or named entity vectors, calculate the similarity between the documents and cluster the documents to form the events. The method for detecting the events based on the characteristic bursts is one effective method for mining the burst events in the data stream, and the method has the main idea that the characteristic words of the document are firstly extracted, the burst phenomenon is detected by analyzing the time-varying tracks of the characteristic words, and then the characteristic words with the same burst tracks are aggregated to form the burst events. However, these two methods are not applicable in case of microblog short texts. Firstly, the data volume of the microblogs is large, and a large amount of time is needed for extracting feature words and forming a tfidf matrix for each microblog. Secondly, the microblog expression mode is irregular, the form is changeable, a large number of new words are likely to be contained, the formed matrix is sparse, the similarity is not easy to calculate, and the identification difficulty is increased. Meanwhile, the traditional method only finishes the extraction of the emergency and does not carry out deeper analysis on the emergency, such as sentiment analysis.

Disclosure of Invention

The invention aims to provide an online emergency detection model for microblog data stream short texts, and the online emergency detection method based on emotion analysis and labels can accurately and quickly extract the emergency in the data stream.

The purpose of the invention is realized as follows:

a microblog online emergency detection method based on emotion analysis and labels comprises the following steps:

(1) constructing an emotion analysis model, namely an emotion concurrence graph, by using an emotion classification model emotion wheel;

(2) performing sentiment classification on the microblogs in the microblog flow by using the sentiment analysis model constructed in the step (1), and detecting the burst period of the microblog flow by adopting a kleinberg algorithm;

(3) extracting microblog labels in a burst period, filtering out junk labels, and performing word segmentation processing on the rest labels; forming an initial keyword of an event;

(4) and (4) extracting words related to the keywords in the microblog by using the keywords generated in the step (3) to form the final description of the event.

In the step (1), an emotion concurrence graph is constructed by the following method:

(1.1) using an emotion wheel model, and manually endowing reasonable words to emotion symbols;

(1.2) performing word segmentation processing on the original microblog data to form a microblog corpus;

(1.3) calculating the similarity between words of the microblog corpus and words of the emotion symbols by using a HowNet dictionary and adopting word similarity based on distance;

(1.3) the similarity of word detection is calculated using the following formula:

in the formula W₁And W₂Represents a word, word W₁There are k terms: { n₁₁,n₁₂,…,n_1k}, word W₂There are p sense items: { n₂₁,n₂₂,…,n_2p}，p₁And p₂Denotes two sememes, d is p₁And p₂The path length in the semantic hierarchy is a positive integer α is an adjustable parameter;

(1.4) establishing connection among words with similarity larger than a given threshold lambda to finish the construction of the emotion concurrence graph; lambda is selected to be 0.6.

The step (3) comprises the following steps:

(3.1) performing part-of-speech tagging on the extracted tag, and removing the tag only with a verb or the tag only with a noun;

(3.2) rejecting labels containing special symbols in the labels;

(3.3) removing labels which contain standard date formats and only have numbers and punctuation marks;

the step (4) comprises the following steps:

(4.1) performing word segmentation on the residual labels in the burst period;

(4.2) calculating a frequent mode of related microblog label keywords in a burst period;

(4.3) extracting 2 item sets in the frequent pattern, and calculating mutual information among words in the 2 item sets;

(4.4) keeping the words with mutual information larger than a given threshold value gamma to form a final event description; selecting the value of gamma to be 1.5;

the mutual information calculation formula in step 4.4 is:

C(W₁) And C (W)₂) Respectively indicate W contained in corpus₁And W₂Number of microblogs, C (W)₁,W₂) Indicates that W is contained at the same time₁And W₂The number of microblogs; and R is the size of the corpus, namely the total number of microblogs.

The invention has the beneficial effects that:

the emotion detection method based on the emotion wheel constructs the emotion concurrence graph based on the emotion wheel, emotion classification is more detailed, emotion is easier to understand and explain, and compared with the event detection accuracy based on the emotion symbol, the emotion detection method based on the emotion wheel is higher in accuracy. And performing emotion analysis by using the established emotion concurrence graph, filtering a large number of useless microblogs, and detecting the burst state of the microblog data stream by using the emotion analysis result, so that the efficiency is high. The microblog label is used as a guide to discover the emergency, the accuracy is higher than that of event discovery based on clustering, and the detection time is short.

Drawings

FIG. 1 is an online emergency model framework based on emotional concurrency graphs.

Detailed Description

The following describes the implementation of the present invention in further detail with reference to the accompanying drawings and the detailed description.

Step 1: and constructing an emotion analysis model, namely an emotion concurrence graph by using the emotion classification model emotion wheel. The method specifically comprises the following steps:

step 1.1: using an emotion wheel model, and manually endowing reasonable words to emotion symbols;

step 1.2: performing word segmentation processing on original microblog data to form a microblog corpus;

step 1.3: and calculating the similarity between words of the microblog corpus and words of the emotion symbols by using the HowNet dictionary and adopting the word similarity based on the distance.

In step 1.3, the similarity of word detection is calculated using the following formula:

in the formula W₁And W₂Represents a word, word W₁There are k terms (concepts): { n₁₁,n₁₂,…,n_1k}, word W₂There are p sense items (concepts): { n₂₁,n₂₂,…,n_2p}，p₁And p₂Denotes two sememes, d is p₁And p₂The path length in the semantic hierarchy is a positive integer α is an adjustable parameter, which in the present invention is taken to be 1.6.

Step 1.4: and establishing connection among the words with the similarity larger than a given threshold lambda to finish the construction of the emotion concurrence graph. In the present invention λ is chosen to be 0.6.

Step 2: and (3) carrying out emotion classification on the microblogs in the microblog flow by using the emotion analysis model constructed in the step (1), and detecting the burst period of the microblog flow by adopting a kleinberg algorithm.

And 2.1, performing word segmentation on each microblog in the microblog flow.

Step 2.2: and establishing an emotion vector Sd of the microblog by using the established emotion concurrence graph model for the microblog with the participle.

Step 2.3: and setting a flag bit flag to true, if the corresponding emotion mark sigma sk of the Sd vector is 1, adding the microblog into an emotion document set Ds Tk, and setting the flag to false.

Step 2.4: and repeating the steps 2.2 and 2.3 until all the microblogs are classified.

Step 2.5: for each type of emotional microblog, a kleinberg algorithm is used for detecting the outbreak period.

And step 3: extracting microblog labels in the burst period, filtering out junk labels, and performing word segmentation processing on the rest labels. An initial keyword for the event is formed.

Step 3.1: and performing part-of-speech tagging on the extracted tags, and removing tags only of verbs or tags only of nouns, such as tags like "# early-safe #", "# late-safe #", "# sing bar #" "# nine village #", "# journey #".

Step 3.2: labels containing special symbols ("," + ",") in the labels are removed. Such as "# laugh + video #", "# early love house #", "# Weico + #".

Step 3.3: labels with standard date format, only numbers and punctuation are removed. Such as "# 365 #", "# 4.01 #".

And 4, step 4: and 3, extracting words related to the keywords in the microblog by using the keywords generated in the step 3 to form the final description of the event.

Step 4.1: and performing word segmentation on the rest tags in the burst period.

Step 4.2: and calculating the frequent mode of the keywords of the microblog labels in the burst period.

Step 4.3: and extracting 2 item sets in the frequent pattern, and calculating mutual information among the words in the 2 item sets.

Step 4.4: and reserving words with mutual information larger than a given threshold value Y, and sequencing the words according to word frequency to form final event description. In the present invention, the value of Y is selected to be 1.5.

The mutual information calculation formula in step 4.4 is:

C(W₁) And C (W)₂) Respectively indicate W contained in corpus₁And W₂Number of microblogs, C (W)₁，W₂) Indicates that W is contained at the same time₁And W₂The number of microblogs. And R is the size of the corpus, namely the total number of microblogs.

Claims

1. A microblog online emergency detection method based on emotion analysis and labels is characterized by comprising the following steps:

(4) extracting words related to the keywords in the microblog by using the keywords generated in the step (3) to form final description of the event;

in the formula W₁And W₂Represents a word, word W₁There are k terms: { n₁₁，n₁₂，…，n_1k}, word W₂There are p sense items: { n₂₁，n₂₂，…，n_2p}，p₁And p₂Denotes two sememes, d is p₁And p₂The path length in the semantic hierarchy is a positive integer α is an adjustable parameter;

(1.4) establishing connection among words with similarity larger than a given threshold lambda to finish the construction of the emotion concurrence graph; lambda is selected to be 0.6;

the step (3) comprises the following steps:

(3.2) rejecting labels containing special symbols in the labels;

the step (4) comprises the following steps:

(4.1) performing word segmentation on the residual labels in the burst period;

the mutual information calculation formula in step 4.4 is:

C(W₁) And C (W)₂) Respectively indicate W contained in corpus₁And W₂Number of microblogs, C (W)₁，W₂) Indicates that W is contained at the same time₁And W₂The number of microblogs; r is gauge of corpusAnd module, namely the total number of microblogs.