CN112287197A

CN112287197A - Method for detecting sarcasm of case-related microblog comments described by dynamic memory cases

Info

Publication number: CN112287197A
Application number: CN202011005842.8A
Authority: CN
Inventors: 余正涛; 谭陈琛; 相艳; 郭军军; 黄于欣; 线岩团
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2021-01-29
Anticipated expiration: 2040-09-23
Also published as: CN112287197B

Abstract

The invention relates to a case description dynamic memory case description-related microblog comment sarcasm detection method, and belongs to the technical field of natural language processing. The invention comprises the following steps: constructing a irony data set of microblog related; respectively carrying out feature coding on the text of the case-related microblog and the case-related microblog comments by utilizing word embedding and position embedding, and introducing an attention mechanism into the case-related microblog comments subjected to feature coding; obtaining a text representation related to the microblog comments related to the case through a dynamic memory mechanism; and splicing the obtained text representation and the comment representation to obtain a new representation, training a model by taking the representation as input, and performing case-related microblog comment ironic detection through the model. The invention realizes the irony detection of case-related comment sentences by using the case description information as the background information, detects the irony from the acquired public sentiment data and provides support for the subsequent sentiment analysis.

Description

Method for detecting sarcasm of case-related microblog comments described by dynamic memory cases

Technical Field

The invention relates to a case description dynamic memory case description-related microblog comment sarcasm detection method, and belongs to the technical field of natural language processing.

Background

With the rapid development of the internet, people pay more and more attention to case-related events, and views of the case-related events are often published on microblogs. In the field of endeavor, people often choose sarcasm to express their own subjective emotion to certain sensitive information. Therefore, ironic detection is an important task for public opinion analysis in the field of case.

The microblog ironic detection task has the following two characteristics: firstly, the ironic sentence is similar to the display emotional sentence in expression, and the ironic sentence is opposite in meaning and is difficult to distinguish on the premise of lacking background; and secondly, the contents of the microblog comments related to the case are not only related to the corresponding microblog texts, but also related to other microblog texts of the same case. The microblog is used for describing the same case, a plurality of microblog texts with different expressions exist, and the microblog texts of the same case can represent the complete case together to form complete description of the case. The sarcasm is detected from the acquired public sentiment data, so that the public sentiment can be correctly guided, and the negative influence caused by the public sentiment event is effectively reduced.

Disclosure of Invention

The invention provides a method for detecting sarcasm of case-related microblog comments described by dynamic memory cases, which is used for detecting sarcasm in case-related microblog emotion analysis and solves the problem of low emotion analysis performance caused by inaccurate sarcasm detection of case-related microblog comments.

The technical scheme of the invention is as follows: a method for detecting sarcasm of case-related microblog comments described by dynamic memory cases comprises the following steps:

step1, constructing a irony data set of the case microblog: a plurality of microblog comments and microblog texts related to the case are crawled through a crawler technology, and the data set is manually labeled to obtain a case-related microblog ironic data set.

Step2, respectively carrying out feature coding on the case-related microblog text and the case-related microblog comments by word embedding and position embedding, and introducing an attention mechanism into the case-related microblog comments after feature coding; obtaining a text representation related to the microblog comments related to the case through a dynamic memory mechanism; and splicing the obtained text representation and the comment representation to obtain a new representation, training a model by taking the representation as input, and performing case-related microblog comment ironic detection through the model.

As a further scheme of the invention, the specific steps of Step1 are as follows:

step1.1, crawling relevant microblog texts and comments of dozens of current hot cases from the Xinlang microblog by using a crawler based on a Scapy frame;

step1.2, filtering and screening the texts and comments of the microblog involved in the case, wherein the filtering and screening mode is as follows: (1) dividing microblog messages according to a forwarding relation '//' for ensuring that comments below the forwarded microblog are analyzed based on the original microblog, (2) deleting a structure of '@ + user name + reply' in the microblog comments and deleting irrelevant hyperlink advertisements;

step1.3, obtaining an involved microblog sarcasia data set by adopting artificial marking, marking the data set by taking one microblog comment as a unit, marking 0 cases as sarcasia in the microblog comment sentences, marking 1 as non-involved microblog sarcasia and obtaining intersection by three blind persons.

As a further scheme of the invention, the specific steps of Step2 are as follows:

step2.1, coding the microblog text involved in the case by adopting a word position coding mode to obtain a word vector l with position information_se；

Wherein S belongs to [0, S-1], S represents the number of words involved in the microblog text, E belongs to [0, E-1], E represents the dimension of word embedding;

step2.2, respectively using the text of the microblog related to the case as a position information word vector l_seAnd word vector w trained with word2vec on large-scale microblog corpus_sElement multiplication is carried out to obtain the text representation f of the case-related microblog_i；

l_sIs characterized by one-hot, representing a column vector representing information of a position information word, S_iRepresenting the word number of the ith case-related microblog text, multiplying the' representative vector by elements, and f_i∈R^EEmbedded representation with location information representing the ith case-related microblog text, where i e 0, N]N represents the number of case-related microblog texts, and the number of case-related microblog texts is N when consistent with the number of case-related microblog comments, w_sEmbedding a representative word into a vector;

step2.3, sending the text of the case-related microblog into a BiGRU model capable of extracting bidirectional semantic features to obtain coded text

Step2.4, commenting the microblog related to the case

Word vector representation w trained with word2vec_cAnd sending it into BiGRU model to obtain codeThe latter remarks

Wherein C belongs to [0, C ], and C represents the number of words in the comment;

step2.5, introducing an attention mechanism to the microblog comments related to the case, so that the microblog comments related to the case can pay attention to the word-level more key information in the sentence, and obtaining a comment v introduced with the attention mechanism;

wherein, W_uIs a parameter matrix, b_uIs an offset term, u_wRepresenting a word-level context vector, alpha_cIs a weight matrix, v_iThe weighted and summed comment sentence vectors are obtained, and v is the weighted comment;

step2.6, and characterizing the obtained text of the case

Characterization v of microblog comments involved in case and previous round of memory information m^t-1And performing interactive calculation to splice the obtained results to obtain a characteristic splicing matrix z^t；

m⁰＝v (6)

m⁰Representing initial memory information, initializing by using v, wherein t is memory frequency, |, represents element absolute value, [;]representing the concatenation of the vectors;

step2.7, introducing an attention mechanism, and acquiring a weighted microblog text representation g^t：

Wherein, W_z、W_ztIs a parameter matrix, b_z、b_wzIs a bias term;

step2.8, mixing g^tReplace the update gate in the GRU;

r^t＝σ(W_rzz^t+W_rhh^t-1+b_r) (8)

wherein r is^tIndicating a reset gate in the gated attention GRU,

representing candidate hidden layers, h^tRepresents a hidden state, W_rz、W_rh、

And

is a parameter matrix; b_rAnd

is a bias term;

step2.9, memorizing the information m of the previous round^t-1Hidden state h obtained by gated attention GRU^tSplicing with the comment v, inputting a linear layer, and activating by using a ReLU function; because one input round cannot well remember all required information, multiple iterations are required;

wherein m is^t∈R^2d×NMemory information of t memory times; w_mIs a parameter matrix;

step2.10, splicing the obtained text representation and the comment representation to obtain a new representation, and training a model by taking the representation as input; splicing the text representation related to the comment and the comment subjected to feature coding to obtain a new representation, and deciding the category of the maximum probability by adopting a softmax classification function;

wherein, W_yA parameter matrix is represented.

The invention has the beneficial effects that:

1. the consistency of the sarcasm of the microblog related files and the case description information is realized, the problem that the difference between the literal meaning and the actual meaning is large can be well solved, and the detection and measurement effect is improved.

2. A dynamic memory mechanism is utilized to obtain better semantic representation through multiple iterations, and related background knowledge is effectively memorized.

Drawings

Fig. 1 is a schematic diagram of a specific structure of the recognition model in the present invention.

Detailed Description

Example 1: as shown in fig. 1, a method for detecting sarcasm of case description related to microblog comments dynamically includes:

step1, constructing a irony data set of the case-related microblog; the method comprises the following specific steps:

step1.3, obtaining a sarcasm data set related to a microblog by adopting artificial marking; marking work is carried out by taking one microblog comment as a unit, the microblog comment sentences contain sarcasm and cases, the cases are marked as 0, other cases are regarded as non-sarcasm microblog sarcasm and marked as 1, and the intersection is obtained by blind judgment of three persons. After the data are out of order, the data are divided into a training set and a testing set, and the data distribution condition is shown in table 1.

TABLE 1 statistical information of the microblog irony dataset

Step2.4, commenting the microblog related to the case

Word vector representation w trained with word2vec_cAnd sending the comment into a BiGRU model to obtain the coded comment

step2.6, and characterizing the obtained text of the case

m⁰＝v (6)

Wherein, W_z、W_ztIs a parameter matrix, b_z、b_wzIs a bias term;

step2.8, mixing g^tReplace the update gate in the GRU;

r^t＝σ(W_rzz^t+W_rhh^t-1+b_r) (8)

wherein r is^tIndicating a reset gate in the gated attention GRU,

And

is a parameter matrix; b_rAnd

is a bias term;

wherein, W_yA parameter matrix is represented.

To illustrate the effect of the present invention, 2-group comparative experiments were set up. The first group of experiments verify the effectiveness of the case description information in fusion, and the other group of experiments verify the effectiveness of the dynamic memory mechanism.

(1) Validity verification of dynamic memory mechanism integrated with case description information

And comparing the modes of ironic detection only by using microblog comment sentences and ironic detection by combining the comment sentences of case description information in the reference model. And taking the text corresponding to the case-related microblog comment as case description information of the case-related microblog comment in a reference model, firstly, respectively inputting the microblog comment and the text corresponding to the microblog comment into the model, then, splicing the characteristics of the microblog comment and the text, and finally, classifying. The results of the experiment are shown in table 2.

TABLE 2 comparison of experimental results with case description information

As can be seen from the analysis of table 2, comparing the two data sets under the same model, it can be seen that the performance index combining the case description information and the comment data set is superior to the performance index of the comment data set only, wherein the F1 value is raised by 2.41% at the maximum after the HAN model adds the case description information, which proves that the introduction of the case description information in the ironic detection is effective. The F1 value for the selfatention model was the greatest in both datasets, reaching 83.10% and 83.15% for Acc and F1 values, respectively, after combining case description information, but still lower than the model herein. Acc and F1 values of the model in the table 2 are optimal results, and reach 85.65% and 85.91% respectively, so that the effectiveness of the model on the task of micro-blog ironic identification is proved.

(2) Validity verification of dynamic memory mechanism

The second part verifies the validity of the dynamic memory mechanism, namely model performance under different memory times is compared, and the experimental results are shown in table 3.

TABLE 3 validation of memory mechanisms

Analysis of Table 3 reveals that when the number of memorization is 0, m is⁰Is v: acc value, P value, R value and F1 value are lowest; when the Acc value and the F1 value in the memory times of 0-3 are increased along with the increase of the memory times and reach the optimal performance in three times of memory, the performance of the model can be influenced and optimized by memorizing case description information for many times. By comparing the evaluation indexes of 3 and 4 memoization, it can be seen that the Acc value and the F1 value begin to decrease as the number of memoization continues to increase, respectively 1.94% and 2.39% lower than the 3 memoization, indicating that overfitting occurs as the number of memoization increases. Experiments prove that the best effect is obtained by 3 times of memory.

Furthermore, combining tables 2 and 3, it can be seen that when the number of remembers is 0, the model herein outperforms the baseline model at Acc, P, and F1 values, indicating that a single body of text does not provide support for comments. When the number of memorization times is 3, the Acc value is raised by 6.01% and the F1 value is raised by 6.27% compared with the RCNN model with the comment as input. On one hand, it is verified that the case description information is effectively added into the case detection task related to the microblog sarcasm, namely, the case description information is beneficial to improving the case detection performance related to the microblog sarcasm; on the other hand, the dynamic memory mechanism provided by the text can fully memorize case description information related to case-related microblog comments, and the memorized information can effectively guide the sarcasm detection task.

The experimental data prove that the problem that the difference between the literal meaning and the actual meaning is large can be well solved by judging the consistency of case description information and comment sentences. Meanwhile, a dynamic memory mechanism is utilized to obtain better semantic representation through multiple iterations. Experiments prove that the method provided by the invention achieves the optimal effect compared with a plurality of baseline models. Aiming at the sarcasm detection task of the concerned microblog comment, the method for guiding the microblog comment detection by using the case description information is effective.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for detecting sarcasm of case-related microblog comments described by dynamic memory cases is characterized by comprising the following steps: the method comprises the following steps:

step1, constructing a irony data set of the case-related microblog;

2. The detection method for the sarcasm of the related microblog comments described by the dynamic memory case as claimed in claim 1, wherein the detection method comprises the following steps: in Step1, crawling a plurality of microblog comments and microblog texts related to the case through a crawler technology, and manually labeling the data set to obtain a case-related microblog ironic data set.

3. The detection method for the sarcasm of the micro-blog comments related to the dynamic memory case description according to claim 1 or 2, wherein: the specific steps of Step1 are as follows:

step1.3, obtaining a sarcasm data set related to a microblog by adopting artificial marking; marking work is carried out by taking one microblog comment as a unit, the microblog comment sentences contain sarcasm and cases, the cases are marked as 0, other cases are regarded as non-sarcasm microblog sarcasm and marked as 1, and the intersection is obtained by blind judgment of three persons.

4. The detection method for the sarcasm of the related microblog comments described by the dynamic memory case as claimed in claim 1, wherein the detection method comprises the following steps: the specific steps of Step2 are as follows:

l_sIs characterized by one-hot, representing a column vector representing information of a position information word, S_iRepresenting the word number of the ith case-related microblog text, multiplying the' representative vector by elements, and f_i∈R^ERepresents the ithEmbedded representation with location information for a subject microblog text, where i e [0, N]N represents the number of case-related microblog texts, and the number of case-related microblog texts is N when consistent with the number of case-related microblog comments, w_sEmbedding a representative word into a vector;