CN111324734A

CN111324734A - Case microblog comment emotion classification method integrating emotion knowledge

Info

Publication number: CN111324734A
Application number: CN202010096024.7A
Authority: CN
Inventors: 余正涛; 郭贤伟; 相艳; 郭军军; 黄于欣; 朱恩昌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2020-06-23
Anticipated expiration: 2040-02-17
Also published as: CN111324734B

Abstract

The invention relates to a case microblog comment emotion classification method fusing emotion knowledge, and belongs to the technical field of natural language processing. The method comprises the steps of constructing an emotion knowledge base comprising case microblog emotion dictionaries, emoticons, network popular lines, negative dictionaries and degree adverb dictionaries; and constructs attribute feature representations. And finally, fusing the semantic representation and the attribute feature representation of the comments based on a convolutional neural network, constructing an emotion classification model, and extracting deep semantic features and serialized emotion knowledge features to realize emotion classification. Experimental results based on case microblog comment corpora show that compared with a benchmark method INIT-CNN, the Macro _ F1 and the Micro _ F1 indexes are improved by 1.88% and 1.94% respectively.

Description

Case microblog comment emotion classification method integrating emotion knowledge

Technical Field

The invention relates to a case microblog comment emotion classification method fusing emotion knowledge, and belongs to the technical field of natural language processing.

Background

Case public sentiment refers to internet public sentiment related to cases. Microblogs, as a prominent representative of social media platforms, are an important source of case public sentiment. Hot cases are easy to rapidly ferment and evolve on a microblog platform, and mass comments and case public opinions are generated. The case public sentiment is easy to cause social disorder and influence judicial decision, and the state is also actively promoting the construction of the intelligent court. Therefore, the emotion classification facing case microblog comments is very necessary and significant for timely understanding and mastering case public opinions so that relevant departments can make decisions quickly.

The emotion classification task is a subtask of text emotion analysis, and emotion classification of case microblog comments can be regarded as a fine-grained emotion classification task in a specific field and is important for preventing public opinion risks. Emotion classification methods are generally classified into: an emotion dictionary based approach, a traditional machine learning based approach, and a deep learning based approach. The method based on the emotion dictionary mainly realizes emotion classification of the text by performing certain combination calculation on emotion words in the text through the emotion dictionary. The method has large dependence on emotion dictionaries, and no complete emotion dictionary can be well used for emotion classification of microblog texts at present. The emotion classification based on traditional machine learning is a common supervised learning method, and a large amount of labeled data and complex artificial feature engineering are required. Most of the existing emotion classification researches are based on deep learning methods. Although the method based on deep learning effectively avoids the defects of methods based on emotion dictionaries and traditional machine learning to a certain extent, most of the methods based on deep learning only encode texts as a whole, the existing emotion computing resources are not effectively utilized, and the emotion classification effects of emoticons, network popular lines and the like cannot be well reflected. Aiming at the problem that the conventional emotion classification method is difficult to effectively utilize emotion knowledge such as common emoticons, negative rules, new words of domain emotions and the like in case microblog comments, the emotion classification method of case microblog comments, which is integrated with emotion knowledge, is provided.

Disclosure of Invention

The invention provides a case microblog comment emotion classification method fusing emotion knowledge, which is used for solving the problem that the classification performance of the traditional method in a case microblog comment emotion classification task is not high, and the traditional emotion analysis method is difficult to effectively utilize emotion knowledge such as common emoticons, negative rules, new words of domain emotions and the like in comments.

The technical scheme of the invention is as follows: the case microblog comment emotion classification method integrating emotion knowledge comprises the following specific steps of:

step1, constructing a case microblog comment library vocabulary: collecting case microblog comment texts as an experimental data set, and carrying out pretreatment of deleting meaningless character data, word segmentation and part-of-speech tagging texts to obtain a case microblog comment corpus word list;

step2, constructing a basic emotion dictionary: based on the emotional vocabulary ontology of the university of the major associates, 7 emotion categories of happiness, goodness, anger, sadness, fear, aversion and surprise are continuously used to construct a basic emotion dictionary; the method comprises the steps of collecting frequently-used emoticons and network popular expressions of a microblog and classifying the frequently-used emoticons and the network popular expressions by sorting existing emotion computing resources to obtain a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular expression set;

step3, constructing a seed emotion word set: taking all words of the basic emotion dictionary in a case microblog comment material library word list as seed emotion words to form a seed emotion word set;

step4, constructing a case microblog emotion dictionary: firstly, mining 7 emotion category candidate emotion words in a case microblog comment material library word list by utilizing an SO-PMI (semantic guide point mutual information) algorithm; then, calculating the cosine similarity of word vectors of the candidate emotion words of each category and the seed emotion words of the corresponding category, and keeping the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category to form an expanded emotion dictionary; then, expanding emotion words through manual screening, adding the expanded emotion words into the seed emotion word set, and performing incremental iteration to further mine field emotion new words; finally, stopping iteration when the emotion new words cannot be mined by the algorithm, and integrating the expanded emotion dictionary and the basic emotion dictionary to obtain a case microblog emotion dictionary;

as a preferred scheme of the invention, in Step4, the SO-PMI algorithm firstly screens out parts of speech in the thesaurus of the case microblog comment corpus as follows: adjectives, verbs, nouns, adverbs, and all words of the emoticon part-of-speech "emoji", where the emoticon part-of-speech "emoji" is a manually defined label for an emoticon word; then calculating SO-PMI values between all emotion words of all emotion categories in each word and the seed emotion word set, reserving the words with SO-PMI values larger than zero as candidate emotion words of corresponding categories, and indicating that the words are more relevant to the current emotion categories when the SO-PMI values of the words are larger than zero and the values of the words are larger; the calculation formula of the SO-PMI algorithm is as follows:

wherein word1 represents words appearing in case microblog comment corpus, and word2 represents seed emotion words; p (word1)&word2) represents the probability of common occurrence of word1 and word2 in case microblog comment libraries, p (word1) represents the probability of occurrence of word1 in case microblog comment libraries, and p (word2) represents the probability of occurrence of word2 in case microblog comment libraries; s_some-kindSet of seed emotional words, S, representing a certain category of emotions_othersRepresenting other 6 kinds of seed emotion word sets.

As a preferred embodiment of the present invention, in Step4, the word vector cosine similarity calculation formula is as follows:

wherein v is_iA word vector representing candidate emotional words, wherein m represents the total number of the candidate emotional words in the current category; v. of_jAnd a word vector representing the seed emotional words, wherein n represents the total number of the seed emotional words in the current category.

Step5, integrating all the resources into a case microblog emotion knowledge base comprising a case microblog emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular language set;

step6, defining part of speech and emotion label attribute characteristics of words by using a case microblog emotion knowledge base, constructing attribute characteristic representation of case microblog comments, fusing semantic representation and attribute characteristic representation of the comments through a dual-channel convolutional neural network, training an emotion classifier of the comments, constructing a dual-channel convolutional neural network model, fusing case microblog emotion knowledge, and realizing emotion classification of case microblog comments.

As a preferred scheme of the invention, in Step6, the semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loaded^N×dA process of querying each word and assigning a word vector;

wherein N represents the vocabulary number of the vocabulary, and d represents the dimension of the word vector; suppose a sequence of comment text T containing n words ═ w₁,w₂,…,w_nFor each word w in T_iCan pass through the word vector list W^N×dQuerying a word vector v_iThen the semantic representation matrix M of the sequence T_TComprises the following steps:

wherein the content of the first and second substances,

a stitching operation representing a row vector direction;

the attribute feature representation of case microblog comments is a process of constructing an attribute feature representation matrix based on a sparse binary vector representation method; firstly, defining K parts of speech and emotion label attributes for each word; then, for a given sequence of comment text containing n words, T ═ w₁,w₂,…,w_nAnd fourthly, marking each word w by part of speech and inquiring a case microblog emotion knowledge base_iAre all mapped as a K-dimensional boolean binary vector v_{bool_i}，v_{bool_i}Has a value of 0/1, 0 indicates that the feature is not present, 1 indicates that the feature is present, and finally, an n × K-dimensional attribute feature representation matrix M is obtained_E：

As a preferable scheme of the invention, in the Step6, the dual-channel convolutional neural network takes INIT-CNN as a reference method, so as to construct a dual-channel convolutional neural network model; the INIT-CNN is a text classification model constructed by adopting an initialized convolution filter technology based on a convolution neural network.

Further, convolutional neural networks have been widely used in dealing with short text classification problems due to their parallelizable computation and automatic feature extraction for classification, and have been proven to be very effective by many researchers. There have been studies based on convolutional neural networks, using an initialized convolutional filter technique, to build a text classification model and achieve excellent results. The convolutional neural network model using the technology is called INIT-CNN and serves as a reference method, a dual-channel convolutional neural network model is constructed, knowledge of case microblog emotions is fused, and emotion classification of case microblog comments is achieved.

As a preferred scheme of the invention, in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer.

The invention has the beneficial effects that:

the method firstly constructs an emotion knowledge base comprising case microblog emotion dictionaries, emoticons, network popular lines, negative dictionaries and degree adverb dictionaries. Second, an attribute feature representation is constructed. And finally, fusing the semantic representation and the attribute feature representation of the comments based on a convolutional neural network, constructing an emotion classification model, extracting deep semantic features and serialized emotion knowledge features, and realizing emotion classification. Theoretical and technical verification is carried out on a case microblog comment data set, and experimental effects prove that compared with a reference method INIT-CNN, the method respectively improves 1.88% and 1.94% in Macro _ F1 and Micro _ F1 indexes, and fully proves the effectiveness of the method.

Drawings

FIG. 1 is a flow chart of construction of case microblog emotion dictionaries in a case microblog emotion knowledge base according to the present invention;

FIG. 2 is a model diagram of the case microblog comment emotion classification method with emotion knowledge fused.

Detailed Description

Example 1: as shown in fig. 1-2, the case microblog comment emotion classification method with emotion knowledge integrated includes the following specific steps:

wherein the content of the first and second substances,

a stitching operation representing a row vector direction;

Example 2: as shown in fig. 1-2, the case microblog comment emotion classification method with emotion knowledge integrated includes the following specific steps:

step1, collecting comment data of 7 cases of microblogs which are concerned in recent years from a Xinlang microblog platform. Wherein, the Laiyuan killing case is 10812 sentences, and the Jiangsong case is 13624 sentences; 17491 sentences for Zhaoyu, Yi and Yong; 17875 sentences in all the cases of being damaged are clear; 33774 sentences in the case of the Chongqing bus falling into the river; 37162 sentences for Henan Marsarada hit-and-run case; and 58626 sentences of the owner right-keeping scheme of the Western An Benz female car, and 189364 sentences in total. Wherein, the 189364 comments are all used as experimental data constructed by case microblog emotion dictionaries; in addition, 30000 comments are randomly sampled from the emotion data for manual emotion annotation, and 11593 comments are finally reserved for experimental data for emotion classification. In addition, data preprocessing operations such as deleting meaningless characters, word segmentation, part of speech tagging and the like are carried out on the experimental data set;

as a preferred scheme of the invention, in Step1, a crawler program is written by using python language, a user account and a cookie pool are constructed, and a case microblog comment data set is collected from a Xinlang microblog API; the data preprocessing is realized by adopting a python language writing program: the method comprises the steps of firstly, carrying out duplication removal and deletion of meaningless characters ("/", "@", URL and the like) on a text, then using a user dictionary adding function of a jieba word segmentation tool to carry out word segmentation and part-of-speech tagging on comment linguistic data, and obtaining a case microblog comment linguistic database word list.

The optimal scheme design is an important component of the method, mainly constructs a case microblog comment experimental data set, and provides data support for case microblog comment emotion classification integrated with emotion knowledge.

Step2, collecting and classifying common emoticons and network popular lines of the microblog by sorting the existing emotion computing resources, and constructing a basic emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular line set;

as a preferred embodiment of the present invention, in Step2, the basic emotion dictionary is constructed based on DUTSD of the university of the major continuous processing, specifically, 7 types of emotions: happy, angry, sadness, fear, dislike and fright; considering the important role of the emoticons and the network popular lines on emotion expression, classifying the emoticons and the network popular line set commonly used parts with emotion colors according to the 7 types of emotions, and adding the emoticons and the network popular line set into the DUTSD to serve as a basic emotion dictionary;

the design of the optimal scheme is an important component of the method, and the method mainly builds a basic emotion dictionary for the method and provides a basis for building a case microblog emotion dictionary for the method.

Step3, acquiring a seed emotion word set;

in Step3, all words appearing in the word list of the case microblog comment corpus in the basic emotion dictionary are used as seed emotion words to form a seed emotion word set.

The design of the optimal scheme is an important component of the method, and the method mainly obtains the seed emotion word set and provides a basis for constructing case microblog emotion dictionaries.

As a preferred scheme of the invention, in Step4, considering that noise may exist in candidate emotion words automatically mined by an SO-PMI algorithm, it is proposed to filter the noise candidate emotion words by calculating the cosine similarity of word vectors of each category of candidate emotion words and corresponding category of seed emotion words and reserving the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category. The word vector cosine similarity calculation formula is as follows: :

Step5, integrating all the resources into a case microblog emotion knowledge base containing a case microblog emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular language set, as shown in table 1;

table 1 shows case microblog emotion knowledge base (unit: one)

The optimal scheme design is an important component of the invention, mainly constructs a case microblog emotion knowledge base, and provides emotion knowledge base support for constructing a case microblog comment emotion classification model integrating emotion knowledge.

The case microblog comment emotion classification model is constructed based on a convolutional neural network and a case microblog emotion knowledge base, and the case microblog comment emotion classification model is integrated with emotion knowledge, and the model structure is composed of a double-channel convolutional neural network. Convolutional neural networks have been widely used in dealing with short text classification problems due to their parallelizable computation and automatic feature extraction for classification, and have been proven to be very effective by many researchers. There have been studies based on convolutional neural networks, using an initialized convolutional filter technique, to build a text classification model and achieve excellent results. The convolutional neural network model using the technology is called INIT-CNN and serves as a reference method, a dual-channel convolutional neural network model is constructed, knowledge of case microblog emotions is fused, and emotion classification of case microblog comments is achieved.

In the Step6, before the model is trained and classified, the text needs to be represented in a computer-readable form. Therefore, the first step is to construct semantic representations and attribute feature representations of case microblog comments input by the model. The semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loaded^N×dA process of querying each word and assigning a word vector;

wherein the content of the first and second substances,

a stitching operation representing a row vector direction;

the attribute feature representation of case microblog comments is a process of constructing an attribute feature representation matrix based on a sparse binary vector representation method; firstly, 15 parts of speech and emotion label attributes are defined for each word, and are shown in a table 2; then, for a given sequence of comment text containing n wordsT＝{w₁,w₂,…,w_nAnd fourthly, marking each word w by part of speech and inquiring a case microblog emotion knowledge base_iAre all mapped as a 15-dimensional Boolean binary vector v_{bool_i}，v_{bool_i}Has a value of 0/1, 0 indicates that the attribute does not exist, 1 indicates that the attribute exists, and finally, an n × 15-dimensional attribute feature representation matrix M is obtained_E：

Table 2 shows the part of speech and emotional tag attributes of the words

As a preferred scheme of the invention, in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer. The invention adopts a cross entropy loss function to measure the difference between the prediction probability distribution and the real probability distribution of the emotion labels, and trains a model through a back propagation algorithm.

The optimal scheme design is an important component of the method, mainly constructs a case microblog comment emotion classification model fused with emotion knowledge, and provides theoretical support for constructing the emotion classification model.

Step7, carrying out experiments on the constructed model aiming at the case microblog comment data set, and verifying the effectiveness of the invention.

In Step7, as a preferred embodiment of the present invention, the experimental model is constructed by using a Tensorflow deep learning framework. According to the model disclosed by the invention, a pre-training word vector and 15 parts of speech and emotion label attributes defined by a case microblog emotion knowledge base are respectively adopted, semantic feature representation and attribute feature representation are constructed for the comment text and are used as the input of the model together, and a 7-class emotion classifier is trained. For the model of the reference method, only the pre-training word vector is used as the comment text to construct semantic feature representation, and the semantic feature representation is directly used as model input to train a 7-class emotion classifier. The experimental results and analyses were as follows:

table 3 shows the results of comparative experiments on the overall classification performance of the Our Model and the INIT-CNN

Table 4 shows the results of the experiment of the reference method INIT-CNN in each emotion category

Table 5 shows the results of the experiment of the Our Model of the present invention in each mood class

The method Our Model is integrated into a constructed emotion knowledge base on the basis of a reference method INIT-CNN. Analysis of Table 3 reveals that, compared to INIT-CNN, the coarse _ Precision of the Our Model is improved by 1.82%, the coarse _ Recall is improved by 1.91%, the coarse _ F1 is improved by 1.88%, and the Micro _ F1 is improved by 1.94%. The experimental result shows that the OurModel has better overall classification performance than the INIT-CNN. Analyzing table 4 and table 5, it can be seen that the Precision, Recall and F1 indexes of the respective mood categories of the ourmodel exceed the INIT-CNN overall. Particularly, compared with the Precision, the Recall and the F1 indexes of the input-CNN, the Precision, the Recall and the F1 of the Our Model are respectively improved by 6.30 percent, 8.86 percent and 7.64 percent on the recognition of the distust class. The experimental results show that: overall, the ourmodel is also superior to INIT-CNN in the ability to recognize various emotion categories. Experimental results prove that the proposed case microblog comment emotion classification method fusing emotion knowledge is effective.

Table 6 shows the data analysis of the example of the method of the present invention, Our Model and the reference method, INIT-CNN

Through the experiment and the example data analysis, only the benchmark method INIT-CNN is adopted for the experiment, and some common emotional knowledge characteristics in case microblog comments cannot be effectively learned by a neural network. For example in table 6: emotional knowledge such as an emoticon, [ spit ] ", a case microblog emotional new word," woman hero ", a network popular language," TMD ", a negative rule," dislike ", and a degree adverb rule," too toxic ", etc., is well learned by the Our Model, and the emotional categories of all the examples are correctly predicted; the baseline method INIT-CNN, however, does not learn these emotional knowledge well, resulting in an inability to correctly predict the emotional category of these examples. The case microblog comment emotion classification method integrating emotion knowledge is practical in emotion classification for case microblog comments.

The optimal scheme design is an important component of the method, and is mainly used for verifying the feasibility and the effectiveness of the case microblog comment emotion classification model integrated with emotion knowledge and constructed by the method.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The case microblog comment emotion classification method integrating emotion knowledge is characterized by comprising the following steps of: the method comprises the following specific steps:

step4, constructing a case microblog emotion dictionary: firstly, mining 7 emotion category candidate emotion words in a word list of a case microblog comment material library by using an SO-PMI semantic guide point mutual information algorithm; then, calculating the cosine similarity of word vectors of the candidate emotion words of each category and the seed emotion words of the corresponding category, and keeping the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category to form an expanded emotion dictionary; then, expanding emotion words through manual screening, adding the expanded emotion words into the seed emotion word set, and performing incremental iteration to further mine field emotion new words; finally, stopping iteration when the emotion new words cannot be mined by the algorithm, and integrating the expanded emotion dictionary and the basic emotion dictionary to obtain a case microblog emotion dictionary;

2. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in Step6, the semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loaded^N×dA process of querying each word and assigning a word vector;

wherein the content of the first and second substances,

a stitching operation representing a row vector direction;

3. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in the Step6, the dual-channel convolutional neural network takes INIT-CNN as a reference method, so that a dual-channel convolutional neural network model is constructed; the INIT-CNN is a text classification model constructed by adopting an initialized convolution filter technology based on a convolution neural network.

4. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in Step4, the SO-PMI algorithm firstly screens out parts of speech in the word list of the case microblog comment material library as follows: adjectives, verbs, nouns, adverbs, and all words of the emoticon part-of-speech "emoji", where the emoticon part-of-speech "emoji" is a manually defined label for an emoticon word; then calculating SO-PMI values between all emotion words of all emotion categories in each word and the seed emotion word set, reserving the words with SO-PMI values larger than zero as candidate emotion words of corresponding categories, and indicating that the words are more relevant to the current emotion categories when the SO-PMI values of the words are larger than zero and the values of the words are larger; the calculation formula of the SO-PMI algorithm is as follows:

5. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in Step4, the word vector cosine similarity calculation formula is as follows:

6. The case microblog comment emotion classification method fusing emotion knowledge according to claim 2 or 3, characterized in that: in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer.