CN111324734A - Case microblog comment emotion classification method integrating emotion knowledge - Google Patents

Case microblog comment emotion classification method integrating emotion knowledge Download PDF

Info

Publication number
CN111324734A
CN111324734A CN202010096024.7A CN202010096024A CN111324734A CN 111324734 A CN111324734 A CN 111324734A CN 202010096024 A CN202010096024 A CN 202010096024A CN 111324734 A CN111324734 A CN 111324734A
Authority
CN
China
Prior art keywords
emotion
words
word
case microblog
comment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010096024.7A
Other languages
Chinese (zh)
Other versions
CN111324734B (en
Inventor
余正涛
郭贤伟
相艳
郭军军
黄于欣
朱恩昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010096024.7A priority Critical patent/CN111324734B/en
Publication of CN111324734A publication Critical patent/CN111324734A/en
Application granted granted Critical
Publication of CN111324734B publication Critical patent/CN111324734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a case microblog comment emotion classification method fusing emotion knowledge, and belongs to the technical field of natural language processing. The method comprises the steps of constructing an emotion knowledge base comprising case microblog emotion dictionaries, emoticons, network popular lines, negative dictionaries and degree adverb dictionaries; and constructs attribute feature representations. And finally, fusing the semantic representation and the attribute feature representation of the comments based on a convolutional neural network, constructing an emotion classification model, and extracting deep semantic features and serialized emotion knowledge features to realize emotion classification. Experimental results based on case microblog comment corpora show that compared with a benchmark method INIT-CNN, the Macro _ F1 and the Micro _ F1 indexes are improved by 1.88% and 1.94% respectively.

Description

Case microblog comment emotion classification method integrating emotion knowledge
Technical Field
The invention relates to a case microblog comment emotion classification method fusing emotion knowledge, and belongs to the technical field of natural language processing.
Background
Case public sentiment refers to internet public sentiment related to cases. Microblogs, as a prominent representative of social media platforms, are an important source of case public sentiment. Hot cases are easy to rapidly ferment and evolve on a microblog platform, and mass comments and case public opinions are generated. The case public sentiment is easy to cause social disorder and influence judicial decision, and the state is also actively promoting the construction of the intelligent court. Therefore, the emotion classification facing case microblog comments is very necessary and significant for timely understanding and mastering case public opinions so that relevant departments can make decisions quickly.
The emotion classification task is a subtask of text emotion analysis, and emotion classification of case microblog comments can be regarded as a fine-grained emotion classification task in a specific field and is important for preventing public opinion risks. Emotion classification methods are generally classified into: an emotion dictionary based approach, a traditional machine learning based approach, and a deep learning based approach. The method based on the emotion dictionary mainly realizes emotion classification of the text by performing certain combination calculation on emotion words in the text through the emotion dictionary. The method has large dependence on emotion dictionaries, and no complete emotion dictionary can be well used for emotion classification of microblog texts at present. The emotion classification based on traditional machine learning is a common supervised learning method, and a large amount of labeled data and complex artificial feature engineering are required. Most of the existing emotion classification researches are based on deep learning methods. Although the method based on deep learning effectively avoids the defects of methods based on emotion dictionaries and traditional machine learning to a certain extent, most of the methods based on deep learning only encode texts as a whole, the existing emotion computing resources are not effectively utilized, and the emotion classification effects of emoticons, network popular lines and the like cannot be well reflected. Aiming at the problem that the conventional emotion classification method is difficult to effectively utilize emotion knowledge such as common emoticons, negative rules, new words of domain emotions and the like in case microblog comments, the emotion classification method of case microblog comments, which is integrated with emotion knowledge, is provided.
Disclosure of Invention
The invention provides a case microblog comment emotion classification method fusing emotion knowledge, which is used for solving the problem that the classification performance of the traditional method in a case microblog comment emotion classification task is not high, and the traditional emotion analysis method is difficult to effectively utilize emotion knowledge such as common emoticons, negative rules, new words of domain emotions and the like in comments.
The technical scheme of the invention is as follows: the case microblog comment emotion classification method integrating emotion knowledge comprises the following specific steps of:
step1, constructing a case microblog comment library vocabulary: collecting case microblog comment texts as an experimental data set, and carrying out pretreatment of deleting meaningless character data, word segmentation and part-of-speech tagging texts to obtain a case microblog comment corpus word list;
step2, constructing a basic emotion dictionary: based on the emotional vocabulary ontology of the university of the major associates, 7 emotion categories of happiness, goodness, anger, sadness, fear, aversion and surprise are continuously used to construct a basic emotion dictionary; the method comprises the steps of collecting frequently-used emoticons and network popular expressions of a microblog and classifying the frequently-used emoticons and the network popular expressions by sorting existing emotion computing resources to obtain a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular expression set;
step3, constructing a seed emotion word set: taking all words of the basic emotion dictionary in a case microblog comment material library word list as seed emotion words to form a seed emotion word set;
step4, constructing a case microblog emotion dictionary: firstly, mining 7 emotion category candidate emotion words in a case microblog comment material library word list by utilizing an SO-PMI (semantic guide point mutual information) algorithm; then, calculating the cosine similarity of word vectors of the candidate emotion words of each category and the seed emotion words of the corresponding category, and keeping the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category to form an expanded emotion dictionary; then, expanding emotion words through manual screening, adding the expanded emotion words into the seed emotion word set, and performing incremental iteration to further mine field emotion new words; finally, stopping iteration when the emotion new words cannot be mined by the algorithm, and integrating the expanded emotion dictionary and the basic emotion dictionary to obtain a case microblog emotion dictionary;
as a preferred scheme of the invention, in Step4, the SO-PMI algorithm firstly screens out parts of speech in the thesaurus of the case microblog comment corpus as follows: adjectives, verbs, nouns, adverbs, and all words of the emoticon part-of-speech "emoji", where the emoticon part-of-speech "emoji" is a manually defined label for an emoticon word; then calculating SO-PMI values between all emotion words of all emotion categories in each word and the seed emotion word set, reserving the words with SO-PMI values larger than zero as candidate emotion words of corresponding categories, and indicating that the words are more relevant to the current emotion categories when the SO-PMI values of the words are larger than zero and the values of the words are larger; the calculation formula of the SO-PMI algorithm is as follows:
Figure BDA0002385296790000021
Figure BDA0002385296790000022
wherein word1 represents words appearing in case microblog comment corpus, and word2 represents seed emotion words; p (word1)&word2) represents the probability of common occurrence of word1 and word2 in case microblog comment libraries, p (word1) represents the probability of occurrence of word1 in case microblog comment libraries, and p (word2) represents the probability of occurrence of word2 in case microblog comment libraries; ssome-kindSet of seed emotional words, S, representing a certain category of emotionsothersRepresenting other 6 kinds of seed emotion word sets.
As a preferred embodiment of the present invention, in Step4, the word vector cosine similarity calculation formula is as follows:
Figure BDA0002385296790000031
Figure BDA0002385296790000032
wherein v isiA word vector representing candidate emotional words, wherein m represents the total number of the candidate emotional words in the current category; v. ofjAnd a word vector representing the seed emotional words, wherein n represents the total number of the seed emotional words in the current category.
Step5, integrating all the resources into a case microblog emotion knowledge base comprising a case microblog emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular language set;
step6, defining part of speech and emotion label attribute characteristics of words by using a case microblog emotion knowledge base, constructing attribute characteristic representation of case microblog comments, fusing semantic representation and attribute characteristic representation of the comments through a dual-channel convolutional neural network, training an emotion classifier of the comments, constructing a dual-channel convolutional neural network model, fusing case microblog emotion knowledge, and realizing emotion classification of case microblog comments.
As a preferred scheme of the invention, in Step6, the semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loadedN×dA process of querying each word and assigning a word vector;
wherein N represents the vocabulary number of the vocabulary, and d represents the dimension of the word vector; suppose a sequence of comment text T containing n words ═ w1,w2,…,wnFor each word w in TiCan pass through the word vector list WN×dQuerying a word vector viThen the semantic representation matrix M of the sequence TTComprises the following steps:
Figure BDA0002385296790000033
wherein the content of the first and second substances,
Figure BDA0002385296790000034
a stitching operation representing a row vector direction;
the attribute feature representation of case microblog comments is a process of constructing an attribute feature representation matrix based on a sparse binary vector representation method; firstly, defining K parts of speech and emotion label attributes for each word; then, for a given sequence of comment text containing n words, T ═ w1,w2,…,wnAnd fourthly, marking each word w by part of speech and inquiring a case microblog emotion knowledge baseiAre all mapped as a K-dimensional boolean binary vector vbool_i,vbool_iHas a value of 0/1, 0 indicates that the feature is not present, 1 indicates that the feature is present, and finally, an n × K-dimensional attribute feature representation matrix M is obtainedE
Figure BDA0002385296790000035
As a preferable scheme of the invention, in the Step6, the dual-channel convolutional neural network takes INIT-CNN as a reference method, so as to construct a dual-channel convolutional neural network model; the INIT-CNN is a text classification model constructed by adopting an initialized convolution filter technology based on a convolution neural network.
Further, convolutional neural networks have been widely used in dealing with short text classification problems due to their parallelizable computation and automatic feature extraction for classification, and have been proven to be very effective by many researchers. There have been studies based on convolutional neural networks, using an initialized convolutional filter technique, to build a text classification model and achieve excellent results. The convolutional neural network model using the technology is called INIT-CNN and serves as a reference method, a dual-channel convolutional neural network model is constructed, knowledge of case microblog emotions is fused, and emotion classification of case microblog comments is achieved.
As a preferred scheme of the invention, in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer.
The invention has the beneficial effects that:
the method firstly constructs an emotion knowledge base comprising case microblog emotion dictionaries, emoticons, network popular lines, negative dictionaries and degree adverb dictionaries. Second, an attribute feature representation is constructed. And finally, fusing the semantic representation and the attribute feature representation of the comments based on a convolutional neural network, constructing an emotion classification model, extracting deep semantic features and serialized emotion knowledge features, and realizing emotion classification. Theoretical and technical verification is carried out on a case microblog comment data set, and experimental effects prove that compared with a reference method INIT-CNN, the method respectively improves 1.88% and 1.94% in Macro _ F1 and Micro _ F1 indexes, and fully proves the effectiveness of the method.
Drawings
FIG. 1 is a flow chart of construction of case microblog emotion dictionaries in a case microblog emotion knowledge base according to the present invention;
FIG. 2 is a model diagram of the case microblog comment emotion classification method with emotion knowledge fused.
Detailed Description
Example 1: as shown in fig. 1-2, the case microblog comment emotion classification method with emotion knowledge integrated includes the following specific steps:
step1, constructing a case microblog comment library vocabulary: collecting case microblog comment texts as an experimental data set, and carrying out pretreatment of deleting meaningless character data, word segmentation and part-of-speech tagging texts to obtain a case microblog comment corpus word list;
step2, constructing a basic emotion dictionary: based on the emotional vocabulary ontology of the university of the major associates, 7 emotion categories of happiness, goodness, anger, sadness, fear, aversion and surprise are continuously used to construct a basic emotion dictionary; the method comprises the steps of collecting frequently-used emoticons and network popular expressions of a microblog and classifying the frequently-used emoticons and the network popular expressions by sorting existing emotion computing resources to obtain a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular expression set;
step3, constructing a seed emotion word set: taking all words of the basic emotion dictionary in a case microblog comment material library word list as seed emotion words to form a seed emotion word set;
step4, constructing a case microblog emotion dictionary: firstly, mining 7 emotion category candidate emotion words in a case microblog comment material library word list by utilizing an SO-PMI (semantic guide point mutual information) algorithm; then, calculating the cosine similarity of word vectors of the candidate emotion words of each category and the seed emotion words of the corresponding category, and keeping the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category to form an expanded emotion dictionary; then, expanding emotion words through manual screening, adding the expanded emotion words into the seed emotion word set, and performing incremental iteration to further mine field emotion new words; finally, stopping iteration when the emotion new words cannot be mined by the algorithm, and integrating the expanded emotion dictionary and the basic emotion dictionary to obtain a case microblog emotion dictionary;
as a preferred scheme of the invention, in Step4, the SO-PMI algorithm firstly screens out parts of speech in the thesaurus of the case microblog comment corpus as follows: adjectives, verbs, nouns, adverbs, and all words of the emoticon part-of-speech "emoji", where the emoticon part-of-speech "emoji" is a manually defined label for an emoticon word; then calculating SO-PMI values between all emotion words of all emotion categories in each word and the seed emotion word set, reserving the words with SO-PMI values larger than zero as candidate emotion words of corresponding categories, and indicating that the words are more relevant to the current emotion categories when the SO-PMI values of the words are larger than zero and the values of the words are larger; the calculation formula of the SO-PMI algorithm is as follows:
Figure BDA0002385296790000051
Figure BDA0002385296790000052
wherein word1 represents words appearing in case microblog comment corpus, and word2 represents seed emotion words; p (word1)&word2) represents the probability of common occurrence of word1 and word2 in case microblog comment libraries, p (word1) represents the probability of occurrence of word1 in case microblog comment libraries, and p (word2) represents the probability of occurrence of word2 in case microblog comment libraries; ssome-kindSet of seed emotional words, S, representing a certain category of emotionsothersRepresenting other 6 kinds of seed emotion word sets.
As a preferred embodiment of the present invention, in Step4, the word vector cosine similarity calculation formula is as follows:
Figure BDA0002385296790000053
Figure BDA0002385296790000061
wherein v isiA word vector representing candidate emotional words, wherein m represents the total number of the candidate emotional words in the current category; v. ofjAnd a word vector representing the seed emotional words, wherein n represents the total number of the seed emotional words in the current category.
Step5, integrating all the resources into a case microblog emotion knowledge base comprising a case microblog emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular language set;
step6, defining part of speech and emotion label attribute characteristics of words by using a case microblog emotion knowledge base, constructing attribute characteristic representation of case microblog comments, fusing semantic representation and attribute characteristic representation of the comments through a dual-channel convolutional neural network, training an emotion classifier of the comments, constructing a dual-channel convolutional neural network model, fusing case microblog emotion knowledge, and realizing emotion classification of case microblog comments.
As a preferred scheme of the invention, in Step6, the semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loadedN×dA process of querying each word and assigning a word vector;
wherein N represents the vocabulary number of the vocabulary, and d represents the dimension of the word vector; suppose a sequence of comment text T containing n words ═ w1,w2,…,wnFor each word w in TiCan pass through the word vector list WN×dQuerying a word vector viThen the semantic representation matrix M of the sequence TTComprises the following steps:
Figure BDA0002385296790000062
wherein the content of the first and second substances,
Figure BDA0002385296790000063
a stitching operation representing a row vector direction;
the attribute feature representation of case microblog comments is a process of constructing an attribute feature representation matrix based on a sparse binary vector representation method; firstly, defining K parts of speech and emotion label attributes for each word; then, for a given sequence of comment text containing n words, T ═ w1,w2,…,wnAnd fourthly, marking each word w by part of speech and inquiring a case microblog emotion knowledge baseiAre all mapped as a K-dimensional boolean binary vector vbool_i,vbool_iHas a value of 0/1, 0 indicates that the feature is not present, 1 indicates that the feature is present, and finally, an n × K-dimensional attribute feature representation matrix M is obtainedE
Figure BDA0002385296790000064
As a preferable scheme of the invention, in the Step6, the dual-channel convolutional neural network takes INIT-CNN as a reference method, so as to construct a dual-channel convolutional neural network model; the INIT-CNN is a text classification model constructed by adopting an initialized convolution filter technology based on a convolution neural network.
As a preferred scheme of the invention, in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer.
Example 2: as shown in fig. 1-2, the case microblog comment emotion classification method with emotion knowledge integrated includes the following specific steps:
step1, collecting comment data of 7 cases of microblogs which are concerned in recent years from a Xinlang microblog platform. Wherein, the Laiyuan killing case is 10812 sentences, and the Jiangsong case is 13624 sentences; 17491 sentences for Zhaoyu, Yi and Yong; 17875 sentences in all the cases of being damaged are clear; 33774 sentences in the case of the Chongqing bus falling into the river; 37162 sentences for Henan Marsarada hit-and-run case; and 58626 sentences of the owner right-keeping scheme of the Western An Benz female car, and 189364 sentences in total. Wherein, the 189364 comments are all used as experimental data constructed by case microblog emotion dictionaries; in addition, 30000 comments are randomly sampled from the emotion data for manual emotion annotation, and 11593 comments are finally reserved for experimental data for emotion classification. In addition, data preprocessing operations such as deleting meaningless characters, word segmentation, part of speech tagging and the like are carried out on the experimental data set;
as a preferred scheme of the invention, in Step1, a crawler program is written by using python language, a user account and a cookie pool are constructed, and a case microblog comment data set is collected from a Xinlang microblog API; the data preprocessing is realized by adopting a python language writing program: the method comprises the steps of firstly, carrying out duplication removal and deletion of meaningless characters ("/", "@", URL and the like) on a text, then using a user dictionary adding function of a jieba word segmentation tool to carry out word segmentation and part-of-speech tagging on comment linguistic data, and obtaining a case microblog comment linguistic database word list.
The optimal scheme design is an important component of the method, mainly constructs a case microblog comment experimental data set, and provides data support for case microblog comment emotion classification integrated with emotion knowledge.
Step2, collecting and classifying common emoticons and network popular lines of the microblog by sorting the existing emotion computing resources, and constructing a basic emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular line set;
as a preferred embodiment of the present invention, in Step2, the basic emotion dictionary is constructed based on DUTSD of the university of the major continuous processing, specifically, 7 types of emotions: happy, angry, sadness, fear, dislike and fright; considering the important role of the emoticons and the network popular lines on emotion expression, classifying the emoticons and the network popular line set commonly used parts with emotion colors according to the 7 types of emotions, and adding the emoticons and the network popular line set into the DUTSD to serve as a basic emotion dictionary;
the design of the optimal scheme is an important component of the method, and the method mainly builds a basic emotion dictionary for the method and provides a basis for building a case microblog emotion dictionary for the method.
Step3, acquiring a seed emotion word set;
in Step3, all words appearing in the word list of the case microblog comment corpus in the basic emotion dictionary are used as seed emotion words to form a seed emotion word set.
The design of the optimal scheme is an important component of the method, and the method mainly obtains the seed emotion word set and provides a basis for constructing case microblog emotion dictionaries.
Step4, constructing a case microblog emotion dictionary: firstly, mining 7 emotion category candidate emotion words in a case microblog comment material library word list by utilizing an SO-PMI (semantic guide point mutual information) algorithm; then, calculating the cosine similarity of word vectors of the candidate emotion words of each category and the seed emotion words of the corresponding category, and keeping the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category to form an expanded emotion dictionary; then, expanding emotion words through manual screening, adding the expanded emotion words into the seed emotion word set, and performing incremental iteration to further mine field emotion new words; finally, stopping iteration when the emotion new words cannot be mined by the algorithm, and integrating the expanded emotion dictionary and the basic emotion dictionary to obtain a case microblog emotion dictionary;
as a preferred scheme of the invention, in Step4, the SO-PMI algorithm firstly screens out parts of speech in the thesaurus of the case microblog comment corpus as follows: adjectives, verbs, nouns, adverbs, and all words of the emoticon part-of-speech "emoji", where the emoticon part-of-speech "emoji" is a manually defined label for an emoticon word; then calculating SO-PMI values between all emotion words of all emotion categories in each word and the seed emotion word set, reserving the words with SO-PMI values larger than zero as candidate emotion words of corresponding categories, and indicating that the words are more relevant to the current emotion categories when the SO-PMI values of the words are larger than zero and the values of the words are larger; the calculation formula of the SO-PMI algorithm is as follows:
Figure BDA0002385296790000081
Figure BDA0002385296790000082
wherein word1 represents words appearing in case microblog comment corpus, and word2 represents seed emotion words; p (word1)&word2) represents the probability of common occurrence of word1 and word2 in case microblog comment libraries, p (word1) represents the probability of occurrence of word1 in case microblog comment libraries, and p (word2) represents the probability of occurrence of word2 in case microblog comment libraries; ssome-kindSet of seed emotional words, S, representing a certain category of emotionsothersRepresenting other 6 kinds of seed emotion word sets.
As a preferred scheme of the invention, in Step4, considering that noise may exist in candidate emotion words automatically mined by an SO-PMI algorithm, it is proposed to filter the noise candidate emotion words by calculating the cosine similarity of word vectors of each category of candidate emotion words and corresponding category of seed emotion words and reserving the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category. The word vector cosine similarity calculation formula is as follows: :
Figure BDA0002385296790000091
Figure BDA0002385296790000092
wherein v isiA word vector representing candidate emotional words, wherein m represents the total number of the candidate emotional words in the current category; v. ofjAnd a word vector representing the seed emotional words, wherein n represents the total number of the seed emotional words in the current category.
Step5, integrating all the resources into a case microblog emotion knowledge base containing a case microblog emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular language set, as shown in table 1;
table 1 shows case microblog emotion knowledge base (unit: one)
Figure BDA0002385296790000093
The optimal scheme design is an important component of the invention, mainly constructs a case microblog emotion knowledge base, and provides emotion knowledge base support for constructing a case microblog comment emotion classification model integrating emotion knowledge.
Step6, defining part of speech and emotion label attribute characteristics of words by using a case microblog emotion knowledge base, constructing attribute characteristic representation of case microblog comments, fusing semantic representation and attribute characteristic representation of the comments through a dual-channel convolutional neural network, training an emotion classifier of the comments, constructing a dual-channel convolutional neural network model, fusing case microblog emotion knowledge, and realizing emotion classification of case microblog comments.
The case microblog comment emotion classification model is constructed based on a convolutional neural network and a case microblog emotion knowledge base, and the case microblog comment emotion classification model is integrated with emotion knowledge, and the model structure is composed of a double-channel convolutional neural network. Convolutional neural networks have been widely used in dealing with short text classification problems due to their parallelizable computation and automatic feature extraction for classification, and have been proven to be very effective by many researchers. There have been studies based on convolutional neural networks, using an initialized convolutional filter technique, to build a text classification model and achieve excellent results. The convolutional neural network model using the technology is called INIT-CNN and serves as a reference method, a dual-channel convolutional neural network model is constructed, knowledge of case microblog emotions is fused, and emotion classification of case microblog comments is achieved.
In the Step6, before the model is trained and classified, the text needs to be represented in a computer-readable form. Therefore, the first step is to construct semantic representations and attribute feature representations of case microblog comments input by the model. The semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loadedN×dA process of querying each word and assigning a word vector;
wherein N represents the vocabulary number of the vocabulary, and d represents the dimension of the word vector; suppose a sequence of comment text T containing n words ═ w1,w2,…,wnFor each word w in TiCan pass through the word vector list WN×dQuerying a word vector viThen the semantic representation matrix M of the sequence TTComprises the following steps:
Figure BDA0002385296790000101
wherein the content of the first and second substances,
Figure BDA0002385296790000102
a stitching operation representing a row vector direction;
the attribute feature representation of case microblog comments is a process of constructing an attribute feature representation matrix based on a sparse binary vector representation method; firstly, 15 parts of speech and emotion label attributes are defined for each word, and are shown in a table 2; then, for a given sequence of comment text containing n wordsT={w1,w2,…,wnAnd fourthly, marking each word w by part of speech and inquiring a case microblog emotion knowledge baseiAre all mapped as a 15-dimensional Boolean binary vector vbool_i,vbool_iHas a value of 0/1, 0 indicates that the attribute does not exist, 1 indicates that the attribute exists, and finally, an n × 15-dimensional attribute feature representation matrix M is obtainedE
Figure BDA0002385296790000103
Table 2 shows the part of speech and emotional tag attributes of the words
Figure BDA0002385296790000104
Figure BDA0002385296790000111
As a preferred scheme of the invention, in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer. The invention adopts a cross entropy loss function to measure the difference between the prediction probability distribution and the real probability distribution of the emotion labels, and trains a model through a back propagation algorithm.
The optimal scheme design is an important component of the method, mainly constructs a case microblog comment emotion classification model fused with emotion knowledge, and provides theoretical support for constructing the emotion classification model.
Step7, carrying out experiments on the constructed model aiming at the case microblog comment data set, and verifying the effectiveness of the invention.
In Step7, as a preferred embodiment of the present invention, the experimental model is constructed by using a Tensorflow deep learning framework. According to the model disclosed by the invention, a pre-training word vector and 15 parts of speech and emotion label attributes defined by a case microblog emotion knowledge base are respectively adopted, semantic feature representation and attribute feature representation are constructed for the comment text and are used as the input of the model together, and a 7-class emotion classifier is trained. For the model of the reference method, only the pre-training word vector is used as the comment text to construct semantic feature representation, and the semantic feature representation is directly used as model input to train a 7-class emotion classifier. The experimental results and analyses were as follows:
table 3 shows the results of comparative experiments on the overall classification performance of the Our Model and the INIT-CNN
Figure BDA0002385296790000112
Table 4 shows the results of the experiment of the reference method INIT-CNN in each emotion category
Figure BDA0002385296790000113
Figure BDA0002385296790000121
Table 5 shows the results of the experiment of the Our Model of the present invention in each mood class
Figure BDA0002385296790000122
The method Our Model is integrated into a constructed emotion knowledge base on the basis of a reference method INIT-CNN. Analysis of Table 3 reveals that, compared to INIT-CNN, the coarse _ Precision of the Our Model is improved by 1.82%, the coarse _ Recall is improved by 1.91%, the coarse _ F1 is improved by 1.88%, and the Micro _ F1 is improved by 1.94%. The experimental result shows that the OurModel has better overall classification performance than the INIT-CNN. Analyzing table 4 and table 5, it can be seen that the Precision, Recall and F1 indexes of the respective mood categories of the ourmodel exceed the INIT-CNN overall. Particularly, compared with the Precision, the Recall and the F1 indexes of the input-CNN, the Precision, the Recall and the F1 of the Our Model are respectively improved by 6.30 percent, 8.86 percent and 7.64 percent on the recognition of the distust class. The experimental results show that: overall, the ourmodel is also superior to INIT-CNN in the ability to recognize various emotion categories. Experimental results prove that the proposed case microblog comment emotion classification method fusing emotion knowledge is effective.
Table 6 shows the data analysis of the example of the method of the present invention, Our Model and the reference method, INIT-CNN
Figure BDA0002385296790000123
Through the experiment and the example data analysis, only the benchmark method INIT-CNN is adopted for the experiment, and some common emotional knowledge characteristics in case microblog comments cannot be effectively learned by a neural network. For example in table 6: emotional knowledge such as an emoticon, [ spit ] ", a case microblog emotional new word," woman hero ", a network popular language," TMD ", a negative rule," dislike ", and a degree adverb rule," too toxic ", etc., is well learned by the Our Model, and the emotional categories of all the examples are correctly predicted; the baseline method INIT-CNN, however, does not learn these emotional knowledge well, resulting in an inability to correctly predict the emotional category of these examples. The case microblog comment emotion classification method integrating emotion knowledge is practical in emotion classification for case microblog comments.
The optimal scheme design is an important component of the method, and is mainly used for verifying the feasibility and the effectiveness of the case microblog comment emotion classification model integrated with emotion knowledge and constructed by the method.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. The case microblog comment emotion classification method integrating emotion knowledge is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, constructing a case microblog comment library vocabulary: collecting case microblog comment texts as an experimental data set, and carrying out pretreatment of deleting meaningless character data, word segmentation and part-of-speech tagging texts to obtain a case microblog comment corpus word list;
step2, constructing a basic emotion dictionary: based on the emotional vocabulary ontology of the university of the major associates, 7 emotion categories of happiness, goodness, anger, sadness, fear, aversion and surprise are continuously used to construct a basic emotion dictionary; the method comprises the steps of collecting frequently-used emoticons and network popular expressions of a microblog and classifying the frequently-used emoticons and the network popular expressions by sorting existing emotion computing resources to obtain a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular expression set;
step3, constructing a seed emotion word set: taking all words of the basic emotion dictionary in a case microblog comment material library word list as seed emotion words to form a seed emotion word set;
step4, constructing a case microblog emotion dictionary: firstly, mining 7 emotion category candidate emotion words in a word list of a case microblog comment material library by using an SO-PMI semantic guide point mutual information algorithm; then, calculating the cosine similarity of word vectors of the candidate emotion words of each category and the seed emotion words of the corresponding category, and keeping the candidate emotion words with the average cosine similarity larger than 0.5 as case microblog emotion new words of the corresponding category to form an expanded emotion dictionary; then, expanding emotion words through manual screening, adding the expanded emotion words into the seed emotion word set, and performing incremental iteration to further mine field emotion new words; finally, stopping iteration when the emotion new words cannot be mined by the algorithm, and integrating the expanded emotion dictionary and the basic emotion dictionary to obtain a case microblog emotion dictionary;
step5, integrating all the resources into a case microblog emotion knowledge base comprising a case microblog emotion dictionary, a negative dictionary, a degree adverb dictionary, an emoticon set and a network popular language set;
step6, defining part of speech and emotion label attribute characteristics of words by using a case microblog emotion knowledge base, constructing attribute characteristic representation of case microblog comments, fusing semantic representation and attribute characteristic representation of the comments through a dual-channel convolutional neural network, training an emotion classifier of the comments, constructing a dual-channel convolutional neural network model, fusing case microblog emotion knowledge, and realizing emotion classification of case microblog comments.
2. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in Step6, the semantic representation of case microblog comments is that a comment sentence is segmented, and a pre-training word vector list W is loadedN×dA process of querying each word and assigning a word vector;
wherein N represents the vocabulary number of the vocabulary, and d represents the dimension of the word vector; suppose a sequence of comment text T containing n words ═ w1,w2,…,wnFor each word w in TiCan pass through the word vector list WN×dQuerying a word vector viThen the semantic representation matrix M of the sequence TTComprises the following steps:
Figure FDA0002385296780000021
wherein the content of the first and second substances,
Figure FDA0002385296780000022
a stitching operation representing a row vector direction;
the attribute feature representation of case microblog comments is a process of constructing an attribute feature representation matrix based on a sparse binary vector representation method; firstly, defining K parts of speech and emotion label attributes for each word; then, for a given sequence of comment text containing n words, T ═ w1,w2,…,wnAnd fourthly, marking each word w by part of speech and inquiring a case microblog emotion knowledge baseiAre all mapped as a K-dimensional boolean binary vector vbool_i,vbool_iHas a value of 0/1, 0 indicates that the feature is not present, 1 indicates that the feature is present, and finally, an n × K-dimensional attribute feature representation matrix M is obtainedE
Figure FDA0002385296780000023
3. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in the Step6, the dual-channel convolutional neural network takes INIT-CNN as a reference method, so that a dual-channel convolutional neural network model is constructed; the INIT-CNN is a text classification model constructed by adopting an initialized convolution filter technology based on a convolution neural network.
4. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in Step4, the SO-PMI algorithm firstly screens out parts of speech in the word list of the case microblog comment material library as follows: adjectives, verbs, nouns, adverbs, and all words of the emoticon part-of-speech "emoji", where the emoticon part-of-speech "emoji" is a manually defined label for an emoticon word; then calculating SO-PMI values between all emotion words of all emotion categories in each word and the seed emotion word set, reserving the words with SO-PMI values larger than zero as candidate emotion words of corresponding categories, and indicating that the words are more relevant to the current emotion categories when the SO-PMI values of the words are larger than zero and the values of the words are larger; the calculation formula of the SO-PMI algorithm is as follows:
Figure FDA0002385296780000024
Figure FDA0002385296780000025
wherein word1 represents words appearing in case microblog comment corpus, and word2 represents seed emotion words; p (word1)&word2) represents the probability of common occurrence of word1 and word2 in case microblog comment libraries, p (word1) represents the probability of occurrence of word1 in case microblog comment libraries, and p (word2) represents the probability of occurrence of word2 in case microblog comment libraries; ssome-kindSet of seed emotional words, S, representing a certain category of emotionsothersRepresenting other 6 kinds of seed emotion word sets.
5. The case microblog comment emotion classification method fusing emotion knowledge according to claim 1, characterized in that: in Step4, the word vector cosine similarity calculation formula is as follows:
Figure FDA0002385296780000031
Figure FDA0002385296780000032
wherein v isiA word vector representing candidate emotional words, wherein m represents the total number of the candidate emotional words in the current category; v. ofjAnd a word vector representing the seed emotional words, wherein n represents the total number of the seed emotional words in the current category.
6. The case microblog comment emotion classification method fusing emotion knowledge according to claim 2 or 3, characterized in that: in Step6, after semantic representation and attribute feature representation of the comments are constructed, the semantic representation and the attribute feature representation are input into a dual-channel convolutional neural network together, and deep semantic features and emotion knowledge features are extracted; then, the two features are directly spliced to obtain semantic synthesis features; and finally, inputting the semantic synthesis characteristics into a full-connection layer for linear transformation and dimension reduction, and outputting the classification class probability distribution of each input text through a Softmax layer.
CN202010096024.7A 2020-02-17 2020-02-17 Case microblog comment emotion classification method integrating emotion knowledge Active CN111324734B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010096024.7A CN111324734B (en) 2020-02-17 2020-02-17 Case microblog comment emotion classification method integrating emotion knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010096024.7A CN111324734B (en) 2020-02-17 2020-02-17 Case microblog comment emotion classification method integrating emotion knowledge

Publications (2)

Publication Number Publication Date
CN111324734A true CN111324734A (en) 2020-06-23
CN111324734B CN111324734B (en) 2021-03-02

Family

ID=71172741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010096024.7A Active CN111324734B (en) 2020-02-17 2020-02-17 Case microblog comment emotion classification method integrating emotion knowledge

Country Status (1)

Country Link
CN (1) CN111324734B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765350A (en) * 2021-01-15 2021-05-07 西华大学 Microblog comment emotion classification method based on emoticons and text information
CN112800225A (en) * 2021-01-28 2021-05-14 南京邮电大学 Microblog comment emotion classification method and system
CN112800229A (en) * 2021-02-05 2021-05-14 昆明理工大学 Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field
CN113076490A (en) * 2021-04-25 2021-07-06 昆明理工大学 Case-related microblog object-level emotion classification method based on mixed node graph
CN113409821A (en) * 2021-05-27 2021-09-17 南京邮电大学 Method for recognizing unknown emotional state of voice signal

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022805A (en) * 2015-07-02 2015-11-04 四川大学 Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information
CN107590134A (en) * 2017-10-26 2018-01-16 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN108108433A (en) * 2017-12-19 2018-06-01 杭州电子科技大学 A kind of rule-based and the data network integration sentiment analysis method
CN109299268A (en) * 2018-10-24 2019-02-01 河南理工大学 A kind of text emotion analysis method based on dual channel model
CN109815485A (en) * 2018-12-24 2019-05-28 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus and storage medium of the identification of microblogging short text feeling polarities
CN110046250A (en) * 2019-03-17 2019-07-23 华南师范大学 Three embedded convolutional neural networks model and its more classification methods of text
CN110362819A (en) * 2019-06-14 2019-10-22 中电万维信息技术有限责任公司 Text emotion analysis method based on convolutional neural networks
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022805A (en) * 2015-07-02 2015-11-04 四川大学 Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information
CN107590134A (en) * 2017-10-26 2018-01-16 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN108108433A (en) * 2017-12-19 2018-06-01 杭州电子科技大学 A kind of rule-based and the data network integration sentiment analysis method
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN109299268A (en) * 2018-10-24 2019-02-01 河南理工大学 A kind of text emotion analysis method based on dual channel model
CN109815485A (en) * 2018-12-24 2019-05-28 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus and storage medium of the identification of microblogging short text feeling polarities
CN110046250A (en) * 2019-03-17 2019-07-23 华南师范大学 Three embedded convolutional neural networks model and its more classification methods of text
CN110362819A (en) * 2019-06-14 2019-10-22 中电万维信息技术有限责任公司 Text emotion analysis method based on convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾一凡: ""基于微博表情符号的中文情感词典构建方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765350A (en) * 2021-01-15 2021-05-07 西华大学 Microblog comment emotion classification method based on emoticons and text information
CN112800225A (en) * 2021-01-28 2021-05-14 南京邮电大学 Microblog comment emotion classification method and system
CN112800225B (en) * 2021-01-28 2022-09-16 南京邮电大学 Microblog comment emotion classification method and system
CN112800229A (en) * 2021-02-05 2021-05-14 昆明理工大学 Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field
CN113076490A (en) * 2021-04-25 2021-07-06 昆明理工大学 Case-related microblog object-level emotion classification method based on mixed node graph
CN113409821A (en) * 2021-05-27 2021-09-17 南京邮电大学 Method for recognizing unknown emotional state of voice signal

Also Published As

Publication number Publication date
CN111324734B (en) 2021-03-02

Similar Documents

Publication Publication Date Title
CN111324734B (en) Case microblog comment emotion classification method integrating emotion knowledge
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
Ishaq et al. Aspect-based sentiment analysis using a hybridized approach based on CNN and GA
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN110287323B (en) Target-oriented emotion classification method
Kaibi et al. A comparative evaluation of word embeddings techniques for twitter sentiment analysis
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
Ayishathahira et al. Combination of neural networks and conditional random fields for efficient resume parsing
Rashid et al. Feature level opinion mining of educational student feedback data using sequential pattern mining and association rule mining
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
Gosai et al. A review on a emotion detection and recognization from text using natural language processing
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN114936277A (en) Similarity problem matching method and user similarity problem matching system
Li et al. A method for resume information extraction using bert-bilstm-crf
Zhang et al. Exploring deep recurrent convolution neural networks for subjectivity classification
Ajallouda et al. Kp-use: an unsupervised approach for key-phrases extraction from documents
Srivastava et al. Comparative analysis of lexicon and machine learning approach for sentiment analysis
Dabade Sentiment analysis of Twitter data by using deep learning And machine learning
Narayanan et al. Character level neural architectures for boosting named entity recognition in code mixed tweets
Zhang et al. Grasp the implicit features: Hierarchical emotion classification based on topic model and SVM
Maisha et al. Supervised machine learning algorithms for sentiment analysis of Bangla newspaper
Putri et al. Bahasa Indonesia pre-trained word vector generation using word2vec for computer and information technology field
Wang et al. Sentiment analysis of science fiction movie reviews based on deep learning
Jagdale et al. Review on sentiment lexicons

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant