CN112507071A

CN112507071A - Network platform short text mixed emotion classification method based on novel emotion dictionary

Info

Publication number: CN112507071A
Application number: CN202011408818.9A
Authority: CN
Inventors: 徐小龙; 黄寄
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-16
Anticipated expiration: 2040-12-03
Also published as: CN112507071B

Abstract

The invention provides a network platform short text mixed emotion classification method based on a novel emotion dictionary, which is used for performing word segmentation processing on texts in samples marked with emotions; counting the word frequency of each word in all samples of a certain emotion; calculating the emotion weight corresponding to each word by using the word frequency of each word; recording the emotion weight of each word to obtain a novel emotion dictionary; carrying out emotion weight calculation on a sample to be classified by using the novel emotion dictionary to obtain an emotion vector containing the emotion weight of each emotion; inputting the emotion vector into a deep learning model as the characteristic representation of an input layer of the deep learning model; and finally obtaining a mixed emotion classification result. According to the method, the emotion in the short text of the network platform is concentrated into a short emotion vector, so that the training speed of realizing mixed emotion classification by deep learning is improved, and the memory occupation of the model during training is reduced.

Description

Network platform short text mixed emotion classification method based on novel emotion dictionary

Technical Field

The invention relates to a network platform short text mixed emotion classification method based on a novel emotion dictionary, and belongs to the field of artificial intelligence.

Background

The short texts of the network platform mainly include: comments under news software, forum comments, blog comments, chat room content, and the like. These web platforms generate a large amount of text, and web texts in various forms have become a channel for human to receive information and a means for emotional communication.

Emotion is an important component of human beings, and affects human behavior, thinking, decision making, and social interaction, and emotion calculation is the task for computers to have the ability to recognize, understand, and express human emotions. The emotion calculation is divided into three parts of recognition, expression and decision, wherein the recognition is how to enable a computer to accurately recognize human emotions and eliminate uncertainty and ambiguity of natural language, and a recognized carrier can be generally divided into characters, voice, expressions, postures and the like; the expression means how to express the abstract emotion by an information carrier which can be intuitively understood by human beings; decision making refers to how to make better decisions with the emotion mechanism. The emotion calculation can calculate the emotional tendency of the object with emotion, and after the emotional tendency of the object with emotion is obtained, the behavior or the state of the object with emotion can be further predicted according to the emotional tendency, for example, the object with emotion is personalized and recommended by the aid of the emotional tendency or the attention of the object with emotion is predicted.

The emotion classification is recognition in emotion calculation, and the main task of the emotion classification is to recognize and classify emotion expressed in words, sentences or documents to obtain the emotion polarity of the words, the sentences or the documents, so that the emotion in the documents or the sentences is recognized. Currently, the main emotion classification methods are mainly classified into the following three categories: the first category is emotion classification methods based on an emotion dictionary. The method based on the emotion dictionary mainly comprises the steps of segmenting a text, finding out words with different parts of speech and calculating corresponding scores of the words. The method is very dependent on an emotion dictionary, and the manual construction of the emotion dictionary is very complicated, such as an improved WordNet dictionary; the second category is machine learning methods based on artificially extracted features. The machine learning method based on the artificial extraction features needs a large amount of data marked in advance, needs people to find out corresponding features, and then uses machine learning methods such as a support vector machine, naive Bayes and the like to classify the emotion, such as an improved KNN model; the third category is emotion classification methods based on deep learning. The emotion analysis method based on deep learning also needs a large amount of data marked in advance, but does not need people to find out corresponding features, the deep learning model can automatically extract the features in the data, and the more common models comprise a cyclic neural network, a convolutional neural network and the like. However, the above methods all generally have the following problems:

1. most research is only classified according to binary (positive and negative) or ternary emotions, but documents often contain more than one emotion, and multiple emotions can be contained in the same document at the same time.

2. The method based on machine learning needs manual feature extraction, and the effect of the final model often depends on the quality of the manually extracted features, so that automatic operation cannot be realized, and the method is difficult to obtain.

3. The method based on deep learning needs to perform feature representation on a document, but because the text length is too long, the dimension of feature representation is huge, so that the problems of slow training time and large memory occupation are caused.

In view of the above, it is necessary to provide a short text mixed emotion classification method for a network platform based on a novel emotion dictionary to solve the above problems.

Disclosure of Invention

The invention aims to provide a network platform short text mixed emotion classification method based on a novel emotion dictionary.

In order to achieve the purpose, the invention provides a network platform short text mixed emotion classification method based on a novel emotion dictionary, which comprises the following steps:

step 1, carrying out artificial emotion marking on the collected short texts of the historical network platform, and using the short texts as a training set;

step 2, performing word segmentation processing on each sample in the training set, then calculating the word frequency of each word under each emotion, calculating the emotion weight of each word by using the word frequency of each word under each emotion, and storing each word and the emotion weight thereof in a dictionary in a key value pair mode to form an emotion dictionary;

step 3, accumulating the emotion weight under a certain emotion of all participles of each sample to obtain the sum of the emotion weight of the certain emotion of the sample, and combining the sum of the emotion weight of each emotion of the sample to form an emotion vector of a training set;

step 4, taking the emotion vectors of the training set as the feature representation of the input layer, and using the emotion vectors for the training of the DNN mixed emotion classification model to obtain a trained DNN model for mixed emotion classification;

step 5, performing word segmentation on the new network platform short text, searching each word after word segmentation in an emotion dictionary to obtain emotion weight corresponding to each word, respectively summing the emotion weights of all words under each emotion to obtain the sum of the emotion weights of the new network platform short text under each emotion, and combining the sum to form an emotion vector;

and 6, inputting the emotion vectors formed by the new network platform short texts into the trained DNN model to obtain vectors containing probability values of each emotion, judging the two maximum probability values and a set threshold value, if the two maximum probability values are larger than or equal to the threshold value, indicating that the new network platform short texts contain the corresponding emotions, otherwise, not containing the corresponding emotions.

As a further improvement of the present invention, the step 2 specifically includes:

step 21, classifying samples of different emotion labels in the training set to obtain sample sets under n classes of emotion labels, wherein each set is S_i(i is more than or equal to 1 and less than or equal to n); setting i to 1, setting an all _ words set to record all the words that appear, setting the all _ words set to null, and trainingThe number of the refining total samples is N;

step 22, setting words_iThe set is null, and the kth sample of the ith type emotion label training set is set as

Set k equal to 1, set count_iSetting count for counter of ith type emotion label training set_i＝1；

Step 23, mixing

The text part is divided into words, and the results after word division are stored in words_iThe set is stored in an all _ words set;

step 24, let k equal to k +1, count_i＝count_i+1, repeating the step 23 until k equals to the total number of samples in the ith type emotion label training set;

step 25, making i equal to i +1, and repeating the steps 22 to 24 until i equal to n;

step 26, extracting the word w from the all _ words set (not putting back), and counting the word w in the words_i(1. ltoreq. i. ltoreq.n)

Called word frequency, n_wIs not 0

The number of (2);

step 27, calculating the emotion weight of the word w under the ith type emotion label

And the emotional weight of the word w under the nth type of emotion;

step 28, if the all _ words set is not empty, then go to step 26;

step 29, the emotion weight of the word w under the ith emotion is expressed as w:

the form of the key-value pair is stored in the dictionary weight_iFinally, n dictionaries are obtained, the n dictionaries are classified into an emotion dictionary, and each weight_iAn emotion page called emotion dictionary.

As a further improvement of the invention, the weight of the word w in the step 27 under the i-th type emotion label

The calculation formula of (2) is as follows:

as a further improvement of the present invention, the step 3 specifically includes:

step 31, let the kth sample of the ith type emotion label training set be

Setting k to 1;

step 32, mixing

The text part of the search result is segmented, and each segmented word w is used as each emotion page weight for inquiring the emotion dictionary obtained in the step 2_iObtaining the value corresponding to the key w

Then calculate out

Sentiment score of the ith sentiment of (1)

And combining the emotion scores of all emotion types to form an emotion vector of a training set

As a further improvement of the invention, the sentiment score in step 32 is

The calculation formula of (2) is as follows:

as a further improvement of the present invention, the step 5 specifically includes:

step 51, defining the new short text of the network platform as a test set sample, wherein the r-th sample in the test set sample is test_rTest will be_rPerforming word segmentation processing on the text to obtain a sample test_rWord sets after word segmentation;

step 52, taking each word in the word set as a key to inquire the emotion weight corresponding to each word in the emotion dictionary obtained in step 2, and then using the formula:

obtaining a test sample test_rSentiment score of the ith category

Then combining the emotion scores of each emotion to form an emotion vector

As a further improvement of the present invention, the step 6 specifically includes:

step 61, converting V obtained in step 5 into V_rInputting the DNN model which is trained, and obtaining a vector V containing probability values of each emotion_p＝(p₁,......,p_n) Wherein p is_i(1. ltoreq. i. ltoreq.n) as a test sample test_rProbability values comprising the ith emotion;

step 62, adding V_p＝(p₁,......,p_n) The highest and the second probability valueTwo large emotional probability values are extracted and are set as p_jAnd p_l；

Step 63, setting emotion threshold value P and judging P_jAnd p_lAnd P, the judgment rule is as follows:

if p is_jIs not less than P and P_lIf not less than P, the test sample test is indicated_rThe j-th emotion and the l-th emotion are contained;

if p is_jIs not less than P and P_lIf < P, the test sample test is indicated_rThe j type emotion is contained;

if p is_j< P and P_lIf not less than P, the test sample test is indicated_rThe I type emotion is contained;

if p is_j< P and P_lIf < P, the test sample test is indicated_rWithout emotion.

The invention has the beneficial effects that: firstly, when emotion classification is realized by using deep learning in the past, a certain text is generally classified into one class of emotion, but the emotion contained in the text is often more than one, the emotion contained in the text can have emotion combination, and the task of classifying various emotions contained in the text is called a mixed emotion classification task; secondly, when the emotion classification task is solved by using deep learning, the text to be classified needs to be subjected to feature representation, the existing feature representation method usually represents the text into a vector with huge dimensionality, so that a large amount of time and memory are consumed during text feature representation and deep learning training.

Drawings

FIG. 1 is a flow chart of a mixed emotion classification method of a network platform short text based on a novel emotion dictionary.

FIG. 2 is a coordinate axis diagram of the mixed emotion classification of the present invention.

FIG. 3 is a schematic diagram of the structure of FIG. 1 when the text is converted into an emotion vector.

Fig. 4 is a schematic structural diagram of the DNN model in fig. 1.

Fig. 5 is a schematic diagram showing the structure of the emotion classification determination in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention realizes the fast and high-efficiency mixed emotion classification task in the short text environment of the network platform, improves the efficiency of the mixed emotion classification task, and reduces the time cost and the memory cost. The invention provides a novel emotion dictionary, and a mixed emotion classification task is completed by combining the novel emotion dictionary with a deep learning method.

The network platform generates massive text data every day, and the text data contains a large amount of emotions. In the past, when deep learning is used for emotion classification, a certain text is generally classified into a class of emotions, but due to the characteristics of human emotions, the number of emotions contained in the text is often more than one, the emotions contained in the text may have emotion combinations, and the emotion combination mode is shown as emotion mixture of each quadrant in fig. 2. The task of classifying a plurality of emotions contained in the text is called a mixed emotion classification task, and the mixed emotion classification task can be solved by using a multi-label classification deep learning model. And manually marking the emotion contained in each comment text by the collected comment data of the network platform, and taking the comment data as a training set.

And performing word segmentation processing on each comment text in the training set to obtain a set of all words in the training set and a set of all words of each type of emotion. Calculating the word frequency of all words in a word set in a training set under each emotion, then calculating the emotion weight of the word by using the word frequency, storing the word and the emotion weight of the word in an emotion dictionary in a key value pair mode, combining the sum of the emotion weights of each emotion of each sample to form an emotion vector of the training set, and inputting the emotion vector into a DNN model for training by using the emotion vector as a feature representation to obtain a trained mixed emotion classifier.

The method comprises the steps of performing word segmentation on an obtained new network platform text, then placing all obtained words as keys in a constructed emotion dictionary to query the values of the words, obtaining emotion weights of all words of the text under different emotions, then summing the emotion weights under each emotion to obtain the emotion weight sum under the emotion, then combining all emotion weight sums to form an emotion vector, inputting the emotion vector into a trained mixed emotion classifier to obtain a predicted vector of probability values of each emotion, then comparing two maximum probability values in the emotion vector with a set threshold value, if the probability values are larger than or equal to the threshold value, indicating that the new network platform text contains the corresponding emotion, and otherwise, not containing the corresponding emotion.

To facilitate understanding of the technical solution of the present invention, some concepts are defined below:

definition 1 mixed emotion classification: the emotion calculation task is used for extracting two or more emotion types in a text according to the emotion contained in the text or a sentence. For example, a text may contain two emotions of anger and fear, and the emotions classified by the classifier should be "anger" and "fear".

Definition 2 feature representation: network platform short text is intended to be recognized by a computer and the text must be characterized in a format that can be recognized by the computer. The feature representation oriented by the invention is an emotion vector of a training set constructed by utilizing a novel emotion dictionary, and the emotion vector is used as feature representation.

Defining 3 emotion dictionaries: the emotion dictionary is a hash set of words or phrases labeled with emotional tendency, and generally consists of word-emotional tendency or phrase-emotional tendency key value pairs. The emotional tendency value can be a group of discrete values, such as positive emotion, neutral emotion and negative emotion, or a group of continuous values, such as values in the range of-1 to 1, wherein the value larger than 0 is regarded as the positive emotion, and the value smaller than 0 is regarded as the negative emotion.

The construction method of the emotion dictionary can be divided into an emotion dictionary construction method based on a dictionary and an emotion dictionary construction method based on a corpus. The construction method of the emotion dictionary based on the dictionary is generally based on the improvement or the expansion of the original emotion dictionary; the method for constructing the emotion dictionary based on the corpus is to perform some processing on words in a corpus to obtain the emotional tendency of the words. General processing methods include calculation of emotional tendency using syntactic relations of words, labeling of emotional tendency using machine learning or deep learning, and the like.

The emotion dictionary used by the invention is characterized in that the emotion weight of each word under each emotion is calculated by utilizing the word frequency of each word under each emotion, a key value pair consisting of the word and the emotion weight of a word under a certain emotion is stored in the dictionary, the emotion weight of each emotion of the word of a training sample is calculated and stored in the dictionary to form the emotion dictionary.

Definition of 4-participles: the Chinese word segmentation is to use a continuous text character string, which can be a continuous long text such as a paragraph and an article, and can also be a continuous short text such as a sentence and a phrase. And recombining and decomposing according to a certain standard to enable the substrings to become substrings with characters and words as basic units, thereby facilitating subsequent processing and analysis work. Since the short text of the network platform is an 'unstructured data', it needs to be converted into 'structured data' by the Chinese word segmentation method.

The word segmentation method based on the prefix dictionary is adopted in the word segmentation. Constructing a prefix dictionary by using words and word frequencies in a statistical dictionary of a known network, traversing each word in the statistical dictionary, traversing a prefix of each word, adding the prefix into the prefix dictionary to obtain the word frequency of the prefix of each word, then generating a directed acyclic graph formed by all possible word conditions in a sentence to be participled, finding a path with the maximum word frequency to obtain a maximum segmentation combination based on the word frequency, wherein the combination is a word segmentation result.

Define 5 emotion weights: emotional weight of a word w under the i-th emotion

The word represents the situationEmotional intensity of feeling. The calculation formula is as follows:

wherein

The word frequency of the word w under the ith type of emotion is shown; count_iThe total number of samples of the ith emotion in the training set; n is the total number of samples in the training set; n is_wIs the sample of how many emotions the word appears in.

Define 6 emotion vectors: the emotion vector is an n-dimensional vector, and n is the number of emotion types to be classified. Each dimension of the vector represents the sum of the emotion weights at the emotion for all words in the sample. The vector may be used as a feature representation for deep learning.

The method of the invention is used for carrying out mixed emotion classification on the short text on the network platform, and obtaining the emotion vector by using the novel emotion dictionary. The emotion vectors are used as feature representation input of the deep learning model, so that time consumption and memory occupation in the deep learning training process are greatly reduced.

As shown in fig. 1, for example, a comment short text of a microblog network platform is used to solve the problem of mixed sentiment classification of comments on a microblog, and the specific operation steps are as follows:

step 1: firstly, carrying out artificial emotion marking on collected microblog comment samples, namely historical network platform short texts, and then using the data samples as a training set.

Step 2: as shown in fig. 3, each sample in the training set is subjected to word segmentation, then the word frequency of each word under each emotion is calculated, the emotion weight of the word is calculated by using the word frequency of the word under each emotion, and each word and the emotion weight thereof are stored in a dictionary in a key-value pair manner to form an emotion dictionary, and the specific implementation method is as follows:

step 21: classifying samples of different emotion labels of the training set to obtain nSample sets under emotion-like labels, each set being called S_i(i is more than or equal to 1 and less than or equal to N), i is the ith emotion, i is set to 1, an all _ words set is set for recording all appeared words, the all _ words set is set to be empty, and the number of training lumped samples is set to be N;

step 22: let all _ words set equal to null for recording all words appearing in training set, let kth sample of i-th class emotion label training set be

Step 23: order words_iThe collection is empty, all texts in the sample of the ith emotion are subjected to word segmentation, and word segmentation results are stored in words_iThe set is also stored in an all _ words set;

step 24: let k equal k +1, count_i＝count_i+1, repeating the step 23 until k equals to the total number of samples in the ith type emotion label training set;

step 25: repeating the steps 22 to 24 until i is equal to n;

step 26: retrieving words w from the all _ words set without putting back, counting words w in words_i(1. ltoreq. i. ltoreq.n)

Called word frequency, n_wIs not 0

The number of (2);

step 27: calculating the emotion weight of the word w under the ith type of emotion label

And the emotional weight of the word under the n-th type emotion; weight of word w under type i emotion label

The calculation formula of (2) is as follows:

step 28: if the all _ words set is not empty, then go to step 26;

step 29: and (3) adding the emotion weight of the word w under the ith type of emotion to w:

the form of the key-value pair is stored in the dictionary weight_iFinally, n dictionaries are obtained. Grouping n dictionaries into an emotion dictionary, each weight_iAn emotion page called emotion dictionary.

And step 3: as shown in FIG. 3, the emotion weights of all participles of each sample under a certain emotion are accumulated to obtain the sum of the emotion weights of the certain emotion of the sample, and then the sum of the emotion weights of each emotion of the sample is combined to form an emotion vector of the training set.

The specific implementation method of the step 3 is as follows:

step 31: let the kth sample of the ith type emotion label training set be

Setting k to 1; step 32: will be provided with

Then, each word w after word segmentation is used as each emotion page weight for inquiring the emotion dictionary constructed in the step 2_iObtaining the value corresponding to the key w

Then, the current sample is calculated by using the following formula

Sentiment score of the ith sentiment of (1)

And combining the emotion scores of each class to form an emotion vector of a training set

And 4, step 4: as shown in fig. 4, the emotion vectors of the training set obtained in step 3 are used as feature representations of the input layer for training the DNN mixed emotion classification model, so as to obtain a trained DNN model for mixed emotion classification.

And 5: and performing word segmentation on the new network platform short text, searching each word after word segmentation in an emotion dictionary to obtain corresponding emotion weight, summing the emotion weights of all words under each emotion respectively to obtain the sum of the emotion weights of the short text under each emotion, and combining the sum to form an emotion vector.

The concrete implementation method of the step 5 is as follows:

step 51: setting the short text of the new network platform as a test set sample, and setting the r-th sample in the test set as test_rTest will be_rPerforming word segmentation processing on the text to obtain a sample test_rWord sets after word segmentation;

step 52: traversing the word set, and using each traversed word w as a key to inquire the weight of each dictionary page in the emotion dictionary_i(1. ltoreq. i. ltoreq. n)

And calculating the emotion score in each type of emotion by using the following formula:

obtaining a test sample test_rSentiment score of the ith category

Then combining the obtained emotion scores under each emotion into an emotion vector

Step 6: as shown in fig. 5, the emotion vector is input into the trained DNN model to obtain a vector including a probability value of each emotion, the maximum two probability values are determined with a set threshold, if the maximum two probability values are greater than or equal to the threshold, it is indicated that the sample includes the emotion, otherwise, the sample does not include the emotion.

The step 6 is realized by the following specific method:

step 61: subjecting the product obtained in step 5

Inputting the trained DNN model to obtain a vector V containing the probability value of each emotion_p＝(p₁,......,p_n) Wherein p is_i(1. ltoreq. i. ltoreq.n) as a test sample test_rProbability values comprising the ith emotion;

step 62: will V_p＝(p₁,......,p_n) The two emotion probability values with the maximum middle probability value and the second highest middle probability value are taken out and are set as p_jAnd p_l；

And step 63: setting an emotion threshold value P, and judging P_jAnd p_lAnd P, the judgment rule is as follows:

if p is_j< P and P_lIf not less than P, the test sample test is indicated_rContains the I class of emotionFeeling;

if p is_j< P and P_lIf < P, the test sample test is indicated_rWithout emotion.

In summary, the invention provides a network platform short text mixed emotion classification method based on a novel emotion dictionary, which includes the steps of manually marking collected comment data of a network platform, using the comment data as a training set, processing the comment data in the training set and storing the processed comment data in an emotion dictionary, forming emotion vectors of the training set from the processed data, and inputting the emotion vectors of the training set as feature representation into a DNN model for training to obtain a trained mixed emotion classifier; the method comprises the steps of performing word segmentation on a network platform text to be detected, then putting the network platform text into an established emotion dictionary for query processing to form an emotion vector of the text to be detected, and finally inputting the emotion vector of the text to be detected into a trained mixed emotion classifier for processing to obtain how the emotion condition is contained in the text to be detected.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A network platform short text mixed emotion classification method based on a novel emotion dictionary is characterized by comprising the following steps:

2. The method for classifying short texts in a network platform based on a novel emotion dictionary according to claim 1, wherein the step 2 specifically comprises:

step 21, classifying samples of different emotion labels in the training set to obtain sample sets under n classes of emotion labels, wherein each set is S_i(i is more than or equal to 1 and less than or equal to n); setting i to be 1, setting an all _ words set to be used for recording all appeared words, setting the all _ words set to be null, and setting the number of training lumped samples to be N;

Step 23, mixing

step 26, extracting the word w from the all _ words set (not putting back), and counting the word w in the words_i(1. ltoreq. i. ltoreq.n) frequency of occurrence f_i ^w，f_i ^wCalled word frequency, n_wF is not 0_i ^w(i is more than or equal to 1 and less than or equal to n);

And the emotional weight of the word w under the nth type of emotion;

step 28, if the all _ words set is not empty, then go to step 26;

3. The novel emotion dictionary-based network platform short text mixed emotion classification method as claimed in claim 2, wherein: the weight of the word w in the step 27 under the ith type emotion label

The calculation formula of (2) is as follows:

4. the novel emotion dictionary-based network platform short text mixed emotion classification method as claimed in claim 2, wherein said step 3 specifically comprises:

step 31, let the kth sample of the ith type emotion label training set be

Setting k to 1;

step 32, mixing

Then calculate out

Sentiment score of the ith sentiment of (1)

5. The novel emotion dictionary-based network platform short text mixed emotion classification method of claim 4The method is characterized in that: the sentiment score in the step 32

The calculation formula of (2) is as follows:

6. the novel emotion dictionary-based network platform short text mixed emotion classification method as claimed in claim 4, wherein said step 5 specifically comprises:

obtaining a test sample test_rSentiment score of the ith category

Then combining the emotion scores of each emotion to form an emotion vector

7. The novel emotion dictionary-based network platform short text mixed emotion classification method as claimed in claim 6, wherein said step 6 specifically comprises:

step 61, converting V obtained in step 5 into V_rInputting the DNN model which is trained to obtain the model containing eachVector V of emotional probability values_p＝(p₁,......,p_n) Wherein p is_i(1. ltoreq. i. ltoreq.n) as a test sample test_rProbability values comprising the ith emotion;

step 62, adding V_p＝(p₁,......,p_n) The two emotion probability values with the maximum middle probability value and the second highest middle probability value are taken out and are set as p_jAnd p_l；

if p is_j< P and P_lIf < P, the test sample test is indicated_rWithout emotion.