CN108681532B

CN108681532B - Sentiment analysis method for Chinese microblog

Info

Publication number: CN108681532B
Application number: CN201810304972.8A
Authority: CN
Inventors: 喻梅; 张功; 于瑞国; 于健; 徐天一; 刘春岩
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-04-08
Filing date: 2018-04-08
Publication date: 2022-03-25
Anticipated expiration: 2038-04-08
Also published as: CN108681532A

Abstract

The invention discloses a Chinese microblog-oriented emotion analysis method, which comprises the following steps of: selecting a training set M containing emoticons by detecting whether the preprocessed microblog sample set L' contains the emoticons; using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks, and using the weakly marked samples as training samples for supervising machine learning; denoising the training set M by a denoising method SAT to obtain a sample set K; constructing a classifier C by combining the denoised sample set K with a supervised learning method; using the accuracy, precision, recall and F value as evaluation criteria, and detecting the precision of the classifier C through the marked sample set P; the method further comprises the following steps: and automatically updating according to the weakly marked samples. According to the invention, the problem of sentiment analysis facing Chinese microblogs is researched by utilizing information about sentiment symbols in the microblogs, and the problem of noise information generated when the sentiment symbols are used for weakly marking the microblogs is solved.

Description

Sentiment analysis method for Chinese microblog

Technical Field

The invention relates to the field of machine learning and natural language processing, in particular to a Chinese microblog-oriented emotion analysis method.

Background

Currently, emotion classification methods for Chinese microblogs can be divided into emotion analysis algorithms based on emotion dictionaries and emotion analysis algorithms based on machine learning. For the method based on the emotion dictionary, text emotion analysis under different granularities is carried out according to the emotion tendentiousness of words provided by the emotion dictionary. For the machine learning method, among models using various machine learning methods, the monitoring method using re-weighting with Naive Bayes Support Vector machine (NB-SVM) has the highest accuracy. The current emotion analysis algorithm usually combines the advantages of the two methods to obtain better emotion analysis effect.

Emoticons are very popular with users today as a way to express emotion directly, from the standpoint of what emoticons are used to see the user. The Chinese emotion analysis is difficult to perform, and different from English expression, the emotion analysis features cannot be displayed obviously due to the linguistic characteristics of Chinese, and enough emotion words are not available in Chinese microblogs for extraction and analysis. Another difficulty is how to handle emoticons in microblog text.

Some studies choose to delete emoticons at the time of study, "if we let emoticons analyze in sentences, then the accuracy of maxent (maximum entry) and svm (support Vector machines) classifiers is negatively impacted". Some studies choose to represent the emotion of the entire text using a certain emoticon in the text, "assume that one emoticon in a message represents the emotion of the entire message and that all words of the message are related to this emotion". The method does not attach importance to the characteristics of numerous emoticons of the microblog, and the emoticons appearing in the microblog are not effectively analyzed during emotion analysis, so that the analysis result is influenced to a certain extent.

Disclosure of Invention

The invention is characterized in that information about emotion symbols in microblogs is utilized to research emotion analysis problem facing Chinese microblogs, and the problem of noise information generated when the emotion symbols are used for weakly marking the microblogs is solved, which is described in detail as follows:

a sentiment analysis method for Chinese microblogs comprises the following steps:

selecting a training set M containing emoticons by detecting whether the preprocessed microblog sample set L' contains the emoticons;

using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks, and using the weakly marked samples as training samples for supervising machine learning; denoising the training set M by a denoising method SAT to obtain a sample set K;

constructing a classifier C by combining the denoised sample set K with a supervised learning method; using the accuracy, precision, recall and F value as evaluation criteria, and detecting the precision of the classifier C through the marked sample set P;

the method further comprises the following steps: and automatically updating according to the weakly marked samples.

Wherein the pretreatment specifically comprises the following steps:

extracting main contents of the microblog from the new sample set; converting the emoticons into corresponding emotion vocabularies; and removing stop words and low-frequency words which cannot represent text features in the word segmentation result.

Further, the emotion polarity of the sample in the training set M is marked by using the emoticon, and as a weak mark, the emotion polarity is specifically:

where pN, nN represent the weighted sum of the positive emoticons and the weighted sum of the negative emoticon emotion values, respectively.

Further, the denoising of the training set M by the denoising method SAT is performed to obtain a sample set K specifically as follows:

weak labeling is carried out on a training set M, an original classifier is constructed by using weak labeled samples, and the training sample set M is detected by using the original classifier;

and (4) taking a sample which is different from the original weak mark in the detected mark result as an error mark, and removing the sample from the training sample set M to obtain a filtered sample set K.

The method for constructing the classifier C by combining the denoised sample set K with the supervised learning method specifically comprises the following steps:

and constructing a classifier C by using the denoised sample set K through four known supervised learning methods of BernoulliNB, MultinomiaNB, Linear SVC and NuSVC with better effect.

The technical scheme provided by the invention has the beneficial effects that:

1. in the emotion analysis process, the accuracy of emotion analysis is improved by processing emotion symbols and performing fine-grained analysis on emotion words;

2. the method solves the problem that noise information is generated when the emotional symbol is used for weakly marking the microblog;

3. due to the fact that the microblog updating frequency is high, the old classifier possibly influences the classification effect of the new microblog set, the method can be automatically updated according to the weak mark samples, the efficiency of the method is far higher than that of a manual marking method, the cost is greatly reduced, and the method is superior to a common supervision machine learning method in this respect.

Drawings

FIG. 1 is a flow chart of an emotion analysis method for Chinese microblogs;

FIG. 2 is a flow chart of the SAT algorithm;

fig. 3 shows the sample proportion change marked correctly in the sample in the experiment for verifying the effect of the noise reduction algorithm.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

In order to achieve the above object, an embodiment of the present invention provides an emotion analysis method for a chinese microblog, including the following steps:

101: preprocessing a text;

wherein the step 101 comprises: and performing natural language technologies such as word segmentation, part of speech tagging and stop words on the text.

102: extracting emotion information;

the purpose of step 102 is to extract valuable emotion information from the text, and extract unit elements with tendency characteristics in the microblog according to the emotion word tendency definition given in the emotion dictionary.

103: and (4) emotion classification.

Step 103 classifies text units into several categories by using the emotion information extraction result in step 102, and classifies subjective text polarity and strength.

In concrete implementation, step 101 is to pre-process the microblog text, and the concrete steps include: extracting main content, word segmentation and stop word.

This step is well known to those skilled in the art, and will not be described in detail herein.

The step 102 is to extract emotion information based on the step 101, and specifically includes the following steps:

the emoticons of the microblog can be regarded as a self-carried mark of the microblog, but some errors exist in the marks, so that the marks are called weak marks. And the weakly marked samples are used as training samples for supervising machine learning, so that a large number of marked samples can be obtained quickly, and the expenditure on manpower and material resources during manual marking is saved.

Selecting a microblog containing emoticons to form a training set M by detecting whether the emoticons exist in the microblog sample set L, and then weakly marking the training text by using the emoticons to mark the emotional tendency of the samples in the training set M.

In the embodiment of the invention, an SAT (SelfAlternative training) denoising method is used for denoising a training set M to obtain a sample set K. In the SAT noise reduction method, weak labeling is carried out on a training set M, or a weak labeled sample is used for constructing an original classifier, then the original classifier is used for detecting the training sample set M, some labeled results obtained by detection are different from original weak labeling, the different samples can be regarded as error labels, and the error labels are removed from the training sample set M, so that a filtered sample set K is obtained.

In the above-mentioned elimination process, although some samples marked correctly are also eliminated, the noise is also reduced. The proportion of the wrongly marked samples in the weakly marked samples can be reduced by iterating for several times, so that the precision of the classifier trained by the sample set can be improved.

In summary, the embodiment of the invention researches the emotion analysis problem facing the Chinese microblog by using the information about the emotion symbol in the microblog, and solves the problem of generating noise information when the emotion symbol is used for weakly marking the microblog.

Example 2

The scheme of example 1 is further described below in conjunction with fig. 1-3, and is described in detail below:

the self-training algorithm is a semi-supervised learning method and is used for solving the problem of insufficient labeled samples. The main idea is as follows: through a model of supervised learning, an original classifier is constructed by using the existing few labeled samples, the original classifier is used for classifying other unlabeled samples, and the sample with the highest confidence coefficient is added into a labeled sample set to expand the labeled sample set. The flow of the self-training algorithm used in the embodiment of the present invention is shown in fig. 1.

The SAT noise reduction method achieves the purpose of noise reduction of weak marker samples in an iterative training and self-optimization mode, wherein the flow of the SAT algorithm is shown in FIG. 2, and the main idea of the SAT noise reduction method is as follows:

using a weak mark sample L as a training set, training a classifier C, then using the trained classifier C to detect the training set L, marking a sample set with a detected result different from an original weak mark result as a sample M, removing the sample M with a detection error from the training set L to obtain a new training set L ', namely L' -L-M, then retraining the classifier C, detecting the optimized training set L again, and iterating in the way to obtain a sample set with smaller noise.

In the experiment for verifying the effect of the noise reduction algorithm, the sample proportion change marked correctly in the samples is shown in fig. 3. This shows that the SAT noise reduction method achieves a good effect and effectively reduces the noise in the weakly labeled samples.

The emotion analysis method adopting the self-training algorithm shown in fig. 1 is specifically divided into the following 6 steps:

201: preprocessing the new sample set L', extracting main content, segmenting words, removing stop words, removing useless information on the microblog and extracting main content of the microblog; converting the emoticons into corresponding emotion vocabularies; removing stop words and low-frequency words which cannot represent text features in the word segmentation result;

202: selecting a component training set M containing emoticons by detecting whether the emoticons exist in a microblog sample set (namely a new sample set) L';

203: using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks;

in specific implementation, a plurality of emoticons exist in some microblogs and may contain emoticons with different polarities, so that the overall emotional tendency of one microblog needs to be obtained by performing weighted calculation on emotion values of all the emoticons in the microblog. For microblog samples containing emotional symbols, the rule adopted for judging the emotional tendency is shown as a formula (1):

204: denoising the training set M by a denoising method SAT to obtain a sample set K;

the SAT denoising process is described in detail in embodiment 1, and is not described herein again in this embodiment of the present invention.

205: constructing a classifier C by combining the denoised sample set K with a supervised learning method;

and (3) constructing a classifier C by using the denoised sample set K through four known good-effect supervised learning methods of BernoulliNB, MultinomiaNB, Linear SVC and NuSVC, and detecting the precision of the classifier C through the marked sample set P.

And (3) performing performance detection on the test set by taking the accuracy as an evaluation standard, and selecting the multinomiaNB and the Linear SVC with the best effect from the four classifiers as classification models through experimental screening.

206: accuracy, precision, recall and F-value were used as evaluation criteria and the precision of classifier C was checked by the labeled sample set P.

In summary, compared with supervised machine learning that manually marks samples, the embodiment of the present invention can train using samples with weak marks and update in time according to new microblogs, so that it has stronger timeliness.

Example 3

The feasibility verification of the solutions of examples 1 and 2 is carried out below with reference to fig. 3, which is described in detail below:

the Unigram + Bigram with the best effect is selected as a text feature in the experiment, and the MultinomiaNB and the Linear SVC are respectively used as classification models to test the performance of the method. Meanwhile, the experiment used Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F-value (F-measure) as evaluation criteria.

The microblog data set used in the experiment is a data set provided by a Chinese microblog emotion tendency analysis and vocabulary semantic relation extraction evaluation task of a CCF natural language processing and Chinese computing conference (NLP & CC2012) in 2012. The accuracy of the emotional tendency is judged by testing seven methods including SVM-SAT, NB-SAT, Lexicon, SVM, NB, SVM-week and NB-week, and the comparison result shows that the accuracy and the F value of SVM-SAT and NB-SAT are far higher than those of Lexicon, SVM-week and NB-week, and are close to those of SVM and NB, which shows that the SAT noise reduction method has good effect and the constructed classifier has excellent performance.

The result of the classifier obtained by the emotion tendency analysis by the method is close to the result of the classifier obtained by using the artificially marked sample, because the proportion of the wrongly marked sample in the sample is greatly reduced after the noise of the sample is reduced by the SAT noise reduction method after a plurality of iterations.

In order to detect the noise reduction effect, a certain number of manually marked samples with expression symbols are selected in an experiment, expression symbols are used for obtaining weak mark information of the samples, the weak mark information is used as a mark, multinomia NB is used as a classification model, Unigram and Bigram are used as text features, an SAT algorithm in a paper is used for reducing noise of data, the number of correctly marked people and the number of wrongly marked people in the remaining samples are counted after each iteration, and the proportion of correct marks is calculated, wherein the proportion is shown in figure 3.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A Chinese microblog-oriented emotion analysis method is characterized by comprising the following steps:

the method further comprises the following steps: automatically updating according to the weakly marked sample;

the emotion polarity of the sample in the training set M is marked by using the emoticon, and the weak mark is specifically as follows:

wherein pN, nN represent the weighted sum of the positive emoticons and the weighted sum of the negative emoticon emotion values, respectively;

the noise reduction of the training set M by the noise reduction method SAT is carried out, and the obtained sample set K specifically comprises the following steps:

weak labeling is carried out on a training set M, an original classifier is constructed by using weak labeled samples, and the training set M is detected by using the original classifier;

and (4) taking a sample which is different from the original weak mark in the detected mark result as an error mark, and removing the sample from the training set M to obtain a filtered sample set K.

2. The method for analyzing emotion facing to Chinese microblogs, according to claim 1, wherein the preprocessing specifically comprises:

extracting the content of the microblog from the new sample set; converting the emoticons into corresponding emotion vocabularies; and removing stop words and low-frequency words which cannot represent text features in the word segmentation result.

3. The method for emotion analysis facing to chinese microblog according to claim 1, wherein the constructing of the classifier C by the noise-reduced sample set K in combination with the supervised learning method specifically includes:

the denoised sample set K is used to construct a classifier C by four supervised learning methods known as BernoulliNB, multinomiannb, linear svc, NuSVC.