CN108681532B - Sentiment analysis method for Chinese microblog - Google Patents
Sentiment analysis method for Chinese microblog Download PDFInfo
- Publication number
- CN108681532B CN108681532B CN201810304972.8A CN201810304972A CN108681532B CN 108681532 B CN108681532 B CN 108681532B CN 201810304972 A CN201810304972 A CN 201810304972A CN 108681532 B CN108681532 B CN 108681532B
- Authority
- CN
- China
- Prior art keywords
- sample set
- emoticons
- sample
- classifier
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese microblog-oriented emotion analysis method, which comprises the following steps of: selecting a training set M containing emoticons by detecting whether the preprocessed microblog sample set L' contains the emoticons; using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks, and using the weakly marked samples as training samples for supervising machine learning; denoising the training set M by a denoising method SAT to obtain a sample set K; constructing a classifier C by combining the denoised sample set K with a supervised learning method; using the accuracy, precision, recall and F value as evaluation criteria, and detecting the precision of the classifier C through the marked sample set P; the method further comprises the following steps: and automatically updating according to the weakly marked samples. According to the invention, the problem of sentiment analysis facing Chinese microblogs is researched by utilizing information about sentiment symbols in the microblogs, and the problem of noise information generated when the sentiment symbols are used for weakly marking the microblogs is solved.
Description
Technical Field
The invention relates to the field of machine learning and natural language processing, in particular to a Chinese microblog-oriented emotion analysis method.
Background
Currently, emotion classification methods for Chinese microblogs can be divided into emotion analysis algorithms based on emotion dictionaries and emotion analysis algorithms based on machine learning. For the method based on the emotion dictionary, text emotion analysis under different granularities is carried out according to the emotion tendentiousness of words provided by the emotion dictionary. For the machine learning method, among models using various machine learning methods, the monitoring method using re-weighting with Naive Bayes Support Vector machine (NB-SVM) has the highest accuracy. The current emotion analysis algorithm usually combines the advantages of the two methods to obtain better emotion analysis effect.
Emoticons are very popular with users today as a way to express emotion directly, from the standpoint of what emoticons are used to see the user. The Chinese emotion analysis is difficult to perform, and different from English expression, the emotion analysis features cannot be displayed obviously due to the linguistic characteristics of Chinese, and enough emotion words are not available in Chinese microblogs for extraction and analysis. Another difficulty is how to handle emoticons in microblog text.
Some studies choose to delete emoticons at the time of study, "if we let emoticons analyze in sentences, then the accuracy of maxent (maximum entry) and svm (support Vector machines) classifiers is negatively impacted". Some studies choose to represent the emotion of the entire text using a certain emoticon in the text, "assume that one emoticon in a message represents the emotion of the entire message and that all words of the message are related to this emotion". The method does not attach importance to the characteristics of numerous emoticons of the microblog, and the emoticons appearing in the microblog are not effectively analyzed during emotion analysis, so that the analysis result is influenced to a certain extent.
Disclosure of Invention
The invention is characterized in that information about emotion symbols in microblogs is utilized to research emotion analysis problem facing Chinese microblogs, and the problem of noise information generated when the emotion symbols are used for weakly marking the microblogs is solved, which is described in detail as follows:
a sentiment analysis method for Chinese microblogs comprises the following steps:
selecting a training set M containing emoticons by detecting whether the preprocessed microblog sample set L' contains the emoticons;
using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks, and using the weakly marked samples as training samples for supervising machine learning; denoising the training set M by a denoising method SAT to obtain a sample set K;
constructing a classifier C by combining the denoised sample set K with a supervised learning method; using the accuracy, precision, recall and F value as evaluation criteria, and detecting the precision of the classifier C through the marked sample set P;
the method further comprises the following steps: and automatically updating according to the weakly marked samples.
Wherein the pretreatment specifically comprises the following steps:
extracting main contents of the microblog from the new sample set; converting the emoticons into corresponding emotion vocabularies; and removing stop words and low-frequency words which cannot represent text features in the word segmentation result.
Further, the emotion polarity of the sample in the training set M is marked by using the emoticon, and as a weak mark, the emotion polarity is specifically:
where pN, nN represent the weighted sum of the positive emoticons and the weighted sum of the negative emoticon emotion values, respectively.
Further, the denoising of the training set M by the denoising method SAT is performed to obtain a sample set K specifically as follows:
weak labeling is carried out on a training set M, an original classifier is constructed by using weak labeled samples, and the training sample set M is detected by using the original classifier;
and (4) taking a sample which is different from the original weak mark in the detected mark result as an error mark, and removing the sample from the training sample set M to obtain a filtered sample set K.
The method for constructing the classifier C by combining the denoised sample set K with the supervised learning method specifically comprises the following steps:
and constructing a classifier C by using the denoised sample set K through four known supervised learning methods of BernoulliNB, MultinomiaNB, Linear SVC and NuSVC with better effect.
The technical scheme provided by the invention has the beneficial effects that:
1. in the emotion analysis process, the accuracy of emotion analysis is improved by processing emotion symbols and performing fine-grained analysis on emotion words;
2. the method solves the problem that noise information is generated when the emotional symbol is used for weakly marking the microblog;
3. due to the fact that the microblog updating frequency is high, the old classifier possibly influences the classification effect of the new microblog set, the method can be automatically updated according to the weak mark samples, the efficiency of the method is far higher than that of a manual marking method, the cost is greatly reduced, and the method is superior to a common supervision machine learning method in this respect.
Drawings
FIG. 1 is a flow chart of an emotion analysis method for Chinese microblogs;
FIG. 2 is a flow chart of the SAT algorithm;
fig. 3 shows the sample proportion change marked correctly in the sample in the experiment for verifying the effect of the noise reduction algorithm.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
In order to achieve the above object, an embodiment of the present invention provides an emotion analysis method for a chinese microblog, including the following steps:
101: preprocessing a text;
wherein the step 101 comprises: and performing natural language technologies such as word segmentation, part of speech tagging and stop words on the text.
102: extracting emotion information;
the purpose of step 102 is to extract valuable emotion information from the text, and extract unit elements with tendency characteristics in the microblog according to the emotion word tendency definition given in the emotion dictionary.
103: and (4) emotion classification.
Step 103 classifies text units into several categories by using the emotion information extraction result in step 102, and classifies subjective text polarity and strength.
In concrete implementation, step 101 is to pre-process the microblog text, and the concrete steps include: extracting main content, word segmentation and stop word.
This step is well known to those skilled in the art, and will not be described in detail herein.
The step 102 is to extract emotion information based on the step 101, and specifically includes the following steps:
the emoticons of the microblog can be regarded as a self-carried mark of the microblog, but some errors exist in the marks, so that the marks are called weak marks. And the weakly marked samples are used as training samples for supervising machine learning, so that a large number of marked samples can be obtained quickly, and the expenditure on manpower and material resources during manual marking is saved.
Selecting a microblog containing emoticons to form a training set M by detecting whether the emoticons exist in the microblog sample set L, and then weakly marking the training text by using the emoticons to mark the emotional tendency of the samples in the training set M.
In the embodiment of the invention, an SAT (SelfAlternative training) denoising method is used for denoising a training set M to obtain a sample set K. In the SAT noise reduction method, weak labeling is carried out on a training set M, or a weak labeled sample is used for constructing an original classifier, then the original classifier is used for detecting the training sample set M, some labeled results obtained by detection are different from original weak labeling, the different samples can be regarded as error labels, and the error labels are removed from the training sample set M, so that a filtered sample set K is obtained.
In the above-mentioned elimination process, although some samples marked correctly are also eliminated, the noise is also reduced. The proportion of the wrongly marked samples in the weakly marked samples can be reduced by iterating for several times, so that the precision of the classifier trained by the sample set can be improved.
In summary, the embodiment of the invention researches the emotion analysis problem facing the Chinese microblog by using the information about the emotion symbol in the microblog, and solves the problem of generating noise information when the emotion symbol is used for weakly marking the microblog.
Example 2
The scheme of example 1 is further described below in conjunction with fig. 1-3, and is described in detail below:
the self-training algorithm is a semi-supervised learning method and is used for solving the problem of insufficient labeled samples. The main idea is as follows: through a model of supervised learning, an original classifier is constructed by using the existing few labeled samples, the original classifier is used for classifying other unlabeled samples, and the sample with the highest confidence coefficient is added into a labeled sample set to expand the labeled sample set. The flow of the self-training algorithm used in the embodiment of the present invention is shown in fig. 1.
The SAT noise reduction method achieves the purpose of noise reduction of weak marker samples in an iterative training and self-optimization mode, wherein the flow of the SAT algorithm is shown in FIG. 2, and the main idea of the SAT noise reduction method is as follows:
using a weak mark sample L as a training set, training a classifier C, then using the trained classifier C to detect the training set L, marking a sample set with a detected result different from an original weak mark result as a sample M, removing the sample M with a detection error from the training set L to obtain a new training set L ', namely L' -L-M, then retraining the classifier C, detecting the optimized training set L again, and iterating in the way to obtain a sample set with smaller noise.
In the experiment for verifying the effect of the noise reduction algorithm, the sample proportion change marked correctly in the samples is shown in fig. 3. This shows that the SAT noise reduction method achieves a good effect and effectively reduces the noise in the weakly labeled samples.
The emotion analysis method adopting the self-training algorithm shown in fig. 1 is specifically divided into the following 6 steps:
201: preprocessing the new sample set L', extracting main content, segmenting words, removing stop words, removing useless information on the microblog and extracting main content of the microblog; converting the emoticons into corresponding emotion vocabularies; removing stop words and low-frequency words which cannot represent text features in the word segmentation result;
202: selecting a component training set M containing emoticons by detecting whether the emoticons exist in a microblog sample set (namely a new sample set) L';
203: using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks;
in specific implementation, a plurality of emoticons exist in some microblogs and may contain emoticons with different polarities, so that the overall emotional tendency of one microblog needs to be obtained by performing weighted calculation on emotion values of all the emoticons in the microblog. For microblog samples containing emotional symbols, the rule adopted for judging the emotional tendency is shown as a formula (1):
where pN, nN represent the weighted sum of the positive emoticons and the weighted sum of the negative emoticon emotion values, respectively.
204: denoising the training set M by a denoising method SAT to obtain a sample set K;
the SAT denoising process is described in detail in embodiment 1, and is not described herein again in this embodiment of the present invention.
205: constructing a classifier C by combining the denoised sample set K with a supervised learning method;
and (3) constructing a classifier C by using the denoised sample set K through four known good-effect supervised learning methods of BernoulliNB, MultinomiaNB, Linear SVC and NuSVC, and detecting the precision of the classifier C through the marked sample set P.
And (3) performing performance detection on the test set by taking the accuracy as an evaluation standard, and selecting the multinomiaNB and the Linear SVC with the best effect from the four classifiers as classification models through experimental screening.
206: accuracy, precision, recall and F-value were used as evaluation criteria and the precision of classifier C was checked by the labeled sample set P.
In summary, compared with supervised machine learning that manually marks samples, the embodiment of the present invention can train using samples with weak marks and update in time according to new microblogs, so that it has stronger timeliness.
Example 3
The feasibility verification of the solutions of examples 1 and 2 is carried out below with reference to fig. 3, which is described in detail below:
the Unigram + Bigram with the best effect is selected as a text feature in the experiment, and the MultinomiaNB and the Linear SVC are respectively used as classification models to test the performance of the method. Meanwhile, the experiment used Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F-value (F-measure) as evaluation criteria.
The microblog data set used in the experiment is a data set provided by a Chinese microblog emotion tendency analysis and vocabulary semantic relation extraction evaluation task of a CCF natural language processing and Chinese computing conference (NLP & CC2012) in 2012. The accuracy of the emotional tendency is judged by testing seven methods including SVM-SAT, NB-SAT, Lexicon, SVM, NB, SVM-week and NB-week, and the comparison result shows that the accuracy and the F value of SVM-SAT and NB-SAT are far higher than those of Lexicon, SVM-week and NB-week, and are close to those of SVM and NB, which shows that the SAT noise reduction method has good effect and the constructed classifier has excellent performance.
The result of the classifier obtained by the emotion tendency analysis by the method is close to the result of the classifier obtained by using the artificially marked sample, because the proportion of the wrongly marked sample in the sample is greatly reduced after the noise of the sample is reduced by the SAT noise reduction method after a plurality of iterations.
In order to detect the noise reduction effect, a certain number of manually marked samples with expression symbols are selected in an experiment, expression symbols are used for obtaining weak mark information of the samples, the weak mark information is used as a mark, multinomia NB is used as a classification model, Unigram and Bigram are used as text features, an SAT algorithm in a paper is used for reducing noise of data, the number of correctly marked people and the number of wrongly marked people in the remaining samples are counted after each iteration, and the proportion of correct marks is calculated, wherein the proportion is shown in figure 3.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (3)
1. A Chinese microblog-oriented emotion analysis method is characterized by comprising the following steps:
selecting a training set M containing emoticons by detecting whether the preprocessed microblog sample set L' contains the emoticons;
using the emoticons to mark the emotional polarity of the samples in the training set M as weak marks, and using the weakly marked samples as training samples for supervising machine learning; denoising the training set M by a denoising method SAT to obtain a sample set K;
constructing a classifier C by combining the denoised sample set K with a supervised learning method; using the accuracy, precision, recall and F value as evaluation criteria, and detecting the precision of the classifier C through the marked sample set P;
the method further comprises the following steps: automatically updating according to the weakly marked sample;
the emotion polarity of the sample in the training set M is marked by using the emoticon, and the weak mark is specifically as follows:
wherein pN, nN represent the weighted sum of the positive emoticons and the weighted sum of the negative emoticon emotion values, respectively;
the noise reduction of the training set M by the noise reduction method SAT is carried out, and the obtained sample set K specifically comprises the following steps:
weak labeling is carried out on a training set M, an original classifier is constructed by using weak labeled samples, and the training set M is detected by using the original classifier;
and (4) taking a sample which is different from the original weak mark in the detected mark result as an error mark, and removing the sample from the training set M to obtain a filtered sample set K.
2. The method for analyzing emotion facing to Chinese microblogs, according to claim 1, wherein the preprocessing specifically comprises:
extracting the content of the microblog from the new sample set; converting the emoticons into corresponding emotion vocabularies; and removing stop words and low-frequency words which cannot represent text features in the word segmentation result.
3. The method for emotion analysis facing to chinese microblog according to claim 1, wherein the constructing of the classifier C by the noise-reduced sample set K in combination with the supervised learning method specifically includes:
the denoised sample set K is used to construct a classifier C by four supervised learning methods known as BernoulliNB, multinomiannb, linear svc, NuSVC.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810304972.8A CN108681532B (en) | 2018-04-08 | 2018-04-08 | Sentiment analysis method for Chinese microblog |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810304972.8A CN108681532B (en) | 2018-04-08 | 2018-04-08 | Sentiment analysis method for Chinese microblog |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108681532A CN108681532A (en) | 2018-10-19 |
CN108681532B true CN108681532B (en) | 2022-03-25 |
Family
ID=63800728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810304972.8A Active CN108681532B (en) | 2018-04-08 | 2018-04-08 | Sentiment analysis method for Chinese microblog |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108681532B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492226B (en) * | 2018-11-10 | 2023-03-24 | 上海五节数据科技有限公司 | Method for improving low text pre-segmentation accuracy rate of emotional tendency proportion |
CN111339306B (en) | 2018-12-18 | 2023-05-12 | 腾讯科技(深圳)有限公司 | Classification model training method, classification method and device, equipment and medium |
CN109977814A (en) * | 2019-03-13 | 2019-07-05 | 武汉大学 | A kind of AdaBoost pedestrian detection method based on unification LBP |
CN116580847B (en) * | 2023-07-14 | 2023-11-28 | 天津医科大学总医院 | Method and system for predicting prognosis of septic shock |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605658A (en) * | 2013-10-14 | 2014-02-26 | 北京航空航天大学 | Search engine system based on text emotion analysis |
CN103761239A (en) * | 2013-12-09 | 2014-04-30 | 国家计算机网络与信息安全管理中心 | Method for performing emotional tendency classification to microblog by using emoticons |
CN105354305A (en) * | 2015-11-05 | 2016-02-24 | 北京邮电大学 | Online-rumor identification method and apparatus |
CN106446147A (en) * | 2016-09-20 | 2017-02-22 | 天津大学 | Emotion analysis method based on structuring features |
CN106569996A (en) * | 2016-03-30 | 2017-04-19 | 广东工业大学 | Chinese-microblog-oriented emotional tendency analysis method |
CN106598942A (en) * | 2016-11-17 | 2017-04-26 | 天津大学 | Expression analysis and deep learning-based social network sentiment analysis method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160189037A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Hybrid technique for sentiment analysis |
CN105205124B (en) * | 2015-09-11 | 2016-11-30 | 合肥工业大学 | A kind of semi-supervised text sentiment classification method based on random character subspace |
CN107169001A (en) * | 2017-03-31 | 2017-09-15 | 华东师范大学 | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning |
CN107451118A (en) * | 2017-07-21 | 2017-12-08 | 西安电子科技大学 | Sentence-level sensibility classification method based on Weakly supervised deep learning |
-
2018
- 2018-04-08 CN CN201810304972.8A patent/CN108681532B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605658A (en) * | 2013-10-14 | 2014-02-26 | 北京航空航天大学 | Search engine system based on text emotion analysis |
CN103761239A (en) * | 2013-12-09 | 2014-04-30 | 国家计算机网络与信息安全管理中心 | Method for performing emotional tendency classification to microblog by using emoticons |
CN105354305A (en) * | 2015-11-05 | 2016-02-24 | 北京邮电大学 | Online-rumor identification method and apparatus |
CN106569996A (en) * | 2016-03-30 | 2017-04-19 | 广东工业大学 | Chinese-microblog-oriented emotional tendency analysis method |
CN106446147A (en) * | 2016-09-20 | 2017-02-22 | 天津大学 | Emotion analysis method based on structuring features |
CN106598942A (en) * | 2016-11-17 | 2017-04-26 | 天津大学 | Expression analysis and deep learning-based social network sentiment analysis method |
Non-Patent Citations (3)
Title |
---|
基于中文微博的情感分析研究;徐帅;《中国优秀硕士学位论文全文数据库信息科技辑》;20140615(第6期);第I139-197,41-42、46-48页 * |
基于文本语义和表情倾向的微博情感分析方法;王文 等;《南京理工大学学报》;20141231;第38卷(第6期);第733、735-736页 * |
徐帅.基于中文微博的情感分析研究.《中国优秀硕士学位论文全文数据库信息科技辑》.2014,(第6期),第I139-197页. * |
Also Published As
Publication number | Publication date |
---|---|
CN108681532A (en) | 2018-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107133220B (en) | Geographic science field named entity identification method | |
CN105868184B (en) | A kind of Chinese personal name recognition method based on Recognition with Recurrent Neural Network | |
CN108681532B (en) | Sentiment analysis method for Chinese microblog | |
CN108804612B (en) | Text emotion classification method based on dual neural network model | |
CN108268447B (en) | Labeling method for Tibetan named entities | |
CN107491435B (en) | Method and device for automatically identifying user emotion based on computer | |
CN108563638B (en) | Microblog emotion analysis method based on topic identification and integrated learning | |
Alotaibi et al. | Optical character recognition for quranic image similarity matching | |
CN110188195B (en) | Text intention recognition method, device and equipment based on deep learning | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN111462752B (en) | Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method | |
CN113505200A (en) | Sentence-level Chinese event detection method combining document key information | |
CN112417132B (en) | New meaning identification method for screening negative samples by using guest information | |
Zhou et al. | Minimum-risk training for semi-Markov conditional random fields with application to handwritten Chinese/Japanese text recognition | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
CN108763192B (en) | Entity relation extraction method and device for text processing | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN113723083A (en) | Weighted negative supervision text emotion analysis method based on BERT model | |
CN113051887A (en) | Method, system and device for extracting announcement information elements | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN111159405B (en) | Irony detection method based on background knowledge | |
CN113486174B (en) | Model training, reading understanding method and device, electronic equipment and storage medium | |
CN111209373A (en) | Sensitive text recognition method and device based on natural semantics | |
CN114416991A (en) | Method and system for analyzing text emotion reason based on prompt |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |