CN106569996B - A kind of Sentiment orientation analysis method towards Chinese microblogging - Google Patents
A kind of Sentiment orientation analysis method towards Chinese microblogging Download PDFInfo
- Publication number
- CN106569996B CN106569996B CN201610898432.8A CN201610898432A CN106569996B CN 106569996 B CN106569996 B CN 106569996B CN 201610898432 A CN201610898432 A CN 201610898432A CN 106569996 B CN106569996 B CN 106569996B
- Authority
- CN
- China
- Prior art keywords
- microblogging
- module
- word
- model
- sentiment orientation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of Sentiment orientation analysis methods towards Chinese microblogging.Including following module: Undersampling technique module, microblogging Text Pretreatment module extend microblogging module, characteristic extracting module, sentiment analysis model training module, emotion tendency discrimination module using Word2vec.Sentiment orientation differentiation finally is carried out to microblogging to be predicted using trained AWCRF model.Advantage of the invention is the Sentiment orientation classification problem that can efficiently solve Chinese microblog data collection when Sentiment orientation is distributed uneven, is implemented simply, and discrimination is high, has very strong real value and realistic meaning.
Description
Technical field
The invention belongs to network information processing technical fields, and in particular to a kind of Sentiment orientation analysis towards Chinese microblogging
Method.
Background technique
The microblogging social platform new as one, is liked by many users.More and more people like by microblogging come
Their viewpoint is delivered, so sufficiently analysis and the emotion excavated in user's microblogging is significantly.The mesh of sentiment analysis
Be to excavate the viewpoint of user from microblogging text and identify its Sentiment orientation.For example, enterprise can be obtained by microblogging
Evaluation of the user to their products & services.It, can be with to the sentiment analysis method of microblogging with traditional sentiment analysis work
It is divided into two classes.One kind is the method based on sentiment dictionary and rule, they are according to positive emotional word in sentence and negative emotion word
Number identify Sentiment orientation.Another kind of is the method based on machine learning, they are trained by selecting suitable feature
Classifier.
However, above method, which all has ignored the Sentiment orientation that Chinese microblog data is concentrated, is distributed disequilibrium to emotional semantic classification
Influence, that is to say, that, can shadow when the quantity of the sentence of the sentence and positive emotion of negative emotion in data set differs greatly
Ring the discriminant accuracy of classifier.In real life, the topic that is discussed in microblogging or event itself often with very strong
Emotion tendency, it is uneven that this causes the Sentiment orientation of many topics to be distributed, such as " # edible oil rise in price # ", " # leather shoes fruit
Freezing the topics such as # " itself has apparent derogatory sense emotion, and " # slaughters cry of a deer prize-winning # " this topic has apparent commendation emotion.
The disequilibrium of data Sentiment orientation distribution exactly causes many machine learning algorithms to show bad key factor, especially
On the recognition effect of the classification to occupy the minority in Sentiment orientation.In addition, the length of microblogging is generally very compared with traditional text
Short, this causes conventional method to be difficult from wherein extracting the information for facilitating emotional semantic classification, and enough there is presently no one
Big sentiment dictionary can cover all emotion words.
Summary of the invention
To solve the above-mentioned problems, the invention proposes a kind of Sentiment orientation analysis method towards Chinese microblogging, masters
The step is wanted to include the following:
(1) Undersampling technique module.Most class samples in training set are reduced using Affinity Propagation algorithm
This quantity is to balance training collection, to reduce influence of the disequilibrium of data set Sentiment orientation distribution to classifying quality.
(2) microblogging Text Pretreatment module.Microblogging text is cleaned, and is segmented, part-of-speech tagging and stop words
The operation such as processing.
(3) microblogging module is extended using Word2vec.By the preceding K phase for seeking each word in microblogging using Word2vec
Like word to extend microblogging.
(4) characteristic extracting module.Related dictionary is loaded, the microblogging pretreated to front carries out feature extraction.
(5) sentiment analysis model training module.Training AWCRF mould on training set after having balanced and extended above
Type;
(6) emotion tendency discrimination module.Sentiment orientation is carried out to microblogging to be predicted using trained AWCRF model
Differentiate.
Detailed description of the invention
Fig. 1 is analysis flow chart diagram of the invention.
Specific embodiment
Following further describes the present invention with reference to the drawings.The present invention is distributed unbalanced Chinese for Sentiment orientation
The Sentiment orientation classification problem of microblog data collection.Fig. 1 is total algorithm process of the invention.
The particular content of each step is described separately as below:
1, Undersampling technique module
The present invention reduced using Affinity Propagation algorithm in training set the quantity of most class samples to
Balance training collection.
Undersampling technique of the invention is divided into following several steps:
(1) a training set t is given1, it is divided into most class maj1With minority class min1;
(2) for most class maj1, it is polymerized to several classes using Affinity Propagation clustering algorithm, it can be with
It is expressed as C={ c1,c2,...cn};
(3) in order to construct the data set of balance, sample is selected at random from each subclass of C in proportion and obtains maj2, make
Obtain maj2Sample size and min1Sample size it is close;
(4) data set maj2And min1It will be brought together to obtain a balance training collection t2;
(5) the training set t after having balanced2Instead of t1As final training set.
2, microblogging Text Pretreatment module
The module groundwork be microblogging text is cleaned, and segmented, part-of-speech tagging and stop words processing etc.
Operation;
3, microblogging module is extended using Word2vec
The present invention using Word2vec by asking the preceding K similar word of each word in microblogging to extend microblogging, specifically
Step includes following two step, is trained term vector and extension microblogging respectively.
(1) training term vector.We have collected a large amount of microblogging corpus from Sina weibo API, filter out some symbols useless
Number and network address, finally be left 5G microblog data be used to work as training set.Then using the Skip-gram model in Word2vec come
Training term vector, the similar word of each word in microblogging is sought finally by the term vector.
(2) microblogging is extended.Firstly, giving a microblogging sentence ti, the word of this available sentence after being segmented to it
Sequence is expressed as { w1,w2,...wn, then, using trained term vector seeks microblogging sentence t aboveiIn each word
Preceding k similar word, thus achieve the purpose that extend microblogging.Microblogging after extension can be expressed as { w1,w2,...wn, w11,
w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Preceding k similar word,
For in microblogging emoticon and punctuation mark be directly retained in microblogging, so extension after microblogging contain more than former microblogging
More information.
4, characteristic extracting module
(1) related dictionary, including sentiment dictionary, feelings symbol dictionary, popular dictionary, negative word dictionary etc. are loaded, is used for
Feature extraction.
(2) by the dictionary data loaded above, pre-defined feature is carried out to pretreated microblogging text
Extraction, by text vector and be converted into the format that sentiment analysis model training module is capable of handling.
5, sentiment analysis model training module
The present invention by CRF model apply in the data after this paper Undersampling technique and Word2vec technical treatment from
And obtain AWCRF model.Then feature vector characteristic extracting module extracted from microblogging uses L- as input
BFGS algorithm trains AWCRF model.The model, which not only has, can overcome emotion in training set to be distributed unbalanced influence advantage,
And there is the advantages of emotion information that can increase microblogging sentence is to alleviate the inadequate influence of sentiment dictionary covering surface.Separately
Outside, since training sample tails off, so the model also has the characteristics that the training time is few and training effectiveness is high, there is very strong reality
With value.
6, emotion tendency discrimination module
Text Pretreatment is carried out to data to be predicted first, is operated using Word2vec extension, feature extraction etc., thus
Obtain the feature vector of measured data;Then using the feature vector of prediction data as AWCRF mode input, utilization is trained
AWCRF model carries out Sentiment orientation differentiation to microblogging to be predicted.
Claims (2)
1. a kind of Sentiment orientation analysis method towards Chinese microblogging, it is characterised in that including following module:
(1) Undersampling technique module reduces most class samples in training set using Affinity Propagation algorithm
Quantity is to balance training collection, to reduce influence of the disequilibrium of data set Sentiment orientation distribution to classifying quality, specifically such as
Under:
S101), a training set t is given1, it is divided into most class maj1With minority class min1;
S102), for most class maj1, it is polymerized to several classes using Affinity Propagation clustering algorithm, it can be with table
It is shown as C={ c1,c2,...cn};
S103), in order to construct the data set of balance, sample is selected at random from each subclass of C in proportion and obtains maj2, so that
maj2Sample size and min1Sample size it is close;
S104), data set maj2And min1It will be brought together to obtain a balance training collection t2;
S105), the training set t after having balanced2Instead of t1As final training set;
(2) microblogging Text Pretreatment module cleans microblogging text, and is segmented, the processing of part-of-speech tagging and stop words
Operation;
(3) microblogging module is extended using Word2vec, by the preceding K similar word for seeking each word in microblogging using Word2vec
To extend microblogging, specifically:
S301), training term vector, has collected a large amount of microblogging corpus from Sina weibo API, filters out symbol useless and network address,
Finally be left 5G microblog data be used to work as training set, then trained using the Skip-gram model in Word2vec word to
Amount, the similar word of each word in microblogging is sought finally by the term vector;
S302), microblogging is extended, firstly, giving a microblogging sentence ti, the word order of this available sentence after being segmented to it
Column, are expressed as { w1,w2,...wn, then, using trained term vector seeks microblogging sentence t aboveiIn each word
Preceding k similar word, to achieve the purpose that extend microblogging;Microblogging after extension can be expressed as { w1,w2,...wn, w11,
w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Preceding k similar word,
For in microblogging emoticon and punctuation mark be directly retained in microblogging, so extension after microblogging contain more than former microblogging
More information;
(4) characteristic extracting module;Related dictionary is loaded, the microblogging pretreated to front carries out feature extraction,
(5) sentiment analysis model training module, training AWCRF model on the training set after having balanced and extended above, tool
Body the following steps are included:
S501) CRF model is applied in the data after this paper Undersampling technique and Word2vec technical treatment to obtain
To AWCRF model;
S502) feature vector for extracting characteristic extracting module from microblogging is instructed as input using L-BFGS algorithm
Practice AWCRF model;
(6) emotion tendency discrimination module carries out Sentiment orientation to microblogging to be predicted using trained AWCRF model and sentences
Not.
2. the method according to claim 1, wherein further including following steps in step (6):
S601), Text Pretreatment carried out to data to be predicted, operated using Word2vec extension, feature extraction etc., thus
To the feature vector of measured data;
S602), using the feature vector of prediction data as AWCRF mode input, using trained AWCRF model to be predicted
Microblogging carry out Sentiment orientation differentiation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610192686 | 2016-03-30 | ||
CN2016101926868 | 2016-03-30 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106569996A CN106569996A (en) | 2017-04-19 |
CN106569996B true CN106569996B (en) | 2019-06-21 |
Family
ID=58532883
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610898432.8A Active CN106569996B (en) | 2016-03-30 | 2016-10-14 | A kind of Sentiment orientation analysis method towards Chinese microblogging |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106569996B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10394959B2 (en) | 2017-12-21 | 2019-08-27 | International Business Machines Corporation | Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources |
CN108304568B (en) * | 2018-02-12 | 2021-01-05 | 郑长敬 | Real estate public expectation big data processing method and system |
CN108681532B (en) * | 2018-04-08 | 2022-03-25 | 天津大学 | Sentiment analysis method for Chinese microblog |
CN111460158B (en) * | 2020-04-01 | 2022-09-23 | 安徽理工大学 | Microblog topic public emotion prediction method based on emotion analysis |
CN111611455A (en) * | 2020-05-22 | 2020-09-01 | 安徽理工大学 | User group division method based on user emotional behavior characteristics under microblog hot topics |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894102A (en) * | 2010-07-16 | 2010-11-24 | 浙江工商大学 | Method and device for analyzing emotion tendentiousness of subjective text |
CN104462065A (en) * | 2014-12-15 | 2015-03-25 | 北京国双科技有限公司 | Event emotion type analyzing method and device |
CN104899298A (en) * | 2015-06-09 | 2015-09-09 | 华东师范大学 | Microblog sentiment analysis method based on large-scale corpus characteristic learning |
-
2016
- 2016-10-14 CN CN201610898432.8A patent/CN106569996B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894102A (en) * | 2010-07-16 | 2010-11-24 | 浙江工商大学 | Method and device for analyzing emotion tendentiousness of subjective text |
CN104462065A (en) * | 2014-12-15 | 2015-03-25 | 北京国双科技有限公司 | Event emotion type analyzing method and device |
CN104899298A (en) * | 2015-06-09 | 2015-09-09 | 华东师范大学 | Microblog sentiment analysis method based on large-scale corpus characteristic learning |
Non-Patent Citations (2)
Title |
---|
A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec;Bai Xue 等;《2014 IEEE International Congress on Big Data》;20140702;第358-363页 |
基于词亲和度的微博词语语义倾向识别算法;唐浩浩 等;《数据采集与处理》;20150113;第30卷(第1期);第137-147页 |
Also Published As
Publication number | Publication date |
---|---|
CN106569996A (en) | 2017-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210216723A1 (en) | Classification model training method, classification method, device, and medium | |
CN106569996B (en) | A kind of Sentiment orientation analysis method towards Chinese microblogging | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
US10685186B2 (en) | Semantic understanding based emoji input method and device | |
Barbieri et al. | Multimodal emoji prediction | |
CN105183717B (en) | A kind of OSN user feeling analysis methods based on random forest and customer relationship | |
CN109376251A (en) | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model | |
CN108874937B (en) | Emotion classification method based on part of speech combination and feature selection | |
CN104281622B (en) | Information recommendation method and device in a kind of social media | |
CN106202053B (en) | A kind of microblogging theme sentiment analysis method of social networks driving | |
CN106354818B (en) | Social media-based dynamic user attribute extraction method | |
CN111125354A (en) | Text classification method and device | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN107688576B (en) | Construction and tendency classification method of CNN-SVM model | |
CN104298665A (en) | Identification method and device of evaluation objects of Chinese texts | |
CN110457711B (en) | Subject word-based social media event subject identification method | |
CN102929861A (en) | Method and system for calculating text emotion index | |
CN106202584A (en) | A kind of microblog emotional based on standard dictionary and semantic rule analyzes method | |
CN109992781B (en) | Text feature processing method and device and storage medium | |
CN113407842B (en) | Model training method, theme recommendation reason acquisition method and system and electronic equipment | |
KR20200087977A (en) | Multimodal ducument summary system and method | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN108009297B (en) | Text emotion analysis method and system based on natural language processing | |
CN104035955B (en) | searching method and device | |
Resyanto et al. | Choosing the most optimum text preprocessing method for sentiment analysis: Case: iPhone Tweets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |