CN106569996B - A kind of Sentiment orientation analysis method towards Chinese microblogging - Google Patents

A kind of Sentiment orientation analysis method towards Chinese microblogging Download PDF

Info

Publication number
CN106569996B
CN106569996B CN201610898432.8A CN201610898432A CN106569996B CN 106569996 B CN106569996 B CN 106569996B CN 201610898432 A CN201610898432 A CN 201610898432A CN 106569996 B CN106569996 B CN 106569996B
Authority
CN
China
Prior art keywords
microblogging
module
word
model
sentiment orientation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610898432.8A
Other languages
Chinese (zh)
Other versions
CN106569996A (en
Inventor
郝志峰
梁礼欣
蔡瑞初
温雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Publication of CN106569996A publication Critical patent/CN106569996A/en
Application granted granted Critical
Publication of CN106569996B publication Critical patent/CN106569996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Sentiment orientation analysis methods towards Chinese microblogging.Including following module: Undersampling technique module, microblogging Text Pretreatment module extend microblogging module, characteristic extracting module, sentiment analysis model training module, emotion tendency discrimination module using Word2vec.Sentiment orientation differentiation finally is carried out to microblogging to be predicted using trained AWCRF model.Advantage of the invention is the Sentiment orientation classification problem that can efficiently solve Chinese microblog data collection when Sentiment orientation is distributed uneven, is implemented simply, and discrimination is high, has very strong real value and realistic meaning.

Description

A kind of Sentiment orientation analysis method towards Chinese microblogging
Technical field
The invention belongs to network information processing technical fields, and in particular to a kind of Sentiment orientation analysis towards Chinese microblogging Method.
Background technique
The microblogging social platform new as one, is liked by many users.More and more people like by microblogging come Their viewpoint is delivered, so sufficiently analysis and the emotion excavated in user's microblogging is significantly.The mesh of sentiment analysis Be to excavate the viewpoint of user from microblogging text and identify its Sentiment orientation.For example, enterprise can be obtained by microblogging Evaluation of the user to their products & services.It, can be with to the sentiment analysis method of microblogging with traditional sentiment analysis work It is divided into two classes.One kind is the method based on sentiment dictionary and rule, they are according to positive emotional word in sentence and negative emotion word Number identify Sentiment orientation.Another kind of is the method based on machine learning, they are trained by selecting suitable feature Classifier.
However, above method, which all has ignored the Sentiment orientation that Chinese microblog data is concentrated, is distributed disequilibrium to emotional semantic classification Influence, that is to say, that, can shadow when the quantity of the sentence of the sentence and positive emotion of negative emotion in data set differs greatly Ring the discriminant accuracy of classifier.In real life, the topic that is discussed in microblogging or event itself often with very strong Emotion tendency, it is uneven that this causes the Sentiment orientation of many topics to be distributed, such as " # edible oil rise in price # ", " # leather shoes fruit Freezing the topics such as # " itself has apparent derogatory sense emotion, and " # slaughters cry of a deer prize-winning # " this topic has apparent commendation emotion. The disequilibrium of data Sentiment orientation distribution exactly causes many machine learning algorithms to show bad key factor, especially On the recognition effect of the classification to occupy the minority in Sentiment orientation.In addition, the length of microblogging is generally very compared with traditional text Short, this causes conventional method to be difficult from wherein extracting the information for facilitating emotional semantic classification, and enough there is presently no one Big sentiment dictionary can cover all emotion words.
Summary of the invention
To solve the above-mentioned problems, the invention proposes a kind of Sentiment orientation analysis method towards Chinese microblogging, masters The step is wanted to include the following:
(1) Undersampling technique module.Most class samples in training set are reduced using Affinity Propagation algorithm This quantity is to balance training collection, to reduce influence of the disequilibrium of data set Sentiment orientation distribution to classifying quality.
(2) microblogging Text Pretreatment module.Microblogging text is cleaned, and is segmented, part-of-speech tagging and stop words The operation such as processing.
(3) microblogging module is extended using Word2vec.By the preceding K phase for seeking each word in microblogging using Word2vec Like word to extend microblogging.
(4) characteristic extracting module.Related dictionary is loaded, the microblogging pretreated to front carries out feature extraction.
(5) sentiment analysis model training module.Training AWCRF mould on training set after having balanced and extended above Type;
(6) emotion tendency discrimination module.Sentiment orientation is carried out to microblogging to be predicted using trained AWCRF model Differentiate.
Detailed description of the invention
Fig. 1 is analysis flow chart diagram of the invention.
Specific embodiment
Following further describes the present invention with reference to the drawings.The present invention is distributed unbalanced Chinese for Sentiment orientation The Sentiment orientation classification problem of microblog data collection.Fig. 1 is total algorithm process of the invention.
The particular content of each step is described separately as below:
1, Undersampling technique module
The present invention reduced using Affinity Propagation algorithm in training set the quantity of most class samples to Balance training collection.
Undersampling technique of the invention is divided into following several steps:
(1) a training set t is given1, it is divided into most class maj1With minority class min1
(2) for most class maj1, it is polymerized to several classes using Affinity Propagation clustering algorithm, it can be with It is expressed as C={ c1,c2,...cn};
(3) in order to construct the data set of balance, sample is selected at random from each subclass of C in proportion and obtains maj2, make Obtain maj2Sample size and min1Sample size it is close;
(4) data set maj2And min1It will be brought together to obtain a balance training collection t2
(5) the training set t after having balanced2Instead of t1As final training set.
2, microblogging Text Pretreatment module
The module groundwork be microblogging text is cleaned, and segmented, part-of-speech tagging and stop words processing etc. Operation;
3, microblogging module is extended using Word2vec
The present invention using Word2vec by asking the preceding K similar word of each word in microblogging to extend microblogging, specifically Step includes following two step, is trained term vector and extension microblogging respectively.
(1) training term vector.We have collected a large amount of microblogging corpus from Sina weibo API, filter out some symbols useless Number and network address, finally be left 5G microblog data be used to work as training set.Then using the Skip-gram model in Word2vec come Training term vector, the similar word of each word in microblogging is sought finally by the term vector.
(2) microblogging is extended.Firstly, giving a microblogging sentence ti, the word of this available sentence after being segmented to it Sequence is expressed as { w1,w2,...wn, then, using trained term vector seeks microblogging sentence t aboveiIn each word Preceding k similar word, thus achieve the purpose that extend microblogging.Microblogging after extension can be expressed as { w1,w2,...wn, w11, w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Preceding k similar word, For in microblogging emoticon and punctuation mark be directly retained in microblogging, so extension after microblogging contain more than former microblogging More information.
4, characteristic extracting module
(1) related dictionary, including sentiment dictionary, feelings symbol dictionary, popular dictionary, negative word dictionary etc. are loaded, is used for Feature extraction.
(2) by the dictionary data loaded above, pre-defined feature is carried out to pretreated microblogging text Extraction, by text vector and be converted into the format that sentiment analysis model training module is capable of handling.
5, sentiment analysis model training module
The present invention by CRF model apply in the data after this paper Undersampling technique and Word2vec technical treatment from And obtain AWCRF model.Then feature vector characteristic extracting module extracted from microblogging uses L- as input BFGS algorithm trains AWCRF model.The model, which not only has, can overcome emotion in training set to be distributed unbalanced influence advantage, And there is the advantages of emotion information that can increase microblogging sentence is to alleviate the inadequate influence of sentiment dictionary covering surface.Separately Outside, since training sample tails off, so the model also has the characteristics that the training time is few and training effectiveness is high, there is very strong reality With value.
6, emotion tendency discrimination module
Text Pretreatment is carried out to data to be predicted first, is operated using Word2vec extension, feature extraction etc., thus Obtain the feature vector of measured data;Then using the feature vector of prediction data as AWCRF mode input, utilization is trained AWCRF model carries out Sentiment orientation differentiation to microblogging to be predicted.

Claims (2)

1. a kind of Sentiment orientation analysis method towards Chinese microblogging, it is characterised in that including following module:
(1) Undersampling technique module reduces most class samples in training set using Affinity Propagation algorithm Quantity is to balance training collection, to reduce influence of the disequilibrium of data set Sentiment orientation distribution to classifying quality, specifically such as Under:
S101), a training set t is given1, it is divided into most class maj1With minority class min1
S102), for most class maj1, it is polymerized to several classes using Affinity Propagation clustering algorithm, it can be with table It is shown as C={ c1,c2,...cn};
S103), in order to construct the data set of balance, sample is selected at random from each subclass of C in proportion and obtains maj2, so that maj2Sample size and min1Sample size it is close;
S104), data set maj2And min1It will be brought together to obtain a balance training collection t2
S105), the training set t after having balanced2Instead of t1As final training set;
(2) microblogging Text Pretreatment module cleans microblogging text, and is segmented, the processing of part-of-speech tagging and stop words Operation;
(3) microblogging module is extended using Word2vec, by the preceding K similar word for seeking each word in microblogging using Word2vec To extend microblogging, specifically:
S301), training term vector, has collected a large amount of microblogging corpus from Sina weibo API, filters out symbol useless and network address, Finally be left 5G microblog data be used to work as training set, then trained using the Skip-gram model in Word2vec word to Amount, the similar word of each word in microblogging is sought finally by the term vector;
S302), microblogging is extended, firstly, giving a microblogging sentence ti, the word order of this available sentence after being segmented to it Column, are expressed as { w1,w2,...wn, then, using trained term vector seeks microblogging sentence t aboveiIn each word Preceding k similar word, to achieve the purpose that extend microblogging;Microblogging after extension can be expressed as { w1,w2,...wn, w11, w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Preceding k similar word, For in microblogging emoticon and punctuation mark be directly retained in microblogging, so extension after microblogging contain more than former microblogging More information;
(4) characteristic extracting module;Related dictionary is loaded, the microblogging pretreated to front carries out feature extraction,
(5) sentiment analysis model training module, training AWCRF model on the training set after having balanced and extended above, tool Body the following steps are included:
S501) CRF model is applied in the data after this paper Undersampling technique and Word2vec technical treatment to obtain To AWCRF model;
S502) feature vector for extracting characteristic extracting module from microblogging is instructed as input using L-BFGS algorithm Practice AWCRF model;
(6) emotion tendency discrimination module carries out Sentiment orientation to microblogging to be predicted using trained AWCRF model and sentences Not.
2. the method according to claim 1, wherein further including following steps in step (6):
S601), Text Pretreatment carried out to data to be predicted, operated using Word2vec extension, feature extraction etc., thus To the feature vector of measured data;
S602), using the feature vector of prediction data as AWCRF mode input, using trained AWCRF model to be predicted Microblogging carry out Sentiment orientation differentiation.
CN201610898432.8A 2016-03-30 2016-10-14 A kind of Sentiment orientation analysis method towards Chinese microblogging Active CN106569996B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610192686 2016-03-30
CN2016101926868 2016-03-30

Publications (2)

Publication Number Publication Date
CN106569996A CN106569996A (en) 2017-04-19
CN106569996B true CN106569996B (en) 2019-06-21

Family

ID=58532883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610898432.8A Active CN106569996B (en) 2016-03-30 2016-10-14 A kind of Sentiment orientation analysis method towards Chinese microblogging

Country Status (1)

Country Link
CN (1) CN106569996B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394959B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources
CN108304568B (en) * 2018-02-12 2021-01-05 郑长敬 Real estate public expectation big data processing method and system
CN108681532B (en) * 2018-04-08 2022-03-25 天津大学 Sentiment analysis method for Chinese microblog
CN111460158B (en) * 2020-04-01 2022-09-23 安徽理工大学 Microblog topic public emotion prediction method based on emotion analysis
CN111611455A (en) * 2020-05-22 2020-09-01 安徽理工大学 User group division method based on user emotional behavior characteristics under microblog hot topics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894102A (en) * 2010-07-16 2010-11-24 浙江工商大学 Method and device for analyzing emotion tendentiousness of subjective text
CN104462065A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Event emotion type analyzing method and device
CN104899298A (en) * 2015-06-09 2015-09-09 华东师范大学 Microblog sentiment analysis method based on large-scale corpus characteristic learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894102A (en) * 2010-07-16 2010-11-24 浙江工商大学 Method and device for analyzing emotion tendentiousness of subjective text
CN104462065A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Event emotion type analyzing method and device
CN104899298A (en) * 2015-06-09 2015-09-09 华东师范大学 Microblog sentiment analysis method based on large-scale corpus characteristic learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec;Bai Xue 等;《2014 IEEE International Congress on Big Data》;20140702;第358-363页
基于词亲和度的微博词语语义倾向识别算法;唐浩浩 等;《数据采集与处理》;20150113;第30卷(第1期);第137-147页

Also Published As

Publication number Publication date
CN106569996A (en) 2017-04-19

Similar Documents

Publication Publication Date Title
US20210216723A1 (en) Classification model training method, classification method, device, and medium
CN106569996B (en) A kind of Sentiment orientation analysis method towards Chinese microblogging
CN106649818B (en) Application search intention identification method and device, application search method and server
US10685186B2 (en) Semantic understanding based emoji input method and device
Barbieri et al. Multimodal emoji prediction
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN109376251A (en) A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model
CN108874937B (en) Emotion classification method based on part of speech combination and feature selection
CN104281622B (en) Information recommendation method and device in a kind of social media
CN106202053B (en) A kind of microblogging theme sentiment analysis method of social networks driving
CN106354818B (en) Social media-based dynamic user attribute extraction method
CN111125354A (en) Text classification method and device
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN110457711B (en) Subject word-based social media event subject identification method
CN102929861A (en) Method and system for calculating text emotion index
CN106202584A (en) A kind of microblog emotional based on standard dictionary and semantic rule analyzes method
CN109992781B (en) Text feature processing method and device and storage medium
CN113407842B (en) Model training method, theme recommendation reason acquisition method and system and electronic equipment
KR20200087977A (en) Multimodal ducument summary system and method
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN108009297B (en) Text emotion analysis method and system based on natural language processing
CN104035955B (en) searching method and device
Resyanto et al. Choosing the most optimum text preprocessing method for sentiment analysis: Case: iPhone Tweets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant