CN106569996A - Chinese-microblog-oriented emotional tendency analysis method - Google Patents

Chinese-microblog-oriented emotional tendency analysis method Download PDF

Info

Publication number
CN106569996A
CN106569996A CN201610898432.8A CN201610898432A CN106569996A CN 106569996 A CN106569996 A CN 106569996A CN 201610898432 A CN201610898432 A CN 201610898432A CN 106569996 A CN106569996 A CN 106569996A
Authority
CN
China
Prior art keywords
microblogging
module
microblog
word
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610898432.8A
Other languages
Chinese (zh)
Other versions
CN106569996B (en
Inventor
郝志峰
梁礼欣
蔡瑞初
温雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Publication of CN106569996A publication Critical patent/CN106569996A/en
Application granted granted Critical
Publication of CN106569996B publication Critical patent/CN106569996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Abstract

The invention discloses a Chinese-microblog-oriented emotional tendency analysis method. The method comprises the following modules: an undersampling technique module, a microblog text preprocessing module, Word2vec-utilized microblog extension module, a feature extraction module, an emotion analysis model training module, and an emotional tendency judgment module. Finally, a trained AWCRF model is utilized to perform emotional tendency judgment on a to-be-predicted microblog. The Chinese-microblog-oriented emotional tendency analysis method provided by the invention has the advantage of capability of effectively solving an emotional tendency classification problem when a Chinese microblog data set exits unbalanced emotional tendency distribution, is simple to implement and has high recognition rate and strong practical value and practical significance.

Description

A kind of Sentiment orientation analysis method towards Chinese microblogging
Technical field
The invention belongs to network information processing technical field, and in particular to a kind of Sentiment orientation towards Chinese microblogging is analyzed Method.
Background technology
Microblogging is liked by many users as a new social platform.More and more people like by microblogging come Their viewpoint is delivered, so fully analysis and the emotion in digging user microblogging are significantly.The mesh of sentiment analysis Be from the viewpoint of digging user in microblogging text and recognize its Sentiment orientation.For example, enterprise can be obtained by microblogging Evaluation of the user to their products & services.With traditional sentiment analysis work, can be with to the sentiment analysis method of microblogging It is divided into two classes.One class is the method based on sentiment dictionary and rule, and they are according to positive emotion word in sentence and negative emotion word Number recognizing Sentiment orientation.Another kind of is the method based on machine learning, and they are trained by selecting suitable feature Grader.
However, the Sentiment orientation that above method all have ignored Chinese microblog data concentration is distributed disequilibrium to emotional semantic classification Impact, that is to say, that when the quantity of the sentence of the sentence and positive emotion of negative emotion in data set differs greatly, can shadow Ring the discriminant accuracy of grader.In real life, the topic that discusses in microblogging or event itself often with very strong Emotion tendency, it is uneven that this causes the Sentiment orientation of many topics to be distributed, such as " # edible oil rise in price # ", " # leather shoes fruit The topics such as jelly # " itself have obvious derogatory sense emotion, and " # slaughters cry of a deer prize-winning # " this topic has obvious commendation emotion. The disequilibrium of data Sentiment orientation distribution exactly causes many machine learning algorithms to show bad key factor, especially On the recognition effect of the classification occupied the minority in Sentiment orientation.In addition, compared with traditional text, the length of microblogging is general very Short, this causes traditional method to be difficult from wherein extracting the information that contributes to emotional semantic classification, and there is presently no one enough Big sentiment dictionary can cover all emotion words.
The content of the invention
In order to solve the above problems, the present invention proposes a kind of Sentiment orientation analysis method towards Chinese microblogging, its master Step is wanted to include as follows:
(1) Undersampling technique module.Many several classes of samples in training set are reduced using Affinity Propagation algorithms This quantity so as to balance training collection, to reduce impact of the disequilibrium of data set Sentiment orientation distribution to classifying quality.
(2) microblogging Text Pretreatment module.Microblogging text is cleaned, and carries out participle, part-of-speech tagging and stop words The operation such as process.
(3) microblogging module is extended using Word2vec.The front K phase of each word in microblogging is sought by using Word2vec Like word so as to extending microblogging.
(4) characteristic extracting module.The related dictionary of loading, carries out feature extraction to above pretreated microblogging.
(5) sentiment analysis model training module.AWCRF moulds are trained in training set after having balanced and extended above Type;
(6) emotion tendency discrimination module.Sentiment orientation is carried out to microblogging to be predicted using the AWCRF models for training Differentiate.
Description of the drawings
Fig. 1 is the analysis process figure of the present invention.
Specific embodiment
The present invention is described further below in conjunction with the accompanying drawings.The present invention is distributed unbalanced Chinese for Sentiment orientation The Sentiment orientation classification problem of microblog data collection.Fig. 1 is the total algorithm flow process of the present invention.
The particular content of each step is described separately below:
1st, Undersampling technique module
The present invention using Affinity Propagation algorithms come reduce the quantity of many several classes of samples in training set so as to Balance training collection.
The Undersampling technique of the present invention is divided into following several steps:
(1) give training set t1, it is divided into into many several classes ofs maj1With minority class min1
(2) for many several classes ofs maj1, it is polymerized to into several classes using Affinity Propagation clustering algorithms, can be with It is expressed as C={ c1,c2,...cn};
(3) in order to build the data set of balance, maj is obtained from random choose sample in each subclass of C in proportion2, make Obtain maj2Sample size and min1Sample size it is close;
(4) data set maj2And min1To be brought together and obtain a balance training collection t2
(5) by training set t after having balanced2Replace t1As final training set.
2nd, microblogging Text Pretreatment module
The module groundwork is that microblogging text is cleaned, and carries out participle, part-of-speech tagging and stop words process etc. Operation;
3rd, microblogging module is extended using Word2vec
The present invention seeks the front K similar word of each word in microblogging so as to extend microblogging by using Word2vec, specifically Step includes following two step, is training term vector and extension microblogging respectively.
(1) train term vector.We have collected substantial amounts of microblogging language material from Sina weibo API, filter out some symbols useless Number and network address, finally be left 5G microblog data be used for work as training set.Then using the Skip-gram models in Word2vec come Training term vector, seeks the similar word of each word in microblogging finally by the term vector.
(2) extend microblogging.First, give a microblogging sentence ti, the word to this sentence can be obtained after its participle Sequence, is expressed as { w1,w2,...wn, then, microblogging sentence t is sought using the term vector for having trained aboveiIn each word Front k similar word, so as to reach extension microblogging purpose.Microblogging after extension can be expressed as { w1,w2,...wn, w11, w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Front k similar word, For the emoticon and punctuation mark in microblogging is directly retained in microblogging, so the microblogging after extension contains more than former microblogging Many information.
4th, characteristic extracting module
(1) the related dictionary of loading, including sentiment dictionary, feelings symbol dictionary, popular dictionary, negative word dictionary etc., are used for Feature extraction.
(2) by the dictionary data for having loaded above, pre-defined feature is carried out to pretreated microblogging text Extraction, by text vector and be converted into the form that sentiment analysis model training module can be processed.
5th, sentiment analysis model training module
The present invention by CRF models apply in the data after this paper Undersampling techniques and Word2vec technical finesses from And obtain AWCRF models.Then the characteristic vector for characteristic extracting module being extracted from microblogging as input, using L- BFGS algorithms are training AWCRF models.During the model can not only overcome training set, emotion is distributed unbalanced impact advantage, And have the advantages that to increase the impact inadequate so as to alleviate sentiment dictionary coverage rate of the emotion information of microblogging sentence.Separately Outward, as training sample tails off, so the features such as model also has few training time and high training effectiveness, with very strong reality With value.
6th, emotion tendency discrimination module
First data to be predicted are carried out with Text Pretreatment, is operated using Word2vec extensions, feature extraction etc., so as to Obtain surveying the characteristic vector of data;Then using the characteristic vector of prediction data as AWCRF mode inputs, using what is trained AWCRF models carry out Sentiment orientation differentiation to microblogging to be predicted.

Claims (5)

1. a kind of Sentiment orientation analysis method towards Chinese microblogging, it is characterised in that include such as lower module:
(1) Undersampling technique module, reduces many several classes of samples in training set using Affinity Propagation algorithms Quantity so as to balance training collection, to reduce impact of the disequilibrium of data set Sentiment orientation distribution to classifying quality;
(2) microblogging Text Pretreatment module, cleans to microblogging text, and carries out the process of participle, part-of-speech tagging and stop words Deng operation;
(3) microblogging module is extended using Word2vec, the front K similar word of each word in microblogging is sought by using Word2vec So as to extend microblogging;
(4) characteristic extracting module;The related dictionary of loading, carries out feature extraction to above pretreated microblogging;
(5) sentiment analysis model training module, trains AWCRF models in the training set after having balanced and extended above;
(6) emotion tendency discrimination module, carries out Sentiment orientation to microblogging to be predicted using the AWCRF models for training and sentences Not.
2. method according to claim 1, it is characterised in that after the module step 1, also comprise the steps:
(2-1) give training set t1, it is divided into into many several classes ofs maj1With minority class min1
(2-2) for many several classes ofs maj1, it is polymerized to into several classes using Affinity Propagation clustering algorithms, can be with table It is shown as C={ c1,c2,...cn};
(2-3) in order to build the data set of balance, maj is obtained from random choose sample in each subclass of C in proportion2So that maj2Sample size and min1Sample size it is close;
(2-4) data set maj2And min1To be brought together and obtain a balance training collection t2
(2-5) by training set t after having balanced2Replace t1As final training set.
3. method according to claim 1, it is characterised in that after step 3, also comprise the steps:
(3-1) term vector is trained, we have collected substantial amounts of microblogging language material from Sina weibo API, filter out some symbols useless And network address, finally it is left the microblog data of 5G for working as training set.Then using the Skip-gram models in Word2vec instructing Practice term vector, the similar word of each word in microblogging is sought finally by the term vector;
(3-2) microblogging is extended, first, a microblogging sentence t is giveni, the word order to this sentence can be obtained after its participle Row, are expressed as { w1,w2,...wn, then, microblogging sentence t is sought using the term vector for having trained aboveiIn each word Front k similar word, so as to reach the purpose of extension microblogging.Microblogging after extension can be expressed as { w1,w2,...wn, w11, w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Front k similar word, For the emoticon and punctuation mark in microblogging is directly retained in microblogging, so the microblogging after extension contains more than former microblogging Many information.
4. method according to claim 1, it is characterised in that after the step 5, also comprise the steps:
(4-1) CRF models are applied in the data after this paper Undersampling techniques and Word2vec technical finesses so as to obtain To AWCRF models;
(4-2) characteristic vector that characteristic extracting module is extracted from microblogging is instructed as input using L-BFGS algorithms Practice AWCRF models.
5. method according to claim 1, it is characterised in that after the step 6, also comprise the steps:
(5-1) data to be predicted are carried out with Text Pretreatment, is operated using Word2vec extensions, feature extraction etc., so as to To the characteristic vector for surveying data;
(5-2) using the characteristic vector of prediction data as AWCRF mode inputs, using the AWCRF models for training to be predicted Microblogging carry out Sentiment orientation differentiation.
CN201610898432.8A 2016-03-30 2016-10-14 A kind of Sentiment orientation analysis method towards Chinese microblogging Active CN106569996B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610192686 2016-03-30
CN2016101926868 2016-03-30

Publications (2)

Publication Number Publication Date
CN106569996A true CN106569996A (en) 2017-04-19
CN106569996B CN106569996B (en) 2019-06-21

Family

ID=58532883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610898432.8A Active CN106569996B (en) 2016-03-30 2016-10-14 A kind of Sentiment orientation analysis method towards Chinese microblogging

Country Status (1)

Country Link
CN (1) CN106569996B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304568A (en) * 2018-02-12 2018-07-20 郑长敬 A kind of real estate Expectations big data processing method and system
CN108681532A (en) * 2018-04-08 2018-10-19 天津大学 A kind of sentiment analysis method towards Chinese microblogging
US10394959B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources
CN111460158A (en) * 2020-04-01 2020-07-28 安徽理工大学 Microblog topic public emotion prediction method based on emotion analysis
CN111611455A (en) * 2020-05-22 2020-09-01 安徽理工大学 User group division method based on user emotional behavior characteristics under microblog hot topics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894102A (en) * 2010-07-16 2010-11-24 浙江工商大学 Method and device for analyzing emotion tendentiousness of subjective text
CN104462065A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Event emotion type analyzing method and device
CN104899298A (en) * 2015-06-09 2015-09-09 华东师范大学 Microblog sentiment analysis method based on large-scale corpus characteristic learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894102A (en) * 2010-07-16 2010-11-24 浙江工商大学 Method and device for analyzing emotion tendentiousness of subjective text
CN104462065A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Event emotion type analyzing method and device
CN104899298A (en) * 2015-06-09 2015-09-09 华东师范大学 Microblog sentiment analysis method based on large-scale corpus characteristic learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BAI XUE 等: "A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec", 《2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA》 *
唐浩浩 等: "基于词亲和度的微博词语语义倾向识别算法", 《数据采集与处理》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394959B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources
US10719665B2 (en) 2017-12-21 2020-07-21 International Business Machines Corporation Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources
CN108304568A (en) * 2018-02-12 2018-07-20 郑长敬 A kind of real estate Expectations big data processing method and system
CN108304568B (en) * 2018-02-12 2021-01-05 郑长敬 Real estate public expectation big data processing method and system
CN108681532A (en) * 2018-04-08 2018-10-19 天津大学 A kind of sentiment analysis method towards Chinese microblogging
CN108681532B (en) * 2018-04-08 2022-03-25 天津大学 Sentiment analysis method for Chinese microblog
CN111460158A (en) * 2020-04-01 2020-07-28 安徽理工大学 Microblog topic public emotion prediction method based on emotion analysis
CN111460158B (en) * 2020-04-01 2022-09-23 安徽理工大学 Microblog topic public emotion prediction method based on emotion analysis
CN111611455A (en) * 2020-05-22 2020-09-01 安徽理工大学 User group division method based on user emotional behavior characteristics under microblog hot topics

Also Published As

Publication number Publication date
CN106569996B (en) 2019-06-21

Similar Documents

Publication Publication Date Title
US11853704B2 (en) Classification model training method, classification method, device, and medium
CN106202032B (en) A kind of sentiment analysis method and its system towards microblogging short text
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN106569996A (en) Chinese-microblog-oriented emotional tendency analysis method
CN109101478B (en) Aspect-level emotion analysis method for E-commerce comment text
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108427670A (en) A kind of sentiment analysis method based on context word vector sum deep learning
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN104331506A (en) Multiclass emotion analyzing method and system facing bilingual microblog text
CN105550269A (en) Product comment analyzing method and system with learning supervising function
Tiwari et al. Social media sentiment analysis on Twitter datasets
CN109446404A (en) A kind of the feeling polarities analysis method and device of network public-opinion
CN106354818B (en) Social media-based dynamic user attribute extraction method
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN107679110A (en) The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction
CN106776574A (en) User comment text method for digging and device
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN108846047A (en) A kind of picture retrieval method and system based on convolution feature
CN108009297B (en) Text emotion analysis method and system based on natural language processing
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN106446147A (en) Emotion analysis method based on structuring features
Chumwatana Using sentiment analysis technique for analyzing Thai customer satisfaction from social media
CN109815485A (en) A kind of method, apparatus and storage medium of the identification of microblogging short text feeling polarities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant