CN106569996A - Chinese-microblog-oriented emotional tendency analysis method - Google Patents
Chinese-microblog-oriented emotional tendency analysis method Download PDFInfo
- Publication number
- CN106569996A CN106569996A CN201610898432.8A CN201610898432A CN106569996A CN 106569996 A CN106569996 A CN 106569996A CN 201610898432 A CN201610898432 A CN 201610898432A CN 106569996 A CN106569996 A CN 106569996A
- Authority
- CN
- China
- Prior art keywords
- microblogging
- module
- microblog
- word
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Abstract
The invention discloses a Chinese-microblog-oriented emotional tendency analysis method. The method comprises the following modules: an undersampling technique module, a microblog text preprocessing module, Word2vec-utilized microblog extension module, a feature extraction module, an emotion analysis model training module, and an emotional tendency judgment module. Finally, a trained AWCRF model is utilized to perform emotional tendency judgment on a to-be-predicted microblog. The Chinese-microblog-oriented emotional tendency analysis method provided by the invention has the advantage of capability of effectively solving an emotional tendency classification problem when a Chinese microblog data set exits unbalanced emotional tendency distribution, is simple to implement and has high recognition rate and strong practical value and practical significance.
Description
Technical field
The invention belongs to network information processing technical field, and in particular to a kind of Sentiment orientation towards Chinese microblogging is analyzed
Method.
Background technology
Microblogging is liked by many users as a new social platform.More and more people like by microblogging come
Their viewpoint is delivered, so fully analysis and the emotion in digging user microblogging are significantly.The mesh of sentiment analysis
Be from the viewpoint of digging user in microblogging text and recognize its Sentiment orientation.For example, enterprise can be obtained by microblogging
Evaluation of the user to their products & services.With traditional sentiment analysis work, can be with to the sentiment analysis method of microblogging
It is divided into two classes.One class is the method based on sentiment dictionary and rule, and they are according to positive emotion word in sentence and negative emotion word
Number recognizing Sentiment orientation.Another kind of is the method based on machine learning, and they are trained by selecting suitable feature
Grader.
However, the Sentiment orientation that above method all have ignored Chinese microblog data concentration is distributed disequilibrium to emotional semantic classification
Impact, that is to say, that when the quantity of the sentence of the sentence and positive emotion of negative emotion in data set differs greatly, can shadow
Ring the discriminant accuracy of grader.In real life, the topic that discusses in microblogging or event itself often with very strong
Emotion tendency, it is uneven that this causes the Sentiment orientation of many topics to be distributed, such as " # edible oil rise in price # ", " # leather shoes fruit
The topics such as jelly # " itself have obvious derogatory sense emotion, and " # slaughters cry of a deer prize-winning # " this topic has obvious commendation emotion.
The disequilibrium of data Sentiment orientation distribution exactly causes many machine learning algorithms to show bad key factor, especially
On the recognition effect of the classification occupied the minority in Sentiment orientation.In addition, compared with traditional text, the length of microblogging is general very
Short, this causes traditional method to be difficult from wherein extracting the information that contributes to emotional semantic classification, and there is presently no one enough
Big sentiment dictionary can cover all emotion words.
The content of the invention
In order to solve the above problems, the present invention proposes a kind of Sentiment orientation analysis method towards Chinese microblogging, its master
Step is wanted to include as follows:
(1) Undersampling technique module.Many several classes of samples in training set are reduced using Affinity Propagation algorithms
This quantity so as to balance training collection, to reduce impact of the disequilibrium of data set Sentiment orientation distribution to classifying quality.
(2) microblogging Text Pretreatment module.Microblogging text is cleaned, and carries out participle, part-of-speech tagging and stop words
The operation such as process.
(3) microblogging module is extended using Word2vec.The front K phase of each word in microblogging is sought by using Word2vec
Like word so as to extending microblogging.
(4) characteristic extracting module.The related dictionary of loading, carries out feature extraction to above pretreated microblogging.
(5) sentiment analysis model training module.AWCRF moulds are trained in training set after having balanced and extended above
Type;
(6) emotion tendency discrimination module.Sentiment orientation is carried out to microblogging to be predicted using the AWCRF models for training
Differentiate.
Description of the drawings
Fig. 1 is the analysis process figure of the present invention.
Specific embodiment
The present invention is described further below in conjunction with the accompanying drawings.The present invention is distributed unbalanced Chinese for Sentiment orientation
The Sentiment orientation classification problem of microblog data collection.Fig. 1 is the total algorithm flow process of the present invention.
The particular content of each step is described separately below:
1st, Undersampling technique module
The present invention using Affinity Propagation algorithms come reduce the quantity of many several classes of samples in training set so as to
Balance training collection.
The Undersampling technique of the present invention is divided into following several steps:
(1) give training set t1, it is divided into into many several classes ofs maj1With minority class min1;
(2) for many several classes ofs maj1, it is polymerized to into several classes using Affinity Propagation clustering algorithms, can be with
It is expressed as C={ c1,c2,...cn};
(3) in order to build the data set of balance, maj is obtained from random choose sample in each subclass of C in proportion2, make
Obtain maj2Sample size and min1Sample size it is close;
(4) data set maj2And min1To be brought together and obtain a balance training collection t2;
(5) by training set t after having balanced2Replace t1As final training set.
2nd, microblogging Text Pretreatment module
The module groundwork is that microblogging text is cleaned, and carries out participle, part-of-speech tagging and stop words process etc.
Operation;
3rd, microblogging module is extended using Word2vec
The present invention seeks the front K similar word of each word in microblogging so as to extend microblogging by using Word2vec, specifically
Step includes following two step, is training term vector and extension microblogging respectively.
(1) train term vector.We have collected substantial amounts of microblogging language material from Sina weibo API, filter out some symbols useless
Number and network address, finally be left 5G microblog data be used for work as training set.Then using the Skip-gram models in Word2vec come
Training term vector, seeks the similar word of each word in microblogging finally by the term vector.
(2) extend microblogging.First, give a microblogging sentence ti, the word to this sentence can be obtained after its participle
Sequence, is expressed as { w1,w2,...wn, then, microblogging sentence t is sought using the term vector for having trained aboveiIn each word
Front k similar word, so as to reach extension microblogging purpose.Microblogging after extension can be expressed as { w1,w2,...wn, w11,
w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Front k similar word,
For the emoticon and punctuation mark in microblogging is directly retained in microblogging, so the microblogging after extension contains more than former microblogging
Many information.
4th, characteristic extracting module
(1) the related dictionary of loading, including sentiment dictionary, feelings symbol dictionary, popular dictionary, negative word dictionary etc., are used for
Feature extraction.
(2) by the dictionary data for having loaded above, pre-defined feature is carried out to pretreated microblogging text
Extraction, by text vector and be converted into the form that sentiment analysis model training module can be processed.
5th, sentiment analysis model training module
The present invention by CRF models apply in the data after this paper Undersampling techniques and Word2vec technical finesses from
And obtain AWCRF models.Then the characteristic vector for characteristic extracting module being extracted from microblogging as input, using L-
BFGS algorithms are training AWCRF models.During the model can not only overcome training set, emotion is distributed unbalanced impact advantage,
And have the advantages that to increase the impact inadequate so as to alleviate sentiment dictionary coverage rate of the emotion information of microblogging sentence.Separately
Outward, as training sample tails off, so the features such as model also has few training time and high training effectiveness, with very strong reality
With value.
6th, emotion tendency discrimination module
First data to be predicted are carried out with Text Pretreatment, is operated using Word2vec extensions, feature extraction etc., so as to
Obtain surveying the characteristic vector of data;Then using the characteristic vector of prediction data as AWCRF mode inputs, using what is trained
AWCRF models carry out Sentiment orientation differentiation to microblogging to be predicted.
Claims (5)
1. a kind of Sentiment orientation analysis method towards Chinese microblogging, it is characterised in that include such as lower module:
(1) Undersampling technique module, reduces many several classes of samples in training set using Affinity Propagation algorithms
Quantity so as to balance training collection, to reduce impact of the disequilibrium of data set Sentiment orientation distribution to classifying quality;
(2) microblogging Text Pretreatment module, cleans to microblogging text, and carries out the process of participle, part-of-speech tagging and stop words
Deng operation;
(3) microblogging module is extended using Word2vec, the front K similar word of each word in microblogging is sought by using Word2vec
So as to extend microblogging;
(4) characteristic extracting module;The related dictionary of loading, carries out feature extraction to above pretreated microblogging;
(5) sentiment analysis model training module, trains AWCRF models in the training set after having balanced and extended above;
(6) emotion tendency discrimination module, carries out Sentiment orientation to microblogging to be predicted using the AWCRF models for training and sentences
Not.
2. method according to claim 1, it is characterised in that after the module step 1, also comprise the steps:
(2-1) give training set t1, it is divided into into many several classes ofs maj1With minority class min1;
(2-2) for many several classes ofs maj1, it is polymerized to into several classes using Affinity Propagation clustering algorithms, can be with table
It is shown as C={ c1,c2,...cn};
(2-3) in order to build the data set of balance, maj is obtained from random choose sample in each subclass of C in proportion2So that
maj2Sample size and min1Sample size it is close;
(2-4) data set maj2And min1To be brought together and obtain a balance training collection t2;
(2-5) by training set t after having balanced2Replace t1As final training set.
3. method according to claim 1, it is characterised in that after step 3, also comprise the steps:
(3-1) term vector is trained, we have collected substantial amounts of microblogging language material from Sina weibo API, filter out some symbols useless
And network address, finally it is left the microblog data of 5G for working as training set.Then using the Skip-gram models in Word2vec instructing
Practice term vector, the similar word of each word in microblogging is sought finally by the term vector;
(3-2) microblogging is extended, first, a microblogging sentence t is giveni, the word order to this sentence can be obtained after its participle
Row, are expressed as { w1,w2,...wn, then, microblogging sentence t is sought using the term vector for having trained aboveiIn each word
Front k similar word, so as to reach the purpose of extension microblogging.Microblogging after extension can be expressed as { w1,w2,...wn, w11,
w12... w1k,w21,w22,...w2k,...,wn2,...wnk, wherein { w11, w12... w1kRepresent word w1Front k similar word,
For the emoticon and punctuation mark in microblogging is directly retained in microblogging, so the microblogging after extension contains more than former microblogging
Many information.
4. method according to claim 1, it is characterised in that after the step 5, also comprise the steps:
(4-1) CRF models are applied in the data after this paper Undersampling techniques and Word2vec technical finesses so as to obtain
To AWCRF models;
(4-2) characteristic vector that characteristic extracting module is extracted from microblogging is instructed as input using L-BFGS algorithms
Practice AWCRF models.
5. method according to claim 1, it is characterised in that after the step 6, also comprise the steps:
(5-1) data to be predicted are carried out with Text Pretreatment, is operated using Word2vec extensions, feature extraction etc., so as to
To the characteristic vector for surveying data;
(5-2) using the characteristic vector of prediction data as AWCRF mode inputs, using the AWCRF models for training to be predicted
Microblogging carry out Sentiment orientation differentiation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610192686 | 2016-03-30 | ||
CN2016101926868 | 2016-03-30 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106569996A true CN106569996A (en) | 2017-04-19 |
CN106569996B CN106569996B (en) | 2019-06-21 |
Family
ID=58532883
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610898432.8A Active CN106569996B (en) | 2016-03-30 | 2016-10-14 | A kind of Sentiment orientation analysis method towards Chinese microblogging |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106569996B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304568A (en) * | 2018-02-12 | 2018-07-20 | 郑长敬 | A kind of real estate Expectations big data processing method and system |
CN108681532A (en) * | 2018-04-08 | 2018-10-19 | 天津大学 | A kind of sentiment analysis method towards Chinese microblogging |
US10394959B2 (en) | 2017-12-21 | 2019-08-27 | International Business Machines Corporation | Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources |
CN111460158A (en) * | 2020-04-01 | 2020-07-28 | 安徽理工大学 | Microblog topic public emotion prediction method based on emotion analysis |
CN111611455A (en) * | 2020-05-22 | 2020-09-01 | 安徽理工大学 | User group division method based on user emotional behavior characteristics under microblog hot topics |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894102A (en) * | 2010-07-16 | 2010-11-24 | 浙江工商大学 | Method and device for analyzing emotion tendentiousness of subjective text |
CN104462065A (en) * | 2014-12-15 | 2015-03-25 | 北京国双科技有限公司 | Event emotion type analyzing method and device |
CN104899298A (en) * | 2015-06-09 | 2015-09-09 | 华东师范大学 | Microblog sentiment analysis method based on large-scale corpus characteristic learning |
-
2016
- 2016-10-14 CN CN201610898432.8A patent/CN106569996B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894102A (en) * | 2010-07-16 | 2010-11-24 | 浙江工商大学 | Method and device for analyzing emotion tendentiousness of subjective text |
CN104462065A (en) * | 2014-12-15 | 2015-03-25 | 北京国双科技有限公司 | Event emotion type analyzing method and device |
CN104899298A (en) * | 2015-06-09 | 2015-09-09 | 华东师范大学 | Microblog sentiment analysis method based on large-scale corpus characteristic learning |
Non-Patent Citations (2)
Title |
---|
BAI XUE 等: "A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec", 《2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA》 * |
唐浩浩 等: "基于词亲和度的微博词语语义倾向识别算法", 《数据采集与处理》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10394959B2 (en) | 2017-12-21 | 2019-08-27 | International Business Machines Corporation | Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources |
US10719665B2 (en) | 2017-12-21 | 2020-07-21 | International Business Machines Corporation | Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources |
CN108304568A (en) * | 2018-02-12 | 2018-07-20 | 郑长敬 | A kind of real estate Expectations big data processing method and system |
CN108304568B (en) * | 2018-02-12 | 2021-01-05 | 郑长敬 | Real estate public expectation big data processing method and system |
CN108681532A (en) * | 2018-04-08 | 2018-10-19 | 天津大学 | A kind of sentiment analysis method towards Chinese microblogging |
CN108681532B (en) * | 2018-04-08 | 2022-03-25 | 天津大学 | Sentiment analysis method for Chinese microblog |
CN111460158A (en) * | 2020-04-01 | 2020-07-28 | 安徽理工大学 | Microblog topic public emotion prediction method based on emotion analysis |
CN111460158B (en) * | 2020-04-01 | 2022-09-23 | 安徽理工大学 | Microblog topic public emotion prediction method based on emotion analysis |
CN111611455A (en) * | 2020-05-22 | 2020-09-01 | 安徽理工大学 | User group division method based on user emotional behavior characteristics under microblog hot topics |
Also Published As
Publication number | Publication date |
---|---|
CN106569996B (en) | 2019-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11853704B2 (en) | Classification model training method, classification method, device, and medium | |
CN106202032B (en) | A kind of sentiment analysis method and its system towards microblogging short text | |
CN105183717B (en) | A kind of OSN user feeling analysis methods based on random forest and customer relationship | |
CN106569996A (en) | Chinese-microblog-oriented emotional tendency analysis method | |
CN109101478B (en) | Aspect-level emotion analysis method for E-commerce comment text | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN108427670A (en) | A kind of sentiment analysis method based on context word vector sum deep learning | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN104331506A (en) | Multiclass emotion analyzing method and system facing bilingual microblog text | |
CN105550269A (en) | Product comment analyzing method and system with learning supervising function | |
Tiwari et al. | Social media sentiment analysis on Twitter datasets | |
CN109446404A (en) | A kind of the feeling polarities analysis method and device of network public-opinion | |
CN106354818B (en) | Social media-based dynamic user attribute extraction method | |
CN104281653A (en) | Viewpoint mining method for ten million microblog texts | |
CN107122349A (en) | A kind of feature word of text extracting method based on word2vec LDA models | |
CN107688576B (en) | Construction and tendency classification method of CNN-SVM model | |
CN107679110A (en) | The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction | |
CN106776574A (en) | User comment text method for digging and device | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN108846047A (en) | A kind of picture retrieval method and system based on convolution feature | |
CN108009297B (en) | Text emotion analysis method and system based on natural language processing | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN106446147A (en) | Emotion analysis method based on structuring features | |
Chumwatana | Using sentiment analysis technique for analyzing Thai customer satisfaction from social media | |
CN109815485A (en) | A kind of method, apparatus and storage medium of the identification of microblogging short text feeling polarities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |