CN103530286A - Multi-class sentiment classification method - Google Patents

Multi-class sentiment classification method Download PDF

Info

Publication number
CN103530286A
CN103530286A CN201310533688.5A CN201310533688A CN103530286A CN 103530286 A CN103530286 A CN 103530286A CN 201310533688 A CN201310533688 A CN 201310533688A CN 103530286 A CN103530286 A CN 103530286A
Authority
CN
China
Prior art keywords
language material
chinese
marked
chinese language
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310533688.5A
Other languages
Chinese (zh)
Inventor
李寿山
汪蓉
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201310533688.5A priority Critical patent/CN103530286A/en
Publication of CN103530286A publication Critical patent/CN103530286A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a multi-class sentiment classification method. The multi-class sentiment classification method comprises the following steps of step 1, obtaining a preset number of marked English linguistic data and unmarked Chinese linguistic data from a corpus, translating the marked English linguistic data to Chinese linguistic data and performing marking; step 2, extracting marked portions from the Chinese linguistic data and preforming sentiment classification on remaining unmarked portions in the Chinese linguistic data; step 3, calculating the classification accuracy rate according to the correctly classified positive class number P of samples in the marked Chinese linguistic data, the correctly classified negative class number N of samples in the marked Chinese linguistic data and the total number A of samples of the Chinese linguistic data.

Description

A kind of across language sensibility classification method
Technical field
The present invention relates to natural language processing technique field and machine learning field, be specifically related to a kind of across language sensibility classification method.
Background technology
Along with the high speed development of network technology, on internet, produced the text message of commenting on for personage, event, product etc. in a large number, the viewpoint information of these magnanimity is being contained huge value.In addition, the network media day by day flourishing, public sentiment supervision also seems and becomes more and more important, government or mechanism in the urgent need to the view of understanding the public to make rational decision-making etc.
Text emotion classification refers to that the subjective texts that user is sent analyzes and excavate, thereby the emotion tendency of text is made to classification judgement, judges that it is to express (Positive) of commendation or (Negative) emotion of derogatory sense.In association area, for monolingual text emotion classification, become the focus that industry is discussed at present, but for the emotional semantic classification research between different language few.
Because English emotional semantic classification research starts to walk early, the available resources such as the sentiment dictionaries of current existing a large amount of maturations and in a large number language materials.Along with the high speed development of infotech, in network, engender the text that different language represents, for example, Chinese, German, French, Japanese etc.These large-scale texts comprise product review, news, blog, microblogging etc., are containing equally a large amount of valuable information.Therefore, build multilingual emotional semantic classification system and have very important theoretical significance and practical value.
In view of the foregoing, the invention provides a kind ofly from multilingual angle, take into full account the sensibility classification method across language of the gap between different language.
For the ease of understanding, first the major terms definition the present invention relates to is made to introduction: emotional semantic classification (Sentiment Classification), i.e. a kind of classification task of passing judgement on that text is divided into according to expressed feeling polarities; Across language emotional semantic classification (Multi-class Classification), refer to and utilize source language to carry out emotional semantic classification to other language; Machine learning classification method (Classification Methods Based on Machine Learning), is for building the statistical learning method of sorter, and input means the vector of sample, and output is the class label of sample.
Summary of the invention
The invention provides a kind ofly across language sensibility classification method, comprise the following steps:
S1, from corpus, obtain preset number mark English language material and the Chinese language material of mark not, the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted;
S2, from described Chinese language material, extract and marked part, remaining mark part in described Chinese language material is carried out to emotional semantic classification;
S3, according to having marked sample number P that in Chinese language material, positive class is being classified correct, having marked the sample number N that in Chinese language material, negative class is classified correct, and total sample number A of Chinese language material, calculates classification accuracy.
Preferably, in described step S1, described English language material and Chinese language material are to obtain from the corpus of electronic applications, also obtain Chinese testing material simultaneously.
Preferably, in described step S1, use computing machine that the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted.
Preferably, in described step S2, use self-traning method that remaining not mark part in described Chinese language material is carried out to semi-supervised emotional semantic classification.
Preferably, in described self-traning method, use maximum entropy classifiers to classify to described Chinese language material.
Preferably, in described step S3, the formula that calculates classification accuracy is accuracy rate=(P+N)/A.
Preferably, described English language material and Chinese language material are the language material of known feeling polarities.
According to provided by the invention, across language sensibility classification method, mark again after the English language material having marked being translated into Chinese language material by computing machine, and extract and marked part, effectively utilized the English language material having marked.Next, remaining not mark part in Chinese language material is carried out to emotional semantic classification, and calculate classification accuracy.So, efficiently solve in Chinese classification and marked the problem that language material lacks, improved the effect of Chinese classification.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 be preferred embodiment of the present invention provide across language sensibility classification method process flow diagram.
Embodiment
Hereinafter with reference to accompanying drawing, also describe the present invention in detail in conjunction with the embodiments.It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.
Fig. 1 be preferred embodiment of the present invention provide across language sensibility classification method process flow diagram.What as shown in Figure 1, preferred embodiment of the present invention provided comprises step S1~S3 across language sensibility classification method.
Step S1: obtain the Chinese language material that marks English language material and do not mark of preset number from corpus, the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted.
Particularly, described English language material and Chinese language material are to obtain from the corpus of electronic applications, also obtain Chinese testing material simultaneously.In the present embodiment, Chinese and English language material is taken from the electronic applications of wanxiaojun language material, comprising the English language material having marked, and the Chinese language material and the Chinese testing material that do not mark.In the present embodiment, all English language materials and Chinese language material are the language material of known feeling polarities.
Next, use computing machine that the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted.In the present embodiment, use Google's translation that English has been marked to language material and translate into Chinese and marked language material.Because English has marked language material aboundresources, and content update is also than comparatively fast, from English to Chinese conversion, ensured scale that language material is collected and ageing.Meanwhile, by machine translation, guaranteed the possibility of English language material and Chinese language material interactive information.
Step S2: extract and marked part from described Chinese language material, remaining not mark part in described Chinese language material is carried out to emotional semantic classification.
Particularly, marked part in Chinese language material, all the other are the parts that do not mark, and this part language material is used to the semi-supervised emotional semantic classification of self-traning method.In self-traning method, use maximum entropy classifiers to classify to described Chinese language material.
Wherein, self-traning method utilization English has marked the Chinese text that language material is translated into, Chinese is not marked to text and carry out maximum entropy classification, according to sorter to result that mark Chinese is not expected, choose with a high credibility marking, and join and mark text, corresponding text is never marked to text simultaneously and reject.In the present embodiment, for guaranteeing classifying quality, the amount of text of once choosing can be set, selection repeatedly circulates.
The principle of work of maximum entropy classifiers is made to introduction below.Maximum entropy sorting technique is based on maximum entropy information theory, its basic thought is under the current all conditions providing of system, to seek the most uniform model that distributes meeting, using the known fact as restriction condition, ask and can make the probability distribution of entropy maximization as correct probability distribution.
In maximum entropy model, conventionally use two-value fundamental function representation feature function, be defined as follows:
f i ( a , b ) = 1 , ifb ∈ a 0 , others
Under maximum entropy model, predicted condition Probability p *(a|b) formula is as follows:
p * ( a | b ) = 1 π ( b ) exp ( Σ i = 1 k λ i f i ( a , b ) )
Wherein, π (b) is normalized factor, λ ibe parameter, can obtain by GIS algorithm.
The advantage of maximum entropy sorting technique is that the condition that does not need to meet between feature and feature is independent, and therefore, the method is applicable to merging various different features, without the impact of considering between them.
Step S3: according to marking in Chinese language material sample number P that positive class classifying correct, marking the sample number N that in Chinese language material, negative class is classified correct, and total sample number A of Chinese language material, calculate classification accuracy.
Particularly, accuracy rate is to evaluate the comprehensive evaluation standard of general classification problem, and its computing formula is accuracy rate=(P+N)/A.
With specific embodiment explanation the present invention, compare the improvement that classic method is brought below.
First, in wanxiaojun corpus, obtain English and marked positive and negative each 1000 pieces of language material, positive and negative each 963,757 pieces of Chinese un-annotated data, positive and negative each 451,435 pieces of Chinese testing material.
Next, according to step S1~S3, carry out successively Chinese corpus labeling, and the Chinese language material not marking is classified again, finally calculate classification accuracy.The accuracy rate that the present embodiment obtains is compared as follows with baseline method.
Table 1 is the accuracy rate table of comparisons that adopts baseline method and the inventive method.As shown in table 1, adopt the experimental result of the inventive method system in electronic applications, once choose amount of text and be set to 400 pieces, choose 3 times, choose altogether 1200 pieces, text.
Figure BDA0000406209990000051
Table 1
As can be seen from Table 1, after the sensibility classification method that employing the present embodiment provides, classification accuracy is obviously better than traditional baseline method.Be not difficult to find, along with the increase of text selection number of times, sensibility classification method accuracy rate provided by the invention is also rising gradually simultaneously.
In sum, according to the present invention, preferred embodiment provides across language sensibility classification method, after the English language material having marked being translated into Chinese language material by computing machine, mark again, and extract after having marked part remainder is processed, rationally utilized resourceful English language material.Next, adopt self-traning method that remaining not mark part in Chinese language material is carried out to emotional semantic classification, and can repeatedly choose text to reach better classifying quality, improve classification accuracy simultaneously.So, efficiently solve in Chinese classification and marked the problem that language material lacks, realized across language emotional semantic classification.
Above-mentioned explanation to the disclosed embodiments, makes professional and technical personnel in the field can realize or use the present invention.To the multiple modification of these embodiment, will be apparent for those skilled in the art, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.Above-mentioned explanation to the disclosed embodiments, makes professional and technical personnel in the field can realize or use the present invention.To the multiple modification of these embodiment, will be apparent for those skilled in the art, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims (7)

1. across a language sensibility classification method, it is characterized in that, comprise the following steps:
S1, from corpus, obtain preset number mark English language material and the Chinese language material of mark not, the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted;
S2, from described Chinese language material, extract and marked part, remaining mark part in described Chinese language material is carried out to emotional semantic classification;
S3, according to having marked sample number P that in Chinese language material, positive class is being classified correct, having marked the sample number N that in Chinese language material, negative class is classified correct, and total sample number A of Chinese language material, calculates classification accuracy.
2. method according to claim 1, is characterized in that, in described step S1, described English language material and Chinese language material are to obtain from the corpus of electronic applications, also obtain Chinese testing material simultaneously.
3. method according to claim 1, is characterized in that, in described step S1, uses computing machine that the English language material having marked is translated into Chinese language material, and the rower of going forward side by side is noted.
4. method according to claim 1, is characterized in that, in described step S2, uses self-traning method that remaining not mark part in described Chinese language material is carried out to semi-supervised emotional semantic classification.
5. method according to claim 4, is characterized in that, in described self-traning method, uses maximum entropy classifiers to classify to described Chinese language material.
6. method according to claim 1, is characterized in that, in described step S3, the formula that calculates classification accuracy is accuracy rate=(P+N)/A.
7. method according to claim 1, is characterized in that, described English language material and Chinese language material are the language material of known feeling polarities.
CN201310533688.5A 2013-10-31 2013-10-31 Multi-class sentiment classification method Pending CN103530286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310533688.5A CN103530286A (en) 2013-10-31 2013-10-31 Multi-class sentiment classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310533688.5A CN103530286A (en) 2013-10-31 2013-10-31 Multi-class sentiment classification method

Publications (1)

Publication Number Publication Date
CN103530286A true CN103530286A (en) 2014-01-22

Family

ID=49932308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310533688.5A Pending CN103530286A (en) 2013-10-31 2013-10-31 Multi-class sentiment classification method

Country Status (1)

Country Link
CN (1) CN103530286A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462229A (en) * 2014-11-13 2015-03-25 苏州大学 Event classification method and device
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
CN106202181A (en) * 2016-06-27 2016-12-07 苏州大学 A kind of sensibility classification method, Apparatus and system
CN106294507A (en) * 2015-06-10 2017-01-04 华中师范大学 Viewpoint data classification method and device across language
CN106844743A (en) * 2017-02-14 2017-06-13 国网新疆电力公司信息通信公司 The sensibility classification method and device of Uighur text
CN106897274A (en) * 2017-01-09 2017-06-27 北京众荟信息技术股份有限公司 Method is repeated in a kind of comment across languages
CN107220293A (en) * 2017-04-26 2017-09-29 天津大学 File classification method based on mood
CN111125124A (en) * 2019-11-18 2020-05-08 云知声智能科技股份有限公司 Corpus labeling method and apparatus based on big data platform
CN111897912A (en) * 2020-07-13 2020-11-06 上海乐言信息科技有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN113657123A (en) * 2021-07-14 2021-11-16 内蒙古工业大学 Mongolian aspect level emotion analysis method based on target template guidance and relation head coding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002133389A (en) * 2000-10-26 2002-05-10 Nippon Telegr & Teleph Corp <Ntt> Data classification learning method, data classification method, data classification learner, data classifier, storage medium with data classification learning program recorded, and recording medium with data classification program recorded
CN102541838A (en) * 2010-12-24 2012-07-04 日电(中国)有限公司 Method and equipment for optimizing emotional classifier
CN102831109A (en) * 2012-08-08 2012-12-19 中国专利信息中心 Machine translating device based on intelligent matching and method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002133389A (en) * 2000-10-26 2002-05-10 Nippon Telegr & Teleph Corp <Ntt> Data classification learning method, data classification method, data classification learner, data classifier, storage medium with data classification learning program recorded, and recording medium with data classification program recorded
CN102541838A (en) * 2010-12-24 2012-07-04 日电(中国)有限公司 Method and equipment for optimizing emotional classifier
CN102831109A (en) * 2012-08-08 2012-12-19 中国专利信息中心 Machine translating device based on intelligent matching and method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李荣陆 等: "使用最大熵模型进行中文文本分类", 《计算机研究与发展》 *
胡亚楠 等: ""基于机器翻译的跨语言关系抽取"", 《中文信息学报》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462229A (en) * 2014-11-13 2015-03-25 苏州大学 Event classification method and device
CN106294507A (en) * 2015-06-10 2017-01-04 华中师范大学 Viewpoint data classification method and device across language
CN105740349A (en) * 2016-01-25 2016-07-06 重庆邮电大学 Sentiment classification method capable of combining Doc2vce with convolutional neural network
CN105740349B (en) * 2016-01-25 2019-03-08 重庆邮电大学 A kind of sensibility classification method of combination Doc2vec and convolutional neural networks
CN106202181A (en) * 2016-06-27 2016-12-07 苏州大学 A kind of sensibility classification method, Apparatus and system
CN106897274A (en) * 2017-01-09 2017-06-27 北京众荟信息技术股份有限公司 Method is repeated in a kind of comment across languages
CN106844743B (en) * 2017-02-14 2020-04-24 国网新疆电力公司信息通信公司 Emotion classification method and device for Uygur language text
CN106844743A (en) * 2017-02-14 2017-06-13 国网新疆电力公司信息通信公司 The sensibility classification method and device of Uighur text
CN107220293A (en) * 2017-04-26 2017-09-29 天津大学 File classification method based on mood
CN107220293B (en) * 2017-04-26 2020-08-18 天津大学 Emotion-based text classification method
CN111125124A (en) * 2019-11-18 2020-05-08 云知声智能科技股份有限公司 Corpus labeling method and apparatus based on big data platform
CN111125124B (en) * 2019-11-18 2023-04-25 云知声智能科技股份有限公司 Corpus labeling method and device based on big data platform
CN111897912A (en) * 2020-07-13 2020-11-06 上海乐言信息科技有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN111897912B (en) * 2020-07-13 2021-04-06 上海乐言科技股份有限公司 Active learning short text classification method and system based on sampling frequency optimization
CN113657123A (en) * 2021-07-14 2021-11-16 内蒙古工业大学 Mongolian aspect level emotion analysis method based on target template guidance and relation head coding

Similar Documents

Publication Publication Date Title
CN103530286A (en) Multi-class sentiment classification method
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN108804417B (en) Document-level emotion analysis method based on specific field emotion words
De Choudhury et al. Happy, nervous or surprised? classification of human affective states in social media
El-Masri et al. A web-based tool for Arabic sentiment analysis
CN104268160B (en) A kind of OpinionTargetsExtraction Identification method based on domain lexicon and semantic role
CN106919673A (en) Text mood analysis system based on deep learning
CN108664474B (en) Resume analysis method based on deep learning
CN107609132A (en) One kind is based on Ontology storehouse Chinese text sentiment analysis method
CN106202584A (en) A kind of microblog emotional based on standard dictionary and semantic rule analyzes method
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
Ansari et al. Sentiment analysis of mixed code for the transliterated hindi and marathi texts
Ljubešić et al. Predicting the level of text standardness in user-generated content
CN104346326A (en) Method and device for determining emotional characteristics of emotional texts
CN104317965A (en) Establishment method of emotion dictionary based on linguistic data
CN110134934A (en) Text emotion analysis method and device
CN104504087A (en) Low-rank decomposition based delicate topic mining method
CN106569996B (en) A kind of Sentiment orientation analysis method towards Chinese microblogging
Stavrianou et al. NLP-based feature extraction for automated tweet classification
CN104573030A (en) Textual emotion prediction method and device
Sapkota et al. Domain adaptation for authorship attribution: Improved structural correspondence learning
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
Chen et al. Chinese Weibo sentiment analysis based on character embedding with dual-channel convolutional neural network
CN107220293A (en) File classification method based on mood
CN104182463A (en) Semantic-based text classification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140122

RJ01 Rejection of invention patent application after publication