CN103530286A - Multi-class sentiment classification method - Google Patents
Multi-class sentiment classification method Download PDFInfo
- Publication number
- CN103530286A CN103530286A CN201310533688.5A CN201310533688A CN103530286A CN 103530286 A CN103530286 A CN 103530286A CN 201310533688 A CN201310533688 A CN 201310533688A CN 103530286 A CN103530286 A CN 103530286A
- Authority
- CN
- China
- Prior art keywords
- language material
- chinese
- marked
- chinese language
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention provides a multi-class sentiment classification method. The multi-class sentiment classification method comprises the following steps of step 1, obtaining a preset number of marked English linguistic data and unmarked Chinese linguistic data from a corpus, translating the marked English linguistic data to Chinese linguistic data and performing marking; step 2, extracting marked portions from the Chinese linguistic data and preforming sentiment classification on remaining unmarked portions in the Chinese linguistic data; step 3, calculating the classification accuracy rate according to the correctly classified positive class number P of samples in the marked Chinese linguistic data, the correctly classified negative class number N of samples in the marked Chinese linguistic data and the total number A of samples of the Chinese linguistic data.
Description
Technical field
The present invention relates to natural language processing technique field and machine learning field, be specifically related to a kind of across language sensibility classification method.
Background technology
Along with the high speed development of network technology, on internet, produced the text message of commenting on for personage, event, product etc. in a large number, the viewpoint information of these magnanimity is being contained huge value.In addition, the network media day by day flourishing, public sentiment supervision also seems and becomes more and more important, government or mechanism in the urgent need to the view of understanding the public to make rational decision-making etc.
Text emotion classification refers to that the subjective texts that user is sent analyzes and excavate, thereby the emotion tendency of text is made to classification judgement, judges that it is to express (Positive) of commendation or (Negative) emotion of derogatory sense.In association area, for monolingual text emotion classification, become the focus that industry is discussed at present, but for the emotional semantic classification research between different language few.
Because English emotional semantic classification research starts to walk early, the available resources such as the sentiment dictionaries of current existing a large amount of maturations and in a large number language materials.Along with the high speed development of infotech, in network, engender the text that different language represents, for example, Chinese, German, French, Japanese etc.These large-scale texts comprise product review, news, blog, microblogging etc., are containing equally a large amount of valuable information.Therefore, build multilingual emotional semantic classification system and have very important theoretical significance and practical value.
In view of the foregoing, the invention provides a kind ofly from multilingual angle, take into full account the sensibility classification method across language of the gap between different language.
For the ease of understanding, first the major terms definition the present invention relates to is made to introduction: emotional semantic classification (Sentiment Classification), i.e. a kind of classification task of passing judgement on that text is divided into according to expressed feeling polarities; Across language emotional semantic classification (Multi-class Classification), refer to and utilize source language to carry out emotional semantic classification to other language; Machine learning classification method (Classification Methods Based on Machine Learning), is for building the statistical learning method of sorter, and input means the vector of sample, and output is the class label of sample.
Summary of the invention
The invention provides a kind ofly across language sensibility classification method, comprise the following steps:
S1, from corpus, obtain preset number mark English language material and the Chinese language material of mark not, the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted;
S2, from described Chinese language material, extract and marked part, remaining mark part in described Chinese language material is carried out to emotional semantic classification;
S3, according to having marked sample number P that in Chinese language material, positive class is being classified correct, having marked the sample number N that in Chinese language material, negative class is classified correct, and total sample number A of Chinese language material, calculates classification accuracy.
Preferably, in described step S1, described English language material and Chinese language material are to obtain from the corpus of electronic applications, also obtain Chinese testing material simultaneously.
Preferably, in described step S1, use computing machine that the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted.
Preferably, in described step S2, use self-traning method that remaining not mark part in described Chinese language material is carried out to semi-supervised emotional semantic classification.
Preferably, in described self-traning method, use maximum entropy classifiers to classify to described Chinese language material.
Preferably, in described step S3, the formula that calculates classification accuracy is accuracy rate=(P+N)/A.
Preferably, described English language material and Chinese language material are the language material of known feeling polarities.
According to provided by the invention, across language sensibility classification method, mark again after the English language material having marked being translated into Chinese language material by computing machine, and extract and marked part, effectively utilized the English language material having marked.Next, remaining not mark part in Chinese language material is carried out to emotional semantic classification, and calculate classification accuracy.So, efficiently solve in Chinese classification and marked the problem that language material lacks, improved the effect of Chinese classification.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 be preferred embodiment of the present invention provide across language sensibility classification method process flow diagram.
Embodiment
Hereinafter with reference to accompanying drawing, also describe the present invention in detail in conjunction with the embodiments.It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.
Fig. 1 be preferred embodiment of the present invention provide across language sensibility classification method process flow diagram.What as shown in Figure 1, preferred embodiment of the present invention provided comprises step S1~S3 across language sensibility classification method.
Step S1: obtain the Chinese language material that marks English language material and do not mark of preset number from corpus, the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted.
Particularly, described English language material and Chinese language material are to obtain from the corpus of electronic applications, also obtain Chinese testing material simultaneously.In the present embodiment, Chinese and English language material is taken from the electronic applications of wanxiaojun language material, comprising the English language material having marked, and the Chinese language material and the Chinese testing material that do not mark.In the present embodiment, all English language materials and Chinese language material are the language material of known feeling polarities.
Next, use computing machine that the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted.In the present embodiment, use Google's translation that English has been marked to language material and translate into Chinese and marked language material.Because English has marked language material aboundresources, and content update is also than comparatively fast, from English to Chinese conversion, ensured scale that language material is collected and ageing.Meanwhile, by machine translation, guaranteed the possibility of English language material and Chinese language material interactive information.
Step S2: extract and marked part from described Chinese language material, remaining not mark part in described Chinese language material is carried out to emotional semantic classification.
Particularly, marked part in Chinese language material, all the other are the parts that do not mark, and this part language material is used to the semi-supervised emotional semantic classification of self-traning method.In self-traning method, use maximum entropy classifiers to classify to described Chinese language material.
Wherein, self-traning method utilization English has marked the Chinese text that language material is translated into, Chinese is not marked to text and carry out maximum entropy classification, according to sorter to result that mark Chinese is not expected, choose with a high credibility marking, and join and mark text, corresponding text is never marked to text simultaneously and reject.In the present embodiment, for guaranteeing classifying quality, the amount of text of once choosing can be set, selection repeatedly circulates.
The principle of work of maximum entropy classifiers is made to introduction below.Maximum entropy sorting technique is based on maximum entropy information theory, its basic thought is under the current all conditions providing of system, to seek the most uniform model that distributes meeting, using the known fact as restriction condition, ask and can make the probability distribution of entropy maximization as correct probability distribution.
In maximum entropy model, conventionally use two-value fundamental function representation feature function, be defined as follows:
Under maximum entropy model, predicted condition Probability p
*(a|b) formula is as follows:
Wherein, π (b) is normalized factor,
λ
ibe parameter, can obtain by GIS algorithm.
The advantage of maximum entropy sorting technique is that the condition that does not need to meet between feature and feature is independent, and therefore, the method is applicable to merging various different features, without the impact of considering between them.
Step S3: according to marking in Chinese language material sample number P that positive class classifying correct, marking the sample number N that in Chinese language material, negative class is classified correct, and total sample number A of Chinese language material, calculate classification accuracy.
Particularly, accuracy rate is to evaluate the comprehensive evaluation standard of general classification problem, and its computing formula is accuracy rate=(P+N)/A.
With specific embodiment explanation the present invention, compare the improvement that classic method is brought below.
First, in wanxiaojun corpus, obtain English and marked positive and negative each 1000 pieces of language material, positive and negative each 963,757 pieces of Chinese un-annotated data, positive and negative each 451,435 pieces of Chinese testing material.
Next, according to step S1~S3, carry out successively Chinese corpus labeling, and the Chinese language material not marking is classified again, finally calculate classification accuracy.The accuracy rate that the present embodiment obtains is compared as follows with baseline method.
Table 1 is the accuracy rate table of comparisons that adopts baseline method and the inventive method.As shown in table 1, adopt the experimental result of the inventive method system in electronic applications, once choose amount of text and be set to 400 pieces, choose 3 times, choose altogether 1200 pieces, text.
Table 1
As can be seen from Table 1, after the sensibility classification method that employing the present embodiment provides, classification accuracy is obviously better than traditional baseline method.Be not difficult to find, along with the increase of text selection number of times, sensibility classification method accuracy rate provided by the invention is also rising gradually simultaneously.
In sum, according to the present invention, preferred embodiment provides across language sensibility classification method, after the English language material having marked being translated into Chinese language material by computing machine, mark again, and extract after having marked part remainder is processed, rationally utilized resourceful English language material.Next, adopt self-traning method that remaining not mark part in Chinese language material is carried out to emotional semantic classification, and can repeatedly choose text to reach better classifying quality, improve classification accuracy simultaneously.So, efficiently solve in Chinese classification and marked the problem that language material lacks, realized across language emotional semantic classification.
Above-mentioned explanation to the disclosed embodiments, makes professional and technical personnel in the field can realize or use the present invention.To the multiple modification of these embodiment, will be apparent for those skilled in the art, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.Above-mentioned explanation to the disclosed embodiments, makes professional and technical personnel in the field can realize or use the present invention.To the multiple modification of these embodiment, will be apparent for those skilled in the art, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.
Claims (7)
1. across a language sensibility classification method, it is characterized in that, comprise the following steps:
S1, from corpus, obtain preset number mark English language material and the Chinese language material of mark not, the English language material having marked is translated into Chinese language material, the rower of going forward side by side is noted;
S2, from described Chinese language material, extract and marked part, remaining mark part in described Chinese language material is carried out to emotional semantic classification;
S3, according to having marked sample number P that in Chinese language material, positive class is being classified correct, having marked the sample number N that in Chinese language material, negative class is classified correct, and total sample number A of Chinese language material, calculates classification accuracy.
2. method according to claim 1, is characterized in that, in described step S1, described English language material and Chinese language material are to obtain from the corpus of electronic applications, also obtain Chinese testing material simultaneously.
3. method according to claim 1, is characterized in that, in described step S1, uses computing machine that the English language material having marked is translated into Chinese language material, and the rower of going forward side by side is noted.
4. method according to claim 1, is characterized in that, in described step S2, uses self-traning method that remaining not mark part in described Chinese language material is carried out to semi-supervised emotional semantic classification.
5. method according to claim 4, is characterized in that, in described self-traning method, uses maximum entropy classifiers to classify to described Chinese language material.
6. method according to claim 1, is characterized in that, in described step S3, the formula that calculates classification accuracy is accuracy rate=(P+N)/A.
7. method according to claim 1, is characterized in that, described English language material and Chinese language material are the language material of known feeling polarities.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310533688.5A CN103530286A (en) | 2013-10-31 | 2013-10-31 | Multi-class sentiment classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310533688.5A CN103530286A (en) | 2013-10-31 | 2013-10-31 | Multi-class sentiment classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103530286A true CN103530286A (en) | 2014-01-22 |
Family
ID=49932308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310533688.5A Pending CN103530286A (en) | 2013-10-31 | 2013-10-31 | Multi-class sentiment classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103530286A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462229A (en) * | 2014-11-13 | 2015-03-25 | 苏州大学 | Event classification method and device |
CN105740349A (en) * | 2016-01-25 | 2016-07-06 | 重庆邮电大学 | Sentiment classification method capable of combining Doc2vce with convolutional neural network |
CN106202181A (en) * | 2016-06-27 | 2016-12-07 | 苏州大学 | A kind of sensibility classification method, Apparatus and system |
CN106294507A (en) * | 2015-06-10 | 2017-01-04 | 华中师范大学 | Viewpoint data classification method and device across language |
CN106844743A (en) * | 2017-02-14 | 2017-06-13 | 国网新疆电力公司信息通信公司 | The sensibility classification method and device of Uighur text |
CN106897274A (en) * | 2017-01-09 | 2017-06-27 | 北京众荟信息技术股份有限公司 | Method is repeated in a kind of comment across languages |
CN107220293A (en) * | 2017-04-26 | 2017-09-29 | 天津大学 | File classification method based on mood |
CN111125124A (en) * | 2019-11-18 | 2020-05-08 | 云知声智能科技股份有限公司 | Corpus labeling method and apparatus based on big data platform |
CN111897912A (en) * | 2020-07-13 | 2020-11-06 | 上海乐言信息科技有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN113657123A (en) * | 2021-07-14 | 2021-11-16 | 内蒙古工业大学 | Mongolian aspect level emotion analysis method based on target template guidance and relation head coding |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002133389A (en) * | 2000-10-26 | 2002-05-10 | Nippon Telegr & Teleph Corp <Ntt> | Data classification learning method, data classification method, data classification learner, data classifier, storage medium with data classification learning program recorded, and recording medium with data classification program recorded |
CN102541838A (en) * | 2010-12-24 | 2012-07-04 | 日电(中国)有限公司 | Method and equipment for optimizing emotional classifier |
CN102831109A (en) * | 2012-08-08 | 2012-12-19 | 中国专利信息中心 | Machine translating device based on intelligent matching and method thereof |
-
2013
- 2013-10-31 CN CN201310533688.5A patent/CN103530286A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002133389A (en) * | 2000-10-26 | 2002-05-10 | Nippon Telegr & Teleph Corp <Ntt> | Data classification learning method, data classification method, data classification learner, data classifier, storage medium with data classification learning program recorded, and recording medium with data classification program recorded |
CN102541838A (en) * | 2010-12-24 | 2012-07-04 | 日电(中国)有限公司 | Method and equipment for optimizing emotional classifier |
CN102831109A (en) * | 2012-08-08 | 2012-12-19 | 中国专利信息中心 | Machine translating device based on intelligent matching and method thereof |
Non-Patent Citations (2)
Title |
---|
李荣陆 等: "使用最大熵模型进行中文文本分类", 《计算机研究与发展》 * |
胡亚楠 等: ""基于机器翻译的跨语言关系抽取"", 《中文信息学报》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462229A (en) * | 2014-11-13 | 2015-03-25 | 苏州大学 | Event classification method and device |
CN106294507A (en) * | 2015-06-10 | 2017-01-04 | 华中师范大学 | Viewpoint data classification method and device across language |
CN105740349A (en) * | 2016-01-25 | 2016-07-06 | 重庆邮电大学 | Sentiment classification method capable of combining Doc2vce with convolutional neural network |
CN105740349B (en) * | 2016-01-25 | 2019-03-08 | 重庆邮电大学 | A kind of sensibility classification method of combination Doc2vec and convolutional neural networks |
CN106202181A (en) * | 2016-06-27 | 2016-12-07 | 苏州大学 | A kind of sensibility classification method, Apparatus and system |
CN106897274A (en) * | 2017-01-09 | 2017-06-27 | 北京众荟信息技术股份有限公司 | Method is repeated in a kind of comment across languages |
CN106844743B (en) * | 2017-02-14 | 2020-04-24 | 国网新疆电力公司信息通信公司 | Emotion classification method and device for Uygur language text |
CN106844743A (en) * | 2017-02-14 | 2017-06-13 | 国网新疆电力公司信息通信公司 | The sensibility classification method and device of Uighur text |
CN107220293A (en) * | 2017-04-26 | 2017-09-29 | 天津大学 | File classification method based on mood |
CN107220293B (en) * | 2017-04-26 | 2020-08-18 | 天津大学 | Emotion-based text classification method |
CN111125124A (en) * | 2019-11-18 | 2020-05-08 | 云知声智能科技股份有限公司 | Corpus labeling method and apparatus based on big data platform |
CN111125124B (en) * | 2019-11-18 | 2023-04-25 | 云知声智能科技股份有限公司 | Corpus labeling method and device based on big data platform |
CN111897912A (en) * | 2020-07-13 | 2020-11-06 | 上海乐言信息科技有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN111897912B (en) * | 2020-07-13 | 2021-04-06 | 上海乐言科技股份有限公司 | Active learning short text classification method and system based on sampling frequency optimization |
CN113657123A (en) * | 2021-07-14 | 2021-11-16 | 内蒙古工业大学 | Mongolian aspect level emotion analysis method based on target template guidance and relation head coding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103530286A (en) | Multi-class sentiment classification method | |
CN106776581B (en) | Subjective text emotion analysis method based on deep learning | |
CN108804417B (en) | Document-level emotion analysis method based on specific field emotion words | |
De Choudhury et al. | Happy, nervous or surprised? classification of human affective states in social media | |
El-Masri et al. | A web-based tool for Arabic sentiment analysis | |
CN104268160B (en) | A kind of OpinionTargetsExtraction Identification method based on domain lexicon and semantic role | |
CN106919673A (en) | Text mood analysis system based on deep learning | |
CN108664474B (en) | Resume analysis method based on deep learning | |
CN107609132A (en) | One kind is based on Ontology storehouse Chinese text sentiment analysis method | |
CN106202584A (en) | A kind of microblog emotional based on standard dictionary and semantic rule analyzes method | |
CN104199845B (en) | Line Evaluation based on agent model discusses sensibility classification method | |
Ansari et al. | Sentiment analysis of mixed code for the transliterated hindi and marathi texts | |
Ljubešić et al. | Predicting the level of text standardness in user-generated content | |
CN104346326A (en) | Method and device for determining emotional characteristics of emotional texts | |
CN104317965A (en) | Establishment method of emotion dictionary based on linguistic data | |
CN110134934A (en) | Text emotion analysis method and device | |
CN104504087A (en) | Low-rank decomposition based delicate topic mining method | |
CN106569996B (en) | A kind of Sentiment orientation analysis method towards Chinese microblogging | |
Stavrianou et al. | NLP-based feature extraction for automated tweet classification | |
CN104573030A (en) | Textual emotion prediction method and device | |
Sapkota et al. | Domain adaptation for authorship attribution: Improved structural correspondence learning | |
CN110321434A (en) | A kind of file classification method based on word sense disambiguation convolutional neural networks | |
Chen et al. | Chinese Weibo sentiment analysis based on character embedding with dual-channel convolutional neural network | |
CN107220293A (en) | File classification method based on mood | |
CN104182463A (en) | Semantic-based text classification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140122 |
|
RJ01 | Rejection of invention patent application after publication |