CN103020249A - Classifier construction method and device as well as Chinese text sentiment classification method and system - Google Patents

Classifier construction method and device as well as Chinese text sentiment classification method and system Download PDF

Info

Publication number
CN103020249A
CN103020249A CN2012105564463A CN201210556446A CN103020249A CN 103020249 A CN103020249 A CN 103020249A CN 2012105564463 A CN2012105564463 A CN 2012105564463A CN 201210556446 A CN201210556446 A CN 201210556446A CN 103020249 A CN103020249 A CN 103020249A
Authority
CN
China
Prior art keywords
emotional
labeled
polarity
sample
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105564463A
Other languages
Chinese (zh)
Inventor
李寿山
张小倩
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN2012105564463A priority Critical patent/CN103020249A/en
Publication of CN103020249A publication Critical patent/CN103020249A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a classifier construction method and device as well as a Chinese text sentiment classification method and system. The classification method comprises the following steps of: obtaining a sample to be labeled from a sample set to be labeled; looking up sentiment words in the sample to be labeled; obtaining the sentiment polarity of each sentiment word; converting the sentiment polarity of the sentiment words of which the sentiment polarity conforms to a sentiment polarity conversion rule in the sample to be labeled; counting the amount of the sentiment words of which the sentiment polarity is negative and positive in the sample to be labeled; according to the amount of the sentiment words of which the sentiment polarity is positive and the amount of the sentiment words of which the sentiment polarity is negative, determining the sentiment polarity of the sample to be labeled to obtain a labeled sample; according to the labeled sample, labeling other samples to be labeled in the sample set to be labeled to obtain a labeled sample set; constructing a maximum entropy classifier by the labeled sample set; and classifying a Chinese text to be classified by the maximum entropy classifier. According to the method, the device and the system provided by the invention, the Chinese text classification time is shortened, and the classification accuracy is improved.

Description

分类器的构建方法及装置、中文文本情感分类方法及系统Classifier construction method and device, Chinese text sentiment classification method and system

技术领域technical field

本发明涉及自然语言处理及模式识别技术领域,尤其涉及一种分类器的构建方法及装置、中文文本情感分类方法及系统。The invention relates to the technical field of natural language processing and pattern recognition, in particular to a method and device for constructing a classifier, and a method and system for classifying Chinese text emotions.

背景技术Background technique

随着Web2.0的蓬勃发展,互联网上产生了大量大众对于人物、事件、产品的等带有感情色彩的评论信息,用户通过浏览这些评论信息可以了解大众舆论对于某一事件或产品的看法,由于评论信息的信息量较大,用户如果单纯地依靠人工去收集和整理,会浪费大量的时间和精力,因此,迫切需要利用计算机帮助用户快速获取和整理这些评论信息,文本情感分析技术应运而生。With the vigorous development of Web2.0, a large number of emotional comments on people, events, products, etc. have been produced on the Internet. Users can understand the opinions of public opinion on a certain event or product by browsing these comments. Due to the large amount of comment information, users will waste a lot of time and energy if they simply rely on manual collection and organization. Therefore, it is urgent to use computers to help users quickly obtain and organize these comment information. Text sentiment analysis technology came into being. born.

所谓文本情感分析,就是利用计算机帮助用户快速获取、整理和分析评论信息,对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。其中,文本情感分类是文本情感分析的一项基本内容,其按不同的粒度可分为句子级、篇章级等,对于句子级和篇章级,文本情感分类是指将文本分为正面文本和负面文本,例如,“我很喜欢这个产品”,通过文本情感分类,这句话将被分类为正面文本,而“这本书是在太差了”将被分类为负面文本。The so-called text sentiment analysis is the process of using computers to help users quickly acquire, organize and analyze comment information, and analyze, process, summarize and reason emotionally subjective texts. Among them, text sentiment classification is a basic content of text sentiment analysis. It can be divided into sentence level and article level according to different granularity. For sentence level and article level, text sentiment classification refers to dividing text into positive text and negative text. Text, for example, "I like this product very much", through text sentiment classification, this sentence will be classified as positive text, while "this book is in terrible" will be classified as negative text.

目前,常用的文本情感分类方法是基于监督方法的,该方法用领域被标记的数据训练特定领域的分类,这种方法虽然取得了较好的分类效果,但由于其需要大量人工标注语料库,因此,构建分类器的时间较长,而且,如果换一个领域就必须重新标注语料,即领域依赖性较大。At present, the commonly used text sentiment classification method is based on the supervised method. This method uses the field-labeled data to train the classification of a specific field. , it takes a long time to build a classifier, and if you change a domain, you must relabel the corpus, that is, the domain dependence is large.

发明内容Contents of the invention

有鉴于此,本发明提供了一种分类器的构建方法及装置、中文文本情感分类方法及系统,用以解决现有的分类方法构建分类器的时间较长且应用领域依赖性较大的问题。其技术方案如下:In view of this, the present invention provides a classifier construction method and device, a Chinese text sentiment classification method and system, to solve the problems that the existing classification method takes a long time to construct a classifier and has a large dependence on the application field . Its technical scheme is as follows:

一种分类器的构建方法,包括:A method for constructing a classifier, comprising:

获取待标注样本集并从所述待标注样本集中获取一个待标注样本,其中,所述待标注样本集包括至少两个待标注样本;Obtain a sample set to be labeled and obtain a sample to be labeled from the sample set to be labeled, wherein the sample set to be labeled includes at least two samples to be labeled;

查找所述待标注样本中的情感词,并获取每个情感词的情感极性,其中,所述情感极性包括正面和负面;Find the emotional words in the sample to be labeled, and obtain the emotional polarity of each emotional word, wherein the emotional polarity includes positive and negative;

转变所述待标注样本中符合情感极性转变规则的情感词的情感极性;Change the emotional polarity of the emotional words that meet the emotional polarity conversion rule in the sample to be marked;

统计所述待标注样本中情感极性为正面的情感词的数量和情感极性为负面的情感词的数量;Counting the number of positive emotional words and emotional polarity in the sample to be marked is the number of negative emotional words;

根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性,得到标注样本;Determine the emotional polarity of the sample to be marked according to the number of positive emotional words and negative emotional words according to the emotional polarity, and obtain the labeled sample;

依据所述标注样本利用自学习的方法对所述待标注样本集中其它待标注样本进行标注,得到标注样本集;Using a self-learning method to mark other samples to be marked in the sample set to be marked according to the marked sample, to obtain a set of marked samples;

利用所述标注样本集中的标注样本构建最大熵分类器。A maximum entropy classifier is constructed by using labeled samples in the labeled sample set.

优选地,转变所述待标注样本中符合情感极性转变规则的情感词的情感极性包括:Preferably, converting the emotional polarity of the emotional words that meet the emotional polarity conversion rules in the sample to be marked includes:

如果待标注样本中的情感词所在的句子中出现了否定关键词,则转变该情感词的情感极性;If negative keywords appear in the sentence where the emotional word in the sample to be marked is located, then change the emotional polarity of the emotional word;

如果待标注样本中的情感词所在的句子的下一句或下一段落出现了转折关键词,则转变该情感词的情感极性;If there is a turning keyword in the next sentence or the next paragraph of the sentence where the emotional word in the sample to be labeled is, then change the emotional polarity of the emotional word;

和/或,如果待标注样本中的情感词所在的句子出现了能愿关键词,则转变该情感词的情感极性。And/or, if the sentence where the emotional word in the sample to be marked is located has a willing keyword, then change the emotional polarity of the emotional word.

优选地,所述根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性包括:Preferably, determining the emotional polarity of the sample to be marked according to the number of positive emotional words and negative emotional words according to the emotional polarity includes:

如果情感极性为正面的情感词的数量与情感极性为负面的情感词的数量的差值大于设定阈值,则确定所述待标注样本的情感极性为正面;If the difference between the quantity of positive emotional words and the negative emotional words is greater than the set threshold, it is determined that the emotional polarity of the sample to be marked is positive;

如果情感极性为负面的情感词的数量与情感极性为正面的情感词的数量的差值大于所述设定阈值,则确定所述待标注样本的情感极性为负面。If the difference between the number of emotional words whose emotional polarity is negative and the number of emotional words whose emotional polarity is positive is greater than the set threshold, it is determined that the emotional polarity of the sample to be labeled is negative.

优选地,根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性,得到标注样本包括:Preferably, the emotional polarity of the sample to be marked is determined according to the number of emotional words whose emotional polarity is positive and the number of emotional words whose emotional polarity is negative, and the obtained labeled samples include:

利用所述标注样本构建最大熵分类器;Constructing a maximum entropy classifier using the labeled samples;

利用所述最大熵分类器对所述待标注样本集中其它待标注样本进行标分类,得到分类结果;Using the maximum entropy classifier to classify other samples to be labeled in the sample set to be labeled to obtain a classification result;

根据所述分类结果确定每个待标注样本的情感极性,得到标注样本集。The emotional polarity of each sample to be labeled is determined according to the classification result to obtain a labeled sample set.

一种中文文本情感分类方法,包括:上述的分类器的构建方法,还包括:A Chinese text sentiment classification method, comprising: the construction method of the above-mentioned classifier, also comprising:

利用构建的最大熵分类器对待分类的中文文本进行分类。The Chinese text to be classified is classified by using the constructed maximum entropy classifier.

一种分类器的构建装置,包括:获取单元、查找单元、极性转变单元、统计单元、确定单元、自学习单元和分类器构建单元;A construction device for a classifier, comprising: an acquisition unit, a search unit, a polarity conversion unit, a statistical unit, a determination unit, a self-learning unit and a classifier construction unit;

所述获取单元,用于获取待标注样本集并从所述待标注样本集中获取一个待标注样本,其中,所述待标注样本集包括至少两个待标注样本;The acquiring unit is configured to acquire a sample set to be labeled and acquire a sample to be labeled from the sample set to be labeled, wherein the sample set to be labeled includes at least two samples to be labeled;

所述查找单元,用于查找所述待标注样本中的情感词,并获取每个情感词的情感极性,其中,所述情感极性包括正面和负面;The search unit is used to search for the emotional words in the sample to be labeled, and obtain the emotional polarity of each emotional word, wherein the emotional polarity includes positive and negative;

所述极性转变单元,用于转变所述待标注样本中符合情感极性转变规则的情感词的情感极性;The polarity conversion unit is used to convert the emotional polarity of the emotional words that meet the emotional polarity conversion rules in the sample to be marked;

所述统计单元,用于统计所述待标注样本中情感极性为正面的情感词的数量和情感极性为负面的情感词的数量;The statistical unit is used to count the number of emotional words whose emotional polarity is positive and the emotional words whose emotional polarity is negative in the samples to be marked;

所述确定单元,用于根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性,得到标注样本;The determining unit is configured to determine the emotional polarity of the sample to be labeled according to the number of emotional words whose emotional polarity is positive and the number of emotional words whose emotional polarity is negative, to obtain a labeled sample;

所述自学习单元,用于依据所述标注样本利用自学习的方法对所述待标注样本集中其它待标注样本进行标注,得到标注样本集;The self-learning unit is configured to use a self-learning method to mark other samples to be marked in the set of samples to be marked according to the marked samples to obtain a set of marked samples;

所述分类器构建单元,用于利用所述标注样本集中的标注样本构建最大熵分类器。The classifier construction unit is configured to use the labeled samples in the labeled sample set to construct a maximum entropy classifier.

优选地,所述极性转变单元包括:第一极性转变子单元、第二极性转变子单元和/或第三极性转变子单元;Preferably, the polarity conversion unit includes: a first polarity conversion subunit, a second polarity conversion subunit and/or a third polarity conversion subunit;

所述第一极性转变子单元,用于当待标注样本中的情感词所在的句子中出现了否定关键词时,转变该情感词的情感极性;The first polarity conversion subunit is used to convert the emotional polarity of the emotional word when negative keywords appear in the sentence where the emotional word in the sample to be marked is located;

所述第二极性转变子单元,用于当待标注样本中的情感词所在的句子的下一句或下一段落出现了转折关键词时,转变该情感词的情感极性;The second polarity conversion subunit is used to change the emotional polarity of the emotional word when a turning keyword appears in the next sentence or the next paragraph of the sentence where the emotional word in the sample to be marked is located;

所述第三极性转变子单元,用于当待标注样本中的情感词所在的句子出现了能愿关键词时,转变该情感词的情感极性。The third polarity conversion subunit is used to change the emotional polarity of the emotional word when the sentence where the emotional word in the sample to be marked is located has a willing keyword.

优选地,所述确定单元包括:第一确定子单元和第二确定子单元;Preferably, the determination unit includes: a first determination subunit and a second determination subunit;

所述第一确定子单元,用于当情感极性为正面的情感词的数量与情感极性为负面的情感词的数量的差值大于设定阈值时,确定所述待标注样本的情感极性为正面;The first determining subunit is used to determine the emotional polarity of the sample to be marked when the difference between the number of emotional words whose emotional polarity is positive and the number of emotional words whose emotional polarity is negative is greater than a set threshold sex is positive;

所述第二确定子单元,用于当情感极性为负面的情感词的数量与情感极性为正面的情感词的数量的差值大于所述设定阈值时,确定所述待标注样本的情感极性为负面。The second determination subunit is used to determine the number of samples to be labeled when the difference between the number of emotional words whose emotional polarity is negative and the number of emotional words whose emotional polarity is positive is greater than the set threshold Emotional polarity is negative.

优选地,所述自学习单元包括:分类器构建子单元、分类子单元和第三确定子单元;Preferably, the self-learning unit includes: a classifier construction subunit, a classification subunit and a third determination subunit;

所述分类器构建子单元,用于利用所述标注样本构建最大熵分类器;The classifier construction subunit is used to construct a maximum entropy classifier using the labeled samples;

所述分类子单元,用于利用所述最大熵分类器对所述待标注样本集中其它待标注样本进行标分类,得到分类结果;The classification subunit is configured to use the maximum entropy classifier to classify other samples to be labeled in the sample set to be labeled to obtain a classification result;

第三确定子单元,用于根据所述分类结果确定每个待标注样本的情感极性。The third determination subunit is configured to determine the emotional polarity of each sample to be labeled according to the classification result.

一种中文文本情感分类系统,包括上述的分类器的构建装置,还包括:分类单元;A Chinese text emotion classification system, including the construction device of the above-mentioned classifier, also includes: a classification unit;

所述分类单元,用于利用所述分类器的构建装置构建的最大熵分类器对待分类的中文文本进行分类。The classification unit is used to classify the Chinese text to be classified using the maximum entropy classifier constructed by the classifier construction device.

本发明提供的分类器的构建方法及装置、中文文本情感分类方法及系统,应用情感极性转变规则对情感器的情感极性进行极性转变,并且依据标注样本利用自学习的方法对待标注样本集中其它待标注样本进行标注,将根据标注样本集的标注样本构建的最大熵分类器作为中文文本情感分类的分类器。本发明提供的分类器的构建方法及装置、中文文本情感分类方法及系统,避免了人工标注训练样本浪费的人耗代价,缩短了用于中文文本情感分类的分类器的构建时间,同时,提高了中文文本情感分类的正确率。The construction method and device of the classifier provided by the present invention, the method and system of Chinese text emotion classification, apply the emotion polarity transformation rule to carry out the polarity transformation on the emotion polarity of the emotion device, and use the self-learning method to treat the labeling samples according to the labeling samples Collect other samples to be labeled for labeling, and use the maximum entropy classifier constructed according to the labeled samples of the labeled sample set as the classifier for Chinese text sentiment classification. The method and device for constructing a classifier provided by the present invention, the method and system for classifying emotion in Chinese texts, avoid the wasted cost of manual labeling of training samples, shorten the construction time of classifiers for sentiment classification in Chinese texts, and at the same time, improve The correct rate of Chinese text sentiment classification.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1为本发明实施例提供的分类器的构建方法的流程示意图;FIG. 1 is a schematic flowchart of a method for constructing a classifier provided in an embodiment of the present invention;

图2为本发明实施例提供的中文文本情感分类系统的结构示意图。FIG. 2 is a schematic structural diagram of a Chinese text sentiment classification system provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明提供了一种分类器的构建方法,图1为该方法的流程图,该方法可以包括:The present invention provides a method for constructing a classifier, and Fig. 1 is a flowchart of the method, and the method may include:

S101:获取待标注样本集并从所述标注样本集中获取一个待标注样本,其中,待标注样本集包括至少两个待标注样本。S101: Obtain a sample set to be labeled and obtain a sample to be labeled from the set of labeled samples, where the sample set to be labeled includes at least two samples to be labeled.

S102:查找待标注样本中的情感词,并获取每个情感词的情感极性,其中,情感词的情感极性包括正面和负面。S102: Find the emotional words in the sample to be labeled, and obtain the emotional polarity of each emotional word, where the emotional polarity of the emotional word includes positive and negative.

S103:转变待标注样本中符合情感极性转变规则的情感词的情感极性。S103: Change the emotional polarity of the emotional words that meet the emotional polarity conversion rule in the samples to be labeled.

S104:统计待标注样本中情感极性为正面的情感词的数量和情感极性为负面的情感词的数量。S104: Count the number of emotional words with positive emotional polarity and the number of emotional words with negative emotional polarity in the samples to be labeled.

S105:根据情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定待标注样本的情感极性,得到标注样本。S105: Determine the emotional polarity of the samples to be labeled according to the number of emotional words with positive emotional polarity and the number of emotional words with negative emotional polarity, and obtain labeled samples.

S106:依据标注样本利用自学习的方法对待标注样本集中其它待标注样本进行标注,得到标注样本集,该标注样本集中包括了所有的标注样本。S106: Using a self-learning method to label other samples to be labeled in the sample set to be labeled according to the labeled sample to obtain a labeled sample set, the labeled sample set includes all labeled samples.

S107:利用标注样本集中的标注样本构建最大熵分类器。S107: Construct a maximum entropy classifier by using the labeled samples in the labeled sample set.

在本发明的另一实施例中,步骤S102可以包括:对照预置的情感词与情感极性的对应表,在待标注样本中查找情感词,并根据对应表获取与情感词对应的情感极性。表1给出了一情感词与情感极性的对应表需要说明的是,本实施例并不限定于表1所列举的情感词,还可有其它情感词。In another embodiment of the present invention, step S102 may include: comparing the preset correspondence table of emotional words and emotional polarities, searching for emotional words in the samples to be labeled, and obtaining the emotional polarity corresponding to the emotional words according to the corresponding table. sex. Table 1 provides a table of correspondence between emotional words and emotional polarities. It should be noted that this embodiment is not limited to the emotional words listed in Table 1, and there may be other emotional words.

表1Table 1

情感极性emotional polarity 情感词emotional words 正面front 喜欢,乐意,满意,好,很好like, happy, satisfied, good, good 负面the negative 讨厌,厌烦,伤心,坏hate, bored, sad, bad

在本发明的再一实施例中,情感极性转变规则可以包括:否定规则、转折规则和/或能愿规则。相应的,步骤S103可以包括:如果待标注样本中的情感词所在的句子中出现了否定关键词,则转变该情感词的情感极性;如果待标注样本中的情感词所在的句子的下一句或下一段落出现了转折关键词,则转变该情感词的情感极性;和/或,如果待标注样本中的情感词所在的句子出现了能愿关键词,则转变该情感词的情感极性。表2列出了常见的否定关键词、转折关键词和能愿关键词,当然本实施例并不限定于这些关键词,还可包括其它表示否定、转折和能愿的关键词。In yet another embodiment of the present invention, the emotion polarity transition rules may include: negation rules, turning rules and/or aspirational rules. Correspondingly, step S103 may include: if negative keywords appear in the sentence where the emotional word in the sample to be marked is located, then change the emotional polarity of the emotional word; if the next sentence of the sentence where the emotional word in the sample to be marked is located Or if there is a turning keyword in the next paragraph, then change the emotional polarity of the emotional word; and/or, if there is a willing keyword in the sentence where the emotional word in the sample to be marked is, then change the emotional polarity of the emotional word . Table 2 lists common negative keywords, turning keywords and willing keywords. Of course, this embodiment is not limited to these keywords, and may include other keywords representing negation, turning and willing.

表2Table 2

Figure BDA00002616392400071
Figure BDA00002616392400071

下面针对基于否定规则、转折规则和能愿规则转变情感词的情感极性列举三个具体实例进行说明:The following is an illustration of three specific examples of changing the emotional polarity of emotional words based on the negation rule, the turning rule and the willingness rule:

例1:我不喜欢这个产品。Example 1: I don't like this product.

在例1的句子中,若情感词为“喜欢”,且该句中出现了否定关键词“不”,则转变情感词“喜欢”的情感极性,即将“喜欢”的情感极性由正面转变为负面。In the sentence of example 1, if the emotional word is "like", and the negative keyword "no" appears in the sentence, then change the emotional polarity of the emotional word "like", that is, change the emotional polarity of "like" from positive to positive turned negative.

例2:我喜欢这个产品的想法,但是这个质量我不能接受。Example 2: I like the idea of this product, but the quality is not acceptable to me.

在例2的句子中,若情感词为“喜欢”,且其所在句子的下一句中出现了转折关键词“但是”,则转变情感词“喜欢”的情感极性,即将“喜欢”的情感极性由正面转变为负面。In the sentence of example 2, if the emotional word is "like", and the transition keyword "but" appears in the next sentence of the sentence, then change the emotional polarity of the emotional word "like", that is, the emotion of "like" The polarity changes from positive to negative.

例3:如果颜色是红色的就好了。Example 3: It would be fine if the color was red.

在例3的句子中,若情感词为“好”,且其所在句子中,在情感词“好”的前边出现了能愿关键词“如果”,则转变情感词“好”的情感极性,即将“好”的情感极性由正面转变为负面。In the sentence of example 3, if the emotional word is "good", and in the sentence where it is located, the keyword "if" appears before the emotional word "good", then change the emotional polarity of the emotional word "good" , which is to change the emotional polarity of "good" from positive to negative.

在本发明的又一实施例中,步骤S106可以包括:如果情感极性为正面的情感词的数量与情感极性为负面的情感词的数量的差值大于设定阈值,则确定待标注样本的情感极性为正面;如果情感极性为负面的情感词的数量与情感极性为正面的情感词的数量的差值大于设定阈值,则确定待标注样本的情感极性为负面。假设情感极性为正面的情感词的数量为N+,情感极性为负面的情感词的数量为N-,设定阈值为Nmax,如果N+-N->Nmax,则确定待标注样本的情感极性为正面,如果N--N+>Nmax,则确定待标注样本的情感极性为负面。In yet another embodiment of the present invention, step S106 may include: if the difference between the number of emotional words whose emotional polarity is positive and the number of emotional words whose emotional polarity is negative is greater than a set threshold, then determine the samples to be labeled The emotional polarity of is positive; if the difference between the number of emotional words whose emotional polarity is negative and the number of emotional words whose emotional polarity is positive is greater than the set threshold, then it is determined that the emotional polarity of the sample to be labeled is negative. Assume that the number of emotional words with positive emotional polarity is N + , and the number of emotional words with negative emotional polarity is N - , set the threshold to N max , if N + -N - > N max , determine to be labeled The emotional polarity of the sample is positive, and if N - -N + >N max , it is determined that the emotional polarity of the sample to be labeled is negative.

在本发明的又一实施例中,步骤S105可以包括:利用标注样本构建最大熵分类器;利用最大熵分类器对待标注样本集中其它待标注样本进行标分类,得到分类结果,根据分类结果确定每个待标注样本的情感极性,最终得到两个标准样本集:正面标注样本集和负面标注样本集。In yet another embodiment of the present invention, step S105 may include: using the labeled samples to construct a maximum entropy classifier; using the maximum entropy classifier to classify other samples to be labeled in the sample set to be labeled to obtain classification results, and determine each The emotional polarity of each sample to be labeled, and finally get two standard sample sets: a positive label sample set and a negative label sample set.

其中,最大熵分类器作为机器学习分类方法中的一种,是基于最大熵信息理论,其基本思想是为所有已知的因素建立模型,而把所有未知的因素排除在外。也就是说,要找到一种概率分布,满足所有已知的事实,但是让未知的因素最随机化。相对于朴素贝叶斯方法,该方法最大的特点就是不需要满足特征与特征之间的条件独立。因此,该方法适合融合各种不一样的特征,而无需考虑它们之间的影响。Among them, the maximum entropy classifier, as one of the machine learning classification methods, is based on the maximum entropy information theory, and its basic idea is to build a model for all known factors and exclude all unknown factors. That is, to find a probability distribution that satisfies all known facts but makes the unknown factors most random. Compared with the naive Bayesian method, the biggest feature of this method is that it does not need to satisfy the conditional independence between features. Therefore, this method is suitable for fusing various features without considering the influence between them.

在最大熵模型下,预测条件概率P(c|D)的公式如下:Under the maximum entropy model, the formula for predicting the conditional probability P(c|D) is as follows:

PP (( cc ii || DD. )) == 11 ZZ (( DD. )) expexp (( ΣΣ kk λλ kk ,, cc Ff kk ,, cc (( DD. ,, cc ii )) ))

其中Z(D)是归一化因子。Pk,c是特征函数,定义为:where Z(D) is the normalization factor. P k,c is the characteristic function, defined as:

Ff kk ,, cc (( DD. ,, cc ′′ )) == 11 nno kk (( dd )) >> 00 andand cc ′′ == cc 00 oterwiseoterwise

本发明还提供了一种中文文本情感分类方法,该方法除了包括上述的步骤S101-S107外,还包括:利用构建的最大熵分类器对待分类的中文文本进行分类。The present invention also provides a method for sentiment classification of Chinese texts. In addition to the above steps S101-S107, the method also includes: classifying the Chinese texts to be classified by using the constructed maximum entropy classifier.

为了将本实施例提供的中文文本情感分类方法与现有的中文文本情感分类方法进行比较,本实施例采用一些领域内的评论语料作为非标注待分类样本,分别对这两种分类方法进行测试。测试中使用的语料为两个领域的数据,分别为关于箱包和酒店的评论。实验选用的评价标准是准确率Accuracy,准确率是评价一般分类问题的综合评价标准,对于每一个领域,标准率的计算为Accuracy=(TP+NP)/A,其中,TP指正面文本分类正确的样本总数,NP指负面文本分类正确的样本总数,A指选择的分类正确的总的样本数。In order to compare the Chinese text sentiment classification method provided in this example with the existing Chinese text sentiment classification method, this example uses comment corpus in some fields as unmarked samples to be classified, and tests the two classification methods respectively . The corpus used in the test is data from two domains, namely reviews about bags and hotels. The evaluation standard selected in the experiment is Accuracy, which is a comprehensive evaluation standard for evaluating general classification problems. For each field, the standard rate is calculated as Accuracy=(TP+NP)/A, where TP refers to the correct classification of positive text The total number of samples, NP refers to the total number of samples with correct negative text classification, and A refers to the total number of samples with correct classification selected.

需要说明的是,对待分类的中文文本的情感极性的正确性进行判定,具体判定内容是,在正面文本中,若正面情感词数目比负面情感词数目多,则正面文本分类正确;在正面文本中,若正面情感词数目比负面情感词数目少或数目相等,则正面文本分类不正确;在负面文本中,若负面情感词数目比正面情感词数目多,则负面文本分类正确;在负面文本中,若负面情感词数目比正面情感词数目少或数目相等,则负面文本分类不正确。It should be noted that the correctness of the sentiment polarity of the Chinese text to be classified is judged. The specific content of the judgment is that, in the positive text, if the number of positive emotional words is more than the number of negative emotional words, the classification of the positive text is correct; In the text, if the number of positive emotional words is less than or equal to the number of negative emotional words, the positive text classification is incorrect; in the negative text, if the number of negative emotional words is more than the number of positive emotional words, the negative text classification is correct; in the negative text In the text, if the number of negative sentiment words is less or equal to the number of positive sentiment words, the classification of negative text is incorrect.

表3为采用本发明提供的分类方法和采用现有技术的分类方法对中文文本进行分类的结果比较:Table 3 compares the result of classifying Chinese texts for adopting the classification method provided by the present invention and adopting the classification method of the prior art:

表3table 3

Figure BDA00002616392400091
Figure BDA00002616392400091

本实验中,采用不同数量的标注样本分别进行了实验验证,且Nmax=3。In this experiment, different numbers of labeled samples were used for experimental verification, and N max =3.

传统的分类方法将计算的每个样本中的情感词数,作为判定样本的情感类别的依据。本发明实施例提供的方法,首先对情感词采用情感极性转变规则对情感词做了极性转变判定,规则包括:否定规则、转折规则、能愿规则,避免了情感极性转变对情感词判定的影响,并将使用非标注样本自动标注后构建的最大熵分类器用于中文文本情感分类。The traditional classification method will calculate the number of emotional words in each sample as the basis for determining the emotional category of the sample. The method that the embodiment of the present invention provides, first adopts emotion polarity transformation rule to emotion word and has done polarity transformation judgment to emotion word, and rule comprises: negation rule, transition rule, can willing rule, has avoided emotion polarity transformation to emotion word The impact of the judgment, and the maximum entropy classifier constructed after automatic labeling of non-labeled samples is used for Chinese text sentiment classification.

从表3的数据可以看出,应用本实施例提供的中文情感分类方法分类的准确率,要远远高于传统的文本情感分类方法的准确率,提高的幅度最高可超过3个百分点,再次证明本实施例提供的分类方法的准确率高,减少人工标注代价的同时,避免了发生情感极性转变的情感词对文本分类结果带来的不利影响,有利于提高文本的分类效果。As can be seen from the data in Table 3, the accuracy rate of the Chinese emotion classification method provided by the present embodiment is much higher than that of the traditional text emotion classification method, and the rate of improvement can exceed 3 percentage points at most. Again, It is proved that the classification method provided by this embodiment has a high accuracy rate, reduces the cost of manual labeling, and avoids the negative impact of emotional words with emotional polarity changes on the text classification results, which is conducive to improving the text classification effect.

与上述分类器的构建方法对应,本发明实施例还提供了一种分类器的构建装置,图2为该装置的结构示意图,该装置可以包括:获取单元101、查找单元102、极性转变单元103、统计单元104、确定单元105、自学习单元106和分类器构建单元107。其中:Corresponding to the construction method of the above-mentioned classifier, the embodiment of the present invention also provides a construction device of the classifier. FIG. 103 , a statistical unit 104 , a determination unit 105 , a self-learning unit 106 and a classifier construction unit 107 . in:

获取单元101,用于获取待标注样本集并从待标注样本集中获取一个待标注样本,其中,待标注样本集包括至少两个待标注样本。查找单元102,用于查找待标注样本中的情感词,并获取每个情感词的情感极性,其中,情感极性包括正面和负面。极性转变单元103,用于转变待标注样本中符合情感极性转变规则的情感词的情感极性。统计单元104,用于统计待标注样本中情感极性为正面的情感词的数量和情感极性为负面的情感词的数量。确定单元105,用于根据情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定待标注样本的情感极性,得到标注样本。自学习单元106,用于利用依据标注样本利用自学习的方法对待标注样本集中其它待标注样本进行标注,得到标注样本集。分类器构建单元107,用于利用标注样本集中的标注样本构建最大熵分类器。The acquiring unit 101 is configured to acquire a sample set to be labeled and acquire a sample to be labeled from the sample set to be labeled, wherein the sample set to be labeled includes at least two samples to be labeled. The searching unit 102 is configured to search for emotional words in the sample to be labeled, and obtain the emotional polarity of each emotional word, wherein the emotional polarity includes positive and negative. The polarity conversion unit 103 is configured to convert the emotional polarity of the emotional words in the sample to be labeled that meet the emotional polarity conversion rule. The statistical unit 104 is configured to count the number of emotional words with positive emotional polarity and the number of emotional words with negative emotional polarity in the samples to be labeled. The determining unit 105 is configured to determine the emotional polarity of the sample to be labeled according to the number of emotional words with positive emotional polarity and the number of emotional words with negative emotional polarity, so as to obtain labeled samples. The self-learning unit 106 is configured to use a self-learning method based on the labeled samples to label other samples to be labeled in the sample set to be labeled to obtain a set of labeled samples. A classifier construction unit 107, configured to construct a maximum entropy classifier by using the labeled samples in the labeled sample set.

在本发明的另一实施例中,极性转变单元103可以包括:第一极性转变子单元、第二极性转变子单元和/或第三极性转变子单元。其中:In another embodiment of the present invention, the polarity conversion unit 103 may include: a first polarity conversion subunit, a second polarity conversion subunit and/or a third polarity conversion subunit. in:

第一极性转变子单元,用于当待标注样本中的情感词所在的句子中出现了否定关键词时,转变该情感词的情感极性。第二极性转变子单元,用于当待标注样本中的情感词所在的句子的下一句或下一段落出现了转折关键词时,转变该情感词的情感极性。第三极性转变子单元,用于当待标注样本中的情感词所在的句子出现了能愿关键词时,转变该情感词的情感极性。The first polarity conversion subunit is used to change the emotional polarity of the emotional word when negative keywords appear in the sentence where the emotional word in the sample to be labeled is located. The second polarity conversion subunit is used to change the emotional polarity of the emotional word when a turning keyword appears in the next sentence or paragraph of the sentence where the emotional word in the sample to be labeled is located. The third polarity conversion subunit is used to change the emotional polarity of the emotional word when the sentence where the emotional word in the sample to be marked is located has a willing keyword.

在本发明的另一实施例中,确定单元105可以包括:第一确定子单元和第二确定子单元。其中:In another embodiment of the present invention, the determining unit 105 may include: a first determining subunit and a second determining subunit. in:

第一确定子单元,用于当情感极性为正面的情感词的数量与情感极性为负面的情感词的数量的差值大于设定阈值时,确定待标注样本的情感极性为正面。第二确定子单元,用于当情感极性为负面的情感词的数量与情感极性为正面的情感词的数量的差值大于设定阈值时,确定待标注样本的情感极性为负面。The first determining subunit is used to determine that the emotional polarity of the sample to be labeled is positive when the difference between the number of emotional words with positive emotional polarity and the number of emotional words with negative emotional polarity is greater than a set threshold. The second determination subunit is used to determine that the emotional polarity of the sample to be labeled is negative when the difference between the number of emotional words with negative emotional polarity and the number of positive emotional words with emotional polarity is greater than a set threshold.

在本发明的再一实施例中,自学自单元106可以包括:分类器构建子单元、分类子单元和第三确定子单元。其中:In yet another embodiment of the present invention, the self-learning unit 106 may include: a classifier construction subunit, a classification subunit, and a third determination subunit. in:

分类器构建子单元,用于利用标注样本构建最大熵分类器。分类子单元,用于利用最大熵分类器对待标注样本集中其它待标注样本进行分类,得到分类结果。第三确定子单元,用于根据分类结果确定每个待标注样本的情感极性。The classifier construction subunit is used to construct a maximum entropy classifier using labeled samples. The classification subunit is used to use the maximum entropy classifier to classify other samples to be labeled in the sample set to be labeled to obtain a classification result. The third determining subunit is configured to determine the emotional polarity of each sample to be labeled according to the classification result.

与上述的中文文本情感分类方法对应,本发明实施例还提供了一种中文文本情感分类系统,该系统除了包括上述的分类器的构建装置外,还包括:分类单元。其中,分类单元,用于利用分类器的构建装置构建的最大熵分类器对待分类的中文文本进行分类。Corresponding to the above-mentioned Chinese text sentiment classification method, an embodiment of the present invention also provides a Chinese text sentiment classification system, the system further includes: a classification unit in addition to the above-mentioned classifier construction device. Wherein, the classification unit is used to classify the Chinese text to be classified using the maximum entropy classifier constructed by the classifier construction device.

本发明实施例提供的分类器的构建方法及装置、中文文本情感分类方法及系统,应用情感极性转变规则对情感器的情感极性进行极性转变,并且依据标注样本利用自学习的方法对待标注样本集中其它待标注样本进行标注,将根据标注样本集的标注样本构建的最大熵分类器作为中文文本情感分类的分类器。本发明提供的分类器的构建方法及装置、中文文本情感分类方法及系统,避免了人工标注训练样本浪费的人耗代价,缩短了用于中文文本情感分类的分类器的构建时间,同时,提高了中文文本情感分类的正确率。The method and device for constructing a classifier provided by the embodiments of the present invention, the method and system for classifying Chinese text emotions, apply the emotion polarity transformation rule to carry out polarity transformation on the emotion polarity of the emotion device, and use the method of self-learning according to the labeled samples to treat The other samples to be labeled in the labeled sample set are labeled, and the maximum entropy classifier constructed based on the labeled samples in the labeled sample set is used as the classifier for Chinese text sentiment classification. The method and device for constructing a classifier provided by the present invention, the method and system for classifying emotion in Chinese texts, avoid the wasted cost of manual labeling of training samples, shorten the construction time of classifiers for sentiment classification in Chinese texts, and at the same time, improve The correct rate of Chinese text sentiment classification.

为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本发明时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above devices, functions are divided into various units and described separately. Of course, when implementing the present invention, the functions of each unit can be implemented in one or more pieces of software and/or hardware.

通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。It can be seen from the above description of the implementation manners that those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in storage media, such as ROM/RAM, disk , CD, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments of the present invention.

本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to part of the description of the method embodiment. The system embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.

本发明可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。The invention is applicable to numerous general purpose and special purpose computing system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc.

本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本发明,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is a relationship between these entities or operations. There is no such actual relationship or order between them.

以上所述仅是本发明的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The foregoing is only a specific embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, some improvements and modifications can also be made without departing from the principle of the present invention. It should be regarded as the protection scope of the present invention.

Claims (10)

1.一种分类器的构建方法,其特征在于,包括:1. A method for building a classifier, comprising: 获取待标注样本集并从所述待标注样本集中获取一个待标注样本,其中,所述待标注样本集包括至少两个待标注样本;Obtain a sample set to be labeled and obtain a sample to be labeled from the sample set to be labeled, wherein the sample set to be labeled includes at least two samples to be labeled; 查找所述待标注样本中的情感词,并获取每个情感词的情感极性,其中,所述情感极性包括正面和负面;Find the emotional words in the sample to be labeled, and obtain the emotional polarity of each emotional word, wherein the emotional polarity includes positive and negative; 转变所述待标注样本中符合情感极性转变规则的情感词的情感极性;Change the emotional polarity of the emotional words that meet the emotional polarity conversion rule in the sample to be marked; 统计所述待标注样本中情感极性为正面的情感词的数量和情感极性为负面的情感词的数量;Counting the number of positive emotional words and emotional polarity in the sample to be marked is the number of negative emotional words; 根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性,得到标注样本;Determine the emotional polarity of the sample to be marked according to the number of positive emotional words and negative emotional words according to the emotional polarity, and obtain the labeled sample; 依据所述标注样本利用自学习的方法对所述待标注样本集中其它待标注样本进行标注,得到标注样本集;Using a self-learning method to mark other samples to be marked in the sample set to be marked according to the marked sample, to obtain a set of marked samples; 利用所述标注样本集中的标注样本构建最大熵分类器。A maximum entropy classifier is constructed by using labeled samples in the labeled sample set. 2.根据权利要求1所述的方法,其特征在于,转变所述待标注样本中符合情感极性转变规则的情感词的情感极性包括:2. method according to claim 1, is characterized in that, transforming the emotional polarity that meets the emotional polarity conversion rule of emotional polarity in the sample to be labeled comprises: 如果待标注样本中的情感词所在的句子中出现了否定关键词,则转变该情感词的情感极性;If negative keywords appear in the sentence where the emotional word in the sample to be marked is located, then change the emotional polarity of the emotional word; 如果待标注样本中的情感词所在的句子的下一句或下一段落出现了转折关键词,则转变该情感词的情感极性;If there is a turning keyword in the next sentence or the next paragraph of the sentence where the emotional word in the sample to be labeled is, then change the emotional polarity of the emotional word; 和/或,如果待标注样本中的情感词所在的句子出现了能愿关键词,则转变该情感词的情感极性。And/or, if the sentence where the emotional word in the sample to be marked is located has a willing keyword, then change the emotional polarity of the emotional word. 3.根据权利要求1所述的方法,其特征在于,所述根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性包括:3. method according to claim 1, is characterized in that, described according to described emotional polarity is positive emotional word quantity and emotional polarity is negative emotional word quantity and determines the emotional extreme of described sample to be labeled Sex includes: 如果情感极性为正面的情感词的数量与情感极性为负面的情感词的数量的差值大于设定阈值,则确定所述待标注样本的情感极性为正面;If the difference between the quantity of positive emotional words and the negative emotional words is greater than the set threshold, it is determined that the emotional polarity of the sample to be marked is positive; 如果情感极性为负面的情感词的数量与情感极性为正面的情感词的数量的差值大于所述设定阈值,则确定所述待标注样本的情感极性为负面。If the difference between the number of emotional words whose emotional polarity is negative and the number of emotional words whose emotional polarity is positive is greater than the set threshold, it is determined that the emotional polarity of the sample to be labeled is negative. 4.根据权利要求1所述的方法,其特征在于,依据所述标注样本利用自学习的方法对所述待标注样本集中其它待标注样本进行标注,得到标注样本集包括:4. method according to claim 1, is characterized in that, utilizes the method for self-learning according to described labeled sample to label other samples to be labeled in described sample set to be labeled, obtain labeled sample set comprising: 利用所述标注样本构建最大熵分类器;Constructing a maximum entropy classifier using the labeled samples; 利用所述最大熵分类器对所述待标注样本集中其它待标注样本进行标分类,得到分类结果;Using the maximum entropy classifier to classify other samples to be labeled in the sample set to be labeled to obtain a classification result; 根据所述分类结果确定每个待标注样本的情感极性,得到标注样本集。The emotional polarity of each sample to be labeled is determined according to the classification result to obtain a labeled sample set. 5.一种中文文本情感分类方法,其特征在于,包括:如权利要求1-4中任意一项所述的分类器的构建方法,还包括:5. A Chinese text emotion classification method, is characterized in that, comprises: the construction method of the classifier as described in any one in claim 1-4, also comprises: 利用构建的最大熵分类器对待分类的中文文本进行分类。The Chinese text to be classified is classified by using the constructed maximum entropy classifier. 6.一种分类器的构建装置,其特征在于,包括:获取单元、查找单元、极性转变单元、统计单元、确定单元、自学习单元和分类器构建单元;6. A construction device for a classifier, characterized in that it comprises: an acquisition unit, a search unit, a polarity conversion unit, a statistical unit, a determination unit, a self-learning unit and a classifier construction unit; 所述获取单元,用于获取待标注样本集并从所述待标注样本集中获取一个待标注样本,其中,所述待标注样本集包括至少两个待标注样本;The acquiring unit is configured to acquire a sample set to be labeled and acquire a sample to be labeled from the sample set to be labeled, wherein the sample set to be labeled includes at least two samples to be labeled; 所述查找单元,用于查找所述待标注样本中的情感词,并获取每个情感词的情感极性,其中,所述情感极性包括正面和负面;The search unit is used to search for the emotional words in the sample to be labeled, and obtain the emotional polarity of each emotional word, wherein the emotional polarity includes positive and negative; 所述极性转变单元,用于转变所述待标注样本中符合情感极性转变规则的情感词的情感极性;The polarity conversion unit is used to convert the emotional polarity of the emotional words that meet the emotional polarity conversion rules in the sample to be marked; 所述统计单元,用于统计所述待标注样本中情感极性为正面的情感词的数量和情感极性为负面的情感词的数量;The statistical unit is used to count the number of emotional words whose emotional polarity is positive and the emotional words whose emotional polarity is negative in the samples to be marked; 所述确定单元,用于根据所述情感极性为正面的情感词的数量与情感极性为负面的情感词的数量确定所述待标注样本的情感极性,得到标注样本;The determining unit is configured to determine the emotional polarity of the sample to be labeled according to the number of emotional words whose emotional polarity is positive and the number of emotional words whose emotional polarity is negative, to obtain a labeled sample; 所述自学习单元,用于依据所述标注样本利用自学习的方法对所述待标注样本集中其它待标注样本进行标注,得到标注样本集;The self-learning unit is configured to use a self-learning method to mark other samples to be marked in the set of samples to be marked according to the marked samples to obtain a set of marked samples; 所述分类器构建单元,用于利用所述标注样本集中的标注样本构建最大熵分类器。The classifier construction unit is configured to use the labeled samples in the labeled sample set to construct a maximum entropy classifier. 7.根据权利要求6所述的装置,其特征在于,所述极性转变单元包括:第一极性转变子单元、第二极性转变子单元和/或第三极性转变子单元;7. The device according to claim 6, wherein the polarity conversion unit comprises: a first polarity conversion subunit, a second polarity conversion subunit and/or a third polarity conversion subunit; 所述第一极性转变子单元,用于当待标注样本中的情感词所在的句子中出现了否定关键词时,转变该情感词的情感极性;The first polarity conversion subunit is used to convert the emotional polarity of the emotional word when negative keywords appear in the sentence where the emotional word in the sample to be marked is located; 所述第二极性转变子单元,用于当待标注样本中的情感词所在的句子的下一句或下一段落出现了转折关键词时,转变该情感词的情感极性;The second polarity conversion subunit is used to change the emotional polarity of the emotional word when a turning keyword appears in the next sentence or the next paragraph of the sentence where the emotional word in the sample to be marked is located; 所述第三极性转变子单元,用于当待标注样本中的情感词所在的句子出现了能愿关键词时,转变该情感词的情感极性。The third polarity conversion subunit is used to change the emotional polarity of the emotional word when the sentence where the emotional word in the sample to be marked is located has a willing keyword. 8.根据权利要求6所述的装置,其特征在于,所述确定单元包括:第一确定子单元和第二确定子单元;8. The device according to claim 6, wherein the determining unit comprises: a first determining subunit and a second determining subunit; 所述第一确定子单元,用于当情感极性为正面的情感词的数量与情感极性为负面的情感词的数量的差值大于设定阈值时,确定所述待标注样本的情感极性为正面;The first determining subunit is used to determine the emotional polarity of the sample to be marked when the difference between the number of emotional words whose emotional polarity is positive and the number of emotional words whose emotional polarity is negative is greater than a set threshold sex is positive; 所述第二确定子单元,用于当情感极性为负面的情感词的数量与情感极性为正面的情感词的数量的差值大于所述设定阈值时,确定所述待标注样本的情感极性为负面。The second determination subunit is used to determine the number of samples to be labeled when the difference between the number of emotional words whose emotional polarity is negative and the number of emotional words whose emotional polarity is positive is greater than the set threshold Emotional polarity is negative. 9.根据权利要求6所述的装置,其特征在于,所述自学习单元包括:分类器构建子单元、分类子单元和第三确定子单元;9. The device according to claim 6, wherein the self-learning unit comprises: a classifier construction subunit, a classification subunit and a third determination subunit; 所述分类器构建子单元,用于利用所述标注样本构建最大熵分类器;The classifier construction subunit is used to construct a maximum entropy classifier using the labeled samples; 所述分类子单元,用于利用所述最大熵分类器对所述待标注样本集中其它待标注样本进行标分类,得到分类结果;The classification subunit is configured to use the maximum entropy classifier to classify other samples to be labeled in the sample set to be labeled to obtain a classification result; 第三确定子单元,用于根据所述分类结果确定每个待标注样本的情感极性。The third determination subunit is configured to determine the emotional polarity of each sample to be labeled according to the classification result. 10.一种中文文本情感分类系统,其特征在于,包括如权利要求6-9中任意一项所述的分类器的构建装置,还包括:分类单元;10. A Chinese text emotion classification system is characterized in that, comprising the construction device of the classifier as described in any one of claims 6-9, also comprising: classification unit; 所述分类单元,用于利用所述分类器的构建装置构建的最大熵分类器对待分类的中文文本进行分类。The classification unit is used to classify the Chinese text to be classified by using the maximum entropy classifier constructed by the classifier construction device.
CN2012105564463A 2012-12-19 2012-12-19 Classifier construction method and device as well as Chinese text sentiment classification method and system Pending CN103020249A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105564463A CN103020249A (en) 2012-12-19 2012-12-19 Classifier construction method and device as well as Chinese text sentiment classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105564463A CN103020249A (en) 2012-12-19 2012-12-19 Classifier construction method and device as well as Chinese text sentiment classification method and system

Publications (1)

Publication Number Publication Date
CN103020249A true CN103020249A (en) 2013-04-03

Family

ID=47968852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105564463A Pending CN103020249A (en) 2012-12-19 2012-12-19 Classifier construction method and device as well as Chinese text sentiment classification method and system

Country Status (1)

Country Link
CN (1) CN103020249A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530283A (en) * 2013-10-25 2014-01-22 苏州大学 Method for extracting emotional triggers
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN104317965A (en) * 2014-11-14 2015-01-28 南京理工大学 Establishment method of emotion dictionary based on linguistic data
CN106844743A (en) * 2017-02-14 2017-06-13 国网新疆电力公司信息通信公司 The sensibility classification method and device of Uighur text
CN107644101A (en) * 2017-09-30 2018-01-30 百度在线网络技术(北京)有限公司 Information classification approach and device, information classification equipment and computer-readable medium
CN108241650A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The training method and device of training criteria for classification
WO2019042450A1 (en) * 2017-09-04 2019-03-07 华为技术有限公司 Natural language processing method and apparatus
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data
CN114443849A (en) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 Method and device for selecting marked sample, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125371A1 (en) * 2007-08-23 2009-05-14 Google Inc. Domain-Specific Sentiment Classification
CN102323944A (en) * 2011-09-02 2012-01-18 苏州大学 Sentiment Classification Method Based on Polarity Transition Rules
CN102682124A (en) * 2012-05-16 2012-09-19 苏州大学 A text sentiment classification method and device
CN102682130A (en) * 2012-05-17 2012-09-19 苏州大学 Text sentiment classification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125371A1 (en) * 2007-08-23 2009-05-14 Google Inc. Domain-Specific Sentiment Classification
CN102323944A (en) * 2011-09-02 2012-01-18 苏州大学 Sentiment Classification Method Based on Polarity Transition Rules
CN102682124A (en) * 2012-05-16 2012-09-19 苏州大学 A text sentiment classification method and device
CN102682130A (en) * 2012-05-17 2012-09-19 苏州大学 Text sentiment classification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
代大明等: "基于情绪词的非监督中文情感分类方法研究", 《中文信息学报》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530283A (en) * 2013-10-25 2014-01-22 苏州大学 Method for extracting emotional triggers
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN104317965B (en) * 2014-11-14 2018-04-03 南京理工大学 Sentiment dictionary construction method based on language material
CN104317965A (en) * 2014-11-14 2015-01-28 南京理工大学 Establishment method of emotion dictionary based on linguistic data
CN108241650B (en) * 2016-12-23 2020-08-11 北京国双科技有限公司 Training method and device for training classification standard
CN108241650A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The training method and device of training criteria for classification
CN106844743B (en) * 2017-02-14 2020-04-24 国网新疆电力公司信息通信公司 Emotion classification method and device for Uygur language text
CN106844743A (en) * 2017-02-14 2017-06-13 国网新疆电力公司信息通信公司 The sensibility classification method and device of Uighur text
WO2019042450A1 (en) * 2017-09-04 2019-03-07 华为技术有限公司 Natural language processing method and apparatus
US11630957B2 (en) 2017-09-04 2023-04-18 Huawei Technologies Co., Ltd. Natural language processing method and apparatus
CN107644101A (en) * 2017-09-30 2018-01-30 百度在线网络技术(北京)有限公司 Information classification approach and device, information classification equipment and computer-readable medium
CN112445897A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Method, system, device and storage medium for large-scale classification and labeling of text data
CN114443849A (en) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 Method and device for selecting marked sample, electronic equipment and storage medium
CN114443849B (en) * 2022-02-09 2023-10-27 北京百度网讯科技有限公司 Annotated sample selection method, device, electronic equipment and storage medium
US11907668B2 (en) 2022-02-09 2024-02-20 Beijing Baidu Netcom Science Technology Co., Ltd. Method for selecting annotated sample, apparatus, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN103020249A (en) Classifier construction method and device as well as Chinese text sentiment classification method and system
Desai et al. Techniques for sentiment analysis of Twitter data: A comprehensive survey
CN102682130B (en) A text sentiment classification method and system
CN102682124B (en) Emotion classifying method and device for text
CN104679728B (en) A kind of text similarity detection method
CN103678564B (en) Internet product research system based on data mining
Batchkarov et al. A critique of word similarity as a method for evaluating distributional semantic models
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN103631961A (en) Method for identifying relationship between sentiment words and evaluation objects
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN105930411A (en) Classifier training method, classifier and sentiment classification system
CN103744984B (en) Method of retrieving documents by semantic information
CN101593204A (en) A Sentiment Analysis System Based on News Comment Webpage
CN103744953A (en) Network hotspot mining method based on Chinese text emotion recognition
CN102323944A (en) Sentiment Classification Method Based on Polarity Transition Rules
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN105589941A (en) Emotional information detection method and apparatus for web text
CN106294845B (en) Multi-emotion classification method and device based on weight learning and multi-feature extraction
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
CN108280057A (en) A kind of microblogging rumour detection method based on BLSTM
CN108038627A (en) A kind of object evaluation method and device
CN104268134A (en) Subjective and objective classifier building method and system
CN104346326A (en) Method and device for determining emotional characteristics of emotional texts
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130403