CN102541838B - Method and equipment for optimizing emotional classifier - Google Patents

Method and equipment for optimizing emotional classifier Download PDF

Info

Publication number
CN102541838B
CN102541838B CN201010612244.7A CN201010612244A CN102541838B CN 102541838 B CN102541838 B CN 102541838B CN 201010612244 A CN201010612244 A CN 201010612244A CN 102541838 B CN102541838 B CN 102541838B
Authority
CN
China
Prior art keywords
classification
language material
emotion classifiers
marked
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010612244.7A
Other languages
Chinese (zh)
Other versions
CN102541838A (en
Inventor
胡长建
邱立坤
赵凯
许洪志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC China Co Ltd
Original Assignee
NEC China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC China Co Ltd filed Critical NEC China Co Ltd
Priority to CN201010612244.7A priority Critical patent/CN102541838B/en
Publication of CN102541838A publication Critical patent/CN102541838A/en
Application granted granted Critical
Publication of CN102541838B publication Critical patent/CN102541838B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and equipment for optimizing emotional classifiers. The method may comprise: selecting emotional classifiers with high classification difference from marked sets; marking the non-marked corpora by using the selected emotional classifiers; extracting reliable marked corpora from non-marked corpora according to the result of marking; updating the marked sets by using the reliable corpora; and training the selected emotional classifiers by using the marked sets to optimize the emotional classifiers. The method and equipment can eliminate the differences between the emotional classifiers and increase the precision of the emotional classifiers.

Description

For the method and apparatus of optimizing emotional classifier
Technical field
The present invention relates generally to field of information processing, particularly for the method and apparatus of optimizing emotional classifier.
Background technology
Extensively universal along with Web2.0, namely I said that you listen, I drills you saw, I writes mode that you read is becoming to user the center that information produces and change by the information disclosure model of Web1.0 the past.Correspondingly, increasing user makes comments for the quality of product or service quality, and this comment have expressed the mood of user oneself, can be referred to as user and produce content (User generated Content, UGC).These users produce content all has important reference significance for consumer or for producer/businessman.Based on the objective evaluation of other users, consumer can determine purchase decision quickly, and producer/businessman can improve oneself product or service better according to the feedback of user.Be therefrom extract the Sentiment orientation of user to an object of the analysis of above-mentioned review information, this technology is called emotional semantic classification, and its object is exactly provide to given text the Sentiment orientation that the people that writes this section of words states: front or negative.Reflect the emotional expression of user definitely, could play active and effective effect to consumer and businessman, therefore the objective emotional semantic classification technology biased without classification will be very important.
Emotional semantic classification is the classification problem more than of natural language processing field, usually two class ways are had from reality realizes, one class is the method based on language material (corpus-based), and an other class is the method based on dictionary (lexicon-based).Experiment proves, this two classes emotional semantic classification algorithm has biased (classification bias) problem of classification.In systems in practice, eliminate the biased true intention that just can reflect user more objectively of classification, therefore the biasing problem of emotional semantic classification is a problem demanding prompt solution.
For the problems referred to above, industry has proposed some associated solutions, and such as US Patent No. 20080249764 proposes a kind of sorter that uses and is polymerized the method improving emotional semantic classification accuracy, reduces partial feeling classification to a certain extent biased.But, prior art is not analysed in depth classification is biased, do not go pointedly to address this problem yet, such as US Patent No. 20080249764 is only by being polymerized different sorters, also namely adopting more characteristic of division to improve nicety of grading, how this eliminates the biased problem of emotional semantic classification if can not solving effectively.
Summary of the invention
For the above problem existed in prior art, the object of the present invention is to provide a kind of method and apparatus for optimizing emotional classifier, emotional semantic classification can have been eliminated by the emotion classifiers optimized and be biased.
According to a first aspect of the invention, provide a kind of method for optimizing emotional classifier, the method can comprise: be biased one group of large emotion classifiers of difference based on marking set selection sort from emotion classifiers set; This group emotion classifiers is used to mark un-annotated data; From un-annotated data, credible mark language material is extracted according to annotation results; Use credible mark language material to upgrade and mark set; And utilize the set of mark upgraded to train this group emotion classifiers, with optimizing emotional classifier.
According to a second aspect of the invention, provide a kind of equipment for optimizing emotional classifier, this equipment can comprise: selecting arrangement, for being biased one group of large emotion classifiers of difference based on marking set selection sort from emotion classifiers set; Annotation equipment, marks un-annotated data for using this group emotion classifiers; Extraction element, for extracting credible mark language material according to annotation results from un-annotated data; Updating device, marks set for using credible mark language material to upgrade; And trainer, train this group emotion classifiers, with optimizing emotional classifier for utilizing the set of mark of renewal.
By following to the description according to the preferred embodiment of the present invention, and by reference to the accompanying drawings, other features of the present invention and advantage will be apparent.
Accompanying drawing explanation
By below in conjunction with the description of the drawings, and understand more comprehensively along with to of the present invention, other objects of the present invention and effect will become clearly and easy to understand, wherein:
Fig. 1 is the process flow diagram of the method for optimizing emotional classifier according to one embodiment of the present of invention;
Fig. 2 is the process flow diagram of the method for optimizing emotional classifier according to an alternative embodiment of the invention; And
Fig. 3 is the block diagram of the equipment for optimizing emotional classifier according to one embodiment of the present of invention.
In all above-mentioned accompanying drawings, identical label represents to have identical, similar or corresponding feature or function.
Embodiment
Below in conjunction with accompanying drawing the present invention be explained in more detail and illustrate.Should be appreciated that drawings and Examples of the present invention only for exemplary effect, not for limiting the scope of the invention.
For the sake of clarity, first the term used in the present invention is done to explain.
1. language material
Language material of the present invention is also referred to as free text, and it can be word, word, sentence, fragment, article etc. and combination in any thereof.
Un-annotated data is the language material not carrying out emotion classification mark.
Having marked language material is the language material having marked emotion classification.Obtain one to have marked language material and mean and both can obtain this language material, the emotion classification that this language material is marked can be obtained again.
2. emotional semantic classification and emotion classifiers
Emotional semantic classification is the classification problem more than of natural language processing field.Generally speaking, emotional semantic classification typically refers to by analyzing language material and marks its Sentiment orientation, such as positive emotion tendency or negative emotion tendency, thus language material is categorized as positive emotion tendency language material and negative emotion tendency language material.In addition, except the mode of above-mentioned mark two classifications, also can be multiple classification by Emotion tagging, because those skilled in the art are easy to the process for two classifications to expand in the process of multiple classification, therefore the present invention be mainly described the mark of two classifications.But it should be noted, the present invention is not limited to be the situation of two classifications by emotional semantic classification.
At present, those skilled in the art often use following sensibility classification method, and the first is based on the sensibility classification method of language material, and it two is sensibility classification methods based on dictionary.
Method based on language material is based on having marked a collection of language material of emotion classification in advance (such as, this comments material can comprise the text set being labeled as positive emotion tendency and the text set being labeled as negative emotion tendency), utilize this comments material to train the emotion classifiers that be have learned sorting algorithm by the method for machine learning, then use the emotion classifiers of training the language material not marking emotion classification to be carried out to the mark of emotion classification.Method based on dictionary prepares a sentiment dictionary in advance, the word of the word and negative emotion of often stating positive emotion is chosen in advance, then the corpus statistics front word of emotion classification and the number of times of negation words are not marked for given, judge by normalization the Sentiment orientation that this language material is corresponding.
Method based on language material and the method based on dictionary can comprise multiple specific algorithm, and are not only a special algorithm.Method based on language material can be such as based on maximum entropy model, based on decision-tree model, based on CRF (conditional Random Field) model, based on neural network model or based on the concrete sensibility classification method such as Naive Bayes model.Method based on dictionary can be such as only based on the sensibility classification method of dictionary or the sensibility classification method etc. based on dictionary and rule.
Emotion classifiers is the instrument utilizing various emotional semantic classification algorithm language material to be carried out to the mark of emotion classification.In the present invention, an emotion classifiers can be corresponding with a kind of emotional semantic classification algorithm, and emotion classifiers can based on marking language material to train, so that the emotional semantic classification reducing to produce when this emotion classifiers is classified to un-annotated data is biased.For convenience of description, hereafter sometimes also emotion classifiers is sketched as sorter.
3. emotional semantic classification is biased
Experiment proves, the sensibility classification method based on language material often has positive emotion corpus labeling by one is negative emotion; And to be more prone to the corpus labeling with negative emotion based on the sensibility classification method of dictionary be positive emotion.Visible, above-mentionedly all avoid endless sentiments sense marking error based on language material or based on the sensibility classification method of dictionary and sorter.For convenience of description, positive emotion is labeled as this mistake of negative emotion and is called negative bias by sorter in the present invention, and negative emotion is labeled as this mistake of positive emotion and is called positive bias by sorter, and negative bias and positive bias are referred to as emotional semantic classification are biased.
The present invention relates to a kind of method for optimizing emotional classifier.The method can comprise: be biased one group of large emotion classifiers of difference based on marking set selection sort from emotion classifiers set; This group emotion classifiers is used to mark un-annotated data; From un-annotated data, credible mark language material is extracted according to annotation results; Use credible mark language material to upgrade and mark set; And utilize the set of mark upgraded to train this group emotion classifiers, with optimizing emotional classifier, and then it is biased to eliminate emotional semantic classification, improves emotional semantic classification precision.
Method of the present invention has the language material of automatic sensing marking error generally and automatically can adjust emotion classifiers two aspect feature.Such as, first one embodiment of the present invention can be selected the classifiers that classification is biased difference large, this classifiers such as comprises sorter A and sorter B, then this selected classifiers is used to carry out the mark of emotion classification to given unfiled document, then using the misclassification corpus of corpus different for two sorter classification results as sorter A, thus can realize this feature of language material of automatic sensing marking error.In addition, two sorters can being classified consistent language material as credible mark language material, join and mark for sorter A re-training in language material set, by so iterating, thus can ensure automatically to adjust this feature to the continuous of sorter A.Because sorter A and B is symmetrical, so said process can be implemented sorter A and B simultaneously.Thus, this classifiers comprising sorter A and B constantly can obtain credible mark language material in mutual coorinated training process, thus overcomes classification biasing problem and significantly improve nicety of grading.
Each embodiment of the present invention will be described below in detail.
Fig. 1 is the process flow diagram of the method for optimizing emotional classifier according to one embodiment of the present of invention.
In step 101, be biased one group of large emotion classifiers of difference based on marking set selection sort from emotion classifiers set.
In the present invention, the emotion classifiers set in step 101 can comprise multiple emotion classifiers, and these emotion classifiers can obtain for multiple emotion classifiers model training by utilizing to have marked to gather.
According to one embodiment of present invention, can realize in the following manner based on marking set one group of emotion classifiers that selection sort is biased difference large from emotion classifiers set: the emotion classifiers in the set of use emotion classifiers is to the corpus labeling of mark marked in set; Difference is biased according to the classification between the described annotation results determination emotion classifiers having marked language material; One group of emotion classifiers is selected according to predetermined policy based on the biased difference of classification.
In one embodiment of the invention, the classification between emotion classifiers is biased difference can by obtaining according to the classification similarity between the emotion classifiers calculated the annotation results marking language material.Classification similarity can be obtained by following various ways: such as, can in the annotation results marking language material the number (different emotions sorter has identical annotation results to the conflict free language material of described classification) of the conflict free language material of statistical classification, and determine described classification similarity based on the number of the conflict free language material of classification and the language material sum marked in set; Again such as, can by included angle cosine, wear one of similarity calculating methods such as this coefficient, Chi-square, log-likelihood and class F1measure and calculate.One group of emotion classifiers is selected according to predetermined policy based on the biased difference of classification.
In one embodiment of the invention, the process of one group of emotion classifiers is selected to be accomplished in several ways based on the biased difference of classification according to predetermined policy.Such as, can sort to the biased difference of classification, and select the one group emotion classifiers corresponding with maximum classification bias difference out-phase.Again such as, can determine that the classification being greater than predetermined threshold is biased difference, and select the one group emotion classifiers corresponding with determined classification bias difference out-phase.
In step 102, this group emotion classifiers is used to mark un-annotated data.
This group emotion classifiers can comprise two or more different sorter, has large classification and depart from difference between this two or more sorter.Because an emotion classifiers can be corresponding with a kind of emotional semantic classification algorithm, therefore use the two or more emotion classifiers in this group to carry out mark to un-annotated data and can obtain two or more annotation results for each un-annotated data, same classification may be labeled as same un-annotated data and also may be labeled as different classes of.
In step 103, from un-annotated data, extract credible mark language material according to annotation results.
In the present invention, credible mark language material determines the annotation results of un-annotated data according to a classifiers.Such as, when a classifiers is identical to same un-annotated data mark, this un-annotated data can be thought a credible mark language material.Wherein, " mark identical " and refer to that in a classifiers, this un-annotated data is all labeled as positive emotion tendency or all this un-annotated data is labeled as negative emotion tendency by each sorter.Again such as, when one group of emotion classifiers to same un-annotated data mark identical probability be greater than predetermined confidence level time, this un-annotated data can be thought a credible mark language material.Wherein, " identical probability is marked " and can be the maximal value in following two probability: in a classifiers, same un-annotated data is labeled as each sorter in the probability of positive emotion tendency and this classifiers and same un-annotated data is labeled as the probability of negative emotion tendency by each sorter.
According to another embodiment of the invention, from un-annotated data, extract credible mark language material according to annotation results can be accomplished in several ways.Such as, can determine to be marked identical un-annotated data by this group emotion classifiers according to annotation results, and extract determined un-annotated data as credible mark language material.Again such as, identical probability can be marked according to annotation results determination un-annotated data by this group emotion classifiers; If described probability is greater than predetermined confidence level, then extract this un-annotated data as credible mark language material.
In step 104, use credible mark language material to upgrade and mark set.
In this step, credible mark language material and the classification be marked thereof can be added to and mark in set, thus expansion marks set.Further, this credible mark language material is not re-used as un-annotated data and uses.
In step 105, the set of mark upgraded is utilized to train this group emotion classifiers, with optimizing emotional classifier.
Upgraded at step 104 owing to marking set, therefore utilize the mark set after upgrading can this group emotion classifiers better selected by training step 101, thus make this group emotion classifiers have better nicety of grading.
According to another embodiment of the invention, the method for optimizing emotional classifier of the present invention can also comprise the process obtaining in advance and marked set, has marked language material and classification thereof marking in set to comprise.By collecting a large amount of mark language materials in advance, the burden of artificial mark language material can be alleviated, also can improve nicety of grading of the present invention further simultaneously.
Fig. 2 is the process flow diagram of the method for optimizing emotional classifier according to an alternative embodiment of the invention.In this embodiment, suppose that emotion classifiers set C comprises 3 emotion classifiers, as follows:
C={ sorter 1, sorter 2, sorter 3},
Wherein sorter 1, sorter 2 and sorter 3 can use any emotional semantic classification algorithm to build.
Suppose that marking set L comprises 4 language materials, each language material has been marked emotion classification, as follows:
L={ “positive-The screen of the mobile is perfect”,
“positive-It′s speedy and space saving and inexpensive”,
“negative-The sound quality is very nice for the price,but
since the player doesn′t work,it′s essentially useless”,
“negative-They just suck and have a high failure rate”
}
Marking in set L, mark language material 1 " The screen of the mobile is perfect " and be noted as positive emotion tendency, mark language material 2 " It ' s speedy and space savingand inexpensive " and be noted as positive emotion tendency, mark language material 3 " The soundquality is very nice for the price, but since the player doesn ' t work, it ' sessentially useless " be noted as negative emotion tendency, mark language material 4 " They justsuck and have a high failure rate " and be noted as negative emotion tendency.
Suppose that unlabeled set closes T as follows:
T={“the product is too bad”,
“The phone screen is too small”,
“I like the appearance of the product”
}
This unlabeled set closes T and comprises 3 language materials, i.e. un-annotated data 1 " the product is toobad ", un-annotated data 2 " The phone screen is too small " and un-annotated data 3 " Ilike the appearance of the product ", each un-annotated data does not carry out emotional semantic classification.
In step 201, the emotion classifiers in emotion classifiers set is used to mark the language material of mark marked in set.
In the present embodiment, emotion classifiers set C comprises 3 emotion classifiers: sorter 1, sorter 2 and sorter 3.These 3 emotion classifiers can mark set L by utilization and obtain for multiple emotion classifiers model training.
Although marked in set L and contained the emotion classification having marked language material, do not consider the emotion classification marked in this step, but utilize all sorters in emotion classifiers set C again to mark emotion classification that these have marked language material.
The annotation results of step 201 is as follows, and wherein "+" represents positive emotion tendency, and "-" represents negative emotion tendency:
The annotation results of sorter 1: < has marked language material 1 ,+>,
< has marked language material 2 ,->,
< has marked language material 3 ,->,
< has marked language material 4 ,->.
The annotation results of sorter 2: < has marked language material 1 ,+>,
< has marked language material 2 ,+>,
< has marked language material 3 ,+>,
< has marked language material 4 ,+>.
The annotation results of sorter 3: < has marked language material 1 ,+>,
< has marked language material 2 ,->,
< has marked language material 3 ,->,
< has marked language material 4 ,+>.
In step 202, be biased difference according to the classification between the annotation results determination emotion classifiers marking language material.
Three sorters annotation results is separately given in above annotation results, therefore, step 202 classification calculated between sorter 1 and sorter 2 is biased difference, classification between sorter 1 and sorter 3 is biased difference and the classification between sorter 2 and sorter 3 is biased difference.
Biased difference of classifying can be obtained by the classification similarity calculated between emotion classifiers.Classification similarity can be calculated by following various ways.
Such as, can in above annotation results the number of the conflict free language material of statistical classification, and determine classification similarity based on the number M of the conflict free language material of classification and the language material sum N marked in set.Classification similarity such as can be defined as the number of conflict free language material of classifying and the ratio M/N of the language material sum marked in set, also can be defined as other Similarity Measure modes that those skilled in the art commonly use.
Suppose that the present embodiment adopts M/N to determine similarity of classifying.According to the annotation results of sorter 1 with sorter 2, the two only has identical mark classification for marking language material 1, and to the classification difference that other three have marked language material and marked.Therefore, the number that can count conflict free language material of classifying between sorter 1 and sorter 2 is 1, due to mark in set marked language material add up to 4, so the classification similarity that can obtain between sorter 1 and sorter 2 is 1/4=0.25.In like manner, the classification similarity that can obtain between sorter 1 and sorter 3 is 3/4=0.75, and the classification similarity between sorter 2 and sorter 3 is 2/4=0.50.
It should be noted, sorter 1, similarity between sorter 2 and this three of sorter 3 can be determined by the ratio of the number that calculates the conflict free language material of classification in three sorters and the language material sum marked in set.Such as, because these three sorters only have identical annotation results (being all labeled as "+" that represent positive emotion tendency), so the classification similarity that can obtain between these three sorters is 1/4=0.25 for marking language material 1.When there is multiple sorter (such as, 100 sorters) in emotion classifiers set, the classification similarity of the classifiers comprising three or more sorters can be calculated by said method.
In addition, also can by included angle cosine, wear one of similarity calculating methods such as this coefficient, Chi-square, log-likelihood and class F1measure and calculate above-mentioned classification similarity.
According to the classification similarity between sorter, the biased difference of classification can be calculated in several ways.Such as, biased for classification difference can be defined as the inverse of classification similarity, also biased for classification difference can be defined as 1/ (1+ classify similarity), or carry out defining classification according to other account forms that this area is commonly used and be biased difference.Suppose that the present embodiment adopts " 1/ (1+ classify similarity) " to determine biased difference of classifying, then the classification that can obtain between sorter 1 and sorter 2 is biased difference is 1/ (1+0.25)=0.8.In like manner, it is 1/ (1+0.75)=0.57 that the classification that can obtain between sorter 1 and sorter 3 is biased difference, and it is 1/ (1+0.50)=0.67 that the classification between sorter 2 and sorter 3 is biased difference.
In another embodiment, suppose that the part annotation results of step 201 is as follows.
The annotation results of sorter 1: < has marked language material 1 ,+, 98%>,
< has marked language material 1 ,-, 78%>,
……。
The annotation results of sorter 2: < has marked language material 1 ,+, 78%>,
< has marked language material 1 ,-, 90%>,
……。
More than schematically show only sorter 1 and sorter 2 respectively for the classification results marking language material 1, "+" represents positive emotion tendency, "-" represents negative emotion tendency, and number percent represents corpus labeling to be the nicety of grading (reliability) of "+" or "-".Such as, " < has marked language material 1 ,+, 98%> " represents that the nicety of grading by marking language material 1 and be labeled as "+" is 98%.
When similarity be calculated by the method for class F1measure time, can utilize the classification calculated between sorter 1 and sorter 2 of following equation be biased difference (be designated as a):
a=2×diff(p + 1-p + 2)×diff(p - 1-p - 2)/(diff(p + 1-p + 2)+diff(p - 1-p - 2))×100(1)
Wherein, p + 1and p - 1represent the nicety of grading that a corpus labeling is positive emotion tendency and negative emotion tendency by sorter 1 respectively, and
diff(p + 1-p + 2)=Abs(p + 1-p + 2)/Max(p + 1,p + 2) (2)
Wherein " Abs (p + 1-p + 2) " represent and get p + 1-p + 2absolute value, " Max (p + 1, p + 2) " represent and get p + 1and p + 2in maximal value.
Equation (1) calculating sorter 1 and sorter 2 can be utilized to be biased difference a for the classification marking language material 1:
a = 2 &times; | 0.98 - 0.78 | / 0.98 * | 0.78 - 0.9 | / 0.9 | 0.98 - 0.78 | / 0.98 + | 0.78 - 0.9 | / 0.9 &times; 100 = 16.1
Similarly, the classification that sorter 1 and sorter 2 marked language material for other can be obtained and be biased difference.Correspondingly, also can obtain sorter 1 and sorter 3 is biased difference for each classification having marked language material, and sorter 2 and sorter 3 are biased difference for each classification having marked language material.
In step 203, select one group of emotion classifiers based on the biased difference of classification according to predetermined policy.
In this step, such as can sort to the biased difference of classification, and select the one group emotion classifiers corresponding with maximum classification bias difference out-phase; Also can determine that the classification being greater than predetermined threshold is biased difference, and select the one group emotion classifiers corresponding with determined classification bias difference out-phase.
In one example in which, suppose that the present embodiment selects the one group emotion classifiers corresponding with maximum classification bias difference out-phase.By the biased difference sequence of 3 classification obtained step 202, it is maximal value in three that the classification can determining between sorter 1 and sorter 2 is biased difference 0.8, thus the classifiers that classification bias difference out-phase maximum with this can be selected corresponding, i.e. sorter 1 and sorter 2.
In another example, suppose that the present embodiment selects the one group emotion classifiers corresponding with determined classification bias difference out-phase, and suppose that this predetermined threshold is 0.6.According to the biased difference of 3 classification that step 202 obtains, classification between sorter 1 and sorter 2 is biased difference 0.8 and the classification between sorter 2 and sorter 3 and is biased difference 0.67 and is all greater than this threshold value, therefore both can selection sort device 1 and this classifiers of sorter 2 in step 203, also can selection sort device 2 and this classifiers of sorter 3.In a kind of alternatives, also again can select in above-mentioned two fold classification device according to other any suitable algorithms, thus finally determine to select which classifiers.
In step 204, this group emotion classifiers is used to mark un-annotated data.
What suppose the final selection of step 203 is sorter 1 and this group emotion classifiers of sorter 2, then step 204 can use sorter 1 and sorter 2 to mark unlabeled set conjunction T respectively.
Such as, step 204 is as follows for the annotation results of un-annotated data, and wherein "+" represents positive emotion tendency, and "-" represents negative emotion tendency:
The annotation results of sorter 1: < un-annotated data 1 ,->,
< un-annotated data 2 ,->,
< un-annotated data 3 ,+>.
The annotation results of sorter 2: < un-annotated data 1 ,->,
< un-annotated data 2 ,+>,
< un-annotated data 3 ,->.
In step 205, determine that this group emotion classifiers marks identical un-annotated data according to annotation results.
According to the annotation results for un-annotated data in step 204, un-annotated data 1 is all labeled as negative emotion tendency "-" by sorter 1 and sorter 2, but different with the mark of un-annotated data 3 to un-annotated data 2.Therefore, can determine that this group emotion classifiers marks identical language material is only un-annotated data 1.
In step 206, extract determined un-annotated data as credible mark language material.
Now, un-annotated data 1 can be thought credible mark language material, because this language material is had by one group the sorter that macrotaxonomy is biased difference be all labeled as negative emotion tendency.In other examples, if un-annotated data 1 by this classifiers be all labeled as have positive emotion tendency also should be thought credible mark language material.Therefore, in the present invention, credible mark language material only refers to that in a classifiers, the mark of each sorter to this language material is identical, and which specific category this corpus labeling must be by each sorter be not limited in this classifiers.
In addition, extract this un-annotated data 1 and mean never to be marked by this un-annotated data 1 in set T as credible mark language material and remove, now to close T as follows for unlabeled set:
“The phone screen is too small”,
“I like the appearance of the product”
}
In addition, it should be noted, step 205 and 206 can use other modes extracting credible mark language material to substitute.In another embodiment of the present invention, can by being marked identical probability according to described annotation results determination un-annotated data by one group of emotion classifiers, and if described probability is greater than predetermined confidence level, then extract described un-annotated data as credible mark language material.Such as when this group emotion classifiers comprises four sorters, if same un-annotated data is labeled as certain classification by three sorters, and this un-annotated data is labeled as different classes of by the 4th sorter, then can obtaining un-annotated data, to be marked identical probability by this group emotion classifiers be 3/4, namely 0.75.Suppose that predetermined confidence level is 0.7, then due to 0.75 > 0.7, so this un-annotated data can be extracted as credible mark language material.
In step 207, judge whether credible mark language material number equals 0.
At this step first number of credible mark language material of obtaining of determining step 206.If this number equals 0, represent there is not credible mark language material, thus flow process terminates.If this number is not 0, then represents never to mark in language material set in step 206 and at least extract 1 credible mark language material, thus proceed step 208.
In the present embodiment, because step 206 is extracted credible mark language material, i.e. un-annotated data 1, therefore continues to perform to step 208 from step 207.
In step 208, use credible mark language material to upgrade and mark set.
In this step, credible mark language material and the classification be marked thereof can be added to and mark in set, thus be updated to as follows by marking set L:
L={“positive-The screen of the mobile is perfect”,
“positive-It′s speedy and space saving and inexpensive”,
“negative-The sound quality is very nice for the price,but
since the player doesn′t work,it′s essentially useless”,
“negative-They just suck and have a high failure rate”,
“negative-the product is too bad”
}
Wherein, credible mark language material and classification thereof is shown in underscore.
In step 209, the set of mark upgraded is utilized to train this group emotion classifiers.
Now, the set that marks upgraded is used to train group emotion classifiers of selected by step 203, also the language material marked in set L namely using step 208 to obtain is trained sorter 1 and sorter 2, instead of trains all sorters in the emotion classifiers set used initial in step 201.
Utilize language material to carry out training to sorter and can adopt multiple method, such as, based on the training method of naive Bayesian, based on the training method of maximum entropy model, based on the training method of the classification of SVM, based on the training method of CRF (condition random field) model, etc.
In addition, it is no matter the emotion classifiers based on language material, or based on the emotion classifiers of dictionary, its precision all depends on the quality and quantity of they or inner dictionary of training to a great extent, improves have important effect so obtain efficiently a large amount of corpus for the overall precision of emotional semantic classification.Therefore, in another embodiment, except each step of the method for optimizing emotional classifier shown in Fig. 2, can also comprise the step obtaining in advance and marked set, this step such as can realize in the following manner: judge whether Internet resources are resources that emotional semantic classification is relevant; Front evaluation information and unfavorable ratings information is extracted from the resource relevant to emotional semantic classification; And Corpus--based Method information or pre-defined rule filter front evaluation information and unfavorable ratings information, thus obtain language material and classification thereof.Obtain by this step having marked set in advance, semi-automatically can extract the language material marked in a large number from existing internet particularly B2C website, to optimize the precision of sentiment analysis.In addition, by collecting a large amount of mark language materials in advance, the burden of artificial mark language material can be alleviated, also can improve nicety of grading of the present invention further simultaneously.
Fig. 3 is the block diagram of the equipment 300 for optimizing emotional classifier according to one embodiment of the present of invention.
Equipment 300 can comprise: selecting arrangement 310, annotation equipment 320, extraction element 330, updating device 340 and trainer 350.
Selecting arrangement 310 is for being biased one group of large emotion classifiers of difference based on marking set selection sort from emotion classifiers set.In one embodiment, selecting arrangement 310 can comprise: taxon, for using the emotion classifiers in emotion classifiers set to the corpus labeling of mark marked in set; The biased Difference determining unit of classification, for being biased difference according to the classification between the annotation results determination emotion classifiers marking language material; Emotion classifiers selection unit, for selecting one group of emotion classifiers based on the biased difference of classification according to predetermined policy.
In one implementation, the biased Difference determining unit of classification can comprise: the device for basis, the annotation results marking language material being calculated to the classification similarity between emotion classifiers; And for based on classification similarity, the classification obtained between emotion classifiers is biased the device of difference.Such as, for comprising according to the device of the classification similarity calculated between emotion classifiers the classification results marking language material: for the number of the conflict free language material of statistical classification in the annotation results marking language material, wherein different emotions sorter has the device of identical annotation results to the conflict free language material of classification; And for based on classification conflict free language material number and marked in set language material sum determine classification similarity device.Again such as, for comprising according to the device of the classification similarity calculated between emotion classifiers the annotation results marking language material: for being calculated the device of classification similarity by one of following similarity calculating method: included angle cosine, wear this coefficient, Chi-square, log-likelihood and class F1measure.
In one implementation, emotion classifiers selection unit can comprise: for the device sorted to the biased difference of classification; And for selecting the device of the one group emotion classifiers corresponding with maximum classification bias difference out-phase.
In another kind of implementation, emotion classifiers selection unit can comprise: the classification for determining to be greater than predetermined threshold is biased the device of difference; And for selecting the device of the one group emotion classifiers corresponding with determined classification bias difference out-phase.
Annotation equipment 320 marks un-annotated data for using this group emotion classifiers.
Extraction element 330 for extracting credible mark language material according to annotation results from un-annotated data.
In one embodiment, extraction element 330 can comprise: for determining the device being marked identical un-annotated data by this group emotion classifiers according to annotation results; And for extracting the device of determined un-annotated data as credible mark language material.
In another embodiment, extraction element 330 can comprise: for being marked the device of identical probability by this group emotion classifiers according to annotation results determination un-annotated data; If be greater than predetermined confidence level for probability, then extract the device of un-annotated data as credible mark language material.
Updating device 340 marks set for using credible mark language material to upgrade.
Trainer 350 trains this group emotion classifiers, with optimizing emotional classifier for utilizing the set of mark of renewal.
In addition, in one embodiment, the equipment for optimizing emotional classifier of the present invention can also comprise acquisition device, and this acquisition device is used for obtaining and marking set in advance, has wherein marked set and has comprised and mark language material and classification thereof.
In sum, the method and apparatus for optimizing emotional classifier of the present invention can eliminate the classification biasing problem of emotional semantic classification, can significantly improve emotional semantic classification precision simultaneously.Specifically, because the present invention according to corpus, relevant sentiment dictionary and testing material, can automatically extract the biased symmetrical classifiers of classification, thus make to eliminate that classification is biased becomes possibility; The present invention uses the framework of coorinated training to recycle a biased symmetrical classifiers of classifying, and revises each sorter in this classifiers respectively, thus can repair by coorinated training and credible document sets biased and lifting nicety of grading of classifying; Method and apparatus of the present invention can not only reduce the cost of labor of training mark language material widely, also makes to use large-scale training language material to become possibility, improves nicety of grading further.
Method of the present invention can realize in the combination of software, hardware or software and hardware.Hardware components can utilize special logic to realize; Software section can store in memory, and by suitable instruction execution system, such as microprocessor, personal computer (PC) or large scale computer perform.
It should be noted that to make the present invention be easier to understand, description above eliminates to be known for a person skilled in the art and may to be required some ins and outs more specifically for realization of the present invention.
There is provided the object of instructions of the present invention to be to illustrate and describing, instead of be used for exhaustive or limit the invention to disclosed form.For those of ordinary skill in the art, many modifications and changes are all apparent.
Therefore; selecting and describing embodiment is to explain principle of the present invention and practical application thereof better; and those of ordinary skill in the art are understood, under the prerequisite not departing from essence of the present invention, all modifications and change all fall within protection scope of the present invention defined by the claims.

Claims (18)

1., for a method for optimizing emotional classifier, comprising:
One group of large emotion classifiers of difference is biased based on marking set selection sort from emotion classifiers set;
Described one group of emotion classifiers is used to mark un-annotated data;
From un-annotated data, credible mark language material is extracted according to annotation results;
Use described credible mark language material to upgrade and mark set; And
Utilize the described one group of emotion classifiers of the set of mark training upgraded, with optimizing emotional classifier;
Wherein from un-annotated data, extract credible mark language material according to annotation results to comprise:
Identical probability is marked by described one group of emotion classifiers according to described annotation results determination un-annotated data;
If described probability is greater than predetermined confidence level, then extract described un-annotated data as credible mark language material.
2. method according to claim 1, wherein comprises based on marking set one group of emotion classifiers that selection sort is biased difference large from emotion classifiers set:
Use the emotion classifiers in described emotion classifiers set to the described corpus labeling of mark marked in set;
Determine that the classification between described emotion classifiers is biased difference according to the described annotation results having marked language material;
Be biased difference based on described classification and select one group of emotion classifiers according to predetermined policy.
3. method according to claim 2, is wherein biased difference according to the classification determined the described annotation results having marked language material between described emotion classifiers and comprises:
According to the classification similarity calculated the described annotation results having marked language material between described emotion classifiers;
Based on described classification similarity, the classification obtained between described emotion classifiers is biased difference.
4. method according to claim 3, wherein according to comprising the classification similarity that the described annotation results having marked language material calculates between described emotion classifiers:
To the described number having marked the conflict free language material of statistical classification in the annotation results of language material, wherein different emotions sorter has identical annotation results to the conflict free language material of described classification;
Based on number and the described language material sum marked in set of the conflict free language material of described classification, determine described classification similarity.
5. method according to claim 3, wherein according to comprising the classification similarity that the described annotation results having marked language material calculates between described emotion classifiers:
Described classification similarity is calculated: included angle cosine, wear this coefficient, Chi-square, log-likelihood and class F1measure by one of following similarity calculating method.
6. method according to claim 2, is wherein biased difference based on described classification and selects one group of emotion classifiers to comprise according to predetermined policy:
Be biased difference to described classification to sort;
Select the one group emotion classifiers corresponding with maximum classification bias difference out-phase.
7. method according to claim 2, is wherein biased difference based on described classification and selects one group of emotion classifiers to comprise according to predetermined policy:
Determine that the classification being greater than predetermined threshold is biased difference;
Select the one group emotion classifiers corresponding with determined classification bias difference out-phase.
8. method according to claim 1, wherein from un-annotated data, extract credible mark language material according to annotation results and comprise:
Determine to be marked identical un-annotated data by described one group of emotion classifiers according to described annotation results;
Extract determined un-annotated data as credible mark language material.
9. method according to claim 1, also comprises:
Obtain in advance and mark set, wherein saidly marked set and comprise and mark language material and classification thereof.
10., for an equipment for optimizing emotional classifier, comprising:
Selecting arrangement, for being biased one group of large emotion classifiers of difference based on marking set selection sort from emotion classifiers set;
Annotation equipment, marks un-annotated data for using described one group of emotion classifiers;
Extraction element, for extracting credible mark language material according to annotation results from un-annotated data;
Updating device, marks set for using described credible mark language material to upgrade; And
Trainer, trains described one group of emotion classifiers, with optimizing emotional classifier for utilizing the set of mark of renewal;
Wherein said extraction element comprises:
For being marked the device of identical probability by described one group of emotion classifiers according to described annotation results determination un-annotated data;
If be greater than predetermined confidence level for described probability, then extract the device of described un-annotated data as credible mark language material.
11. equipment according to claim 10, wherein said selecting arrangement comprises:
Taxon, for using the emotion classifiers in described emotion classifiers set to the described corpus labeling of mark marked in set;
To the described annotation results having marked language material, the biased Difference determining unit of classification, determines that the classification between described emotion classifiers is biased difference for basis;
Emotion classifiers selection unit, selects one group of emotion classifiers for being biased difference based on described classification according to predetermined policy.
12. equipment according to claim 11, wherein said classification is biased Difference determining unit and comprises:
For basis, the described annotation results having marked language material is calculated to the device of the classification similarity between described emotion classifiers;
For based on described classification similarity, the classification obtained between described emotion classifiers is biased the device of difference.
13. equipment according to claim 12, wherein comprise the device of the classification similarity that the described classification results having marked language material calculates between described emotion classifiers for basis:
For to the described number having marked the conflict free language material of statistical classification in the annotation results of language material, wherein different emotions sorter has the device of identical annotation results to the conflict free language material of described classification;
For determining the device of described classification similarity based on the number of the conflict free language material of described classification and the described language material sum marked in set.
14. equipment according to claim 12, wherein comprise the device of the classification similarity that the described annotation results having marked language material calculates between described emotion classifiers for basis:
For being calculated the device of described classification similarity by one of following similarity calculating method: included angle cosine, wear this coefficient, Chi-square, log-likelihood and class F1 measure.
15. equipment according to claim 11, wherein said emotion classifiers selection unit comprises:
For being biased the device that difference sorts to described classification;
For selecting the device of the one group emotion classifiers corresponding with maximum classification bias difference out-phase.
16. equipment according to claim 11, wherein said emotion classifiers selection unit comprises:
Classification for determining to be greater than predetermined threshold is biased the device of difference;
For selecting the device of the one group emotion classifiers corresponding with determined classification bias difference out-phase.
17. equipment according to claim 10, wherein said extraction element comprises:
For determining the device being marked identical un-annotated data by described one group of emotion classifiers according to described annotation results;
For extracting the device of determined un-annotated data as credible mark language material.
18. equipment according to claim 10, also comprise:
Acquisition device, marks set for obtaining in advance, has wherein saidly marked set and comprises and mark language material and classification thereof.
CN201010612244.7A 2010-12-24 2010-12-24 Method and equipment for optimizing emotional classifier Expired - Fee Related CN102541838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010612244.7A CN102541838B (en) 2010-12-24 2010-12-24 Method and equipment for optimizing emotional classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010612244.7A CN102541838B (en) 2010-12-24 2010-12-24 Method and equipment for optimizing emotional classifier

Publications (2)

Publication Number Publication Date
CN102541838A CN102541838A (en) 2012-07-04
CN102541838B true CN102541838B (en) 2015-03-11

Family

ID=46348763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010612244.7A Expired - Fee Related CN102541838B (en) 2010-12-24 2010-12-24 Method and equipment for optimizing emotional classifier

Country Status (1)

Country Link
CN (1) CN102541838B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190034A (en) * 2018-08-23 2019-01-11 北京百度网讯科技有限公司 For obtaining the method and device of information

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105378707A (en) * 2013-04-11 2016-03-02 朗桑有限公司 Entity extraction feedback
CN103530286A (en) * 2013-10-31 2014-01-22 苏州大学 Multi-class sentiment classification method
CN105205044A (en) * 2015-08-26 2015-12-30 苏州大学张家港工业技术研究院 Emotion and non-emotion question classifying method and system
CN107704878B (en) * 2017-10-09 2021-06-22 南京大学 Hyperspectral database semi-automatic establishment method based on deep learning
CN108460009B (en) * 2017-12-14 2022-09-16 中山大学 Emotion dictionary embedded attention mechanism cyclic neural network text emotion analysis method
CN110209764B (en) * 2018-09-10 2023-04-07 腾讯科技(北京)有限公司 Corpus annotation set generation method and device, electronic equipment and storage medium
CN110019821A (en) * 2019-04-09 2019-07-16 深圳大学 Text category training method and recognition methods, relevant apparatus and storage medium
CN110069602B (en) * 2019-04-15 2021-11-19 网宿科技股份有限公司 Corpus labeling method, apparatus, server and storage medium
CN110147551B (en) * 2019-05-14 2023-07-11 腾讯科技(深圳)有限公司 Multi-category entity recognition model training, entity recognition method, server and terminal
CN110288007B (en) * 2019-06-05 2021-02-02 北京三快在线科技有限公司 Data labeling method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739430A (en) * 2008-11-21 2010-06-16 中国科学院计算技术研究所 Method for training and classifying text emotion classifiers based on keyword
CN102236650A (en) * 2010-04-20 2011-11-09 日电(中国)有限公司 Method and device for correcting and/or expanding sentiment dictionary

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739430A (en) * 2008-11-21 2010-06-16 中国科学院计算技术研究所 Method for training and classifying text emotion classifiers based on keyword
CN102236650A (en) * 2010-04-20 2011-11-09 日电(中国)有限公司 Method and device for correcting and/or expanding sentiment dictionary

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Combining classifiers for word sense disambiguation;RADU FLORIAN et al;《Natural Language Engineering》;20021231;第8卷(第4期);全文 *
Expanding Chinese sentiment dictionaries from large scale unlabeled corpus;Hongzhi Xu et al;《PACLIC 24 Proceedings》;20101107;全文 *
SELC: A Self-Supervised Model for Sentiment Classification;Likun Qiu et al;《Proceedings of the 18th ACM conference on Information and knowledge management》;20091106;全文 *
Using WordNet to Disambiguate Word Senses for Text Classification;Ying Liu et al;《ICCS 2007》;20071231;全文 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190034A (en) * 2018-08-23 2019-01-11 北京百度网讯科技有限公司 For obtaining the method and device of information
CN109190034B (en) * 2018-08-23 2019-12-13 北京百度网讯科技有限公司 Method and device for acquiring information

Also Published As

Publication number Publication date
CN102541838A (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN102541838B (en) Method and equipment for optimizing emotional classifier
CN102682124B (en) Emotion classifying method and device for text
CN107491531A (en) Chinese network comment sensibility classification method based on integrated study framework
CN111104526A (en) Financial label extraction method and system based on keyword semantics
TW201737118A (en) Method and device for webpage text classification, method and device for webpage text recognition
CN106919673A (en) Text mood analysis system based on deep learning
US10755045B2 (en) Automatic human-emulative document analysis enhancements
US20060200341A1 (en) Method and apparatus for processing sentiment-bearing text
Abulaish et al. Feature and opinion mining for customer review summarization
CN109933686B (en) Song label prediction method, device, server and storage medium
CN109522412B (en) Text emotion analysis method, device and medium
KR20120109943A (en) Emotion classification method for analysis of emotion immanent in sentence
US20160189057A1 (en) Computer implemented system and method for categorizing data
CN105069141A (en) Construction method and construction system for stock standard news library
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN108563638A (en) A kind of microblog emotional analysis method based on topic identification and integrated study
WO2020143301A1 (en) Training sample validity detection method, computer device, and computer non-volatile storage medium
CN104346326A (en) Method and device for determining emotional characteristics of emotional texts
CN112685374B (en) Log classification method and device and electronic equipment
Chong et al. Comparison of naive bayes and svm classification in grid-search hyperparameter tuned and non-hyperparameter tuned healthcare stock market sentiment analysis
CN112328469A (en) Function level defect positioning method based on embedding technology
CN103886097A (en) Chinese microblog viewpoint sentence recognition feature extraction method based on self-adaption lifting algorithm
CN115827867A (en) Text type detection method and device
CN109300031A (en) Data digging method and device based on stock comment data
CN108647335A (en) Internet public opinion analysis method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150311

Termination date: 20161224