CN102541838A

CN102541838A - Method and equipment for optimizing emotional classifier

Info

Publication number: CN102541838A
Application number: CN2010106122447A
Authority: CN
Inventors: 胡长建; 邱立坤; 赵凯; 许洪志
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2010-12-24
Filing date: 2010-12-24
Publication date: 2012-07-04
Anticipated expiration: 2030-12-24
Also published as: CN102541838B

Abstract

The invention discloses a method and equipment for optimizing emotional classifiers. The method may comprise: selecting emotional classifiers with high classification difference from marked sets; marking the non-marked corpora by using the selected emotional classifiers; extracting reliable marked corpora from non-marked corpora according to the result of marking; updating the marked sets by using the reliable corpora; and training the selected emotional classifiers by using the marked sets to optimize the emotional classifiers. The method and equipment can eliminate the differences between the emotional classifiers and increase the precision of the emotional classifiers.

Description

Be used to optimize the method and apparatus of emotion classifiers

Technical field

The present invention relates generally to field of information processing, particularly be used to optimize the method and apparatus of emotion classifiers.

Background technology

Along with extensively popularizing of Web2.0, the past by the information communication mode of Web1.0 be that I say that you listen, I drill you see, I write mode that you read and change to the center that the user is becoming information generating.Correspondingly, more and more users is made comments for the quality of product or service quality, and user's oneself mood has been expressed in this comment, can be referred to as the user produce content (User generated Content, UGC).These users produce content perhaps all has important significance for producer/businessman for the consumer.Based on other users' objective evaluation, the consumer can confirm purchase decision quickly, and producer/businessman can improve oneself product or service better according to user's feedback.A purpose to the analysis of above-mentioned review information is therefrom to extract user's emotion tendency, and this technology is called the emotion classification, and its purpose is exactly that given text is provided the emotion that the people the explained tendency of writing this section words: positive or negative.The emotional expression that reflects the user definitely could play active and effective effect to consumer and businessman, and the emotion sorting technique of therefore objective nothing classification biasing will be very important.

The emotion classification is the classification problem more than of natural language processing field, on reality realizes, two types of ways is arranged usually, one type of method that is based on language material (corpus-based), other one type of method that is based on dictionary (lexicon-based).Experiment showed, that these two types of emotion sorting algorithms all have classification biasing (classification bias) problem.In real system, eliminate the true intention that the classification biasing just can reflect the user more objectively, so the biasing problem of emotion classification is a problem demanding prompt solution.

For the problems referred to above, industry has proposed some relevant solutions, and for example U.S. Pat 20080249764 has proposed a kind of method of using the sorter polymerization to improve the emotion classification accuracy, has reduced partial feeling classification biasing to a certain extent.But; Prior art is not done in-depth analysis to the classification biasing; Do not go pointedly to address this problem yet; For example U.S. Pat 20080249764 also promptly adopts more characteristic of division to improve nicety of grading only through the different sorter of polymerization, and this can not solve problem how to eliminate emotion classification biasing effectively.

Summary of the invention

To the above problem that exists in the prior art, the object of the present invention is to provide a kind of method and apparatus that is used to optimize emotion classifiers, can eliminate emotion classification biasing through the emotion classifiers of optimizing.

According to a first aspect of the invention, a kind of method that is used to optimize emotion classifiers is provided, this method can comprise: based on marking set one group of big emotion classifiers of selection sort biasing difference from the emotion classifiers set; Use this group emotion classifiers not mark to marking language material; Never mark the credible mark language material of extraction in the language material according to annotation results; Use credible mark language material to upgrade and mark set; And utilize the set of the mark training of upgrading to be somebody's turn to do the group emotion classifiers, to optimize emotion classifiers.

According to a second aspect of the invention, a kind of equipment that is used to optimize emotion classifiers is provided, this equipment can comprise: selecting arrangement is used for based on marking set from one group of big emotion classifiers of emotion classifiers set selection sort biasing difference; Annotation equipment is used for using this group emotion classifiers not mark marking language material; Extraction element is used for never marking language material according to annotation results and extracts credible mark language material; Updating device is used to use credible mark language material to upgrade and marks set; And trainer, be used to utilize the set of mark of renewal to train this group emotion classifiers, to optimize emotion classifiers.

To the description according to preferred implementation of the present invention, and combine accompanying drawing through following, other characteristics of the present invention and advantage will be conspicuous.

Description of drawings

Through the explanation below in conjunction with accompanying drawing, and along with more fullying understand of the present invention, other purposes of the present invention and effect will become and know more and easy to understand, wherein:

Fig. 1 is the process flow diagram of method that is used to optimize emotion classifiers according to one embodiment of the present of invention;

Fig. 2 is the process flow diagram of method that is used to optimize emotion classifiers according to an alternative embodiment of the invention; And

Fig. 3 is the block diagram of equipment that is used to optimize emotion classifiers according to one embodiment of the present of invention.

In all above-mentioned accompanying drawings, identical label representes to have identical, similar or corresponding feature or function.

Embodiment

Below in conjunction with accompanying drawing the present invention is explained in more detail and explains.Should be appreciated that accompanying drawing of the present invention and embodiment only are used for exemplary effect, be not to be used to limit protection scope of the present invention.

For the sake of clarity, at first employed term among the present invention is done to explain.

1. language material

Language material of the present invention is also referred to as free text, and it can be word, word, sentence, fragment, article etc. and combination in any thereof.

Not marking language material is the language material that does not carry out emotion classification mark.

Having marked language material is the language material that has marked the emotion classification.Obtain one and marked language material and both meaned and can obtain this language material, can obtain again this language material the emotion classification that marked.

2. emotion is classified and emotion classifiers

The emotion classification is the classification problem more than of natural language processing field.Generally speaking, the emotion classification typically refers to through language material analysis being marked its emotion tendency, and for example positive emotion tendency or negative emotion are inclined to, thereby language material is categorized as positive emotion tendency language material and negative emotion tendency language material.In addition; Except the mode of two classifications of above-mentioned mark; Also can emotion be labeled as a plurality of classifications, owing to those skilled in the art are easy to the processing for two classifications is expanded in the processing of a plurality of classifications, so the present invention mainly describes the mark of two classifications.But it should be noted that the present invention is not limited to emotion is categorized as the situation of two classifications.

At present, those skilled in the art often use following sensibility classification method, and the first is based on the sensibility classification method of language material, its two be based on dictionary sensibility classification method.

Based on the method for language material be based on marked the emotion classification in advance a collection of language material (for example; This comments material can comprise text set that is labeled as the positive emotion tendency and the text set that is labeled as the negative emotion tendency); Utilize this comments material to train the emotion classifiers of having learnt sorting algorithm through the method for machine learning, use the emotion classifiers of being trained the language material that does not mark the emotion classification to be carried out the mark of emotion classification then.Method based on dictionary is to prepare an emotion dictionary in advance; The speech of speech of often explaining positive emotion and negative emotion is chosen in advance; For the given corpus statistics front speech that does not mark the emotion classification and the number of times of negation words, judge the emotion tendency that this language material is corresponding then through normalization.

Can comprise multiple specific algorithm based on the method for language material with based on the method for dictionary, and be not only a special algorithm.Method based on language material for example can be based on maximum entropy model, based on decision-tree model, based on CRF (conditional Random Field) model, based on neural network model or based on concrete sensibility classification methods such as Naive Bayes models.Based on the method for dictionary for example can be only based on the sensibility classification method of dictionary or based on the sensibility classification method of dictionary and rule etc.

Emotion classifiers is to utilize various emotion sorting algorithms to come language material is carried out the instrument of the mark of emotion classification.In the present invention, an emotion classifiers can be corresponding with a kind of emotion sorting algorithm, and emotion classifiers can train based on marking language material, do not carry out the emotion classification biasing that the branch time-like produces so that reduce this emotion classifiers to marking language material.For the ease of describing, hereinafter also is sorter with the emotion classifiers summary sometimes.

3. emotion classification biasing

Experiment showed, that the sensibility classification method based on language material is a negative emotion with a corpus labeling with positive emotion often; And be more prone to the corpus labeling with negative emotion based on the sensibility classification method of dictionary is positive emotion.Thus it is clear that, above-mentioned based on language material or all avoid endless sentiments sense marking error based on the sensibility classification method and the sorter of dictionary.For the ease of describing, sorter is labeled as this mistake of negative emotion with positive emotion and is called negative bias in the present invention, and sorter is labeled as this mistake of positive emotion with negative emotion and is called positive bias, and negative bias and positive bias are referred to as the emotion classification setover.

The present invention relates to a kind of method that is used to optimize emotion classifiers.This method can comprise: based on marking set one group of big emotion classifiers of selection sort biasing difference from the emotion classifiers set; Use this group emotion classifiers not mark to marking language material; Never mark the credible mark language material of extraction in the language material according to annotation results; Use credible mark language material to upgrade and mark set; And utilize the set of the mark training of renewal to be somebody's turn to do the group emotion classifiers, and with the optimization emotion classifiers, and then eliminate the emotion classification and setover, improve the emotion nicety of grading.

Method of the present invention has the language material of automatic perception marking error generally and can adjust emotion classifiers two aspect characteristics automatically.For example; One embodiment of the present invention can at first be selected the big classifiers of classification biasing difference; This classifiers for example comprises sorter A and sorter B; Use selected this classifiers that given unfiled document is carried out the mark of emotion classification then, corpus that then can two sorter classification results is different is as the misclassification corpus of sorter A, thereby can realize these characteristics of language material of automatic perception marking error.In addition, language material that can the classification of two sorters is consistent is as credible mark language material, joins to mark in the language material set to be used for sorter A is trained again, through so iterating, thereby can guarantee sorter A constantly adjusted these characteristics automatically.Because sorter A and B are symmetrical, so said process can be implemented sorter A and B simultaneously.Thus, this classifiers that comprises sorter A and B can constantly obtain credible mark language material in the coorinated training process mutually, thereby has overcome the classification biasing problem and significantly improved nicety of grading.

To describe each embodiment of the present invention below in detail.

Fig. 1 is the process flow diagram of method that is used to optimize emotion classifiers according to one embodiment of the present of invention.

In step 101, based on marking set one group of big emotion classifiers of selection sort biasing difference from the emotion classifiers set.

In the present invention, the emotion classifiers set in the step 101 can comprise a plurality of emotion classifiers, and these emotion classifiers can mark set through utilization and obtain for a plurality of emotion classifiers model trainings.

According to one embodiment of present invention, can realize in the following manner based on marking set one group of big emotion classifiers of selection sort biasing difference from the emotion classifiers set: the emotion classifiers in the set of use emotion classifiers is to mark the corpus labeling of mark in the set; According to the classification biasing difference of the said annotation results that has marked language material being confirmed between the emotion classifiers; Select one group of emotion classifiers based on classification biasing difference according to predetermined policy.

In one embodiment of the invention, the biasing of the classification between emotion classifiers difference can obtain the classification similarity between the emotion classifiers that annotation results calculated that marks language material through basis.The classification similarity can obtain through following multiple mode: for example; Can be in to the annotation results that marks language material the number (the different emotions sorter has identical annotation results to the conflict free language material of said classification) of the conflict free language material of statistical classification, and based on the number and the definite said classification similarity of language material sum that has marked in the set of the conflict free language material of classification; Again for example, can be through included angle cosine, wear one of similarity calculating methods such as this coefficient, Chi-square, log-likelihood and type F1measure and calculate.Select one group of emotion classifiers based on classification biasing difference according to predetermined policy.

In one embodiment of the invention, select the process of one group of emotion classifiers to be accomplished in several ways based on classification biasing difference according to predetermined policy.For example, can sort, and select and one group of corresponding emotion classifiers of maximum classification bias difference out-phase classification biasing difference.Again for example, can confirm classification biasing difference greater than predetermined threshold, and the selection one group emotion classifiers corresponding with determined classification bias difference out-phase.

In step 102, use this group emotion classifiers not mark to marking language material.

This group emotion classifiers can comprise two or more different sorters, has big classification between these two or more sorters and departs from difference.Because an emotion classifiers can be corresponding with a kind of emotion sorting algorithm; Therefore use the two or more emotion classifiers in this group not mark and to be directed against each and not mark language material and obtain two or more annotation results, possibly be labeled as same classification and also possibly be labeled as different classes of for the same language material that do not mark to marking language material.

In step 103, never mark the credible mark language material of extraction in the language material according to annotation results.

In the present invention, credible mark language material is according to a classifiers annotation results that does not mark language material to be confirmed.For example, the same corpus labeling that do not mark when identical, can not marked language material with this and thinks a credible mark language material when a classifiers.Wherein, " marking " to be meant in the classifiers each sorter identical, all this not to be marked corpus labeling be positive emotion tendency or all this not to be marked corpus labeling be the negative emotion tendency.Again for example,, can this not marked language material and think a credible mark language material when not marking the identical probability of corpus labeling same when one group of emotion classifiers greater than predetermined confidence level.Wherein, " marking identical probability " can be the maximal value in following two probability: each sorter does not mark in probability that corpus labeling is a positive emotion tendency and this classifiers each sorter will samely not mark corpus labeling is the probability that negative emotion is inclined to same in the classifiers.

According to another embodiment of the invention, never mark in the language material according to annotation results and extract credible mark language material and can be accomplished in several ways.For example, can confirm by the identical not mark language material of this group emotion classifiers mark according to annotation results, and extract the determined language material that do not mark as credible mark language material.Again for example, can confirm not mark language material by the identical probability of this group emotion classifiers mark according to annotation results; If said probability, then extracts this greater than predetermined confidence level and does not mark language material as credible mark language material.

In step 104, use credible mark language material to upgrade and mark set.

In this step, can add marking in the set with credible mark language material and by the classification that marked to, marked set thereby expand.And this is credible, and the mark language material does not no longer use as marking language material.

In step 105, utilize the set of the mark training of upgrading to be somebody's turn to do the group emotion classifiers, to optimize emotion classifiers.

Be able in the step 104 upgrade owing to marked to be integrated into, the set of mark after therefore utilization is upgraded is training step 101 selected these group emotion classifiers better, thereby make this group emotion classifiers have better nicety of grading.

According to another embodiment of the invention, the method that is used to optimize emotion classifiers of the present invention can also comprise obtains the process that has marked set in advance, can comprise in the set and has marked language material and classification thereof marking.Through collecting a large amount of mark language materials in advance, can alleviate the burden of artificial mark language material, also can further improve nicety of grading of the present invention simultaneously.

Fig. 2 is the process flow diagram of method that is used to optimize emotion classifiers according to an alternative embodiment of the invention.In this embodiment, suppose to comprise 3 emotion classifiers among the emotion classifiers set C, as follows:

C={ sorter 1, sorter 2, sorter 3},

Wherein sorter 1, sorter 2 and sorter 3 can use any emotion sorting algorithm to make up.

Suppose to mark set L and comprise 4 language materials, each language material is all by mark emotion classification, and is as follows:

L＝{ “positive-The?screen?of?the?mobile?is?perfect”，

“positive-It′s?speedy?and?space?saving?and?inexpensive”，

“negative-The?sound?quality?is?very?nice?for?the?price，but

since?the?player?doesn′t?work，it′s?essentially?useless”，

“negative-They?just?suck?and?have?a?high?failure?rate”

}

In marking set L; Mark language material 1 " The screen of the mobile is perfect " and be noted as the positive emotion tendency; Mark language material 2 " It ' s speedy and space savingand inexpensive " and be noted as the positive emotion tendency; Marked language material 3 " The soundquality is very nice for the price; but since the player doesn ' t work, it ' sessentially useless " and be noted as the negative emotion tendency, marked language material 4 " They justsuck and have a high failure rate " and be noted as the negative emotion tendency.

It is as follows to suppose that unlabeled set closes T:

T＝{“the?product?is?too?bad”，

“The?phone?screen?is?too?small”，

“I?like?the?appearance?of?the?product”

}

This unlabeled set closes T and comprises 3 language materials; Promptly do not mark language material 1 " the product is toobad ", do not mark language material 2 " The phone screen is too small " and do not mark language material 3 " Ilike the appearance of the product ", each does not mark language material and does not all carry out the emotion classification.

In step 201, use the emotion classifiers in the emotion classifiers set that the language material of mark that marks in the set is marked.

In the present embodiment, emotion classifiers set C comprises 3 emotion classifiers: sorter 1, sorter 2 and sorter 3.These 3 emotion classifiers can mark set L through utilization and obtain for a plurality of emotion classifiers model trainings.

Comprise the emotion classification that marks language material though marked set among the L, in this step, do not consider the emotion classification that marked, but all sorters that utilize emotion classifiers to gather among the C have marked the emotion classification that these have marked language material again.

The annotation results of step 201 is as follows, and wherein "+" expression positive emotion is inclined to, "-" expression negative emotion tendency:

The annotation results of sorter 1: marked language material 1 ,+,

Marked language material 2 ,-,

Marked language material 3 ,-,

Marked language material 4 ,-.

The annotation results of sorter 2: marked language material 1 ,+,

Marked language material 2 ,+,

Marked language material 3 ,+,

Marked language material 4 ,+.

The annotation results of sorter 3: marked language material 1 ,+,

Marked language material 2 ,-,

Marked language material 3 ,-,

Marked language material 4 ,+.

In step 202, according to the classification biasing difference of the annotation results that marks language material being confirmed between the emotion classifiers.

In above annotation results, provided three sorters annotation results separately; Therefore, classification biasing difference between classification biasing difference, sorter 1 and the sorter 3 between step 202 calculating sorter 1 and the sorter 2 and the classification biasing difference between sorter 2 and the sorter 3.

Can obtain classification biasing difference through the classification similarity of calculating between the emotion classifiers.Multiple mode below can passing through is calculated the classification similarity.

For example, can be in above annotation results the number of the conflict free language material of statistical classification, and confirm the similarity of classifying with the language material sum N that has marked in the set based on the number M of the conflict free language material of classification.The classification similarity for example can be defined as the number and the ratio M/N that marks the language material sum in the set of the conflict free language material of classification, also can be defined as other similarity account forms that those skilled in the art use always.

Suppose that present embodiment adopts M/N to confirm the classification similarity.Can know that according to the sorter 1 and the annotation results of sorter 2 the two only has identical mark classification for marking language material 1, and different to other three classifications that marked language material and marked.Therefore, the number that can count between sorter 1 and the sorter 2 the conflict free language material of classification is 1 and since marked marked language material in the set add up to 4, so the classification similarity that can obtain between sorter 1 and the sorter 2 is 1/4=0.25.In like manner, the classification similarity that can obtain between sorter 1 and the sorter 3 is 3/4=0.75, and the classification similarity between sorter 2 and the sorter 3 is 2/4=0.50.

It should be noted, number that can be through calculating three conflict free language materials of classification in the sorter and marked language material sum in the set recently confirm the similarity between sorter 1, sorter 2 and sorter 3 these threes.For example, because these three sorters only have identical annotation results (all being labeled as "+" of expression positive emotion tendency) for marking language material 1, so the classification similarity that can obtain between these three sorters is 1/4=0.25.When in emotion classifiers set, having a plurality of sorters (for example, 100 sorters), can calculate the classification similarity of a classifiers that comprises three or more a plurality of sorters through said method.

In addition, also can be through included angle cosine, wear one of similarity calculating methods such as this coefficient, Chi-square, log-likelihood and type F1measure and calculate above-mentioned classification similarity.

According to the classification similarity between the sorter, can calculate classification biasing difference in several ways.For example, can classification biasing difference be defined as the inverse of classification similarity, also can classification biasing difference be defined as 1/ (1+ classify similarity), perhaps come the defining classification difference of setovering according to this area other account forms commonly used.Suppose that present embodiment adopts " 1/ (1+ classify similarity) " to confirm classification biasing difference, then can obtain classification between sorter 1 and the sorter 2 difference of setovering is 1/ (1+0.25)=0.8.In like manner, the classification biasing difference that can obtain between sorter 1 and the sorter 3 is 1/ (1+0.75)=0.57, and the classification biasing difference between sorter 2 and the sorter 3 is 1/ (1+0.50)=0.67.

In another embodiment, suppose that the part annotation results of step 201 is as follows.

The annotation results of sorter 1: marked language material 1 ,+, 98% >,

Marked language material 1 ,-, 78% >,

……。

The annotation results of sorter 2: marked language material 1 ,+, 78% >,

Marked language material 1 ,-, 90% >,

……。

Below only schematically show sorter 1 and sorter 2 respectively for the classification results that marks language material 1; "+" expression positive emotion tendency; "-" expression negative emotion tendency, number percent represent that with corpus labeling be "+" perhaps nicety of grading of "-" (reliability).For example, will to mark the nicety of grading that language material 1 is labeled as "+" be 98% in " < marked language material 1 ,+, 98%>" expression.

When similarity is a method through class F1measure when calculating, can utilize following equality calculate sorter 1 and the classification between the sorter 2 setover difference (be designated as a):

a＝2×diff(p ⁺ ₁-p ⁺ ₂)×diff(p ^- ₁-p ^- ₂)/(diff(p ⁺ ₁-p ⁺ ₂)+diff(p ^- ₁-p ^- ₂))×100(1)

Wherein, p ⁺ ₁And p ^- ₁Representing sorter 1 respectively is the nicety of grading of positive emotion tendency and negative emotion tendency with a corpus labeling, and

diff(p ⁺ ₁-p ⁺ ₂)＝Abs(p ⁺ ₁-p ⁺ ₂)/Max(p ⁺ ₁，p ⁺ ₂) (2)

" Abs (p wherein ⁺ ₁-p ⁺ ₂) " expression gets p ⁺ ₁-p ⁺ ₂Absolute value, " Max (p ⁺ ₁, p ⁺ ₂) " expression gets p ⁺ ₁And p ⁺ ₂In maximal value.

Can utilize equality (1) to calculate sorter 1 and sorter 2 for the classification biasing difference a that marks language material 1:

a = 2 \times \frac{| 0.98 - 0.78 | / 0.98 * | 0.78 - 0.9 | / 0.9}{| 0.98 - 0.78 | / 0.98 + | 0.78 - 0.9 | / 0.9} \times 100 = 16.1

Similarly, can obtain sorter 1 and mark the classification biasing difference of language material for other with sorter 2.Correspondingly, also can obtain sorter 1 and mark the classification biasing difference of language material for each, and sorter 2 has marked the classification biasing difference of language material with sorter 3 for each with sorter 3.

In step 203, select one group of emotion classifiers according to predetermined policy based on classification biasing difference.

In this step, for example can sort, and select and one group of corresponding emotion classifiers of maximum classification bias difference out-phase classification biasing difference; Also can confirm classification biasing difference greater than predetermined threshold, and the selection one group emotion classifiers corresponding with determined classification bias difference out-phase.

In an example, suppose that present embodiment is selected and the maximum one group of corresponding emotion classifiers of bias difference out-phase of classifying.Through 3 the classification biasing difference ordering that step 202 is obtained; Can confirm that the classification biasing difference 0.8 between sorter 1 and the sorter 2 is the maximal value among the three; Thereby can select and the corresponding classifiers of this maximum classification bias difference out-phase, i.e. sorter 1 and sorter 2.

In another example, suppose the present embodiment selection one group emotion classifiers corresponding, and suppose that this predetermined threshold is 0.6 with determined classification bias difference out-phase.3 classification biasing differences according to step 202 obtains can be known; Classification biasing difference 0.8 between sorter 1 and the sorter 2 and the classification biasing difference 0.67 between sorter 2 and the sorter 3 are all greater than this threshold value; Therefore both can selection sort device 1 and sorter 2 these classifiers in step 203, also can selection sort device 2 and sorter 3 these classifiers.In a kind of alternatives, also can in above-mentioned two fold classification device, select once more according to other any suitable algorithms, thus final definite which classifiers of selecting.

In step 204, use this group emotion classifiers not mark to marking language material.

Suppose that final what select is sorter 1 and sorter 2 these group emotion classifiers to step 203, then step 204 can be used sorter 1 and sorter 2 respectively unlabeled set to be closed T to mark.

For example, step 204 is as follows for the annotation results that does not mark language material, and wherein "+" expression positive emotion is inclined to, "-" expression negative emotion tendency:

The annotation results of sorter 1: do not mark language material 1 ,-,

Do not mark language material 2 ,-,

Do not mark language material 3 ,+.

The annotation results of sorter 2: do not mark language material 1 ,-,

Do not mark language material 2 ,+,

Do not mark language material 3 ,-.

In step 205, confirm the identical not mark language material of this group emotion classifiers mark according to annotation results.

According to knowing for the annotation results that does not mark language material in the step 204, sorter 1 all will not mark language material 1 with sorter 2 and will be labeled as negative emotion tendency "-", but different with the mark that does not mark language material 3 to not marking language material 2.Therefore, can confirm that the identical language material of this group emotion classifiers mark only is not mark language material 1.

In step 206, extract the determined language material that do not mark as credible mark language material.

At this moment, can not think credible mark language material, because this language material all is labeled as the negative emotion tendency by one group of sorter with macrotaxonomy biasing difference with marking language material 1.In other examples, all be not labeled as by this classifiers and have positive emotion tendency and should it be thought credible mark language material yet if mark language material 1.Therefore, in the present invention, credible mark language material only is meant that each sorter is identical to the mark of this language material in the classifiers, and which specific category each sorter that is not limited in this classifiers must be with this corpus labeling all.

In addition, extract this and do not mark language material 1 and mean that as credible mark language material this is not marked language material 1 never marks set and remove among the T, this moment unlabeled set to close T as follows:

“The?phone?screen?is?too?small”，

“I?like?the?appearance?of?the?product”

}

In addition, it should be noted that

step

205 and 206 can use other modes of extracting credible mark language material to substitute.In another embodiment of the present invention; Can be through confirming not mark language material according to said annotation results by one group of identical probability of emotion classifiers mark; And if said probability then extracts the said language material that do not mark as credible mark language material greater than predetermined confidence level.For example when this group emotion classifiers comprises four sorters; If three sorters are certain classification with the same corpus labeling that do not mark; And the 4th sorter do not mark corpus labeling with this is different classes of; Then can not marked language material is 3/4 by the identical probability of this group emotion classifiers mark, promptly 0.75.Suppose that predetermined confidence level is 0.7, then, do not mark language material as credible mark language material so can extract this owing to 0.75＞0.7.

In step 207, judge whether credible mark language material number equals 0.

At first confirm the number of the resulting credible mark language material of step 206 in this step.If this number equals 0, there is not credible mark language material in expression, thereby flow process finishes.If this number is not 0, then is illustrated in step 206 and never marks in the language material set and extract 1 credible mark language material at least, thereby proceed step 208.

In the present embodiment,, promptly do not mark language material 1, therefore continue to carry out to step 208 from step 207 because step 206 has been extracted credible mark language material.

In step 208, use credible mark language material to upgrade and mark set.

In this step, can add marking in the set with credible mark language material and by the classification that marked to, be updated to as follows thereby will mark set L:

L＝{“positive-The?screen?of?the?mobile?is?perfect”，

“positive-It′s?speedy?and?space?saving?and?inexpensive”，

“negative-The?sound?quality?is?very?nice?for?the?price，but

since?the?player?doesn′t?work，it′s?essentially?useless”，

“negative-They?just?suck?and?have?a?high?failure?rate”，

“negative-the?product?is?too?bad”

}

Wherein, be credible mark language material and classification thereof shown in the underscore.

In step 209, utilize the set of the mark training of upgrading to be somebody's turn to do the group emotion classifiers.

At this moment; Using the mark that upgrades to gather trains the selected one group of emotion classifiers of step 203; The language material that marks among the set L that also promptly uses step 208 to obtain comes sorter 1 and sorter 2 are trained, rather than all sorters in the initial emotion classifiers set of using in the step 201 are trained.

Utilize language material to come sorter trained and can adopt several different methods, for example, based on the training method of naive Bayesian; Training method based on maximum entropy model; Based on the training method of the classification of SVM, based on the training method of CRF (condition random field) model, or the like.

In addition; No matter be based on the emotion classifiers of language material; Also be based on the emotion classifiers of dictionary; Its precision all depends on quality and quantity of their or inner dictionary of training to a great extent, improves for the overall precision of emotion classification important effect is arranged so obtain efficient a large amount of corpus.Therefore; In another embodiment; Except each step of method of emotion classifiers is optimized in shown in Figure 2 being used to; Can also comprise and obtain the step that has marked set in advance, this step for example can realize in the following manner: judge whether Internet resources are the relevant resources of emotion classification; Extract front evaluation information and negative evaluation information from the resource relevant with the emotion classification; And filter, thereby obtain language material and classification thereof based on statistical information or pre-defined rule frontal evaluation information and negative evaluation information.Obtain the step that has marked set in advance through this, can semi-automatically particularly extract the good language material of a large amount of marks the B2C website, to optimize the precision that emotion is analyzed from existing internet.In addition,, the burden of artificial mark language material can be alleviated, also nicety of grading of the present invention can be further improved simultaneously through collecting a large amount of mark language materials in advance.

Fig. 3 is the block diagram of equipment 300 that is used to optimize emotion classifiers according to one embodiment of the present of invention.

Equipment 300 can comprise: selecting arrangement 310, annotation equipment 320, extraction element 330, updating device 340 and trainer 350.

Selecting arrangement 310 is used for based on marking set from one group of big emotion classifiers of emotion classifiers set selection sort biasing difference.In one embodiment, selecting arrangement 310 can comprise: taxon, and the emotion classifiers that is used for using the emotion classifiers set is to mark the corpus labeling of mark in the set; Classification biasing difference is confirmed the unit, is used for according to the classification biasing difference of the annotation results that marks language material being confirmed between the emotion classifiers; The emotion classifiers selected cell is used for selecting one group of emotion classifiers based on classification biasing difference according to predetermined policy.

In one implementation, classification biasing difference confirms that the unit can comprise: be used for according to the device that the annotation results that marks language material is calculated the classification similarity between the emotion classifiers; And be used for based on the classification similarity, obtain the device of the classification biasing difference between the emotion classifiers.For example; Be used for to comprise: be used at number, wherein different emotions sorter device that the conflict free language material of classifying is had identical annotation results to the conflict free language material of annotation results statistical classification that marks language material according to the device that the classification results that marks language material is calculated the classification similarity between the emotion classifiers; And be used for based on the number of the conflict free language material of classification and the language material sum that has marked set confirm the to classify device of similarity.Again for example, be used for to comprise: the device that is used for calculating the classification similarity: included angle cosine, wear this coefficient, Chi-square, log-likelihood and type F1measure through one of following similarity calculating method according to the device that the annotation results that marks language material is calculated the classification similarity between the emotion classifiers.

In one implementation, the emotion classifiers selected cell can comprise: be used for device that classification biasing difference is sorted; And the device that is used to select the one group emotion classifiers corresponding with maximum classification bias difference out-phase.

In another kind of implementation, the emotion classifiers selected cell can comprise: be used for confirming the device greater than the classification biasing difference of predetermined threshold; And the device that is used to select the one group emotion classifiers corresponding with determined classification bias difference out-phase.

Annotation equipment 320 is used for using this group emotion classifiers not mark marking language material.

Extraction element 330 is used for never marking language material according to annotation results and extracts credible mark language material.

In one embodiment, extraction element 330 can comprise: be used for confirming by the identical device that does not mark language material of this group emotion classifiers mark according to annotation results; And be used to extract the determined device that does not mark language material as credible mark language material.

In another embodiment, extraction element 330 can comprise: be used for confirming not mark language material by the device of the identical probability of this group emotion classifiers mark according to annotation results; If be used for probability, then extract and do not mark the device of language material as credible mark language material greater than predetermined confidence level.

Updating device 340 is used to use credible mark language material to upgrade to mark set.

Trainer 350 is used to utilize this group emotion classifiers of the set of mark training of renewal, to optimize emotion classifiers.

In addition, in one embodiment, the equipment that is used to optimize emotion classifiers of the present invention can also comprise deriving means, and this deriving means is used for obtaining in advance and marks set, has wherein marked to comprise in the set and marked language material and classification thereof.

In sum, the method and apparatus that is used to optimize emotion classifiers of the present invention can be eliminated the classification biasing problem of emotion classification, can significantly improve the emotion nicety of grading simultaneously.Particularly, because the present invention can automatically extract a classifiers of classification biasing symmetry according to corpus, relevant emotion dictionary and testing material, thereby make that eliminating the classification biasing becomes possibility; The present invention uses the framework of coorinated training to recycle a classifiers of classification biasing symmetry, revises each sorter in this classifiers respectively, thereby can repair the classification biasing and promote nicety of grading through coorinated training and credible document sets; Method and apparatus of the present invention not only can reduce the cost of labor of training mark language material widely, also makes to use extensive corpus to become possibility, has further promoted nicety of grading.

Method of the present invention can realize in the combination of software, hardware or software and hardware.Hardware components can utilize special logic to realize; Software section can be stored in the storer, and by suitable instruction execution system, for example microprocessor, personal computer (PC) or large scale computer are carried out.

Should be noted that for the present invention is more readily understood top description has been omitted to be known for a person skilled in the art and possibly to be essential some ins and outs more specifically for realization of the present invention.

The purpose that instructions of the present invention is provided is in order to explain and to describe, rather than is used for exhaustive or the present invention is restricted to disclosed form.As far as those of ordinary skill in the art, many modifications and change all are conspicuous.

Therefore; Selecting and describing embodiment is in order to explain principle of the present invention and practical application thereof better; And those of ordinary skills are understood, under the prerequisite that does not break away from essence of the present invention, all modifications all falls within the protection scope of the present invention that is limited claim with change.

Claims

1. method that is used to optimize emotion classifiers comprises:

Based on marking set one group of big emotion classifiers of selection sort biasing difference from the emotion classifiers set;

Use said one group of emotion classifiers not mark to marking language material;

Never mark the credible mark language material of extraction in the language material according to annotation results;

Use said credible mark language material to upgrade and mark set; And

Utilize the said one group of emotion classifiers of the set of mark training that upgrades, to optimize emotion classifiers.

2. method according to claim 1 wherein comprises based on marking set one group of big emotion classifiers of selection sort biasing difference from the emotion classifiers set:

Use emotion classifiers in the set of said emotion classifiers to the said corpus labeling of mark that has marked in the set;

According to the classification biasing difference of the said annotation results that has marked language material being confirmed between the said emotion classifiers;

Select one group of emotion classifiers based on said classification biasing difference according to predetermined policy.

3. method according to claim 2, wherein according to the said annotation results that has marked language material being confirmed the classification biasing difference between the said emotion classifiers comprises:

According to the said annotation results that has marked language material is calculated the classification similarity between the said emotion classifiers;

Based on said classification similarity, obtain the classification biasing difference between the said emotion classifiers.

4. method according to claim 3 wherein comprises according to the classification similarity that the said annotation results that has marked language material is calculated between the said emotion classifiers:

The number of the conflict free language material of statistical classification in to the said annotation results that has marked language material, wherein the different emotions sorter has identical annotation results to the conflict free language material of said classification;

Based on the number and the said language material sum that has marked in the set of the conflict free language material of said classification, confirm said classification similarity.

5. method according to claim 3 wherein comprises according to the classification similarity that the said annotation results that has marked language material is calculated between the said emotion classifiers:

Calculate said classification similarity through one of following similarity calculating method: included angle cosine, wear this coefficient, Chi-square, log-likelihood and type F1measure.

6. method according to claim 2, wherein select one group of emotion classifiers to comprise according to predetermined policy based on said classification biasing difference:

Said classification biasing difference is sorted;

Select and one group of corresponding emotion classifiers of maximum classification bias difference out-phase.

7. method according to claim 2, wherein select one group of emotion classifiers to comprise according to predetermined policy based on said classification biasing difference:

Confirm classification biasing difference greater than predetermined threshold;

Select the one group emotion classifiers corresponding with determined classification bias difference out-phase.

8. method according to claim 1 wherein never marks in the language material according to annotation results and extracts credible mark language material and comprise:

Confirm by the identical not mark language material of said one group of emotion classifiers mark according to said annotation results;

Extract the determined language material that do not mark as credible mark language material.

9. method according to claim 1 wherein never marks in the language material according to annotation results and extracts credible mark language material and comprise:

Confirm not mark language material by the identical probability of said one group of emotion classifiers mark according to said annotation results;

If said probability, then extracts the said language material that do not mark greater than predetermined confidence level as credible mark language material.

10. method according to claim 1 also comprises:

Obtain in advance and marked set, the wherein said mark comprises in the set and marked language material and classification thereof.

11. an equipment that is used to optimize emotion classifiers comprises:

Selecting arrangement is used for based on marking set from one group of big emotion classifiers of emotion classifiers set selection sort biasing difference;

Annotation equipment is used to use said one group of emotion classifiers not mark marking language material;

Extraction element is used for never marking language material according to annotation results and extracts credible mark language material;

Updating device is used to use said credible mark language material to upgrade and marks set; And

Trainer is used to utilize the set of mark of renewal to train said one group of emotion classifiers, to optimize emotion classifiers.

12. equipment according to claim 11, wherein said selecting arrangement comprises:

Taxon, the emotion classifiers that is used for using said emotion classifiers set is to the said corpus labeling of mark that has marked in the set;

Classification biasing difference is confirmed the unit, is used for according to the classification biasing difference of the said annotation results that has marked language material being confirmed between the said emotion classifiers;

The emotion classifiers selected cell is used for selecting one group of emotion classifiers based on said classification biasing difference according to predetermined policy.

13. equipment according to claim 12, wherein said classification biasing difference confirms that the unit comprises:

Be used for according to the device that the said annotation results that has marked language material is calculated the classification similarity between the said emotion classifiers;

Be used for based on said classification similarity, obtain the device of the classification biasing difference between the said emotion classifiers.

14. equipment according to claim 13 wherein is used for comprising according to the device that the said classification results that has marked language material is calculated the classification similarity between the said emotion classifiers:

Be used for to the said number that has marked the conflict free language material of annotation results statistical classification of language material, wherein different emotions sorter device that the conflict free language material of said classification is had identical annotation results;

Be used for number and the said device that has marked the definite said classification similarity of language material sum of set based on the conflict free language material of said classification.

15. equipment according to claim 13 wherein is used for comprising according to the device that the said annotation results that has marked language material is calculated the classification similarity between the said emotion classifiers:

Be used for calculating the device of said classification similarity: included angle cosine, wear this coefficient, Chi-square, log-likelihood and type F1measure through one of following similarity calculating method.

16. equipment according to claim 12, wherein said emotion classifiers selected cell comprises:

Be used for device that said classification biasing difference is sorted;

Be used to select the device of the one group emotion classifiers corresponding with maximum classification bias difference out-phase.

17. equipment according to claim 12, wherein said emotion classifiers selected cell comprises:

Be used for confirming device greater than the classification biasing difference of predetermined threshold;

Be used to select the device of the one group emotion classifiers corresponding with determined classification bias difference out-phase.

18. equipment according to claim 11, wherein said extraction element comprises:

Be used for confirming by the identical device that does not mark language material of said one group of emotion classifiers mark according to said annotation results;

Be used to extract the determined device that does not mark language material as credible mark language material.

19. equipment according to claim 11, wherein said extraction element comprises:

Be used for confirming not mark language material by the device of the identical probability of said one group of emotion classifiers mark according to said annotation results;

If be used for said probability, then extract the said device that does not mark language material as credible mark language material greater than predetermined confidence level.

20. equipment according to claim 11 also comprises:

Deriving means is used for obtaining in advance and marks set, and wherein said the mark comprises in the set and marked language material and classification thereof.