CN104462409A

CN104462409A - Cross-language emotional resource data identification method based on AdaBoost

Info

Publication number: CN104462409A
Application number: CN201410766618.9A
Authority: CN
Inventors: 卢玲; 杨武; 刘恒洋
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2014-12-12
Filing date: 2014-12-12
Publication date: 2015-03-25
Anticipated expiration: 2034-12-12
Also published as: CN104462409B

Abstract

The invention discloses a cross-language emotional resource data identification method based on AdaBoost. The method includes building an emotional resource data identification mode and judging the category of the initial data d by estimating the posterior probability to the category according to the prior probability and the conditional probability (1), translating the target language training set to source language training set, processing emotional resource data training on a combined training set by using emotional resource data identification method of AdaBoost, and constructing a weak classifier (2), updating training set by setting sliding window, training the best weak classifier, finally obtaining a best classifier which is suitable for target language emotional resource data identification (3), and thus the specific language emotional resource data is identified.

Description

Based on AdaBoost across language affection resources data identification method

Technical field

The present invention relates to computer realm, particularly relate to a kind of based on AdaBoost across language affection resources data identification method.

Background technology

Along with the fast development of the social network-i i-platform such as microblogging, text emotion sorting technique has become the focus of text information processing.The affection resources of mark is had to be that text emotion Study of recognition provides the foundation.At present, the language material resource in English field has SentiWordNet, fine granularity sentiment analysis language material MPQA etc.; There is HowNet sentiment dictionary in Chinese field, Dalian University of Science & Engineering emotion vocabulary body etc.But, there is the distribution of the language material of mark under different language to be unbalanced.When lack certain language have a mark language material time, what utilize other Languages has mark language material to assist to carry out emotion recognition and become a heat subject.

Across language sentiment analysis (Cross Lingual Sentiment Analysis, CLSA) refer to utilize existing language have mark language material, auxiliary another kind of language carries out emotional orientation analysis.Existing CLSA technology utilizes bilingual dictionary or parallel corpus to set up macaronic corresponding relation, then uses similar technique to carry out the sentiment analysis of target language.Also utilize machine translation mothod, first different language is translated into same language, then apply sentiment analysis method on single language.The people such as Wan utilize machine translation mothod by the Chinese text intertranslation having the English text of mark Yu do not mark, then use Co-Training algorithm to carry out Chinese emotion recognition.Xu Jun proposes a kind of migration self-learning algorithm for the inaccuracy problem of mechanical translation, by the high confidence level translation sample in automatic mark training set, carries out repetitive exercise to sorter.Above-mentioned research is all based on different language material background.When the background of existing language material resource is different, CLSA strategy is also distinguished to some extent.In addition, the method for the strategy that affection resources moves and emotion recognition is closely related, can not cast aside emotion identification method and study separately feeling shifting strategy.

The present invention proposes a kind of affection resources moving method based on AdaBoost algorithm.First small-scale target language training set is translated into source language, then merge with extensive source language training set and build initial Weak Classifier; Then AdaBoost Algorithm for Training multi-categorizer is used; The emotion recognition across language is achieved through multi-classifier cooperate.

Summary of the invention

The present invention is intended at least solve the technical matters existed in prior art, especially innovatively propose a kind of based on AdaBoost across language affection resources data identification method.

In order to realize above-mentioned purpose of the present invention, the invention provides a kind of based on AdaBoost across language affection resources data identification method, its key is, comprises the steps:

Step 1, sets up affection resources data identification model, estimates the posterior probability of raw data d for classification, judge the classification of raw data d thus by prior probability and conditional probability;

Step 2, translates into source language training set by target language training set, on joint training collection, then use the affection resources data identification algorithm of AdaBoost to carry out the training of affection resources data, structure Weak Classifier;

Step 3, upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language affection resources data identification, form optimum classifier, thus identify language-specific affection resources data.

Described based on AdaBoost across language affection resources data identification method, preferably, described step 1 comprises:

Calculate the prior probability of raw data; Extract the affective characteristics of raw data again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as the preliminary judged result of affection resources data identification.

Described based on AdaBoost across language affection resources data identification method, preferably, described step 2 comprises:

Step 2-1, constructs multiple Weak Classifier collaborative work, constantly adjusts sample distribution, train new Weak Classifier by AdaBoost affection resources data identification algorithm, comprises the vector of each Weak Classifier weights through the generation one that iterates;

Step 2-2, by AdaBoost affection resources data identification Algorithm for Training source language training set and target language training set.

Described based on AdaBoost across language affection resources data identification method, preferably, described step 2 also comprises:

Step 2-3, carries out initialization, makes iteration round k=1;

Step 2-4, sets up joint training collection, makes joint training collection CR _k=R ∪ T _s, following formula:

{CR}_{k} = {d_{i} (y_{i}, w_{i} (k))}_{i = 1}^{| T_{s} | + | R |};

Right

&ForAll; d_{i} &Element; {CR}_{k},

Y _ifor its classification marks; w _ik () is d _iweights in kth wheel iteration, wherein d _ifor raw data; Source language training set is designated as R; Target language training set is designated as T; | T| < < | R|, namely the sample number of T is much smaller than R.Form source language training set after T translation and be designated as T _s; Emotion classification is designated as Y={0,1}; AdaBoost iterations is designated as K; Weak Classifier weight vector is designated as W;

Step 2-5, initializes weights, when making k=1, w _i(k)=1/ (T _s|+| R|);

Step 2-6, as (k=1 ... K), at CR _kpresent weight divide and plant, train optimum Weak Classifier h _k: CR _k→ Y; Use h _kto CR _kall sample classifications; Calculate error in classification ε _k, following formula:

Step 2-7, if (ε _k> 1/2), so { k=k-1; Break; , calculate Weak Classifier h _kweights α _k, following formula:

α _k＝(1/2)×ln(1-ε _k/ε _k)；

Step 2-8, record Weak Classifier weights: make W (k)=α _k; Upgrade the weights of each sample, following formula:

Wherein, take e as the exponential function of power be exp, Z _knormalized factor:

Step 2-9, classification results is following formula:

Weak classifier set to be selected

H (d_{i}) = sign (Σ_{k = 1}^{K} α_{k} h_{k} (d_{i})) .

Described based on AdaBoost across language affection resources data identification method, preferably, described step 3 comprises:

Step 3-1, adopts moving window to upgrade the method for training set, trains optimum Weak Classifier by successive ignition;

Step 3-2, trains optimum Weak Classifier, to associating training set CR _kby the descending sort of sample weights size, wherein, if the sorter of kth wheel iteration is h _k, use h _kfor CR _kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN _k; H is weak classifier set to be selected; Generate optimum Weak Classifier h _k+1training set be CN _r,

Step 3-3, makes pos=|CR _k|-cnum; H={ Φ }; Pos represents the reference position of moving window,

Step 3-4, training set

{CN}_{r} = {d_{i} (y_{i}, w_{i} (k))}_{i = | {CR}_{k} |}^{| {CR}_{k} | - cnum},

{TN}_{r} = {d_{i} (y_{i}, w_{i} (k))}_{i = pos}^{pos - scale},

CN _r＝CN _r∪TN _r；

Step 3-5, at training set CN _rupper training Weak Classifier h _r; Use h _rto CR _kclassify, calculate classification error rate e _r, following formula:

e_{r} = \underset{i : h_{r} (d_{i}) &NotEqual; y_{i}}{Σ} w_{i} (k);

Step 3-6, weak classifier set H=H ∪ { h to be selected _r(e _r);

pos＝pos-step；

If (pos-scale) < 0 then { break; , be when the number of samples of remaining data set is less than a moving window size, deconditioning;

Step 3-7, optimum Weak Classifier

h_{k + 1} = \underset{h_{r} &Element; h}{\arg \min} {{(e_{r})}_{h_{r} : C R_{k} &RightArrow; Y}} .

In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:

Proposed based on AdaBoost across language affection resources data identification method.First target language training set is translated into source language; AdaBoost algorithm is used again on joint training collection; Upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language emotion recognition, the classification policy that the present invention is based on AdaBoost obtains the accuracy and recall rate that are better than BaseLine, demonstrates the validity of the method.

Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:

Fig. 1 is UNB-A and UNB-B classifying quality of the present invention contrast schematic diagram;

Fig. 2 is accuracy and the recall rate change schematic diagram of AdaBoost successive ignition of the present invention;

Fig. 3 be the present invention is based on AdaBoost across language affection resources data identification method process flow diagram.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.

In describing the invention, it will be appreciated that, term " longitudinal direction ", " transverse direction ", " on ", D score, "front", "rear", "left", "right", " vertically ", " level ", " top ", " end " " interior ", the orientation of the instruction such as " outward " or position relationship be based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore can not be interpreted as limitation of the present invention.

In describing the invention, unless otherwise prescribed and limit, it should be noted that, term " installation ", " being connected ", " connection " should be interpreted broadly, such as, can be mechanical connection or electrical connection, also can be the connection of two element internals, can be directly be connected, also indirectly can be connected by intermediary, for the ordinary skill in the art, the concrete meaning of above-mentioned term can be understood as the case may be.

Affection resources moves, and refers to and is moved to target language by source language affection resources, that is: utilize the source language sample of extensive band mark, and the target language sample of band mark on a small scale, carries out emotion praise, demote polarity identification target language text.In research of the present invention, the resource of use is source language sentiment dictionary.Move for across language affection resources, the present invention mainly solves 3 problems: 1. set up affection resources data identification model; 2. applicable machine translation method is chosen; 3. the multi-classifier cooperate strategy based on AdaBoost is designed.

1, set up affection resources data identification model, estimate that raw data d is for classification c by prior probability and conditional probability _kposterior probability, judge the classification of raw data d thus.Formula (1) is the classification calculating formula of naive Bayesian multinomial model:

(1) (CNB refers to the set of naive Bayesian multinomial model, and n is positive integer)

Wherein: P (c _k) be on category set C, prior probability of all categories in training set D; w _irepresent i-th characteristic item of raw data d; Wt (w _i) be characteristic item w in raw data d _iweights.

The present invention carries out affection resources data identification with naive Bayesian multinomial model.To the source language of mark and target language sample be had as training set.First the prior probability of text is calculated; Extract the affective characteristics of text again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as emotion recognition result.

2, only have the background of source language sentiment dictionary in existing resource under, then translation source speech training collection and source language sentiment dictionary is simultaneously needed.Through duplicate removal after translation, sentiment dictionary significantly dimensionality reduction can be caused, cause the recall rate of affection resources data characteristics to reduce.By target language is translated into source language, then perform training set.

3, across in language emotion recognition problem, because target language training set is minimum, the emotion classifiers performance that the joint training collection that source language and target language are formed builds is more weak.To this, present invention employs the method constructing multiple Weak Classifier collaborative work.AdaBoost is a kind of algorithm frame various Weak Classifier being integrated into strong classifier.AdaBoost trains new Weak Classifier by continuous adjustment sample distribution, comprises the vector of each Weak Classifier weights through the generation one that iterates, and the weights of each Weak Classifier represent its significance level in classification.

Source language training set is designated as R; Target language training set is designated as T; | T| < < | R|, namely the sample number of T is much smaller than R.Form source language training set after T translation and be designated as T _s; Emotion classification is designated as Y={0,1}; AdaBoost iterations is designated as K; Weak Classifier weight vector is designated as W.Based on AdaBoost across language affection resources data identification arthmetic statement as shown in algorithm 1:

Algorithm 1:

(1) initialization: make iteration round k=1;

(2) set up joint training collection: make CR _k=R ∪ T _s, shown in (2):

{CR}_{k} = {d_{i} (y_{i}, w_{i} (k))}_{i = 1}^{| T_{s} | + | R |} - - - (2)

Right y _ifor its classification marks; w _ik () is d _iweights in kth wheel iteration, wherein d _ifor raw data;

(3) initializes weights: when making k=1, w _i(k)=1/ (T _s|+| R|);

⑷for(k＝1…K)

1. at CR _kpresent weight divide and plant, train optimum Weak Classifier h _k: CR _k→ Y;

2. h is used _kto CR _kall sample classifications; Calculate error in classification ε _k, as shown in Equation 3:

If 3. (ε _k> 1/2), so { k=k-1; Break; }

4. Weak Classifier h is calculated _kweights α _k, shown in (4):

α _k＝(1/2)×ln(1-ε _k/ε _k) (4)

5. Weak Classifier weights are recorded: make W (k)=α _k;

6. the weights of each sample are upgraded, shown in (5):

w_{k + 1} (i) = \frac{w_{k} (i)}{Z_{k}} \times \begin{matrix}  \end{matrix} \{\begin{matrix} e^{{- α}_{k}} & if h_{k} (d_{i}) = y_{i} \\ e^{α_{k}} & if h_{k} (d_{i}) &NotEqual; y_{i} \end{matrix} - - - (5)

Wherein, Z _knormalized factor:

(5) final classification results is such as formula shown in (6):

H (d_{i}) = sign (Σ_{k = 1}^{K} α_{k} h_{k} (d_{i})) - - - (6)

Based on the optimum Weak Classifier of moving window

To the optimum Weak Classifier training of AdaBoost algorithm, adopt moving window to upgrade the method for training set, train optimum Weak Classifier by successive ignition.

If the sorter of kth wheel iteration is h _k, use h _kfor CR _kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN _k; H is weak classifier set to be selected; Generate optimum Weak Classifier h _k+1training set be CN _r; Training h _k+1the arthmetic statement of optimum Weak Classifier is as shown in algorithm 2:

Algorithm 2:

(1) to CR _kby the descending sort of sample weights size;

(2) make pos=|CR _k|-cnum; H={ Φ };

⑶while(true)

{

{CN}_{r} = {d_{i} (y_{i}, w_{i} (k))}_{i = | {CR}_{k} |}^{| {CR}_{k} | - cnum}

{TN}_{r} = {d_{i} (y_{i}, w_{i} (k))}_{i = pos}^{pos - scale}

CN _r＝CN _r∪TN _r；

At CN _rupper training Weak Classifier h _r;

Use h _rto CR _kclassify, calculate classification error rate e _r, shown in (7):

e_{r} = \underset{i : h_{r} (d_{i}) &NotEqual; y_{i}}{Σ} w_{i} (k) - - - (7)

H＝H∪{h _r(e _r)}；

pos＝pos-step；

If (pos-scale) < 0then{break; }

}

(4) - - - h_{k + 1} = \underset{h_{r} &Element; h}{\arg \min} {{(e_{r})}_{h_{r} : C R_{k} &RightArrow; Y}};

To the size of moving window, if scale < < is cnum, then training set is less, CN _rwith CN _r+1similarity increase.This can make recall rate on the low side on the one hand; In addition, also can reduce the difference of each candidate's Weak Classifier, make the discrimination of optimum Weak Classifier relatively low.Research of the present invention shows, when scale value is [cnum/7, cnum/3], classification performance there will not be significant difference.Windowed time is scale=cnum/4 by the present invention.For increasing the difference between Weak Classifier, window sliding step-length is set to scale >=step, selects repeated sample to avoid twice training.

Experiment conclusion:

Experiment language material is from NLP & CC2013 Chinese microblogging evaluation and test task.Wherein source language is English, and target language is Chinese.Source language training set selects the comment language material in MUSIC field, comprises 4000 comment property texts altogether, wherein positive and negative to each 2000.Target language training set selects the comment language material in MUSIC field, comprise 40 texts altogether, wherein, positive and negative to each 20 (in training set, number of training and the NLP & CC2013 evaluating standard of source language and target language are consistent).That selects NLP & CC2013 to evaluate and test has the result set (MUSIC field) of mark as test sample book collection, wherein comprises 814, target language sample, wherein positive 412, class sample, negative 402, class sample.Corpus distribution is as shown in table 1.

Corpus used tested by table 1

Adopt MPQA dictionary, wherein comprise forward word 2789, negative sense word 6079.Machine translation tools selects Microsoft Translater.Experimental result with accuracy, recall rate for evaluation index.

Strategy in machine translation compares

Concentrate from test sample book and randomly draw 4 groups of test sample books, often organize sample number 500 (each group sample has intersection).Respectively with method 1.: target language training set is translated into source language, then build sorter UNB-A on joint training collection; Method is 2.: source language training set and MPQA dictionary are translated into target language, then build sorter UNB-B on joint training collection.The performance of two sorters as shown in Figure 1.Wherein accuracy and recall rate are the mean value of 4 groups of test sample books.

Fig. 1 is the contrast of UNB-A and UNB-B classifying quality,

The recall rate of visible UNB-B is starkly lower than UNB-A, and main cause is the impact by mechanical translation accuracy, and the difference that Yi Jizhong, western language are expressed causes.Such as:

Original text: Bread's music was more interesting than this!

The music of translation: Bread is more interesting than this!

Original text: Antony has a very interesting voice.

Translation: Antony has an absorbing sound.

Wherein, " interesting " is translated into word " interesting " uniquely in MPAQ dictionary; Then " interesting " and " interesting " respectively in the text of translation.This makes " interesting " and " interesting " these two affective characteristicses all can not be extracted to from Chinese language text, causes classification recall rate lower thus.

Affection resources migration results

For checking is based on the validity of AdaBoost method, with two sorters for BaseLine, whole 814 samples choosing test sample book collection carry out contrast experiment: the sorter SNB 1. obtained on source language training set; 2. on source language and target language joint training collection, the sorter UNB obtained is trained; 3. the sorter AdaNB that the inventive method obtains is utilized.1-20 iteration is carried out to AdaNB, as moving window scale=cnum/4, during step=scale, classification performance as shown in Figure 2: Fig. 2 is the accuracy of AdaBoost successive ignition and recall rate change.

Visible AdaNB classification overall performance when iterations reaches 7 increases and maintains and stablizes; When iterations reaches 15, overall performance is tending towards optimum.But with the increase of iterations, the performance of AdaNB does not continue to improve, and positive class accuracy and negative class recall rate slightly decline, and maintains stable on certain level.Analyze its reason, think that Weak Classifier of the present invention only have chosen the part sample of training set, training set is less, and Weak Classifier only achieves local preferably performance, and recall rate is on the low side, and along with the increase of iterations, the phenomenon of over-fitting appears in sorter.Table 2 is accuracy and the recall rate of each sorter.Wherein AdaNB sorter iterations is 15 times.

The various sorter classification results contrast of table 2

The positive class F value of visible AdaNB reaches 0.749789, is better than SNB sorter 0.11, is better than UNB sorter 0.10; Negative class F value reaches 0.693136, is better than SNB sorter 0.06, is better than UNB sorter 0.02.All in all, the overall performance of AdaNB in certain iterations is better than BaseLine, obtains comparatively balanced accuracy and recall rate.In NLP & CC2013 evaluates and tests, the emotion recognition accuracy of its MUSIC field comment is up to 0.76, minimumly reaches 0.50.The present invention's experiment is using the result set of this evaluation and test as test set, and namely adopt the test sample book collection identical with this evaluation and test, the average accuracy of positive and negative class reaches 0.71, is better than the average level of this evaluation and test, also show the validity of the inventive method.

Beneficial effect of the present invention: based on mechanical translation, is built into strong classifier by multiple Weak Classifier with the method for weighting.In the often wheel iteration of AdaBoost, the method proposing to upgrade with moving window training set trains optimum Weak Classifier.Experiment shows, the machine translation method of employing is feasible.Be better than BaseLine based on AdaBoost algorithm across language emotional semantic classification performance, in the MUSIC field comment that NLP & CC2013 evaluates and tests, the average accuracy of application the inventive method reaches 0.69, indicates the validity of the inventive method.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.

Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalents thereof.

Claims

1. based on AdaBoost across a language affection resources data identification method, it is characterized in that, comprise the steps:

2. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 1 comprises:

3. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 2 comprises:

4. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 2 also comprises:

Step 2-3, carries out initialization, makes iteration round k=1;

{CR}_{k} = {d_{i} (y_{i}, w_{i} (k))}_{i = 1}^{| T_{s} | + | R |};

Right

&ForAll; d_{i} &Element; {CR}_{k},

Step 2-5, initializes weights, when making k=1, w _i(k)=1/ (| T _s|+| R|);

α _k＝(1/2)×ln(1-ε _k/ε _k)；

Step 2-9, classification results is following formula:

Weak classifier set to be selected

H (d_{i}) = sign (Σ_{k = 1}^{K} α_{k} h_{k} (d_{i})) .

5. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 3 comprises:

Step 3-4, training set

{CN}_{r} = {d_{i} (y_{i}, w_{i} (k))}_{i = | {CR}_{k} |}^{| {CR}_{k} | - cnum},

{TN}_{r} = {d_{i} (y_{i}, w_{i} (k))}_{i = pos}^{pos - scale},

CN _r＝CN _r∪TN _r；

e_{r} = \underset{i : h_{r} (d_{i}) &NotEqual; y_{i}}{Σ} w_{i} (k);

Step 3-6, weak classifier set H=H ∪ { h to be selected _r(e _r);

pos＝pos-step；

Step 3-7, optimum Weak Classifier

h_{k + 1} = \underset{h_{r} &Element; H}{\arg \min} {{(e_{r})}_{h_{r} : {CR}_{k} &RightArrow; Y}} .