CN104462409A - Cross-language emotional resource data identification method based on AdaBoost - Google Patents

Cross-language emotional resource data identification method based on AdaBoost Download PDF

Info

Publication number
CN104462409A
CN104462409A CN201410766618.9A CN201410766618A CN104462409A CN 104462409 A CN104462409 A CN 104462409A CN 201410766618 A CN201410766618 A CN 201410766618A CN 104462409 A CN104462409 A CN 104462409A
Authority
CN
China
Prior art keywords
weak classifier
language
training set
adaboost
data identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410766618.9A
Other languages
Chinese (zh)
Other versions
CN104462409B (en
Inventor
卢玲
杨武
刘恒洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN201410766618.9A priority Critical patent/CN104462409B/en
Publication of CN104462409A publication Critical patent/CN104462409A/en
Application granted granted Critical
Publication of CN104462409B publication Critical patent/CN104462409B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Resources & Organizations (AREA)
  • Computing Systems (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-language emotional resource data identification method based on AdaBoost. The method includes building an emotional resource data identification mode and judging the category of the initial data d by estimating the posterior probability to the category according to the prior probability and the conditional probability (1), translating the target language training set to source language training set, processing emotional resource data training on a combined training set by using emotional resource data identification method of AdaBoost, and constructing a weak classifier (2), updating training set by setting sliding window, training the best weak classifier, finally obtaining a best classifier which is suitable for target language emotional resource data identification (3), and thus the specific language emotional resource data is identified.

Description

Based on AdaBoost across language affection resources data identification method
Technical field
The present invention relates to computer realm, particularly relate to a kind of based on AdaBoost across language affection resources data identification method.
Background technology
Along with the fast development of the social network-i i-platform such as microblogging, text emotion sorting technique has become the focus of text information processing.The affection resources of mark is had to be that text emotion Study of recognition provides the foundation.At present, the language material resource in English field has SentiWordNet, fine granularity sentiment analysis language material MPQA etc.; There is HowNet sentiment dictionary in Chinese field, Dalian University of Science & Engineering emotion vocabulary body etc.But, there is the distribution of the language material of mark under different language to be unbalanced.When lack certain language have a mark language material time, what utilize other Languages has mark language material to assist to carry out emotion recognition and become a heat subject.
Across language sentiment analysis (Cross Lingual Sentiment Analysis, CLSA) refer to utilize existing language have mark language material, auxiliary another kind of language carries out emotional orientation analysis.Existing CLSA technology utilizes bilingual dictionary or parallel corpus to set up macaronic corresponding relation, then uses similar technique to carry out the sentiment analysis of target language.Also utilize machine translation mothod, first different language is translated into same language, then apply sentiment analysis method on single language.The people such as Wan utilize machine translation mothod by the Chinese text intertranslation having the English text of mark Yu do not mark, then use Co-Training algorithm to carry out Chinese emotion recognition.Xu Jun proposes a kind of migration self-learning algorithm for the inaccuracy problem of mechanical translation, by the high confidence level translation sample in automatic mark training set, carries out repetitive exercise to sorter.Above-mentioned research is all based on different language material background.When the background of existing language material resource is different, CLSA strategy is also distinguished to some extent.In addition, the method for the strategy that affection resources moves and emotion recognition is closely related, can not cast aside emotion identification method and study separately feeling shifting strategy.
The present invention proposes a kind of affection resources moving method based on AdaBoost algorithm.First small-scale target language training set is translated into source language, then merge with extensive source language training set and build initial Weak Classifier; Then AdaBoost Algorithm for Training multi-categorizer is used; The emotion recognition across language is achieved through multi-classifier cooperate.
Summary of the invention
The present invention is intended at least solve the technical matters existed in prior art, especially innovatively propose a kind of based on AdaBoost across language affection resources data identification method.
In order to realize above-mentioned purpose of the present invention, the invention provides a kind of based on AdaBoost across language affection resources data identification method, its key is, comprises the steps:
Step 1, sets up affection resources data identification model, estimates the posterior probability of raw data d for classification, judge the classification of raw data d thus by prior probability and conditional probability;
Step 2, translates into source language training set by target language training set, on joint training collection, then use the affection resources data identification algorithm of AdaBoost to carry out the training of affection resources data, structure Weak Classifier;
Step 3, upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language affection resources data identification, form optimum classifier, thus identify language-specific affection resources data.
Described based on AdaBoost across language affection resources data identification method, preferably, described step 1 comprises:
Calculate the prior probability of raw data; Extract the affective characteristics of raw data again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as the preliminary judged result of affection resources data identification.
Described based on AdaBoost across language affection resources data identification method, preferably, described step 2 comprises:
Step 2-1, constructs multiple Weak Classifier collaborative work, constantly adjusts sample distribution, train new Weak Classifier by AdaBoost affection resources data identification algorithm, comprises the vector of each Weak Classifier weights through the generation one that iterates;
Step 2-2, by AdaBoost affection resources data identification Algorithm for Training source language training set and target language training set.
Described based on AdaBoost across language affection resources data identification method, preferably, described step 2 also comprises:
Step 2-3, carries out initialization, makes iteration round k=1;
Step 2-4, sets up joint training collection, makes joint training collection CR k=R ∪ T s, following formula:
CR k = { d i ( y i , w i ( k ) ) } i = 1 | T s | + | R | ; Right &ForAll; d i &Element; CR k , Y ifor its classification marks; w ik () is d iweights in kth wheel iteration, wherein d ifor raw data; Source language training set is designated as R; Target language training set is designated as T; | T| < < | R|, namely the sample number of T is much smaller than R.Form source language training set after T translation and be designated as T s; Emotion classification is designated as Y={0,1}; AdaBoost iterations is designated as K; Weak Classifier weight vector is designated as W;
Step 2-5, initializes weights, when making k=1, w i(k)=1/ (T s|+| R|);
Step 2-6, as (k=1 ... K), at CR kpresent weight divide and plant, train optimum Weak Classifier h k: CR k→ Y; Use h kto CR kall sample classifications; Calculate error in classification ε k, following formula:
Step 2-7, if (ε k> 1/2), so { k=k-1; Break; , calculate Weak Classifier h kweights α k, following formula:
α k=(1/2)×ln(1-ε kk);
Step 2-8, record Weak Classifier weights: make W (k)=α k; Upgrade the weights of each sample, following formula:
Wherein, take e as the exponential function of power be exp, Z knormalized factor:
Step 2-9, classification results is following formula:
Weak classifier set to be selected H ( d i ) = sign ( &Sigma; k = 1 K &alpha; k h k ( d i ) ) .
Described based on AdaBoost across language affection resources data identification method, preferably, described step 3 comprises:
Step 3-1, adopts moving window to upgrade the method for training set, trains optimum Weak Classifier by successive ignition;
Step 3-2, trains optimum Weak Classifier, to associating training set CR kby the descending sort of sample weights size, wherein, if the sorter of kth wheel iteration is h k, use h kfor CR kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN k; H is weak classifier set to be selected; Generate optimum Weak Classifier h k+1training set be CN r,
Step 3-3, makes pos=|CR k|-cnum; H={ Φ }; Pos represents the reference position of moving window,
Step 3-4, training set CN r = { d i ( y i , w i ( k ) ) } i = | CR k | | CR k | - cnum ,
TN r = { d i ( y i , w i ( k ) ) } i = pos pos - scale ,
CN r=CN r∪TN r
Step 3-5, at training set CN rupper training Weak Classifier h r; Use h rto CR kclassify, calculate classification error rate e r, following formula:
e r = &Sigma; i : h r ( d i ) &NotEqual; y i w i ( k ) ;
Step 3-6, weak classifier set H=H ∪ { h to be selected r(e r);
pos=pos-step;
If (pos-scale) < 0 then { break; , be when the number of samples of remaining data set is less than a moving window size, deconditioning;
Step 3-7, optimum Weak Classifier h k + 1 = arg min h r &Element; h { ( e r ) h r : C R k &RightArrow; Y } .
In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:
Proposed based on AdaBoost across language affection resources data identification method.First target language training set is translated into source language; AdaBoost algorithm is used again on joint training collection; Upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language emotion recognition, the classification policy that the present invention is based on AdaBoost obtains the accuracy and recall rate that are better than BaseLine, demonstrates the validity of the method.
Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:
Fig. 1 is UNB-A and UNB-B classifying quality of the present invention contrast schematic diagram;
Fig. 2 is accuracy and the recall rate change schematic diagram of AdaBoost successive ignition of the present invention;
Fig. 3 be the present invention is based on AdaBoost across language affection resources data identification method process flow diagram.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
In describing the invention, it will be appreciated that, term " longitudinal direction ", " transverse direction ", " on ", D score, "front", "rear", "left", "right", " vertically ", " level ", " top ", " end " " interior ", the orientation of the instruction such as " outward " or position relationship be based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore can not be interpreted as limitation of the present invention.
In describing the invention, unless otherwise prescribed and limit, it should be noted that, term " installation ", " being connected ", " connection " should be interpreted broadly, such as, can be mechanical connection or electrical connection, also can be the connection of two element internals, can be directly be connected, also indirectly can be connected by intermediary, for the ordinary skill in the art, the concrete meaning of above-mentioned term can be understood as the case may be.
Affection resources moves, and refers to and is moved to target language by source language affection resources, that is: utilize the source language sample of extensive band mark, and the target language sample of band mark on a small scale, carries out emotion praise, demote polarity identification target language text.In research of the present invention, the resource of use is source language sentiment dictionary.Move for across language affection resources, the present invention mainly solves 3 problems: 1. set up affection resources data identification model; 2. applicable machine translation method is chosen; 3. the multi-classifier cooperate strategy based on AdaBoost is designed.
1, set up affection resources data identification model, estimate that raw data d is for classification c by prior probability and conditional probability kposterior probability, judge the classification of raw data d thus.Formula (1) is the classification calculating formula of naive Bayesian multinomial model:
(1) (CNB refers to the set of naive Bayesian multinomial model, and n is positive integer)
Wherein: P (c k) be on category set C, prior probability of all categories in training set D; w irepresent i-th characteristic item of raw data d; Wt (w i) be characteristic item w in raw data d iweights.
The present invention carries out affection resources data identification with naive Bayesian multinomial model.To the source language of mark and target language sample be had as training set.First the prior probability of text is calculated; Extract the affective characteristics of text again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as emotion recognition result.
2, only have the background of source language sentiment dictionary in existing resource under, then translation source speech training collection and source language sentiment dictionary is simultaneously needed.Through duplicate removal after translation, sentiment dictionary significantly dimensionality reduction can be caused, cause the recall rate of affection resources data characteristics to reduce.By target language is translated into source language, then perform training set.
3, across in language emotion recognition problem, because target language training set is minimum, the emotion classifiers performance that the joint training collection that source language and target language are formed builds is more weak.To this, present invention employs the method constructing multiple Weak Classifier collaborative work.AdaBoost is a kind of algorithm frame various Weak Classifier being integrated into strong classifier.AdaBoost trains new Weak Classifier by continuous adjustment sample distribution, comprises the vector of each Weak Classifier weights through the generation one that iterates, and the weights of each Weak Classifier represent its significance level in classification.
Source language training set is designated as R; Target language training set is designated as T; | T| < < | R|, namely the sample number of T is much smaller than R.Form source language training set after T translation and be designated as T s; Emotion classification is designated as Y={0,1}; AdaBoost iterations is designated as K; Weak Classifier weight vector is designated as W.Based on AdaBoost across language affection resources data identification arthmetic statement as shown in algorithm 1:
Algorithm 1:
(1) initialization: make iteration round k=1;
(2) set up joint training collection: make CR k=R ∪ T s, shown in (2):
CR k = { d i ( y i , w i ( k ) ) } i = 1 | T s | + | R | - - - ( 2 )
Right y ifor its classification marks; w ik () is d iweights in kth wheel iteration, wherein d ifor raw data;
(3) initializes weights: when making k=1, w i(k)=1/ (T s|+| R|);
⑷for(k=1…K)
1. at CR kpresent weight divide and plant, train optimum Weak Classifier h k: CR k→ Y;
2. h is used kto CR kall sample classifications; Calculate error in classification ε k, as shown in Equation 3:
If 3. (ε k> 1/2), so { k=k-1; Break; }
4. Weak Classifier h is calculated kweights α k, shown in (4):
α k=(1/2)×ln(1-ε kk) (4)
5. Weak Classifier weights are recorded: make W (k)=α k;
6. the weights of each sample are upgraded, shown in (5):
w k + 1 ( i ) = w k ( i ) Z k &times; e - &alpha; k if h k ( d i ) = y i e &alpha; k if h k ( d i ) &NotEqual; y i - - - ( 5 )
Wherein, Z knormalized factor:
(5) final classification results is such as formula shown in (6):
H ( d i ) = sign ( &Sigma; k = 1 K &alpha; k h k ( d i ) ) - - - ( 6 )
Based on the optimum Weak Classifier of moving window
To the optimum Weak Classifier training of AdaBoost algorithm, adopt moving window to upgrade the method for training set, train optimum Weak Classifier by successive ignition.
If the sorter of kth wheel iteration is h k, use h kfor CR kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN k; H is weak classifier set to be selected; Generate optimum Weak Classifier h k+1training set be CN r; Training h k+1the arthmetic statement of optimum Weak Classifier is as shown in algorithm 2:
Algorithm 2:
(1) to CR kby the descending sort of sample weights size;
(2) make pos=|CR k|-cnum; H={ Φ };
⑶while(true)
{
CN r = { d i ( y i , w i ( k ) ) } i = | CR k | | CR k | - cnum
TN r = { d i ( y i , w i ( k ) ) } i = pos pos - scale
CN r=CN r∪TN r
At CN rupper training Weak Classifier h r;
Use h rto CR kclassify, calculate classification error rate e r, shown in (7):
e r = &Sigma; i : h r ( d i ) &NotEqual; y i w i ( k ) - - - ( 7 )
H=H∪{h r(e r)};
pos=pos-step;
If (pos-scale) < 0then{break; }
}
( 4 ) - - - h k + 1 = arg min h r &Element; h { ( e r ) h r : C R k &RightArrow; Y } ;
To the size of moving window, if scale < < is cnum, then training set is less, CN rwith CN r+1similarity increase.This can make recall rate on the low side on the one hand; In addition, also can reduce the difference of each candidate's Weak Classifier, make the discrimination of optimum Weak Classifier relatively low.Research of the present invention shows, when scale value is [cnum/7, cnum/3], classification performance there will not be significant difference.Windowed time is scale=cnum/4 by the present invention.For increasing the difference between Weak Classifier, window sliding step-length is set to scale >=step, selects repeated sample to avoid twice training.
Experiment conclusion:
Experiment language material is from NLP & CC2013 Chinese microblogging evaluation and test task.Wherein source language is English, and target language is Chinese.Source language training set selects the comment language material in MUSIC field, comprises 4000 comment property texts altogether, wherein positive and negative to each 2000.Target language training set selects the comment language material in MUSIC field, comprise 40 texts altogether, wherein, positive and negative to each 20 (in training set, number of training and the NLP & CC2013 evaluating standard of source language and target language are consistent).That selects NLP & CC2013 to evaluate and test has the result set (MUSIC field) of mark as test sample book collection, wherein comprises 814, target language sample, wherein positive 412, class sample, negative 402, class sample.Corpus distribution is as shown in table 1.
Corpus used tested by table 1
Adopt MPQA dictionary, wherein comprise forward word 2789, negative sense word 6079.Machine translation tools selects Microsoft Translater.Experimental result with accuracy, recall rate for evaluation index.
Strategy in machine translation compares
Concentrate from test sample book and randomly draw 4 groups of test sample books, often organize sample number 500 (each group sample has intersection).Respectively with method 1.: target language training set is translated into source language, then build sorter UNB-A on joint training collection; Method is 2.: source language training set and MPQA dictionary are translated into target language, then build sorter UNB-B on joint training collection.The performance of two sorters as shown in Figure 1.Wherein accuracy and recall rate are the mean value of 4 groups of test sample books.
Fig. 1 is the contrast of UNB-A and UNB-B classifying quality,
The recall rate of visible UNB-B is starkly lower than UNB-A, and main cause is the impact by mechanical translation accuracy, and the difference that Yi Jizhong, western language are expressed causes.Such as:
Original text: Bread's music was more interesting than this!
The music of translation: Bread is more interesting than this!
Original text: Antony has a very interesting voice.
Translation: Antony has an absorbing sound.
Wherein, " interesting " is translated into word " interesting " uniquely in MPAQ dictionary; Then " interesting " and " interesting " respectively in the text of translation.This makes " interesting " and " interesting " these two affective characteristicses all can not be extracted to from Chinese language text, causes classification recall rate lower thus.
Affection resources migration results
For checking is based on the validity of AdaBoost method, with two sorters for BaseLine, whole 814 samples choosing test sample book collection carry out contrast experiment: the sorter SNB 1. obtained on source language training set; 2. on source language and target language joint training collection, the sorter UNB obtained is trained; 3. the sorter AdaNB that the inventive method obtains is utilized.1-20 iteration is carried out to AdaNB, as moving window scale=cnum/4, during step=scale, classification performance as shown in Figure 2: Fig. 2 is the accuracy of AdaBoost successive ignition and recall rate change.
Visible AdaNB classification overall performance when iterations reaches 7 increases and maintains and stablizes; When iterations reaches 15, overall performance is tending towards optimum.But with the increase of iterations, the performance of AdaNB does not continue to improve, and positive class accuracy and negative class recall rate slightly decline, and maintains stable on certain level.Analyze its reason, think that Weak Classifier of the present invention only have chosen the part sample of training set, training set is less, and Weak Classifier only achieves local preferably performance, and recall rate is on the low side, and along with the increase of iterations, the phenomenon of over-fitting appears in sorter.Table 2 is accuracy and the recall rate of each sorter.Wherein AdaNB sorter iterations is 15 times.
The various sorter classification results contrast of table 2
The positive class F value of visible AdaNB reaches 0.749789, is better than SNB sorter 0.11, is better than UNB sorter 0.10; Negative class F value reaches 0.693136, is better than SNB sorter 0.06, is better than UNB sorter 0.02.All in all, the overall performance of AdaNB in certain iterations is better than BaseLine, obtains comparatively balanced accuracy and recall rate.In NLP & CC2013 evaluates and tests, the emotion recognition accuracy of its MUSIC field comment is up to 0.76, minimumly reaches 0.50.The present invention's experiment is using the result set of this evaluation and test as test set, and namely adopt the test sample book collection identical with this evaluation and test, the average accuracy of positive and negative class reaches 0.71, is better than the average level of this evaluation and test, also show the validity of the inventive method.
Beneficial effect of the present invention: based on mechanical translation, is built into strong classifier by multiple Weak Classifier with the method for weighting.In the often wheel iteration of AdaBoost, the method proposing to upgrade with moving window training set trains optimum Weak Classifier.Experiment shows, the machine translation method of employing is feasible.Be better than BaseLine based on AdaBoost algorithm across language emotional semantic classification performance, in the MUSIC field comment that NLP & CC2013 evaluates and tests, the average accuracy of application the inventive method reaches 0.69, indicates the validity of the inventive method.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalents thereof.

Claims (5)

1. based on AdaBoost across a language affection resources data identification method, it is characterized in that, comprise the steps:
Step 1, sets up affection resources data identification model, estimates the posterior probability of raw data d for classification, judge the classification of raw data d thus by prior probability and conditional probability;
Step 2, translates into source language training set by target language training set, on joint training collection, then use the affection resources data identification algorithm of AdaBoost to carry out the training of affection resources data, structure Weak Classifier;
Step 3, upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language affection resources data identification, form optimum classifier, thus identify language-specific affection resources data.
2. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 1 comprises:
Calculate the prior probability of raw data; Extract the affective characteristics of raw data again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as the preliminary judged result of affection resources data identification.
3. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 2 comprises:
Step 2-1, constructs multiple Weak Classifier collaborative work, constantly adjusts sample distribution, train new Weak Classifier by AdaBoost affection resources data identification algorithm, comprises the vector of each Weak Classifier weights through the generation one that iterates;
Step 2-2, by AdaBoost affection resources data identification Algorithm for Training source language training set and target language training set.
4. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 2 also comprises:
Step 2-3, carries out initialization, makes iteration round k=1;
Step 2-4, sets up joint training collection, makes joint training collection CR k=R ∪ T s, following formula:
CR k = { d i ( y i , w i ( k ) ) } i = 1 | T s | + | R | ; Right &ForAll; d i &Element; CR k , Y ifor its classification marks; w ik () is d iweights in kth wheel iteration, wherein d ifor raw data; Source language training set is designated as R; Target language training set is designated as T; | T| < < | R|, namely the sample number of T is much smaller than R.Form source language training set after T translation and be designated as T s; Emotion classification is designated as Y={0,1}; AdaBoost iterations is designated as K; Weak Classifier weight vector is designated as W;
Step 2-5, initializes weights, when making k=1, w i(k)=1/ (| T s|+| R|);
Step 2-6, as (k=1 ... K), at CR kpresent weight divide and plant, train optimum Weak Classifier h k: CR k→ Y; Use h kto CR kall sample classifications; Calculate error in classification ε k, following formula:
Step 2-7, if (ε k> 1/2), so { k=k-1; Break; , calculate Weak Classifier h kweights α k, following formula:
α k=(1/2)×ln(1-ε kk);
Step 2-8, record Weak Classifier weights: make W (k)=α k; Upgrade the weights of each sample, following formula:
Wherein, take e as the exponential function of power be exp, Z knormalized factor:
Step 2-9, classification results is following formula:
Weak classifier set to be selected H ( d i ) = sign ( &Sigma; k = 1 K &alpha; k h k ( d i ) ) .
5. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 3 comprises:
Step 3-1, adopts moving window to upgrade the method for training set, trains optimum Weak Classifier by successive ignition;
Step 3-2, trains optimum Weak Classifier, to associating training set CR kby the descending sort of sample weights size, wherein, if the sorter of kth wheel iteration is h k, use h kfor CR kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN k; H is weak classifier set to be selected; Generate optimum Weak Classifier h k+1training set be CN r,
Step 3-3, makes pos=|CR k|-cnum; H={ Φ }; Pos represents the reference position of moving window,
Step 3-4, training set CN r = { d i ( y i , w i ( k ) ) } i = | CR k | | CR k | - cnum ,
TN r = { d i ( y i , w i ( k ) ) } i = pos pos - scale ,
CN r=CN r∪TN r
Step 3-5, at training set CN rupper training Weak Classifier h r; Use h rto CR kclassify, calculate classification error rate e r, following formula:
e r = &Sigma; i : h r ( d i ) &NotEqual; y i w i ( k ) ;
Step 3-6, weak classifier set H=H ∪ { h to be selected r(e r);
pos=pos-step;
If (pos-scale) < 0 then { break; , be when the number of samples of remaining data set is less than a moving window size, deconditioning;
Step 3-7, optimum Weak Classifier h k + 1 = arg min h r &Element; H { ( e r ) h r : CR k &RightArrow; Y } .
CN201410766618.9A 2014-12-12 2014-12-12 Across language affection resources data identification method based on AdaBoost Expired - Fee Related CN104462409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410766618.9A CN104462409B (en) 2014-12-12 2014-12-12 Across language affection resources data identification method based on AdaBoost

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410766618.9A CN104462409B (en) 2014-12-12 2014-12-12 Across language affection resources data identification method based on AdaBoost

Publications (2)

Publication Number Publication Date
CN104462409A true CN104462409A (en) 2015-03-25
CN104462409B CN104462409B (en) 2017-08-25

Family

ID=52908444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410766618.9A Expired - Fee Related CN104462409B (en) 2014-12-12 2014-12-12 Across language affection resources data identification method based on AdaBoost

Country Status (1)

Country Link
CN (1) CN104462409B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938565A (en) * 2016-06-27 2016-09-14 西北工业大学 Multi-layer classifier and Internet image aided training-based color image emotion classification method
CN106709829A (en) * 2015-08-03 2017-05-24 科大讯飞股份有限公司 On-line-question-database-based learning condition diagnosis method and system
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features
CN107564580A (en) * 2017-09-11 2018-01-09 合肥工业大学 Gastroscope visual aids processing system and method based on integrated study
CN108090040A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 A kind of text message sorting technique and system
CN108376133A (en) * 2018-03-21 2018-08-07 北京理工大学 The short text sensibility classification method expanded based on emotion word
CN110222181A (en) * 2019-06-06 2019-09-10 福州大学 A kind of film review sentiment analysis method based on Python
CN112559685A (en) * 2020-12-11 2021-03-26 芜湖汽车前瞻技术研究院有限公司 Automobile forum spam comment identification method
US11151182B2 (en) * 2017-07-24 2021-10-19 Huawei Technologies Co., Ltd. Classification model training method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231702A (en) * 2008-01-25 2008-07-30 华中科技大学 Categorizer integration method
US20100217595A1 (en) * 2009-02-24 2010-08-26 Korea Institute Of Science And Technology Method For Emotion Recognition Based On Minimum Classification Error
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN103761311A (en) * 2014-01-23 2014-04-30 中国矿业大学 Sentiment classification method based on multi-source field instance migration

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231702A (en) * 2008-01-25 2008-07-30 华中科技大学 Categorizer integration method
US20100217595A1 (en) * 2009-02-24 2010-08-26 Korea Institute Of Science And Technology Method For Emotion Recognition Based On Minimum Classification Error
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN103761311A (en) * 2014-01-23 2014-04-30 中国矿业大学 Sentiment classification method based on multi-source field instance migration

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709829A (en) * 2015-08-03 2017-05-24 科大讯飞股份有限公司 On-line-question-database-based learning condition diagnosis method and system
CN106709829B (en) * 2015-08-03 2020-06-02 科大讯飞股份有限公司 Learning situation diagnosis method and system based on online question bank
CN105938565A (en) * 2016-06-27 2016-09-14 西北工业大学 Multi-layer classifier and Internet image aided training-based color image emotion classification method
CN108090040B (en) * 2016-11-23 2021-08-17 北京国双科技有限公司 Text information classification method and system
CN108090040A (en) * 2016-11-23 2018-05-29 北京国双科技有限公司 A kind of text message sorting technique and system
US11151182B2 (en) * 2017-07-24 2021-10-19 Huawei Technologies Co., Ltd. Classification model training method and apparatus
CN107564580B (en) * 2017-09-11 2019-02-12 合肥工业大学 Gastroscope visual aids processing system and method based on integrated study
CN107564580A (en) * 2017-09-11 2018-01-09 合肥工业大学 Gastroscope visual aids processing system and method based on integrated study
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features
CN108376133A (en) * 2018-03-21 2018-08-07 北京理工大学 The short text sensibility classification method expanded based on emotion word
CN110222181A (en) * 2019-06-06 2019-09-10 福州大学 A kind of film review sentiment analysis method based on Python
CN110222181B (en) * 2019-06-06 2021-08-31 福州大学 Python-based film evaluation emotion analysis method
CN112559685A (en) * 2020-12-11 2021-03-26 芜湖汽车前瞻技术研究院有限公司 Automobile forum spam comment identification method

Also Published As

Publication number Publication date
CN104462409B (en) 2017-08-25

Similar Documents

Publication Publication Date Title
CN104462409A (en) Cross-language emotional resource data identification method based on AdaBoost
CN103345922B (en) A kind of large-length voice full-automatic segmentation method
Grönroos et al. Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology
Tymann et al. GerVADER-A German Adaptation of the VADER Sentiment Analysis Tool for Social Media Texts.
CN105205124B (en) A kind of semi-supervised text sentiment classification method based on random character subspace
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN105808525A (en) Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106095996A (en) Method for text classification
CN106484675A (en) Fusion distributed semantic and the character relation abstracting method of sentence justice feature
CN107391486A (en) A kind of field new word identification method based on statistical information and sequence labelling
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN102541838B (en) Method and equipment for optimizing emotional classifier
TW201329752A (en) Text readability measuring system and method thereof
CN106257455A (en) A kind of Bootstrapping algorithm based on dependence template extraction viewpoint evaluation object
CN103020167B (en) A kind of computer Chinese file classification method
CN106294324A (en) A kind of machine learning sentiment analysis device based on natural language parsing tree
Zubarev et al. Cross-language text alignment for plagiarism detection based on contextual and context-free models
CN110110116A (en) A kind of trademark image retrieval method for integrating depth convolutional network and semantic analysis
CN109933781A (en) Chinese patent text entity relation extraction method based on SAO structure
CN108681532A (en) A kind of sentiment analysis method towards Chinese microblogging
CN103744838A (en) Chinese emotional abstract system and Chinese emotional abstract method for measuring mainstream emotional information
Al Awaida et al. Automated arabic essay grading system based on f-score and arabic worldnet
CN111368035A (en) Neural network-based Chinese dimension-dimension Chinese organization name dictionary mining system
Tong et al. Multi-Task Learning for Mispronunciation Detection on Singapore Children's Mandarin Speech.
Mahdaouy et al. Cs-um6p at semeval-2022 task 6: Transformer-based models for intended sarcasm detection in english and arabic

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170825

Termination date: 20201212