CN104462409A - Cross-language emotional resource data identification method based on AdaBoost - Google Patents
Cross-language emotional resource data identification method based on AdaBoost Download PDFInfo
- Publication number
- CN104462409A CN104462409A CN201410766618.9A CN201410766618A CN104462409A CN 104462409 A CN104462409 A CN 104462409A CN 201410766618 A CN201410766618 A CN 201410766618A CN 104462409 A CN104462409 A CN 104462409A
- Authority
- CN
- China
- Prior art keywords
- weak classifier
- language
- training set
- adaboost
- data identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000002996 emotional effect Effects 0.000 title abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 85
- 238000013519 translation Methods 0.000 claims description 18
- 230000008451 emotion Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 description 14
- 239000000463 material Substances 0.000 description 13
- 230000008909 emotion recognition Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 239000010754 BS 2869 Class F Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 235000008429 bread Nutrition 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 101150071892 snb-1 gene Proteins 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Marketing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Resources & Organizations (AREA)
- Computing Systems (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Computational Linguistics (AREA)
- Economics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-language emotional resource data identification method based on AdaBoost. The method includes building an emotional resource data identification mode and judging the category of the initial data d by estimating the posterior probability to the category according to the prior probability and the conditional probability (1), translating the target language training set to source language training set, processing emotional resource data training on a combined training set by using emotional resource data identification method of AdaBoost, and constructing a weak classifier (2), updating training set by setting sliding window, training the best weak classifier, finally obtaining a best classifier which is suitable for target language emotional resource data identification (3), and thus the specific language emotional resource data is identified.
Description
Technical field
The present invention relates to computer realm, particularly relate to a kind of based on AdaBoost across language affection resources data identification method.
Background technology
Along with the fast development of the social network-i i-platform such as microblogging, text emotion sorting technique has become the focus of text information processing.The affection resources of mark is had to be that text emotion Study of recognition provides the foundation.At present, the language material resource in English field has SentiWordNet, fine granularity sentiment analysis language material MPQA etc.; There is HowNet sentiment dictionary in Chinese field, Dalian University of Science & Engineering emotion vocabulary body etc.But, there is the distribution of the language material of mark under different language to be unbalanced.When lack certain language have a mark language material time, what utilize other Languages has mark language material to assist to carry out emotion recognition and become a heat subject.
Across language sentiment analysis (Cross Lingual Sentiment Analysis, CLSA) refer to utilize existing language have mark language material, auxiliary another kind of language carries out emotional orientation analysis.Existing CLSA technology utilizes bilingual dictionary or parallel corpus to set up macaronic corresponding relation, then uses similar technique to carry out the sentiment analysis of target language.Also utilize machine translation mothod, first different language is translated into same language, then apply sentiment analysis method on single language.The people such as Wan utilize machine translation mothod by the Chinese text intertranslation having the English text of mark Yu do not mark, then use Co-Training algorithm to carry out Chinese emotion recognition.Xu Jun proposes a kind of migration self-learning algorithm for the inaccuracy problem of mechanical translation, by the high confidence level translation sample in automatic mark training set, carries out repetitive exercise to sorter.Above-mentioned research is all based on different language material background.When the background of existing language material resource is different, CLSA strategy is also distinguished to some extent.In addition, the method for the strategy that affection resources moves and emotion recognition is closely related, can not cast aside emotion identification method and study separately feeling shifting strategy.
The present invention proposes a kind of affection resources moving method based on AdaBoost algorithm.First small-scale target language training set is translated into source language, then merge with extensive source language training set and build initial Weak Classifier; Then AdaBoost Algorithm for Training multi-categorizer is used; The emotion recognition across language is achieved through multi-classifier cooperate.
Summary of the invention
The present invention is intended at least solve the technical matters existed in prior art, especially innovatively propose a kind of based on AdaBoost across language affection resources data identification method.
In order to realize above-mentioned purpose of the present invention, the invention provides a kind of based on AdaBoost across language affection resources data identification method, its key is, comprises the steps:
Step 1, sets up affection resources data identification model, estimates the posterior probability of raw data d for classification, judge the classification of raw data d thus by prior probability and conditional probability;
Step 2, translates into source language training set by target language training set, on joint training collection, then use the affection resources data identification algorithm of AdaBoost to carry out the training of affection resources data, structure Weak Classifier;
Step 3, upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language affection resources data identification, form optimum classifier, thus identify language-specific affection resources data.
Described based on AdaBoost across language affection resources data identification method, preferably, described step 1 comprises:
Calculate the prior probability of raw data; Extract the affective characteristics of raw data again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as the preliminary judged result of affection resources data identification.
Described based on AdaBoost across language affection resources data identification method, preferably, described step 2 comprises:
Step 2-1, constructs multiple Weak Classifier collaborative work, constantly adjusts sample distribution, train new Weak Classifier by AdaBoost affection resources data identification algorithm, comprises the vector of each Weak Classifier weights through the generation one that iterates;
Step 2-2, by AdaBoost affection resources data identification Algorithm for Training source language training set and target language training set.
Described based on AdaBoost across language affection resources data identification method, preferably, described step 2 also comprises:
Step 2-3, carries out initialization, makes iteration round k=1;
Step 2-4, sets up joint training collection, makes joint training collection CR
k=R ∪ T
s, following formula:
Step 2-5, initializes weights, when making k=1, w
i(k)=1/ (T
s|+| R|);
Step 2-6, as (k=1 ... K), at CR
kpresent weight divide and plant, train optimum Weak Classifier h
k: CR
k→ Y; Use h
kto CR
kall sample classifications; Calculate error in classification ε
k, following formula:
Step 2-7, if (ε
k> 1/2), so { k=k-1; Break; , calculate Weak Classifier h
kweights α
k, following formula:
α
k=(1/2)×ln(1-ε
k/ε
k);
Step 2-8, record Weak Classifier weights: make W (k)=α
k; Upgrade the weights of each sample, following formula:
Wherein, take e as the exponential function of power be exp, Z
knormalized factor:
Step 2-9, classification results is following formula:
Weak classifier set to be selected
Described based on AdaBoost across language affection resources data identification method, preferably, described step 3 comprises:
Step 3-1, adopts moving window to upgrade the method for training set, trains optimum Weak Classifier by successive ignition;
Step 3-2, trains optimum Weak Classifier, to associating training set CR
kby the descending sort of sample weights size, wherein, if the sorter of kth wheel iteration is h
k, use h
kfor CR
kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN
k; H is weak classifier set to be selected; Generate optimum Weak Classifier h
k+1training set be CN
r,
Step 3-3, makes pos=|CR
k|-cnum; H={ Φ }; Pos represents the reference position of moving window,
Step 3-4, training set
CN
r=CN
r∪TN
r;
Step 3-5, at training set CN
rupper training Weak Classifier h
r; Use h
rto CR
kclassify, calculate classification error rate e
r, following formula:
Step 3-6, weak classifier set H=H ∪ { h to be selected
r(e
r);
pos=pos-step;
If (pos-scale) < 0 then { break; , be when the number of samples of remaining data set is less than a moving window size, deconditioning;
Step 3-7, optimum Weak Classifier
In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:
Proposed based on AdaBoost across language affection resources data identification method.First target language training set is translated into source language; AdaBoost algorithm is used again on joint training collection; Upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language emotion recognition, the classification policy that the present invention is based on AdaBoost obtains the accuracy and recall rate that are better than BaseLine, demonstrates the validity of the method.
Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:
Fig. 1 is UNB-A and UNB-B classifying quality of the present invention contrast schematic diagram;
Fig. 2 is accuracy and the recall rate change schematic diagram of AdaBoost successive ignition of the present invention;
Fig. 3 be the present invention is based on AdaBoost across language affection resources data identification method process flow diagram.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
In describing the invention, it will be appreciated that, term " longitudinal direction ", " transverse direction ", " on ", D score, "front", "rear", "left", "right", " vertically ", " level ", " top ", " end " " interior ", the orientation of the instruction such as " outward " or position relationship be based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore can not be interpreted as limitation of the present invention.
In describing the invention, unless otherwise prescribed and limit, it should be noted that, term " installation ", " being connected ", " connection " should be interpreted broadly, such as, can be mechanical connection or electrical connection, also can be the connection of two element internals, can be directly be connected, also indirectly can be connected by intermediary, for the ordinary skill in the art, the concrete meaning of above-mentioned term can be understood as the case may be.
Affection resources moves, and refers to and is moved to target language by source language affection resources, that is: utilize the source language sample of extensive band mark, and the target language sample of band mark on a small scale, carries out emotion praise, demote polarity identification target language text.In research of the present invention, the resource of use is source language sentiment dictionary.Move for across language affection resources, the present invention mainly solves 3 problems: 1. set up affection resources data identification model; 2. applicable machine translation method is chosen; 3. the multi-classifier cooperate strategy based on AdaBoost is designed.
1, set up affection resources data identification model, estimate that raw data d is for classification c by prior probability and conditional probability
kposterior probability, judge the classification of raw data d thus.Formula (1) is the classification calculating formula of naive Bayesian multinomial model:
(1) (CNB refers to the set of naive Bayesian multinomial model, and n is positive integer)
Wherein: P (c
k) be on category set C, prior probability of all categories in training set D; w
irepresent i-th characteristic item of raw data d; Wt (w
i) be characteristic item w in raw data d
iweights.
The present invention carries out affection resources data identification with naive Bayesian multinomial model.To the source language of mark and target language sample be had as training set.First the prior probability of text is calculated; Extract the affective characteristics of text again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as emotion recognition result.
2, only have the background of source language sentiment dictionary in existing resource under, then translation source speech training collection and source language sentiment dictionary is simultaneously needed.Through duplicate removal after translation, sentiment dictionary significantly dimensionality reduction can be caused, cause the recall rate of affection resources data characteristics to reduce.By target language is translated into source language, then perform training set.
3, across in language emotion recognition problem, because target language training set is minimum, the emotion classifiers performance that the joint training collection that source language and target language are formed builds is more weak.To this, present invention employs the method constructing multiple Weak Classifier collaborative work.AdaBoost is a kind of algorithm frame various Weak Classifier being integrated into strong classifier.AdaBoost trains new Weak Classifier by continuous adjustment sample distribution, comprises the vector of each Weak Classifier weights through the generation one that iterates, and the weights of each Weak Classifier represent its significance level in classification.
Source language training set is designated as R; Target language training set is designated as T; | T| < < | R|, namely the sample number of T is much smaller than R.Form source language training set after T translation and be designated as T
s; Emotion classification is designated as Y={0,1}; AdaBoost iterations is designated as K; Weak Classifier weight vector is designated as W.Based on AdaBoost across language affection resources data identification arthmetic statement as shown in algorithm 1:
Algorithm 1:
(1) initialization: make iteration round k=1;
(2) set up joint training collection: make CR
k=R ∪ T
s, shown in (2):
Right
y
ifor its classification marks; w
ik () is d
iweights in kth wheel iteration, wherein d
ifor raw data;
(3) initializes weights: when making k=1, w
i(k)=1/ (T
s|+| R|);
⑷for(k=1…K)
1. at CR
kpresent weight divide and plant, train optimum Weak Classifier h
k: CR
k→ Y;
2. h is used
kto CR
kall sample classifications; Calculate error in classification ε
k, as shown in Equation 3:
If 3. (ε
k> 1/2), so { k=k-1; Break; }
4. Weak Classifier h is calculated
kweights α
k, shown in (4):
α
k=(1/2)×ln(1-ε
k/ε
k) (4)
5. Weak Classifier weights are recorded: make W (k)=α
k;
6. the weights of each sample are upgraded, shown in (5):
Wherein, Z
knormalized factor:
(5) final classification results is such as formula shown in (6):
Based on the optimum Weak Classifier of moving window
To the optimum Weak Classifier training of AdaBoost algorithm, adopt moving window to upgrade the method for training set, train optimum Weak Classifier by successive ignition.
If the sorter of kth wheel iteration is h
k, use h
kfor CR
kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN
k; H is weak classifier set to be selected; Generate optimum Weak Classifier h
k+1training set be CN
r; Training h
k+1the arthmetic statement of optimum Weak Classifier is as shown in algorithm 2:
Algorithm 2:
(1) to CR
kby the descending sort of sample weights size;
(2) make pos=|CR
k|-cnum; H={ Φ };
⑶while(true)
{
CN
r=CN
r∪TN
r;
At CN
rupper training Weak Classifier h
r;
Use h
rto CR
kclassify, calculate classification error rate e
r, shown in (7):
H=H∪{h
r(e
r)};
pos=pos-step;
If (pos-scale) < 0then{break; }
}
To the size of moving window, if scale < < is cnum, then training set is less, CN
rwith CN
r+1similarity increase.This can make recall rate on the low side on the one hand; In addition, also can reduce the difference of each candidate's Weak Classifier, make the discrimination of optimum Weak Classifier relatively low.Research of the present invention shows, when scale value is [cnum/7, cnum/3], classification performance there will not be significant difference.Windowed time is scale=cnum/4 by the present invention.For increasing the difference between Weak Classifier, window sliding step-length is set to scale >=step, selects repeated sample to avoid twice training.
Experiment conclusion:
Experiment language material is from NLP & CC2013 Chinese microblogging evaluation and test task.Wherein source language is English, and target language is Chinese.Source language training set selects the comment language material in MUSIC field, comprises 4000 comment property texts altogether, wherein positive and negative to each 2000.Target language training set selects the comment language material in MUSIC field, comprise 40 texts altogether, wherein, positive and negative to each 20 (in training set, number of training and the NLP & CC2013 evaluating standard of source language and target language are consistent).That selects NLP & CC2013 to evaluate and test has the result set (MUSIC field) of mark as test sample book collection, wherein comprises 814, target language sample, wherein positive 412, class sample, negative 402, class sample.Corpus distribution is as shown in table 1.
Corpus used tested by table 1
Adopt MPQA dictionary, wherein comprise forward word 2789, negative sense word 6079.Machine translation tools selects Microsoft Translater.Experimental result with accuracy, recall rate for evaluation index.
Strategy in machine translation compares
Concentrate from test sample book and randomly draw 4 groups of test sample books, often organize sample number 500 (each group sample has intersection).Respectively with method 1.: target language training set is translated into source language, then build sorter UNB-A on joint training collection; Method is 2.: source language training set and MPQA dictionary are translated into target language, then build sorter UNB-B on joint training collection.The performance of two sorters as shown in Figure 1.Wherein accuracy and recall rate are the mean value of 4 groups of test sample books.
Fig. 1 is the contrast of UNB-A and UNB-B classifying quality,
The recall rate of visible UNB-B is starkly lower than UNB-A, and main cause is the impact by mechanical translation accuracy, and the difference that Yi Jizhong, western language are expressed causes.Such as:
Original text: Bread's music was more interesting than this!
The music of translation: Bread is more interesting than this!
Original text: Antony has a very interesting voice.
Translation: Antony has an absorbing sound.
Wherein, " interesting " is translated into word " interesting " uniquely in MPAQ dictionary; Then " interesting " and " interesting " respectively in the text of translation.This makes " interesting " and " interesting " these two affective characteristicses all can not be extracted to from Chinese language text, causes classification recall rate lower thus.
Affection resources migration results
For checking is based on the validity of AdaBoost method, with two sorters for BaseLine, whole 814 samples choosing test sample book collection carry out contrast experiment: the sorter SNB 1. obtained on source language training set; 2. on source language and target language joint training collection, the sorter UNB obtained is trained; 3. the sorter AdaNB that the inventive method obtains is utilized.1-20 iteration is carried out to AdaNB, as moving window scale=cnum/4, during step=scale, classification performance as shown in Figure 2: Fig. 2 is the accuracy of AdaBoost successive ignition and recall rate change.
Visible AdaNB classification overall performance when iterations reaches 7 increases and maintains and stablizes; When iterations reaches 15, overall performance is tending towards optimum.But with the increase of iterations, the performance of AdaNB does not continue to improve, and positive class accuracy and negative class recall rate slightly decline, and maintains stable on certain level.Analyze its reason, think that Weak Classifier of the present invention only have chosen the part sample of training set, training set is less, and Weak Classifier only achieves local preferably performance, and recall rate is on the low side, and along with the increase of iterations, the phenomenon of over-fitting appears in sorter.Table 2 is accuracy and the recall rate of each sorter.Wherein AdaNB sorter iterations is 15 times.
The various sorter classification results contrast of table 2
The positive class F value of visible AdaNB reaches 0.749789, is better than SNB sorter 0.11, is better than UNB sorter 0.10; Negative class F value reaches 0.693136, is better than SNB sorter 0.06, is better than UNB sorter 0.02.All in all, the overall performance of AdaNB in certain iterations is better than BaseLine, obtains comparatively balanced accuracy and recall rate.In NLP & CC2013 evaluates and tests, the emotion recognition accuracy of its MUSIC field comment is up to 0.76, minimumly reaches 0.50.The present invention's experiment is using the result set of this evaluation and test as test set, and namely adopt the test sample book collection identical with this evaluation and test, the average accuracy of positive and negative class reaches 0.71, is better than the average level of this evaluation and test, also show the validity of the inventive method.
Beneficial effect of the present invention: based on mechanical translation, is built into strong classifier by multiple Weak Classifier with the method for weighting.In the often wheel iteration of AdaBoost, the method proposing to upgrade with moving window training set trains optimum Weak Classifier.Experiment shows, the machine translation method of employing is feasible.Be better than BaseLine based on AdaBoost algorithm across language emotional semantic classification performance, in the MUSIC field comment that NLP & CC2013 evaluates and tests, the average accuracy of application the inventive method reaches 0.69, indicates the validity of the inventive method.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalents thereof.
Claims (5)
1. based on AdaBoost across a language affection resources data identification method, it is characterized in that, comprise the steps:
Step 1, sets up affection resources data identification model, estimates the posterior probability of raw data d for classification, judge the classification of raw data d thus by prior probability and conditional probability;
Step 2, translates into source language training set by target language training set, on joint training collection, then use the affection resources data identification algorithm of AdaBoost to carry out the training of affection resources data, structure Weak Classifier;
Step 3, upgrading training set by arranging moving window, training optimum Weak Classifier; Finally obtain the sorter being applicable to target language affection resources data identification, form optimum classifier, thus identify language-specific affection resources data.
2. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 1 comprises:
Calculate the prior probability of raw data; Extract the affective characteristics of raw data again, calculate the conditional probability of feature; Last using the classification of posteriority maximum probability as the preliminary judged result of affection resources data identification.
3. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 2 comprises:
Step 2-1, constructs multiple Weak Classifier collaborative work, constantly adjusts sample distribution, train new Weak Classifier by AdaBoost affection resources data identification algorithm, comprises the vector of each Weak Classifier weights through the generation one that iterates;
Step 2-2, by AdaBoost affection resources data identification Algorithm for Training source language training set and target language training set.
4. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 2 also comprises:
Step 2-3, carries out initialization, makes iteration round k=1;
Step 2-4, sets up joint training collection, makes joint training collection CR
k=R ∪ T
s, following formula:
Step 2-5, initializes weights, when making k=1, w
i(k)=1/ (| T
s|+| R|);
Step 2-6, as (k=1 ... K), at CR
kpresent weight divide and plant, train optimum Weak Classifier h
k: CR
k→ Y; Use h
kto CR
kall sample classifications; Calculate error in classification ε
k, following formula:
Step 2-7, if (ε
k> 1/2), so { k=k-1; Break; , calculate Weak Classifier h
kweights α
k, following formula:
α
k=(1/2)×ln(1-ε
k/ε
k);
Step 2-8, record Weak Classifier weights: make W (k)=α
k; Upgrade the weights of each sample, following formula:
Wherein, take e as the exponential function of power be exp, Z
knormalized factor:
Step 2-9, classification results is following formula:
Weak classifier set to be selected
5. according to claim 1 based on AdaBoost across language affection resources data identification method, it is characterized in that, described step 3 comprises:
Step 3-1, adopts moving window to upgrade the method for training set, trains optimum Weak Classifier by successive ignition;
Step 3-2, trains optimum Weak Classifier, to associating training set CR
kby the descending sort of sample weights size, wherein, if the sorter of kth wheel iteration is h
k, use h
kfor CR
kadd key words sorting, correct sample number of classifying is cnum, and classification error sample number is enum; If moving window size is scale; Window sliding step-length is step; Sample in window is collection TN
k; H is weak classifier set to be selected; Generate optimum Weak Classifier h
k+1training set be CN
r,
Step 3-3, makes pos=|CR
k|-cnum; H={ Φ }; Pos represents the reference position of moving window,
Step 3-4, training set
CN
r=CN
r∪TN
r;
Step 3-5, at training set CN
rupper training Weak Classifier h
r; Use h
rto CR
kclassify, calculate classification error rate e
r, following formula:
Step 3-6, weak classifier set H=H ∪ { h to be selected
r(e
r);
pos=pos-step;
If (pos-scale) < 0 then { break; , be when the number of samples of remaining data set is less than a moving window size, deconditioning;
Step 3-7, optimum Weak Classifier
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410766618.9A CN104462409B (en) | 2014-12-12 | 2014-12-12 | Across language affection resources data identification method based on AdaBoost |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410766618.9A CN104462409B (en) | 2014-12-12 | 2014-12-12 | Across language affection resources data identification method based on AdaBoost |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104462409A true CN104462409A (en) | 2015-03-25 |
CN104462409B CN104462409B (en) | 2017-08-25 |
Family
ID=52908444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410766618.9A Expired - Fee Related CN104462409B (en) | 2014-12-12 | 2014-12-12 | Across language affection resources data identification method based on AdaBoost |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104462409B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105938565A (en) * | 2016-06-27 | 2016-09-14 | 西北工业大学 | Multi-layer classifier and Internet image aided training-based color image emotion classification method |
CN106709829A (en) * | 2015-08-03 | 2017-05-24 | 科大讯飞股份有限公司 | On-line-question-database-based learning condition diagnosis method and system |
CN107360200A (en) * | 2017-09-20 | 2017-11-17 | 广东工业大学 | A kind of fishing detection method based on classification confidence and web site features |
CN107564580A (en) * | 2017-09-11 | 2018-01-09 | 合肥工业大学 | Gastroscope visual aids processing system and method based on integrated study |
CN108090040A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | A kind of text message sorting technique and system |
CN108376133A (en) * | 2018-03-21 | 2018-08-07 | 北京理工大学 | The short text sensibility classification method expanded based on emotion word |
CN110222181A (en) * | 2019-06-06 | 2019-09-10 | 福州大学 | A kind of film review sentiment analysis method based on Python |
CN112559685A (en) * | 2020-12-11 | 2021-03-26 | 芜湖汽车前瞻技术研究院有限公司 | Automobile forum spam comment identification method |
US11151182B2 (en) * | 2017-07-24 | 2021-10-19 | Huawei Technologies Co., Ltd. | Classification model training method and apparatus |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101231702A (en) * | 2008-01-25 | 2008-07-30 | 华中科技大学 | Categorizer integration method |
US20100217595A1 (en) * | 2009-02-24 | 2010-08-26 | Korea Institute Of Science And Technology | Method For Emotion Recognition Based On Minimum Classification Error |
CN103617245A (en) * | 2013-11-27 | 2014-03-05 | 苏州大学 | Bilingual sentiment classification method and device |
CN103761311A (en) * | 2014-01-23 | 2014-04-30 | 中国矿业大学 | Sentiment classification method based on multi-source field instance migration |
-
2014
- 2014-12-12 CN CN201410766618.9A patent/CN104462409B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101231702A (en) * | 2008-01-25 | 2008-07-30 | 华中科技大学 | Categorizer integration method |
US20100217595A1 (en) * | 2009-02-24 | 2010-08-26 | Korea Institute Of Science And Technology | Method For Emotion Recognition Based On Minimum Classification Error |
CN103617245A (en) * | 2013-11-27 | 2014-03-05 | 苏州大学 | Bilingual sentiment classification method and device |
CN103761311A (en) * | 2014-01-23 | 2014-04-30 | 中国矿业大学 | Sentiment classification method based on multi-source field instance migration |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709829A (en) * | 2015-08-03 | 2017-05-24 | 科大讯飞股份有限公司 | On-line-question-database-based learning condition diagnosis method and system |
CN106709829B (en) * | 2015-08-03 | 2020-06-02 | 科大讯飞股份有限公司 | Learning situation diagnosis method and system based on online question bank |
CN105938565A (en) * | 2016-06-27 | 2016-09-14 | 西北工业大学 | Multi-layer classifier and Internet image aided training-based color image emotion classification method |
CN108090040B (en) * | 2016-11-23 | 2021-08-17 | 北京国双科技有限公司 | Text information classification method and system |
CN108090040A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | A kind of text message sorting technique and system |
US11151182B2 (en) * | 2017-07-24 | 2021-10-19 | Huawei Technologies Co., Ltd. | Classification model training method and apparatus |
CN107564580B (en) * | 2017-09-11 | 2019-02-12 | 合肥工业大学 | Gastroscope visual aids processing system and method based on integrated study |
CN107564580A (en) * | 2017-09-11 | 2018-01-09 | 合肥工业大学 | Gastroscope visual aids processing system and method based on integrated study |
CN107360200A (en) * | 2017-09-20 | 2017-11-17 | 广东工业大学 | A kind of fishing detection method based on classification confidence and web site features |
CN108376133A (en) * | 2018-03-21 | 2018-08-07 | 北京理工大学 | The short text sensibility classification method expanded based on emotion word |
CN110222181A (en) * | 2019-06-06 | 2019-09-10 | 福州大学 | A kind of film review sentiment analysis method based on Python |
CN110222181B (en) * | 2019-06-06 | 2021-08-31 | 福州大学 | Python-based film evaluation emotion analysis method |
CN112559685A (en) * | 2020-12-11 | 2021-03-26 | 芜湖汽车前瞻技术研究院有限公司 | Automobile forum spam comment identification method |
Also Published As
Publication number | Publication date |
---|---|
CN104462409B (en) | 2017-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104462409A (en) | Cross-language emotional resource data identification method based on AdaBoost | |
Tymann et al. | GerVADER-A German Adaptation of the VADER Sentiment Analysis Tool for Social Media Texts. | |
Grönroos et al. | Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology | |
CN105205124B (en) | A kind of semi-supervised text sentiment classification method based on random character subspace | |
CN109977413A (en) | A kind of sentiment analysis method based on improvement CNN-LDA | |
CN110378409A (en) | It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method | |
CN105808525A (en) | Domain concept hypernym-hyponym relation extraction method based on similar concept pairs | |
CN106095996A (en) | Method for text classification | |
CN106484675A (en) | Fusion distributed semantic and the character relation abstracting method of sentence justice feature | |
CN107391486A (en) | A kind of field new word identification method based on statistical information and sequence labelling | |
CN107122349A (en) | A kind of feature word of text extracting method based on word2vec LDA models | |
CN102541838B (en) | Method and equipment for optimizing emotional classifier | |
CN110717341B (en) | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot | |
CN106257455A (en) | A kind of Bootstrapping algorithm based on dependence template extraction viewpoint evaluation object | |
CN103020167B (en) | A kind of computer Chinese file classification method | |
CN104346326A (en) | Method and device for determining emotional characteristics of emotional texts | |
Rustamov et al. | Sentence-level subjectivity detection using neuro-fuzzy models | |
CN110110116A (en) | A kind of trademark image retrieval method for integrating depth convolutional network and semantic analysis | |
Zubarev et al. | Cross-language text alignment for plagiarism detection based on contextual and context-free models | |
Al Awaida et al. | Automated arabic essay grading system based on f-score and arabic worldnet | |
CN108681532A (en) | A kind of sentiment analysis method towards Chinese microblogging | |
Tong et al. | Multi-Task Learning for Mispronunciation Detection on Singapore Children's Mandarin Speech. | |
CN111368035A (en) | Neural network-based Chinese dimension-dimension Chinese organization name dictionary mining system | |
Seva et al. | Multi-lingual ICD-10 Coding using a Hybrid rule-based and Supervised Classification Approach at CLEF eHealth 2017. | |
Mahdaouy et al. | Cs-um6p at semeval-2022 task 6: Transformer-based models for intended sarcasm detection in english and arabic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170825 Termination date: 20201212 |