CN103886097A

CN103886097A - Chinese microblog viewpoint sentence recognition feature extraction method based on self-adaption lifting algorithm

Info

Publication number: CN103886097A
Application number: CN201410135746.3A
Authority: CN
Inventors: 陈锻生; 吴扬扬; 方圆
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2014-04-04
Filing date: 2014-04-04
Publication date: 2014-06-25

Abstract

The invention discloses a Chinese microblog viewpoint sentence recognition feature extraction method based on a self-adaption lifting algorithm. The method comprises the steps that firstly, features related to recognizing a microblog viewpoint sentence are set and recognized, weak classifiers with single features form a strong classifier with a plurality of features, and critical recognition features are selected in the construction process of the strong classifier; finally, an effective subjective sentence recognition feature set and the strong classifier formed by the recognition feature set are output, and effective recognition bases can be provided for recognition of the Chinese microblog viewpoint sentence through the subjective sentence recognition feature set.

Description

The extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm

Technical field

The present invention relates to a kind of extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm.

Background technology

In Chinese microblogging, effectively differentiate whether comprise the viewpoint of people to things, the information such as suggestion or tendency, are the important foundations of automatic Collection and analysis network Chinese public opinion data.From the angle of text mining, identify subjective statement and can improve the accuracy of viewpoint classification, reduce the interference of non-subjective statement to follow-up natural language processing inter-related tasks such as viewpoint summary, tendency statistics and sentiment analysis.

Along with developing rapidly of internet and popularizing of Web2.0, the issue of information is no longer the patent of newpapers and periodicals, periodical publisher, TV station and news website, and microblogging website has become the publication medium of public information.Than traditional blog, the feature of microblogging maximum is its micro-, within single piece of blog length is generally limited in 140 words.In microblogging, not only can comprise news, also may comprise the information such as viewpoint or suggestion of microblog users individual to things.

Viewpoint sentence is based on asserting or commenting on and expressing with a guy's emotion and purpose.Can trace back to the classification of the subjective and objective sentence in opinion mining about the classification of viewpoint sentence, it is the classification of carrying out on the media data of comment on commodity mostly, and the maximum feature of carrying out the differentiation of viewpoint sentence on microblogging is the restriction of its number of words and the freedom of language construction.Because the restriction of number of words, the frequency of its word and part of speech thereof, dependence greatly reduces with respect to plain text; Because the freedom of language construction, the analysis on syntactic structure is carried out just relatively difficultly.For the subjective composition characteristics identification of this short text of Chinese microblogging, also lack the combined optimization method of the effective sorting technique of system and feature extraction at present.

Adaptive boosting algorithm is a kind ofly to combine multiple Weak Classifiers and become the method for a strong classifier.Weak Classifier is the binary classifier that the probability of error is less than 0.5, and using it to differentiate the random conjecture that two class problem ratio error probability are 0.5 will get well, and the probability of error of strong classifier H can be arbitrarily small.With reference to the thought of the combination multi-categorizer of adaptive boosting algorithm, we have proposed a kind of effective feature selection approach for subjective sentence identification in Chinese microblogging.

Summary of the invention

The extracting method that the invention provides a kind of Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm, it has overcome the deficiency described in background technology.

The technical scheme adopting that the present invention solves its technical matters is:

The extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm, it comprises:

Step 1, whether be the microblogging training sample of viewpoint sentence mark, input this microblogging training sample set S={ (x if having band _i, y _i), i=1 ..., n}, wherein x _i∈ X, y _i∈ Y, Y={-1 ,+1}, X is m feature of this n microblogging training sample, Y is the classification results that each microblogging training sample is corresponding, if this microblogging training sample x _ibe viewpoint sentence, this microblogging training sample is labeled as y _i=+1, otherwise be y _i=-1;

The stopping criterion for iteration of setting feature selecting is: error in classification ε _jbe less than threshold value beta with 0.5 gap, wherein, β can according to circumstances set voluntarily;

Set the initial weight distribution D of microblogging training sample set ₁for being evenly distributed,

Setting selecteed initial characteristics set is empty set;

Set iteration variable initial value j=1, maximum iteration time is m;

Step 2,21-27 carries out loop iteration according to the following steps, comprising:

Step 21 is D in weight distribution _jmicroblogging training sample concentrate, find with feature f _jfor the Weak Classifier h of single features _j, this Weak Classifier h _jto the error in classification ε of this microblogging training sample set _jwith 0.5 disparity,

wherein: the error in classification of this Weak Classifier to this microblogging training sample set

h is all single features Weak Classifiers that are output as Y;

Step 22, writes down this Weak Classifier h _jparameter: feature f _j, two points of these weight distribution microblogging training sample sets threshold value and binary relation operational symbol;

Step 23, upgrades selected characteristic set F=F ∪ { f _j, the selected feature f of this iteration _jin iteration afterwards, do not re-use;

Step 24, calculates this Weak Classifier h _jweight in strong classifier H

Step 25, if error in classification | ε _j-0.5|≤β, maximum iteration time T=j, exits iteration, finishes feature selecting, otherwise, proceed step 26;

Step 26, iteration variable j value adds 1, if j is greater than m, has selected whole features, exits iteration, otherwise, proceed step 27;

Step 27, upgrade the weight distribution of this microblogging training sample set:

i=1 ..., n, wherein:

return to step 21;

Step ³, the selected characteristic set F={f of output _j| j=1 ..., T} and strong classifier

H (x) = sign [Σ_{j = 1}^{T} α_{j} h_{j} (x)] .

Among one embodiment: this error in classification ε _jcan be set with 0.5 gap β.

Among one embodiment: the recognition feature of Chinese microblogging viewpoint sentence comprises the part of speech in Chinese microblogging statement.

Among one embodiment: the recognition feature of Chinese microblogging viewpoint sentence comprises the emotion set of words in sentiment dictionary.

Among one embodiment: the recognition feature of Chinese microblogging viewpoint sentence comprises the interdependent feature between word and word.

Among one embodiment: the recognition feature of Chinese microblogging viewpoint sentence comprises the position feature between word and word.

The technical program is compared with background technology, and its tool has the following advantages:

The present invention provides a kind of effective feature selecting solution for the identification of the subjective sentence of Chinese microblogging, the Weak Classifier with single feature is built into the strong classifier with multiple features, and in the building process of strong classifier, carry out the selection of crucial recognition feature, effectively subjective sentence recognition feature set of output, can provide effective basis of characterization for the identification of Chinese microblogging viewpoint sentence by this subjectivity sentence recognition feature set.

Accompanying drawing explanation

Below in conjunction with drawings and Examples, the invention will be further described.

Fig. 1 is the process flow diagram of the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm.

Embodiment

Please refer to Fig. 1, identify Chinese microblogging viewpoint beginning of the sentence and first will set the feature relevant to identifying microblogging viewpoint sentence, then from numerous correlated characteristics, extract the feature most with identification effect according to the Chinese microblogging viewpoint sentence recognition feature extracting method based on adaptive boosting algorithm.

In the present embodiment, adopting the emotion set of words in part of speech and the sentiment dictionary in Chinese microblogging is basic recognition feature, part of speech comprises adjective, verb, interjection etc., part of speech adopts in the ICTPOS Chinese part of speech label sets of the Computer Department of the Chinese Academy of Science 23 kinds of parts of speech that connect etc. except punctuation mark and front and back, in the recognition feature leaching process of viewpoint sentence, we have set up Weak Classifier for each part of speech, and each parts of speech classification device is used for mating a kind of part of speech.Emotion word adopts multiple for judging the emotion set of words of emotion classification of word in sentiment dictionary, and in like manner, we also set up Weak Classifier for each emotion set of words.Interdependent feature adopts the dependence collection of Stanford university, and totally 57 kinds, in like manner, we also set up Weak Classifier for each dependence, for mating a kind of dependence.Finally can also add the relevant position feature in appearance position between word and word, in the present embodiment, construct two class word position features and describe the syntactic structure of Chinese microblogging.To sum up, can build the Weak Classifier corresponding with these more than 100 features and carry out the identification of microblogging viewpoint sentence.By adaptive boosting algorithm, above more than 100 features are further extracted, find out the recognition feature of the microblogging viewpoint sentence with identification effect, find out the identification Weak Classifier of the microblogging viewpoint sentence with identification effect, form strong classifier microblogging viewpoint sentence is identified by the Weak Classifier filtering out being carried out to linear combination, or its recognition effect is by better than adopting the single feature effect that optionally several features or all features are classified.

If the Weak Classifier of a certain feature is h _j: X → Y,

Setting selecteed initial characteristics set is empty set;

Set iteration variable initial value j=1, maximum iteration time is m;

h is all single features Weak Classifiers that are output as Y;

Step 24, calculates this Weak Classifier h _jweight in strong classifier H

Step 25, if error in classification | ε _j-0.5|≤β, maximum iteration time T=j, exits iteration, finishes feature selecting, otherwise, continue execution step 26;

Step 26, iteration variable j value adds 1, if j is greater than m, has selected whole features, exits iteration, otherwise, continue execution step 27;

i=1 ..., n, wherein:

return to step 21;

Step 3, the selected characteristic set F={f of output _j| j=1 ..., T} and strong classifier

the selected characteristic set of this last output is the set of the subjective sentence of effective microblogging recognition feature, and conventionally, T is less than m, just can reach the object of feature selecting.

In the present embodiment, adopt accuracy rate, recall rate and F value (F1-Measure) to analyze the identification effect of this feature to viewpoint sentence as statistical indicator.In one preferred embodiment, choose the data of the Chinese microblogging sentiment analysis evaluation and test of NLP & CC2012 and carry out training and testing as this microblogging training sample set, select 6 part of speech features can obtain the effect of 80% F1 value overall target by this method.Use part of speech to add that position feature screens the strong classifier drawing and carries out the screening of subjective sentence, its classifying quality drawing is better.

The above, only for preferred embodiment of the present invention, therefore can not limit according to this scope of the invention process, the equivalence of doing according to the scope of the claims of the present invention and description changes and modifies, and all should still belong in the scope that the present invention contains.

Claims

1. the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm, is characterized in that: comprising:

Setting selecteed initial characteristics set is empty set;

Set iteration variable initial value j=1, maximum iteration time is m;

wherein: the error in classification of this Weak Classifier to this microblogging training sample set h is all single features Weak Classifiers that are output as Y;

Step 24, calculates this Weak Classifier h _jweight in strong classifier H

i=1 ..., n, wherein,

return to step 21;

H (x) = sign [Σ_{j = 1}^{T} α_{j} h_{j} (x)] .

2. the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm according to claim 1, is characterized in that: this error in classification ε _jcan be set with 0.5 gap β.

3. the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm according to claim 1, is characterized in that: the recognition feature of Chinese microblogging viewpoint sentence comprises the part of speech in Chinese microblogging statement.

4. the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm according to claim 1, is characterized in that: the recognition feature of Chinese microblogging viewpoint sentence comprises the emotion set of words in sentiment dictionary.

5. the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm according to claim 1, is characterized in that: the recognition feature of Chinese microblogging viewpoint sentence comprises the interdependent feature between word and word.

6. the extracting method of the Chinese microblogging viewpoint sentence recognition feature based on adaptive boosting algorithm according to claim 1, is characterized in that: the recognition feature of Chinese microblogging viewpoint sentence comprises the position feature between word and word.