CN108920545A

CN108920545A - The Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension

Info

Publication number: CN108920545A
Application number: CN201810610017.7A
Authority: CN
Inventors: 孙界平; 胡思才; 琚生根; 李兴国; 袁宵; 汪嘉伟; 王婧妍
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2018-11-30
Anticipated expiration: 2038-06-13
Also published as: CN108920545B

Abstract

The invention discloses the Chinese affective characteristics selection method of a kind of sentiment dictionary based on extension and Ka Fang model, the present invention is directed to current Research Challenges and deficiency, proposes the Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension.First combine Hownet and it is improved based on the polarity value calculating method of word frequency to emotion word each in dictionary carry out polarity number calculating, establish have feeling polarities value dictionary；Then by negative word in comment text clause and its number occur and detecting, antisense conversion process carried out to the affective characteristic words in negative word confining spectrum, in effective solution emotion text the problem of polarity reversion；Finally the feeling polarities value of characteristic item and its two factor of frequency feature in class are dissolved into card side's feature selection module (CHI), substantially have modified the defect of card side's model, so that there are improved model better affective characteristics to select performance, text emotion classifying quality can be effectively improved.

Description

The Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension

Technical field

The present invention relates to a kind of natural language processing technique more particularly to a kind of sentiment dictionaries and Ka Fangmo based on extension The Chinese affective characteristics selection method of type.

Background technique

In recent years, sentiment analysis oneself become natural language processing in a hot issue, in Market Forecast Analysis, the people The numerous areas such as meaning investigation, intelligent shopping guide, public comment suffer from wide application space and development prospect.Emotion comment text After feature vector, various dimensions disaster may be generated, is unfavorable for model training, therefore, text feature selection is especially heavy It wants.Emotion text feature selecting is generally divided into two classes, and one kind is directly to pass through look-up table using sentiment dictionary to extract affective characteristics Word [1], advantage be simple and efficient, the disadvantage is that do not consider affective characteristics to the influence degree of model, meanwhile, have ignored sentiment dictionary Outer feature；Another kind of is the feature selection approach based on statistics, such as card side's model, IG method, WLLR method, MI method, TF- IDF method etc. calculates training text characteristic item word frequency and document frequency using statistical method, so as to find out each feature The weight coefficient of item, using the high characteristic item of coefficient as the characteristic item of final choice.Currently, second method is more commonly used, it is preceding People obtains card release side's model by experimental study and IG method is the conclusion of one of current most effective feature selecting algorithm, especially Other methods are substantially better than in category distribution relative equilibrium.Therefore the defect and deficiency for inquiring into and correcting both methods, mention The efficiency of high feature selecting has very important practical significance.

In recent years, some scholars have made some improvements [2-4] for the deficiency of IG algorithm and Ka Fang model algorithm. TFIG algorithm [2] using the characteristic item frequency of occurrences (including characteristic item do not occur, occur it is primary, there are multiple three kinds of situations) carry out The calculating of information gain, this method are pretty good to long text classifying quality, but apply in short text sentiment analysis, then effect one As.CHI_LF algorithm [3] is by the information such as frequency disribution and characteristic item and class between position, class in frequency in the class of characteristic item, class The other positive and negative degree of correlation incorporates in card side's model, and effect is preferable under class deflection condition, but applies relatively equal in category distribution Performance boost is limited in the short text of weighing apparatus.

There are one emotion comment texts, and distinguishing feature is：Generally existing negative word in comment text.The appearance of negative word The feeling polarities in its confining spectrum are inverted, can be brought to the supervised learning sorting algorithm based on bag of words certain negative It influences.Therefore it needs to handle negative word before carrying out feature selecting.Xia [5] et al. is detected using negative word and feature Word statistical method obtains corresponding converting text to the affective characteristic words conversion of text, and the text after making full use of extension carries out Pairs of training and test.The method reduce the dependences to additional corpus data, to the emotion text adaptability in each field It is relatively strong, but this method, because to invert to most of affective characteristic words outside negative word confining spectrum, there are some features Noise, if the emotion text bad to Chinese word segmentation effect, has certain negative effect.

Summary of the invention

The object of the invention is that providing a kind of sentiment dictionary and Ka Fang based on extension to solve the above-mentioned problems The Chinese affective characteristics selection method of model.

The present invention is achieved through the following technical solutions above-mentioned purpose：

1, the present invention includes the following steps：

(1) card side's feature selection module：Assume initially that Feature Words with classification are directly incoherent, if divided using card side It is bigger that the calculated test value of cloth deviates threshold value, then more confident negative null hypothesis, receives the standby of null hypothesis and then assume：Feature Word and classification have the very high degree of association；The calculation formula of card side's model is as follows：

Wherein, A_ijIt indicates to include characteristic item t_iAnd belong to class c_jNumber of documents, B_ijIt indicates to include characteristic item t_iBut do not belong to In class c_jNumber of documents, C_ijIt indicates not including characteristic item t_iBut belong to class c_jNumber of documents, D_ijIt indicates not including characteristic item t_i And it is not belonging to class c_jNumber of documents, N indicate total number of documents amount；

(2) polarity number of affective characteristics item is incorporated into card side's model：One by one by the affective characteristics item in emotion comment text It is compared with the dictionary with feeling polarities value, if in dictionary, the value of this feature item in dictionary is made for the affective characteristics item For the feeling polarities value of this feature item, such as " outstanding " corresponding value in dictionary is 0.919, then is used as " outstanding " for 0.919 Feeling polarities value；If this feature item does not appear in dictionary, calculated using formula (6), such as " bright in luster ", is led to Crossing formula and calculating its feeling polarities value is 0.705；The final polarity number for obtaining all affective characteristics items in text；

From the definition of card side's model, it can be found that, matrix A indicates that the document comprising characteristic item belongs to the quantity of each class, is special Most important one in sign selection, its value directly determines the final value of feature selecting；In all identical situation of other situations, It include characteristic item t in matrix A_iAnd belong to class c_jValue A_ijIt is bigger, then characteristic item t_iTo class c_jContribution margin it is higher, characteristic item t_iWith classification c_jJust there is the deeper degree of association, at this time by characteristic item t_iChoose the accuracy rate that can be improved text classification；Together When, other matrix Bs, C, D are calculated by matrix A；

But matrix A only considered number of documents, not consider to the feeling polarities of characteristic item itself, and feeling polarities value is high Characteristic item in emotional semantic classification have biggish effect；The present invention considers this influence of two parts factor to classification, to feelings Sense polarity number is merged with matrix A, and the number of documents comprising characteristic item is enclosed to the feeling polarities value of this feature item, matrix A develops into A'_ij=A_ij×E_ij；Such as in all similar situation of other situations, characteristic item " adjusting ", " liking " is in commendation class The number of files of middle appearance is respectively 5,5, and original matrix A judges the two words to the important journey of commendation class simply by document frequency Degree, such case are then difficult to distinguish which word more important to commendation class；And for modified matrix A ', due to the pole of the two words Property value be respectively 0.1,0.9, then its matrix A ' in it is corresponding value be 5.5,9.5, be equivalent to the feature high for feeling polarities value Item provides an increment, larger can will possibly occur less but feeling polarities value in training corpus in the feature selecting stage Higher feature is chosen；

Wherein, E_ijIt is an emotional intensity matrix, row indicates that affective characteristics item, column indicate class, and first row indicates derogatory sense class, Secondary series indicates commendation class, e_iIt is the characteristic item emotional value obtained using sentiment dictionary and formula (6), wherein working as e_iIt, should when >=0 Characteristic item is commendatory term, then E_ijIn value in two classes be respectively：

From formula (8) as can be seen that when this feature item is commendatory term, A'_ijBe equivalent to in matrix A include characteristic item t_i And belong to class c₁The value A of commendation class_i1It has been multiplied by a big weight 1+e_i, include characteristic item t in matrix A_iAnd belong to class c₀Derogatory sense The value A of class_i0It remains unchanged；

Work as e_iWhen < 0, this feature item is derogatory term, then E_ijTwo class intermediate values be respectively：

From formula (9) as can be seen that when this feature item is derogatory term, A'_ijBe equivalent to in matrix A include characteristic item t_i And belong to class c₀The value A of derogatory sense class_i0It has been multiplied by a big weight 1+ | e_i|, it include characteristic item t in matrix A_iAnd belong to class c₁It praises The value A of adopted class_i1It remains unchanged；

After finding out A' according to above-mentioned formula, in conjunction with feature quantity N, B', C', D' are found out, finally obtains new characteristic value：

(3) frequency information and card side's models coupling between characteristic item class：Card side's model consider always comprising characteristic item or Number of documents without characteristic item, not in view of the frequency that characteristic item occurs in class, but characteristic frequency is also feature selecting One key factor；Such as characteristic item t_iAnd t_kIt is worth relatively by card side's model is calculated, if but characteristic item t_iIn class c_jThe frequency of middle appearance is very big, and t_kThe frequency of appearance is seldom, then t_iIn class c_jExpressive ability obviously than t_jBy force, but card side Model does not reflect the difference of this respect but；TF (t is utilized in the present invention_i,c_j) indicate characteristic item t_iIn classification c_jIn word frequency Number, formula are：

Wherein tf_k(t_i,c_j) indicate feature t_iBelonging to class c_jDocument d_kIn word frequency number, tf (t_i,c_j)_minIndicate feature t_iBelonging to class c_jSingle document in the minimum word frequency number that occurs, f (t_i,c_j)_maxIndicate feature t_iBelonging to class c_jSingle document in The maximum word frequency number of appearance, | c_j| expression belongs to class c_jNumber of documents；

In view of influence of the number of documents to it between different classes of, now it is normalized：

As characteristic item normalizes word frequency in each class；

(4) affective characteristics select formula：Frequency between the polarity number of affective characteristics item and the class of characteristic item is incorporated into Ka Fangmo In type, final formula is as follows：

According to formula (13), the amendment chi-square value of each affective characteristics can be acquired, to all values by descending Sequence arranges, and takes out affective characteristics corresponding to certain amount of maximum modified chi-square value as final affective characteristics, then Feature vector is carried out to emotion text according to selected affective characteristics, text classifier is recycled to classify.

The beneficial effects of the present invention are：

The present invention is the Chinese affective characteristics selection method of a kind of sentiment dictionary based on extension and Ka Fang model, and existing Technology is compared, and the present invention is directed to current Research Challenges and deficiency, is proposed in sentiment dictionary and Ka Fang model based on extension Literary affective characteristics selection method.First combine Hownet and it is improved based on the polarity value calculating method of word frequency to feelings each in dictionary Feel word and carry out polarity number calculating, establishes the dictionary for having feeling polarities value；Then by negative word in comment text clause and There is number and is detected in it, and handles the emotion word in negative word confining spectrum, to effectively limit negative The negative effect of word bring；Finally word frequency between characteristic item feeling polarities value and relevant feature class is melted with card side's model It closes, substantially has modified the defect of card side's model, so that there are improved model better affective characteristics to select performance, it can It is effective to improve text emotion classifying quality.

Detailed description of the invention

Fig. 1 is the dictionary Establishing process figure with feeling polarities value of the invention；

Fig. 2 is feeling polarities judging nicety rate curve graph of the invention；

Fig. 3 is hotel's data classification precision curve graph of the invention；

Fig. 3 (a) is to be classified using Nae Bayesianmethod to hotel's Chinese comment data, and Fig. 3 (b) is to utilize support Vector machine classifies to hotel's data；

Fig. 4 is book data nicety of grading curve graph of the invention；

Fig. 4 (a) is to be classified using Nae Bayesianmethod to books Chinese comment data, and Fig. 4 (b) is to utilize support Vector machine classifies to book data；

Fig. 5 is notebook data nicety of grading curve graph of the invention；

Fig. 5 (a) is to be classified using Nae Bayesianmethod to notebook Chinese comment data, and Fig. 5 (b) is to utilize branch Vector machine is held to classify to notebook data.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings：

Establish the dictionary with feeling polarities value

The polarity number calculating of affective characteristics item is mainly the following method：One is semantic-based method, using knowing Adopted elite tree is netted to calculate affective characteristics item and pass judgement on the similarity between adopted benchmark word, using the similarity as the affective characteristics item Polarity number^[8-9], the feeling polarities value based on Hownet, which calculates, depends on benchmark word, can not calculate simultaneously for some unregistered words Its feeling polarities value；One is Statistics-Based Methods^[10], pass through the co-occurrence information between statistics affective characteristics item and benchmark word Obtaining its polarity number, this method calculates simple, but the support of Big-corpus is needed, while also relying on benchmark word, when Between complexity it is high；There are also one is the affection computation methods based on word frequency^[11], the individual character using affective characteristics item passing judgement on adopted feelings The word frequency information that occurs obtains its feeling polarities value in sense dictionary, and this method does not depend on benchmark word, while not needing a large amount of Corpus carries out co-occurrence search, calculates simple, and time complexity is low, but it compares the vocabulary dependent on sentiment dictionary.

Present invention combination Hownet and affection computation method based on word frequency, and related improvement has been carried out to it, it obtains final Feeling polarities value.

Lexicon extension and pretreatment：

(1) present invention combines the sentiment dictionary of Hownet publication and Taiwan Univ.'s natural language processing laboratory to provide simplified Chinese sentiment dictionary (NTUSD) is pre-processed (merging and deletion including emotion word), the emotion being expanded to the dictionary Dictionary.

(2) word each in sentiment dictionary is resolved into individual character, then counts all words and occurs in passing judgement on adopted sentiment dictionary Quantity, formed have word frequency dictionary.

Similarity calculation based on Hownet：

Hownet be one it is fairly perfect, using concept representated by Chinese and english as object, to disclose concept and general The commonsense knowledge base based on relation on attributes between thought.

Chinese vocabulary description in Hownet is all based on " justice is former " this basic conception.Due in Chinese a word not Different concepts may be given expression in same context, therefore, word is organized into the set of several concepts by Hownet, each general Read all by one group of description justice it is former represented by.As shown in table 1, vocabulary " disagreeable " is there are three concept, and two of them are adjectives, and one It is verb, each concept is indicated by multiple justice are former, such as " aValue | attribute value ", " easiness | difficulty or ease ".Liu Qun^[6]Et al. The calculation method of two semantic similarities of the word based on Hownet is described in detail.

The former representation of the justice of 1 vocabulary of table

Polarity value calculating method of the emotion word based on Hownet is a kind of method based on sentiment dictionary, it is to choose some praise Then derogatory sense benchmark word calculates emotion word with the tightness degree for passing judgement on adopted benchmark word according to emotion word.The present invention uses 40 pairs are passed judgement on adopted benchmark word^[8], wherein commendation benchmark vocabulary is shown as p_i(i takes 1~40), derogatory sense benchmark vocabulary are shown as n_j(j takes 1~ 40).Specific formula for calculation is as follows：

Wherein, sim (p_i, w) and indicate the emotion word and commendation benchmark word p_iSimilarity, sim (n_j, w) and indicate the emotion word With derogatory sense benchmark word n_jSimilarity, e_hownet(w) polarity number of the emotion word is indicated.Work as e_hownet(w) 0 >, then by the emotion Word is classified as commendatory term, e_hownet(w) 0 <, then be classified as derogatory term for the emotion word, and the emotion that numerical values recited represents emotion word is strong Degree.

Again to e_hownet(w) minimax normalized, the as polarity number based on Hownet of the emotion word are carried out.2.3 Similarity calculation based on word frequency

The meaning of Chinese terms is expressed by forming the Chinese character of the word, therefore can be by the polarity number group of emotion word It is indicated at the tendency degree function of all Chinese characters of the word.And the tendency degree of Chinese character, by calculating the Chinese character in sentiment dictionary The frequency of appearance obtains^[11]。

If emotion word W is made of k Chinese character, it is expressed as W=w₁w₂...w_k, Chinese character w_iThe Sentiment orientation degree of (i takes 1~k) For T_i, then the polarity number e of emotion word_tf(w) shown in calculating such as formula (2).

Wherein, λ_iIndicate composition Chinese character w_iWeight,The meaning of emotion word is codetermined by each Chinese character that forms , here equal weight can be set by each Chinese character.T_iIndicate Chinese character w_iSentiment orientation degree, can be by formula (3) (4) (5) It obtains jointly.

Wherein,Indicate the initial tendency degree of Chinese character, f_Pi、f_NiRespectively indicate Chinese character w_iIn commendation dictionary and derogatory sense dictionary The frequency of appearance, n, m are illustrated respectively in the quantity of different Chinese character in commendation dictionary and derogatory sense dictionary.σ is the normalized factor, Its formula is：

According to the calculation method of formula (3), if the Chinese character occurs in commendation dictionary but number is indefinite, but in derogatory term Do not occur in allusion quotation, then the Sentiment orientation degree calculated is all 1, this is not obviously consistent with practical, Chinese character Sentiment orientation degree be usually with Pass judgement on the increase of frequency of occurrences difference in adopted dictionary and increase, therefore need to be to the T in formula (3)_iIt is weighted amendment, such as Shown in formula (5)：

Wherein, λ indicates the absolute value for the frequency difference that the Chinese character occurs in passing judgement on adopted dictionary.

According to above-mentioned formula, polarity number of the emotion word based on word frequency can be obtained.For any one emotion word W, tendency Spend e_tf(w) value range is -1~1, e_tf(w) being greater than 0 indicates that the vocabulary reaches positive emotion, and what is expressed less than 0 is negative Emotion, e_tf(w) absolute value is bigger, and the emotion for illustrating that the vocabulary reaches is stronger.

Establish the dictionary for having feeling polarities value：

Present invention combination Hownet and emotion value calculating method based on word frequency, and both are obtained by linear regression model (LRM) The best weights weight values of method, obtain the final dictionary with feeling polarities value, and specific establishment process is as shown in Figure 1.

Wherein, feature of present invention word polarity number can be obtained by formula (1) and formula (2) weighting.Calculation formula such as formula (6) institute Show.

E=α × e_hownet+β×e_tf (6)

Wherein, the value range of α and β is 0~1, and alpha+beta=1.The present invention is using linear regression model (LRM) to sentiment dictionary All posting terms of middle known polarity are tested, and curve graph such as Fig. 2 institute that its feeling polarities judging nicety rate changes with α is obtained Show.As α=0, the polarity of emotion word is completely by e_tfIt determines, as α=1, the polarity of emotion word is completely by e_hownetIt determines, pole Property judging nicety rate is lower.As α=0.451, β=0.549, accuracy rate reaches peak.

All Feature Words polarity numbers in sentiment dictionary are calculated using formula (6), if Feature Words polarity judges incorrectly, Its polarity number is then set as 0, if polarity correct judgment, calculated result is the feeling polarities value of the word.

Processing to negative word：

In the machine learning classification to emotion comment text, bag of words are current most widely used feature vectors Mode, but bag of words ignore the sequence of word in sentence, destroy syntactic structure, cannot be disposed to negative word.Such as There is negative word in fruit comment text, the sentence feeling polarities in its confining spectrum can be converted to its opposite polarity, such as： " I/like/this/book " and " I do not like/or not/this/book ", there is negative word in second comment sentence, by the comment sentence Polarity inverted, and in bag of words, this phenomenon is not reflected, thus by machine learning classification, easily by second Sentence is mistakenly classified as commendation class.In order to reduce such case, the present invention is before carrying out feature selecting, first to negative word And its text in confining spectrum is handled.

Establish antisense dictionary：

In emotion comment text, the emotion description to things is generally described with verb, adjective, adverbial word.Therefore, greatly Part verb, adjective, adverbial word feeling polarities with respect to other parts of speech for, it is obvious.Based on this feature, the present invention couple All emotion word carries out part of speech judgement in training set, by belong to verb, adjective, adverbial word emotion word extract and be included in word In allusion quotation, the emotion word polarity in dictionary is calculated by formula (6), then presses the commendation word set of acquisition and derogatory sense word set It is ranked up according to the sequence of emotional intensity from big to small.If P_iFor the commendation emotion word of commendation polarity number ranking i-th bit, N_iTo demote The derogatory sense emotion word of adopted polarity number ranking i-th bit can form and pass judgement on emotion word one-to-one relationship (P_i,N_i), wherein i≤min (len(P_i),len(N_i)), len (P_i) indicate commendation word set quantity, len (N_i) indicate derogatory sense word set quantity.For not having The emotion word for forming corresponding relationship is substantially negligible its polarity number, so in antonym due to these word polarity number very littles Directly this partial feeling word is deleted in allusion quotation.

Detection and processing negative word：

It is detected using all negative words of the look-up table to emotion comment text training set and test set, such as table 2.It determines The position that negative word occurs.

Table 2 negates vocabulary

For the range that negative word defines, negative word is directly started that (emotion comment text is by making pauses in reading unpunctuated ancient writings to clause by the present invention Punctuate is separated into multiple clauses) confining spectrum of the range as negative word between ending word^[5].In the range that negative word defines Interior, the emotion word that will appear in antisense dictionary is all converted to its corresponding antonym, other emotion words remain unchanged, and will Negative word removes.Such as：Become after " I do not like/or not/this/book " conversion " I/disagreeable/this/book ".

Occurs even number negative word in if clause, such as：" I do not think/or not/this/hotel/or not clean ", " there is no/mono- A/people/or not satisfied " etc., this generic clause is double negative sentences^[12], there is no quilts for the emotion word polarity in negative word confining spectrum Reversion, therefore to the clause for even number negative word occur, directly negative word is removed, the emotion word in negative word confining spectrum is protected It holds constant；To the clause for odd number negative word occur, the emotion word in negative word confining spectrum is inverted, while will negative Word removes.

Feature selecting is carried out by negative word after treatment, then to all affective characteristic words.

Improved feature selection approach：

Card side's feature selection module：

The basic thought of " hypothesis testing " in statistics is utilized in card side's model：Assume initially that Feature Words are direct with classification It is incoherent, if bigger using the calculated test value deviation threshold value of chi square distribution, more confident negative null hypothesis, Receive the standby of null hypothesis then to assume：Feature Words and classification have the very high degree of association.The calculation formula of card side's model is as follows：

Wherein, A_ijIt indicates to include characteristic item t_iAnd belong to class c_jNumber of documents, B_ijIt indicates to include characteristic item t_iBut do not belong to In class c_jNumber of documents, C_ijIt indicates not including characteristic item t_iBut belong to class c_jNumber of documents, D_ijIt indicates not including characteristic item t_i And it is not belonging to class c_jNumber of documents, N indicate total number of documents amount.

The polarity number of affective characteristics item is incorporated into card side's model：

Affective characteristics item in emotion comment text is compared with the dictionary with feeling polarities value one by one, if the emotion Characteristic item is in dictionary, then using the value of this feature item in dictionary as the feeling polarities value of this feature item, such as " outstanding " in word Corresponding value is 0.919 in allusion quotation, then by the 0.919 feeling polarities value as " outstanding ".If this feature item does not appear in dictionary In, then it is calculated using formula (6), such as " bright in luster ", calculating its feeling polarities value by formula is 0.705.Finally obtain Obtain the polarity number of all affective characteristics items in text.

From the definition of card side's model, it can be found that, matrix A indicates that the document comprising characteristic item belongs to the quantity of each class, is special Most important one in sign selection, its value directly determines the final value of feature selecting.In all identical situation of other situations, It include characteristic item t in matrix A_iAnd belong to class c_jValue A_ijIt is bigger, then characteristic item t_iTo class c_jContribution margin it is higher, characteristic item t_iWith classification c_jJust there is the deeper degree of association, at this time by characteristic item t_iChoose the accuracy rate that can be improved text classification.Together When, other matrix Bs, C, D are calculated by matrix A.

But matrix A only considered number of documents, not consider to the feeling polarities of characteristic item itself, and feeling polarities value is high Characteristic item in emotional semantic classification have biggish effect.The present invention considers this influence of two parts factor to classification, to feelings Sense polarity number is merged with matrix A, and the number of documents comprising characteristic item is enclosed to the weight (feeling polarities of this feature item Value), matrix A develops into A'_ij=A_ij×E_ij.Such as in all similar situation of other situations, characteristic item " adjusting ", " liking " The number of files occurred in commendation class is respectively 5,5, and original matrix A judges the two words to commendation class simply by document frequency Significance level, such case is then difficult to distinguish which word more important to commendation class.And for modified matrix A ', due to this two The polarity number of a word is respectively 0.1,0.9, then its matrix A ' in it is corresponding value be 5.5,9.5, be equivalent to as feeling polarities value High characteristic item provides an increment, larger can will possibly occur less but feelings in training corpus in the feature selecting stage The sense higher feature of polarity number is chosen.

From formula (8) as can be seen that when this feature item is commendatory term, A'_ijBe equivalent to in matrix A include characteristic item t_i And belong to class c₁The value A of (commendation class)_i1It has been multiplied by a big weight (1+e_i), it include characteristic item t in matrix A_iAnd belong to class c₀ The value A of (derogatory sense class)_i0It remains unchanged.

From formula (9) as can be seen that when this feature item is derogatory term, A'_ijBe equivalent to in matrix A include characteristic item t_i And belong to class c₀The value A of (derogatory sense class)_i0Be multiplied by a big weight (1+ | e_i|), it include characteristic item t in matrix A_iAnd belong to class c₁The value A of (commendation class)_i1It remains unchanged.

Frequency information and card side's models coupling between characteristic item class：

What card side's model considered always includes characteristic item or the number of documents without characteristic item, not in view of characteristic item exists The frequency occurred in class, but characteristic frequency is also a key factor of feature selecting.Such as characteristic item t_iAnd t_kBy card side's model It is calculated to be worth relatively, if but characteristic item t_iIn class c_jThe frequency of middle appearance is very big, and t_kThe frequency of appearance is seldom, Then t_iIn class c_jExpressive ability obviously than t_jBy force, but card side's model does not reflect the difference of this respect.It is sharp in the present invention With TF (t_i,c_j) indicate characteristic item t_iIn classification c_jIn word frequency number, formula is：

Wherein tf_k(t_i,c_j) indicate feature t_iBelonging to class c_jDocument d_kIn word frequency number, tf (t_i,c_j)_minIndicate feature t_iBelonging to class c_jSingle document in the minimum word frequency number that occurs, f (t_i,c_j)_maxIndicate feature t_iBelonging to class c_jSingle document in The maximum word frequency number of appearance, | c_j| expression belongs to class c_jNumber of documents.

As characteristic item normalizes word frequency in each class.

Affective characteristics of the present invention select formula：

The present invention incorporates frequency between the polarity number of affective characteristics item and the class of characteristic item in card side's model, final formula As follows：

Description of test：

Experimental data：

The present invention using Tan Song wave acquisition hotel, books, three fields of notebook Chinese comment data, each field It respectively include 2000 fronts and 2000 negative reviews

3 experimental data of table

Algorithm experimental：

(base is calculated for main contrast forefathers' algorithm document [3], document [5] and basic IG feature selecting algorithm of the present invention Method), to verify the validity of inventive algorithm.

Five folding cross validations are carried out to each data set：It is 5 equal portions by data set random division, wherein four parts are used as training Data, it is in addition a to be used as test data, data set is trained and is tested by this mode, and utilizes naive Bayesian point Two kinds of classification methods of class and support vector machines are classified.Obtain 5 numerical value are averaged and obtain final result.Experiment As a result as follows：

1. pair hotel's Chinese comment data carry out the result of naive Bayesian and support vector cassification as shown in figure 3, from Fig. 3 can be seen that inventive algorithm during feature quantity is 200~3000 all better than other algorithms.

2. pair books Chinese comment data carries out naive Bayesian and the result of support vector cassification is as shown in Figure 4：

From fig. 4, it can be seen that inventive algorithm is better than other algorithms using effect when Naive Bayes Classifier, utilize When support vector machine classifier, inventive algorithm reaches F maximum value 0.917 when feature quantity is 2600, and document [5], text [3], base algorithm are offered when feature quantity is 2600, F value is 0.911,0.906,0.893 respectively.Inventive algorithm is than other Algorithm will be got well.

3. pair notebook Chinese comment data carries out naive Bayesian and the result of support vector cassification is as shown in Figure 5：

From fig. 5, it can be seen that inventive algorithm is substantially better than other algorithms during feature quantity is 200~3000. When inventive algorithm utilizes Naive Bayes Classification, when feature quantity is 1800, reach F maximum value 0.922, document [5], text Offer [3], base algorithm is 0.907,0.897,0.896 when feature quantity is 1800, inventive algorithm is obviously better than other Algorithm.When using support vector cassification, the fluctuation of document [5] algorithm is too big, document [3] although, base algorithm fluctuation it is small, Maximum F value is smaller, respectively 0.909,0.882, and inventive algorithm lies substantially in the peak of all algorithms, and F value is up to 0.92, it is better than other algorithms.

The present invention is tested by the Chinese comment data to hotel, books, notebook three fields and verifies this hair Bright algorithm.Naive Bayesian is utilized from being can be seen that after the present invention carries out affective characteristics selection to each data set in Fig. 3~5 When classification, relative to basic algorithm, F value improves 4% or so, and relative to existing innovatory algorithm, F value improves 2% or so；Benefit When with support vector cassification, relative to basic algorithm, F value improves 3% or so, and relative to existing innovatory algorithm, F value is improved 1% or so.Illustrate that inventive algorithm further improves the effect of text emotion feature selecting compared to existing algorithm.

Conclusion：

Feeling polarities value and word frequency information are dissolved into card side's model on the basis of forefathers study and carry out feelings by the present invention The selection for feeling characteristic item, is optimized feature selection approach, this method is additionally contemplates that negative word to text emotion pole simultaneously The influence of property, is detected and is judged to negative word, and handled the emotion word in negative word confining spectrum, further mentioned The high accuracy of emotion text classification.Future can continue to explore the pass of feature vector weight in feeling polarities value and classifier Connection optimizes emotion text classifier using feeling polarities value, improves the nicety of grading that classifier is directed to emotion text.

Bibliography：

[1] emotional orientation analysis computer age .2017,3 of Yang Kui, the Duan Qiongjin based on sentiment dictionary method:10.

[2]Xu Y,Chen L.Term-frequency Based Feature Selection Methods for Text Categorization[C]//Fourth International Conference on Genetic and Evolutionary Computing.IEEE Computer Society,2010:280-283.

[3] Ah antelope Song, Liu Haifeng, Liu keep optimization CHI text feature selection method of the raw based on position and word frequency information [J].Computer Science\s&\sapplication,2015,05(9):322-330.

[4] Shi Hui, Jia Daiping, seedling train improvement information gain text feature selection algorithm computer of the based on word frequency information Using 2014,34 (11):3279.

[5]Xia R,Xu F,Zong C,et al.Dual Sentiment Analysis:Considering Two Sides of One Review[J].IEEE Transactions on Knowledge&Data Engineering,2015, 27(8):2120-2133.

[6] Liu Qun, Li Su build and are based on《Hownet》Similarity of Words calculate [J] Chinese computing linguistics, 2002.

[7]Lee S Y M.Sentiment Classification and Polarity Shifting.[C]// International Conference on Computational Linguistics.Association for Computational Linguistics,2010:635-643.

[8] Wang Zhenyu, Wu Zeheng, Hu Fangtao calculate [J] computer work based on the word feeling polarities of HowNet and PMI Journey, 2012,38 (15):187-189.

[9] a kind of Beijing semantic similarity calculation method Information technology based on HowNet of Fan Hongyi, Zhang Yangsen College journal (natural science edition), 2014,29 (4):42.

[10] Wu Jinyuan, Ji Junzhong, Zhao Xuewu, Wu Chensheng, Du Fang China are based on the emotion word weight of Feature Selection Calculate Beijing University of Technology journal, 2016,42 (1):142.

[11] Xu Xiaodan, Duan Zhengjie, it is old in educate emotion method for digging mountain of the based on expanding sentiment dictionary and characteristic weighing Eastern college journal (engineering version), 2014,6 (44):15-19.

[12] Liu Yujiao, jade pendant are taken root, Wu Shaomei, Chinese text emotion point of the Soviet Union Chong based on emotion dictionary in conjunction with conjunction Class Sichuan University journal (natural science edition), 2015,52 (1):57-62.

Basic principles and main features and advantages of the present invention of the invention have been shown and described above.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.

Claims

1. a kind of Chinese affective characteristics selection method of sentiment dictionary and Ka Fang model based on extension, which is characterized in that including Following steps：

(1) card side's feature selection module：Assume initially that Feature Words with classification are directly incoherent, if utilizing chi square distribution meter It is bigger that the test value of calculating deviates threshold value, then more confident negative null hypothesis, receives the standby of null hypothesis and then assume：Feature Words with Classification has the very high degree of association；The calculation formula of card side's model is as follows：

Wherein, A_ijIt indicates to include characteristic item t_iAnd belong to class c_jNumber of documents, B_ijIt indicates to include characteristic item t_iBut it is not belonging to class c_j Number of documents, C_ijIt indicates not including characteristic item t_iBut belong to class c_jNumber of documents, D_ijIt indicates not including characteristic item t_iAnd not Belong to class c_jNumber of documents, N indicate total number of documents amount；

(2) polarity number of affective characteristics item is incorporated into card side's model：By the affective characteristics item in emotion comment text one by one with band Have emotion polarity number dictionary compare, if the affective characteristics item in dictionary, using the value of this feature item in dictionary be used as this The feeling polarities value of characteristic item, such as " outstanding " corresponding value in dictionary is 0.919, then by 0.919 feelings as " outstanding " Feel polarity number；If this feature item does not appear in dictionary, calculated using formula (6), such as " bright in luster ", passes through public affairs It is 0.705 that formula, which calculates its feeling polarities value,；The final polarity number for obtaining all affective characteristics items in text；

From the definition of card side's model, it can be found that, matrix A indicates that the document comprising characteristic item belongs to the quantity of each class, is feature choosing Most important one in selecting, its value directly determines the final value of feature selecting；In all identical situation of other situations, matrix It include characteristic item t in A_iAnd belong to class c_jValue A_ijIt is bigger, then characteristic item t_iTo class c_jContribution margin it is higher, characteristic item t_iWith Classification c_jJust there is the deeper degree of association, at this time by characteristic item t_iChoose the accuracy rate that can be improved text classification；Meanwhile Other matrix Bs, C, D are calculated by matrix A；

But matrix A only considered number of documents, not consider to the feeling polarities of characteristic item itself, and the spy that feeling polarities value is high Levy item has biggish effect in emotional semantic classification；This influence of two parts factor to classification is considered herein, to feeling polarities Value is merged with matrix A, and the number of documents comprising characteristic item is enclosed to the feeling polarities value of this feature item, and matrix A develops For A'_ij=A_ij×E_ij；Such as in all similar situation of other situations, characteristic item " adjusting ", " liking " occurs in commendation class Number of files be respectively 5,5, original matrix A judge simply by document frequency the two words to the significance level of commendation class, this Kind of situation is then difficult to distinguish which word more important to commendation class；And for modified matrix A ', due to the polarity number of the two words Respectively 0.1,0.9, then its matrix A ' in it is corresponding value be 5.5,9.5, be equivalent to and mentioned for the high characteristic item of feeling polarities value For an increment, it is higher larger can will possibly to occur less but feeling polarities value in training corpus in the feature selecting stage Feature to being chosen；

Wherein, E_ijIt is an emotional intensity matrix, row indicates affective characteristics item, and arranging indicates class, first row expression derogatory sense class, second Column indicate commendation class, e_iIt is the characteristic item emotional value obtained using sentiment dictionary and formula (6), wherein working as e_iWhen >=0, this feature Item is commendatory term, then E_ijIn value in two classes be respectively：

From formula (8) as can be seen that when this feature item is commendatory term, A'_ijBe equivalent to in matrix A include characteristic item t_iAnd belong to In class c₁The value A of commendation class_i1It has been multiplied by a big weight 1+e_i, include characteristic item t in matrix A_iAnd belong to class c₀Derogatory sense class Value A_i0It remains unchanged；

From formula (9) as can be seen that when this feature item is derogatory term, A'_ijBe equivalent to in matrix A include characteristic item t_iAnd belong to In class c₀The value A of derogatory sense class_i0It has been multiplied by a big weight 1+ | e_i|, it include characteristic item t in matrix A_iAnd belong to class c₁Commendation class Value A_i1It remains unchanged；

(3) frequency information and card side's models coupling between characteristic item class：What card side's model considered always includes characteristic item or is free of The number of documents of characteristic item, not in view of the frequency that characteristic item occurs in class, but characteristic frequency is also one of feature selecting Key factor；Such as characteristic item t_iAnd t_kIt is worth relatively by card side's model is calculated, if but characteristic item t_iIn class c_jIn The frequency of appearance is very big, and t_kThe frequency of appearance is seldom, then t_iIn class c_jExpressive ability obviously than t_jBy force, but Ka Fangmo Type does not reflect the difference of this respect but；TF (t is utilized herein_i,c_j) indicate characteristic item t_iIn classification c_jIn word frequency number, it is public Formula is：

Wherein tf_k(t_i,c_j) indicate feature t_iBelonging to class c_jDocument d_kIn word frequency number, tf (t_i,c_j)_minIndicate feature t_i? Belong to class c_jSingle document in the minimum word frequency number that occurs, f (t_i,c_j)_maxIndicate feature t_iBelonging to class c_jSingle document in occur Maximum word frequency number, | c_j| expression belongs to class c_jNumber of documents；

As characteristic item normalizes word frequency in each class；

(4) affective characteristics select formula：Frequency between the polarity number of affective characteristics item and the class of characteristic item is incorporated in card side's model, Final formula is as follows：

According to formula (13), the amendment chi-square value of each affective characteristics can be acquired, descending sequence is pressed to all values Arrangement, takes out affective characteristics corresponding to certain amount of maximum modified chi-square value as final affective characteristics, then basis Selected affective characteristics carry out feature vector to emotion text, and text classifier is recycled to classify.