CN110096597A

CN110096597A - A kind of text TF-IDF feature reconstruction method of combination emotional intensity

Info

Publication number: CN110096597A
Application number: CN201910224082.0A
Authority: CN
Inventors: 邓修齐; 康琦; 张量
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2019-03-22
Filing date: 2019-03-22
Publication date: 2019-08-06
Anticipated expiration: 2039-03-22
Also published as: CN110096597B

Abstract

The present invention relates to a kind of text TF-IDF feature reconstruction methods of combination emotional intensity, expression and user name are extracted and divided by canonical matching process, positional relationship according to intensity dictionary and negative word, degree adverb, repetitor is modified word intensity, new word is replaced by the near synonym replacement method based on Word2Vec, so that the TF-IDF feature vector to text is reconstructed.Compared with prior art, the present invention considers situations such as negative word, degree adverb, repetitor, is modified to the TF-IDF feature of word, retains the information such as intensity, the position of word；With the new word on the ripe word replacement test collection occurred in training set, enhance Generalization Capability；It can not need to be segmented manually directly using former sentence as input when using.

Description

A kind of text TF-IDF feature reconstruction method of combination emotional intensity

Technical field

The invention belongs to the classification fields in natural language processing, are related to a kind of text classification preprocess method, especially It is related to a kind of text TF-IDF feature reconstruction method of combination emotional intensity.

Background technique

Instantly the reverse text frequency (Term of word frequency-is commonly used in natural language processing and machine learning field Frequency-Inverse Document Frequency, abbreviation TF-IDF) construction obtain the feature vector of text.With microblogging Netspeak for representative includes that special languages ingredient, the existing methods such as expression, user name are not handled them, is caused Information is obscured；The elements such as negative word, degree adverb, dittograph in Chinese text will have a direct impact on the emotional intensity of text With polarity, the feature vector that existing method obtains can not retain these information, cause the misalignment of information；In test set and practical fortune Some new words not in training set in, existing method can give up them, cause the loss of information.

Summary of the invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide a kind of text emotions to analyze Preprocess method is extracted and is divided to expression and user name by canonical matching process, according to intensity dictionary and negative word, Degree adverb, repetitor positional relationship word intensity is modified, pass through the near synonym replacement method based on Word2Vec New word is replaced, so that the TF-IDF feature vector to text is reconstructed.

The purpose of the present invention can be achieved through the following technical solutions:

A kind of text TF-IDF feature reconstruction method of combination emotional intensity, comprising the following steps:

S1 is constructed and is deactivated dictionary, degree dictionary and negative dictionary, and the word in the degree dictionary is strong with emotion The degree adverb of grade is spent, the word in the negative dictionary is negative word；

S2 obtains text to be analyzed, is multiple clauses by text segmentation using punctuation mark as separation；

S3 traverses each word in clause and records the number and position that they occur, leaves out stop words therein, right The subsequent word of degree adverb carries out emotional intensity amendment, and the overturning of feeling polarities is carried out to the subsequent word of negative word；

S4 creates a blank dictionary to each section of text to be analyzed, is indexed with word, strong with the emotion of word Degree, quantity make key assignments, traverse each word, if current term is off word, degree adverb or negative word, skip the word Any operation is not done；If not including current term in existing dictionary, which is deposited into dictionary；If deposited in dictionary In current term, then the emotional intensity and quantity of corresponding word in dictionary are updated；

S5 extracts the TF-IDF characteristic value of text, respectively by the TF-IDF value of each word and emotion corresponding in dictionary Intensity is multiplied, the characteristic value after being reconstructed:

TF-IDF_new,w=TF-IDF_w×deg_w

Wherein, TF-IDF_new,wFor the TF-IDF characteristic value of the word w after reconstruct, TF-IDF_wFor the original TF- of word w IDF characteristic value, deg_wFor the emotional intensity of word w.

The deactivated dictionary includes English character, number and mathematical character.

The text to be analyzed is the microblogging text comprising user name and expression, in the step S2, is used first Canonical matching method in text user name (text after@symbol) and expression (text in [] symbol) matched and mentioned It takes, they is distinguished with plain text, the influence to avoid the word in them with Sentiment orientation to the emotion of whole text.

In the step S2, the separation of each clause is punctuation mark.

The punctuation mark does not include pause mark, quotation marks, dash, single quotation marks and colon.

In the step S3, the emotional intensity calculation formula of word are as follows:

Wherein, deg_wFor the emotional intensity of word w, m degree adverb, n negative word are had before the word, pow is The intensity value of degree adverb.

In the step S3, if there is the case where negative word is before degree adverb, the emotion of corresponding word Intensity amendment are as follows:

This method first constructs a list, all words occurred when for storing trained, in step S4 in initialization In, word and list are compared, when word is the new word being not present in list, using near synonym replacement method, with column The highest word of similarity replaces the new word in table.

In the step S4, after words all in text are all stored in dictionary, adding for the emotional intensity of word is also carried out Power operation, specifically includes the following steps:

1) text is obtained divided by word frequency of occurrence Dict [w] [count] with total emotional intensity Dict [w] [deg] of word The average emotional intensity of word w in this

2) total emotional intensity Dict [w] [deg] of word w is motivated: ifThen Dict [w] [deg] It is updated to Dict [w] [deg]+deg_w+ M, ifThen Dict [w] [deg] is updated toWherein, M is excitation value.The real feelings of calculating word w after abbreviation are strong Degree:

Wherein, M is excitation value.

Compared with prior art, the invention has the following advantages that

(1) degree dictionary is constructed, amendment is weighted by word intensity is modified by degree word in sentence, continuously occurs multiple Correction effect can be superimposed when degree word；The modified word intensity of negative word in sentence is carried out polarity reversion by building negative dictionary, Inversion effect can be superimposed when continuously there are multiple negative words.

(2) former sentence is segmented by morphemes such as punctuation mark, user name, non-morpheme words, negative word and degree adverb are repaired Positive interaction only in section effectively, to the emotion ambiguity for avoiding long sentence, complicated sentence pattern from being easy to generate, the counter productives such as mix.

(3) user name, the expression in text are matched and is extracted using canonical matching method, by it and plain text word Language is distinguished, and is avoided information and is obscured.

(3) with the new word on the ripe word replacement test collection occurred in training set, enhance Generalization Capability.

(4) it can not need to be segmented manually directly using former sentence as input when using.

(5) situations such as considering negative word, degree adverb, repetitor, is modified the TF-IDF feature of word, retains word The information such as intensity, the position of language.

Detailed description of the invention

Fig. 1 is the present embodiment method flow schematic diagram；

Fig. 2 (a), 2 (b) are an example, wherein Fig. 2 (a) and Fig. 2 (b) is respectively to use conventional method and the method for the present invention TF-IDF feature is extracted, and result is subjected to visual histogram；

Fig. 3 (a), 3 (b) are another example, wherein Fig. 3 (a) and Fig. 3 (b) is respectively to use conventional method and side of the present invention Method extracts TF-IDF feature, and result is carried out visual histogram；

Fig. 4 is the hyperplane schematic diagram of conventional method；

Fig. 5 is the hyperplane schematic diagram after the present embodiment feature reconstruction.

It by training set, 500 texts of 10000 texts is test training that Fig. 6, which is conventional method and the present embodiment method, The performance comparison of the svm classifier model got.

Specific embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to Following embodiments.

Embodiment

A kind of text TF-IDF feature reconstruction method of combination emotional intensity, build first deactivated dictionary, degree dictionary and It negate dictionary.Stop words mainly includes English character, number, mathematical character, punctuation mark and the extra-high Chinese word character of frequency of use Deng, such as " you ", " I ", " because ", " and " etc.；Degree dictionary includes a series of for modifying adjective and the strong journey of adverbial word The adverbial word of degree emotion intensity corresponding with them selects word " the degree grade that more comprehensively Hownet was issued in 2007 Other word dictionary " be used as degree dictionary, and by degree adverb therein be divided into " extremely ", " super ", " very ", " compared with ", " slightly ", Six ranks such as " deficient " respectively correspond 1.7,1.5,1.3,1.1,0.8 and 0.5 etc. six intensity；Negate dictionary include " no ", The common negative word such as "No", " non-", " not ".As Fig. 1 be it is shown, this method includes following below scheme:

1. canonical matches

User name, the expression in text are matched and extracted using canonical matching method first, by it and plain text Word is distinguished.As soon as being often matched to an expression or user name, position of their content with them in former sentence is recorded.

User name canonical matching expression: "@{ 1 } w { 1,30 } $ |@{ 1 } w { 1,30 } s ", meaning :@be followed by length be 1 To 30 character string (Chinese character or English), then connect user name full stop (end of the sentence symbol or space).

Microblogging expression regular expression: " [w { 1,5 }] ", meaning: the word for being 1-5 by the length that bracket " [] " surrounds Symbol string.

2. punctuate segmentation (rough segmentation)

Using punctuation mark as separation, former sentence is divided into many clauses (rough segmentation), wherein pause mark, quotation marks, dash, list The punctuation marks such as quotation marks, colon are not belonging to above-mentioned decollator scope because these punctuates will not interrupt sentence semantic logic and The continuity of emotion.The modification of degree adverb, negative word in each clause only comes into force in clause, does not influence other sons Sentence.By the step for, the rule of feature reconstruction below can be simplified to a certain extent, it is superfluous to avoid mixing for information It is remaining.

3. information excavating

On the basis of previous step rough segmentation, (subdivision, such as the jieba participle in python are segmented to each clause Library), it traverses each word in clause and records the number and position that they occur.Then leave out wherein all stop words, The amendment of emotion degree is carried out to the subsequent word of degree adverb, the overturning of feeling polarities is carried out to the subsequent word of negative word.

If being modified word is w, there are m degree adverb, n negative word before it, the intensity value of degree adverb is pow. " emotion degree " attribute calculation formula of so word w is as follows:

If the emotional intensity value of the word is all degree of the front there are multiple degree adverbs before modificand The product of adverbial word intensity.That is, the effect of degree adverb can be superimposed；If there are multiple negatives before modificand Word, then odd number negative word Overlay is equal to a negative word according to " two negatives make a positive " principle, even number negative word is folded Effect is added to be equal to no negative word.

4. data correction

This step is modified the intensity value of word mainly for two kinds of special circumstances in Chinese.On the one hand, when one A word had not only been denied word modification but also had been modified by degree adverb, needed to consider the relative positional relationship of degree adverb and negative word, Such as: " I am not especially happy ", expression is positive emotion, and negative word " no " has modified " special ", so that the reinforcement of its script Affectivity has become weakening, although whole emotion be still it is positive " happy ", its degree ratio " happy " is weaker.Therefore work as There is the case where negative word is before degree adverb, negative inverse is sought the emotional intensity acquired in previous step, in word quilt While being modified to positive emotion, its emotional intensity is also weakened.That is:

5. intensity weighted

Each section of text to be analyzed can all create a blank dictionary, be indexed with word, and intensity, the quantity of word are made Key assignments.

All words are traversed again, if current vocabulary is off word, degree adverb or negative word, is skipped them and are not appointed What is operated；If not including current vocabulary in existing dictionary, it is just deposited into dictionary；If having existed for working as in dictionary Preceding vocabulary, then just update dictionary in equivalent " intensity " (deg) and " quantity " (count) two information, to it While intensity, quantity are updated, give its Intensity attribute one additional excitation value M, to increase the emotion of dittograph Intensity:

Dict [w] [deg] +=deg_w

Dict [w] [count] +=1

After all words of text have all been stored in dictionary, so that it may carry out the weighting of word intensity.The total of word is used first Intensity Dict [w] [deg] seeks to obtain the mean intensity of word w in a document divided by word frequency of occurrence Dict [w] [count]:

Since the emotional intensity of identical word in the same text may be positive and may also be negative, so being weighted average Intensity afterwardsBoth may be positive may also be negative, and it is strong give it while intensity, the quantity to it are updated Attribute one additional excitation value M is spent, repeats intensity of the word in sentence to increase.In view of word identical in text Emotional intensity, which may be positive, to be negative, if the word being negative to an emotional intensity adds a positive energize, can weaken it Original emotion even results in the reversion of its polarity.So first calculating word mean intensityFurther according toPositive and negative add The scheme of corresponding positive/negative excitation.

If it is N, i.e., last Dict [w] [count]=N, the then real feelings of word w that total degree, which occurs, in word w in text Intensity:

With the increase of N,1 can be leveled off to, so excitationBeing can be with the increase of word frequency of occurrence N And it approaches to saturation.That is some word occurs repeatedly in the text, its emotional intensity can be strengthened, but this reinforces There is the upper limit, the effect reinforced every time can be more and more weaker with the increase of frequency of occurrence.Motivate the specific value of M by using Person's decision, an adjustable parameter as this method.

6. new word is replaced

In this method initialization, a list can be constructed, all words occurred when for storing trained.Test and When practice, word and the word in list can be compared, when occurring that new word is not present in list, this method is used The near synonym replacement method (such as synonyms packet in python) of Word2Vec, compares two words in term vector space Cosine similarity, new word is replaced with the highest ripe word of similarity in list, to enhance the Generalization Capability of model.

7. feature reconstruction

The TF-IDF of text is extracted by existing method (such as sklearn.TfidfVectorizer packet in python) The TF-IDF value of each word is multiplied by feature with emotional intensity corresponding in dictionary respectively, the characteristic value after being reconstructed:

TF-IDF_new,w=TF-IDF_w×Deg_w

TF-IDF_new,w: the TF-IDF characteristic value of the word w after reconstruct；

TF-IDF_w: pass through the original TF-IDF characteristic value of the TfidfVectorizer word w acquired；

Deg_w: the emotional intensity value of word w.

In this way, just being reconstructed to the TF-IDF in feature vector, new feature vector has been obtained.

Fig. 2 and Fig. 3 is a specific example, and it is special to extract TF-IDF with conventional method and new method proposed by the present invention respectively Sign, and result is visualized.By comparing it can be found that conventional method is lost the location information and degree letter of sentence Breath causes two adversative text features the same；TF-IDF feature reconstruction method proposed by the present invention remains text Location information and degree information, and they are embodied out in TF-IDF feature.

Fig. 4 and Fig. 5 is the TF-IDF feature reconstruction method effect diagram that this method proposes, Fig. 4 is the super flat of conventional method Face schematic diagram, the schematic diagram after the feature reconstruction of the position Fig. 5.It can be seen that the degree adverb in conventional method loses repairing for they Decorations effect, and individual features are retained after reconstruct, improve the accuracy of classification.

It is the property of svm classifier model that test set training obtains that Fig. 6, which is by training set, 500 texts of 10000 texts, It can comparison.By comparing can see, in the various aspects such as accuracy rate, recall rate, precision ratio, F1 scoring, TF- proposed by the present invention IDF feature reconstruction method will be better than conventional method.

The following are 4 examples comparatives to illustrate:

Example 1:@also like today miuky today you are very good-looking [happy] [happy]

It is compared with the traditional method, user name "@also likes miuky today " and expression " [happy] " correctly divide It cuts.

Example 2: I does not like driving

Method	Feature vector
		Conventional method	Drive: 0.89 likes: 0.45
Method proposed by the present invention	Like: -0.46 drives: -0.89

It is compared with the traditional method, the feature vector after reconstruct remains the information of negative word " no ".

Example 3: I does not like driving very much

Method	Feature vector
		Method proposed by the present invention	Like: -0.79 drives: -1.51

Feature vector after reconstruct remains the strength information of degree adverb " very ".

Example 4: today, my heart very grief and indignation (explanation: comprising " sad " word in the model that training obtains, and did not included " grief and indignation ")

Method	Feature vector
		Conventional method	Today: 0.86 today: 0.52
Method proposed by the present invention	Heart: 0.52 grief and indignation: 1.11

It is compared with the traditional method, new word " grief and indignation " can be identified, word information retains more complete.

Claims

1. a kind of text TF-IDF feature reconstruction method of combination emotional intensity, which comprises the following steps:

S1 is constructed and is deactivated dictionary, degree dictionary and negative dictionary, and the word in the degree dictionary is with emotional intensity etc. The degree adverb of grade, the word in the described negative dictionary is negative word；

S3 traverses each word in clause and records the number and position that they occur, leaves out stop words therein, to degree The subsequent word of adverbial word carries out emotional intensity amendment, and the overturning of feeling polarities is carried out to the subsequent word of negative word；

S4 creates a blank dictionary to each section of text to be analyzed, is indexed with word, with the emotional intensity of word, number Amount makees key assignments, traverses each word, if current term is off word, degree adverb or negative word, skips the word and do not do Any operation；If not including current term in existing dictionary, which is deposited into dictionary；If had existed in dictionary Current term then updates the emotional intensity and quantity of corresponding word in dictionary；

S5 extracts the TF-IDF characteristic value of text, respectively by the TF-IDF value of each word and emotional intensity corresponding in dictionary It is multiplied, the characteristic value after being reconstructed:

TF-IDF_new,w=TF-IDF_w×deg_w

Wherein, TF-IDF_new,wFor the TF-IDF characteristic value of the word w after reconstruct, TF-IDF_wIt is special for the original TF-IDF of word w Value indicative, deg_wFor the emotional intensity of word w.

2. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 1, which is characterized in that institute The deactivated dictionary stated includes English character, number and mathematical character.

3. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 1, which is characterized in that institute The text to be analyzed stated is the microblogging text comprising user name and expression, in the step S2, uses canonical matching method first To in text user name and expression matched and extracted, they are distinguished with plain text, to avoid in them band feelings Influence of the word of sense tendency to the emotion of whole text.

4. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 3, which is characterized in that text The text after the entitled symbol of user in this, expression are the text in [] symbol.

5. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 1, which is characterized in that institute In the step S2 stated, the separation of each clause is punctuation mark.

6. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 5, which is characterized in that institute The punctuation mark stated does not include pause mark, quotation marks, dash, single quotation marks and colon.

7. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 1, which is characterized in that institute In the step S3 stated, the emotional intensity calculation formula of word are as follows:

Wherein, deg_wFor the emotional intensity of word w, m degree adverb, n negative word are had before the word, pow is degree pair The intensity value of word.

8. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 7, which is characterized in that institute In the step S3 stated, if there is the case where negative word is before degree adverb, the emotional intensity of corresponding word is corrected are as follows:

9. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 1, which is characterized in that should Method first constructs a list, all words occurred when for storing trained, in step s 4, by word in initialization It is compared with list, when word is the new word being not present in list, using near synonym replacement method, with similarity in list Highest word replaces the new word.

10. a kind of text TF-IDF feature reconstruction method of combination emotional intensity according to claim 1, which is characterized in that In the step S4, after words all in text are all stored in dictionary, the weighting operations of the emotional intensity of word, tool are also carried out Body the following steps are included:

1) it is obtained in text with total emotional intensity Dict [w] [deg] of word divided by word frequency of occurrence Dict [w] [count] The average emotional intensity of word w

2) the real feelings intensity of word w is calculated: