CN106547866B

CN106547866B - A kind of fine granularity sensibility classification method based on the random co-occurrence network of emotion word

Info

Publication number: CN106547866B
Application number: CN201610936655.9A
Authority: CN
Inventors: 马力; 刘锋; 李培; 白琳; 宫玉龙; 杨琳
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2016-10-24
Filing date: 2016-10-24
Publication date: 2017-12-26
Anticipated expiration: 2036-10-24
Also published as: CN106547866A

Abstract

A kind of fine granularity sensibility classification method based on the random co-occurrence network of emotion word, it is theoretical using random network, utilize word co-occurrence phenomenon, by the mark of emotional noumenon vocabulary dictionary, form a stochastic network model based on word order built with affective characteristics, that is emotion Term co-occurrence network model, model yojan is carried out on this basis, the most long matching process of emotion word and TC algorithms are combined and carry out the classification of SWLM TC unsupervised learnings, or further the most long matching process of emotion word and HMM machine learning algorithms are combined and establishes fine granularity sentiment classification model and is predicted using model realization classification；The fine granularity emotional semantic classification of the achievable paragraph level text of the present invention, improve the precision of simple TC algorithms, make classification more accurate, using SWLM TC to after sample set progress HMM model training and carrying out mood classification to sample to be tested storehouse, improve the automation of simple machine learning algorithm.

Description

Fine-grained emotion classification method based on random co-occurrence network of emotion words

Technical Field

The invention belongs to the technical field of information retrieval, and particularly relates to a fine-grained emotion classification method based on an emotion word random co-occurrence network.

Background

In recent years, with the rapid development of economy and information technology, the development of social forms is deeply influenced by interconnection networking, and a huge pushing effect is generated on economy, the internet residents generate information as expensive as the sea, in the process of accelerating landing of the mobile internet, the popularization of various intelligent mobile devices enables the information to be spread in the internet at lower cost and higher speed, different types of information can generate different influences, negative statements can enable netizens to generate negative influences, malignant group messages and public events occur, not only can the feelings of individuals be influenced, but also huge economic losses can be generated, and the problem of exploring emotional information becomes urgent. In the aspect of construction of a text emotion corpus, the existing corpus comprises a Pang corpus, a Whissell corpus, a Berrardinelli film comment corpus and a product comment corpus, while the Chinese emotion corpus is less in resource marking, and part of emotion corpuses described in tourist attractions are marked in Qinghua university to assist speech synthesis, but the scale is smaller. The texts related to Blog, forum, network news comments and the like are called as novel texts internationally, the novel texts on the network provide data sources for emotion analysis, and the analysis and processing of the novel texts become a hot spot of current research. In the information today, the network becomes a part of people's life, emotion analysis becomes an important reference for understanding real thoughts of netizens, and in the emergency management of public events, the study of netizens emotion by using novel texts on the network becomes a new direction.

At present, the tendency research on texts is relatively deep, the researches are relatively successful in product comments and film evaluations, in view of the complexity of languages, the difference of individual expression and no systematic description on the formation of human emotions, fine-grained emotion analysis is still few at present, chinese is different from English emotion analysis due to various reasons such as free grammar, large vocabulary amount, relatively free form and the like in the evolution process, and semantic analysis frequently used in English is difficult in Chinese, so that various difficulties are caused. Emotional analysis and psychology are closely related, psychological research finds that the relationship between words and human emotions can be measured, and the semantic tendency of independent words or phrases is important for conveying human emotions. Research shows that the semantic tendency of words and phrases mainly has two phenomena, 1) emotional terms with the same tendency often appear at the same time; 2) Opposite tending emotional terms generally do not occur simultaneously. Due to the existence of the two phenomena, emotion analysis can simplify many things, and researches show that word co-occurrence networks established by common texts of English and Chinese satisfy the characteristics of the small world, and researches on aspects of text segmentation and theme extraction are carried out on the basis of the networks.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a fine-grained emotion classification method based on an emotion Word random co-occurrence network, which combines a long Match SWLM (Sentimental Word Long Change Match) of emotion words with a machine learning algorithm to realize fine-grained emotion classification of paragraph-level texts.

In order to achieve the purpose, the invention adopts the technical scheme that:

a fine-grained emotion classification method based on an emotion Word random co-occurrence network is characterized in that a random network theory is adopted, a Word co-occurrence phenomenon is utilized, a random network model which is constructed by emotion characteristics and is based on Word sequence is formed through labeling of an emotion body Word lexicon, model reduction is conducted on the basis, an emotion Word Longest matching method (SWLM) and a TC algorithm are combined to conduct SWLM-TC unsupervised learning classification, or the emotion Word Longest matching method and the HMM machine learning algorithm are further combined to establish a fine-grained emotion classification model, and classification prediction is achieved through the model.

The construction process of the emotion word co-occurrence network model is as follows:

1) Performing a sentence-splitting operation on each text resulting in an ordered set of sentences S ₁ →S ₂ →…→S _n ；

2) For each sentence S _i Performing word segmentation, filtering stop words and nonsense real words, and labeling emotion words by using an emotion word ontology library to obtain a group of ordered emotion words W ₁ →W ₂ →…→W _n ；

3) For each sentence, use WL (window length, generally take 2) bit sliding window to extract vocabulary pair from sentence<w _i ,w _j &In case ofA new node W is added to W _i And is w _i Weight n of _wi Setting an initial value as 1; otherwise n _wi Plus 1, ifAdd a new edge (w) to E _i ,w _j ) And is (w) _i ,w _j ) Weight n of _wi,wj Setting an initial value to be 1; otherwise n _wi,wj Adding 1;

4) After all texts are processed, the establishment of a network model G is completed;

the method comprises the following steps that S represents a sequence consisting of a plurality of sentences, w represents extracted emotion words, belongs to sigma, sigma is Chinese word collection, and the Chinese word collection is an emotion body word set which is marked by an emotion vocabulary body library after stop words and nonsense real words are removed; w is the node set of the network model G, W = { W _i |i∈[1,N]N is the number of nodes of G; e is the edge set of the network model G, the number of the edges of the network model G is M, and E = { (w) _i ,w _j )|w _i ,w _j E is W, and W _i And w _j There is a sequential co-occurrence relationship between (w), (w) _i ,w _j ) Represents the slave node w _i Point to node w _j A directed edge of (a); n is a radical of _W As weights of nodes in the network model G, N _W ＝{n _wi |w _i E.g. w }; NE is the weight of the edge in the network model G and represents the node w _i And w _j Weight of edges in between, NE = { n = { n } _wi,wj |(w _i ,w _j )∈E}。

In the emotion vocabulary ontology library, the emotions are divided into 21 classes of 7 categories, and the emotion categories are happiness { happy (PA), peace (PE) }, good { respect (PD), praise (PH), believe (PG), love (PB), wish (PK) }, anger { anger (NA) }, grief { sad (NB), disappointment (NJ), guilt (NH), thinking (PE) }, fear { panic (NI), fear (NC), photophobia (NG) }, dislike (NE), nausea (ND), derogatory (NN), sudoku (NK), suspicion (NL) }, surprise { PC) }; the emotion intensity power is 1,3,5,7,9 fifth gear, 9 represents the maximum intensity, 1 represents the minimum intensity, the part-of-speech types in the emotion vocabulary body are totally 7 types, namely noun (noun), verb (verb), adjective (adj), adverb (adv), network word (nw), idiom (idiom) and preposition phrase (prep), and the emotion words 27466 are contained.

Dividing the network model G into 7 sub-networks according to seven emotions of happiness, anger, sadness, fear and surprise, connecting the node with the highest weight with the broken network sub-blocks in the process of splitting the sub-networks if the broken situation occurs, and constructing seven sub-networks Gx | x = {1,2,3,4,5,6,7}, namely G, which can be used for fine-grained calculation ₁ ,G ₂ ,G ₃ ,G ₄ ,G ₅ ,G ₆ ,G ₇ 。

The longest matching method of the emotional words carries out longest matching through the maximum weight words of the emotional words, so that the relevant emotional themes can be accurately classified without disambiguation and anti-noise processing, and weight calculation is carried out through seven small classification models to obtain parameters capable of being classified by machine learning.

When classified, have the following definitions:

longest weight matching path length d _max (S): the network Gx | x = {1,2,3,4,5,6,7}, if two emotion words are covered sequentially, matching is carried out by using directly connected edges, and if the two emotion words are in the network G _x If there is a network interval, then when selecting a path, selecting a node with the largest weight for matching, that is, the length of S, and the calculation formula is as follows:

wherein d is _max (w _i ,w _i+x ) Is the maximum weight matching path from the ith word to the (i + x) th word in the network;

emotion weight coefficient SW (sentimental weight): in the network G, the emotion polarity proportion occupied by each of the seven subnetworks is more obvious in classification by using the coefficient, the classification problem caused by fuzzy boundary is reduced, the reproduction times of words in the emotion word network are freq, the polarity intensity is P, and the calculation formula is as follows:

WC _i ＝freq×P

where WC is the sentiment value of each word in the sub-network, W _y For emotion values of sub-networks, SW _x SW value of the sub-network x, namely emotion weight coefficient;

classification coefficient CC (Classification coefficient): after the maximum matching word path is determined, the recurrence Re and the emotion intensity power of the words on the path are calculated as follows assuming that there are n words:

CC _i ＝Re×power

wherein CC _i Is the classification coefficient of a single word;

class prediction coefficient CPC (Classification prediction coefficient): when a machine learning algorithm is used for classification, a prediction mechanism adopted for classification of samples which cannot be judged is adopted; according to SW _x Sorting is performed if SW ₁ +SW ₂ >80％，SW ₁ /SW ₂ &gt, 1.5, then fall under SW ₁ If SW ₁ +SW ₂ >80％，SW ₁ /SW ₂ &=1.5, in this case falling under SW ₁ And SW ₂ Under two attributes; if SW ₁ +SW ₂ &And 80%, the classification of the article is relatively complicated, and the article is classified under the corresponding classification according to the classification coefficient:

the SWLM-TC method comprises the following steps:

1) Clauses are divided from the articles to be classified, and the clause sequence is S' ₁ →S′ ₂ →…→S′ _n ；

2) Segmenting each sequential sentence, removing meaningless real words and auxiliary words, labeling by using an emotion vocabulary noumenon lexicon, selecting the labeled words, and sequencing the words, namely W' ₁ →W′ ₂ →…→W′ _n ；

3) Performing corresponding network search according to the attribution of the marked words;

4) Carrying out path selection on words in the network, and if the words are two adjacent words, using a directly connected path; if two words are not adjacent, selecting the words on the paths connected with the words, searching the path with the maximum weight according to the steps and finding d _max (S)；

5) Computing the maximum weight path d _max A classification coefficient CC at (S);

6) Calculating classification coefficients CC under each attribution sub-network, comparing the sizes of the coefficients, if the classification coefficients CC, SW and SW are emotion weight coefficients (sentimental weight), namely the weights of classified emotions in 7 sub-networks, if the classification coefficients CC, SW and SW are different, according to the final sorting principle of the classification coefficients CC, if the first weight accounts for eighty percent, attributing to the corresponding emotion network, and if the first weight does not exceed eighty percent, classifying the emotion networks under two emotion networks before the weight ranking;

7) And if the classification can not be ensured, performing classification prediction on the text to be classified according to a Classification Prediction Coefficient (CPC).

The method for establishing the fine-grained emotion classification model and realizing classification prediction by using the model comprises the following steps:

1) Using SWLM-TC to carry out fine-grained classification on a part of texts in all samples, and calculating a weight coefficient SW of the emotion to which each text belongs in the sample set _x The other part of the text is used as a classification verification experiment;

2) For theAll samples classified using SWLM-TC: calculating the classification coefficient CC of each text, classifying the classification coefficient according to the 6 th step of the SWLM-TC algorithm, and adding the sample into the corresponding emotion classification set TS of x _x (Train Set), if the classification coefficient can not be determined, using SWLM-TC to perform prediction, using SWLM-TC algorithm 7 th step to enter corresponding classification;

3) After sample data is used for computing the text emotion by using an SWLM-TC algorithm, training an HMM classification model by using correspondingly classified texts, and then training by using the HMM classification model:

a) For the text to be detected, classifying the text by using an HMM algorithm, and if the text can be classified correctly, judging the sub-emotion classification of the text;

b) And for the text without the classification result, performing classification prediction by using a Classification Prediction Coefficient (CPC).

Firstly, performing emotion calculation on a sample set by using an SWLM-TC algorithm to obtain a text sample library classified into 7 sub-emotions, training an HMM model by using the sample library, and performing text classification test verification on the rest part of texts in the text library by using the trained HMM model.

Compared with the prior art, the invention has the beneficial effects that:

1. the method can be used for classifying the emotion of the text in a fine granularity manner, is different from the traditional tendency calculation, and has classification in a finer granularity.

2. The precision of a single TC algorithm is improved, and the classification is more accurate.

3. After HMM model training is carried out on the sample set by using the SWLM-TC, emotion classification is carried out on a sample library to be tested, and automation of a pure machine learning algorithm is improved.

Drawings

Fig. 1 is a general flow chart of the algorithm of the present invention.

FIG. 2 is a flow chart of the SWLM-TC algorithm of the present invention.

FIG. 3 is a flow chart of the SWLM-HMM algorithm of the present invention.

Fig. 4 is a line graph of the TC experimental data of the word frequency-based labeling algorithm.

FIG. 5 is a line graph of SWLM-TC heuristic experimental data.

FIG. 6 is a line graph of experimental data for the SWLM-HMM algorithm.

FIG. 7 is a graph showing the data of the micro-averages in the experiment of the present invention.

FIG. 8 is a graph of the macro-average data in the experiments of the present invention.

FIG. 9 is a diagram showing the distribution of classification data in the experiment of the present invention (correct classification).

FIG. 10 is a diagram showing the distribution of classified data (misclassification into classes) in the experiment of the present invention.

FIG. 11 is a diagram showing the distribution of classified data in the experiment of the present invention (which belongs to such classified data).

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

As shown in FIG. 1, the invention relates to a fine-grained emotion classification method based on an emotion Word random co-occurrence network, which comprises the steps of firstly, adopting a random network theory, utilizing Word co-occurrence phenomena, and labeling an emotion body vocabulary Word library to form a random network model which is constructed by emotion characteristics and is based on Word sequence, namely an emotion Word co-occurrence network model, carrying out model reduction on the basis, combining an emotion Word Longest matching method (SWLM, sentiment Word Longest Match) and a TC algorithm to carry out SWLM-TC unsupervised learning classification, or further combining the emotion Word Longest matching method and the HMM machine learning algorithm to establish a fine-grained emotion classification model and utilizing the model to realize classification prediction. The specific contents are as follows:

1 emotional word co-occurrence model based on random network

In order to facilitate fine-grained research on paragraph-level texts and discover the internal rules among emotion words, the invention constructs an emotion word co-occurrence Network model suitable for fine-grained emotion analysis by improving the method provided by documents [ YANG Feng, PENG Qin-ke, XU Tao, sentment Classification for communications Based on Random Network Theory, acta Automatica Sinica,2010.6Vol.36, no6 ].

1.1 Emotion vocabulary noumenon thesaurus

The Chinese emotion vocabulary ontology library is a Chinese ontology resource which is organized and labeled by the information retrieval research laboratory of the university of major associates under the guidance of the Lin Hongfei professor by the efforts of all the members of the research laboratory. The emotion classification system of the Chinese emotion vocabulary ontology is constructed on the basis of the 6 major types of emotion classification systems of Ekman which are relatively influential abroad. On the basis of Ekman, the vocabulary ontology adds the emotion category 'good' to perform more detailed division on the recognition emotion. Finally, the emotions in the vocabulary ontology are divided into 7 major classes and 21 minor classes. The emotion classifications are happy { happy (PA), peaceful (PE) }, good { respect (PD), praise (PH), believe (PG), love (PB), wish (PK) }, anger { anger (NA) }, grief { sadness (NB), disappointment (NJ), guilt (NH), thinking (PE) }, fear { panic (NI), fear (NC), shame (NG) }, aversion { vexation (NE), hate (ND), derelimity (NN), musky taboo (NK), suspicion (NE) }, surprise { surprise (PC) }, respectively. The emotion intensity power is classified into 1,3,5,7,9 fifth gear, 9 indicates the maximum intensity, and 1 indicates the minimum intensity. The part-of-speech types in the emotional vocabulary ontology are divided into 7 types, namely nouns (noun), verbs (verbs), adjectives (adj), adverbs (adv), network words (nw), idioms (idiom) and prepositional phrases (prep), and the emotional words 27466 are contained in the part-of-speech types.

1.2 network model

The small world network model introduced by Watts and Strogtz [ Watts D J, strogtz S H.collective dynamics of 'small-world' networks.Nature,1998,393 (6684): 440-442], the unscaled network model proposed by Barabasi and Albert [ Barabasi A L, albert R.Emergence of scaling in random networks.science,1999,286 (5439): 509-512] has, after an initial work on complex networks, in contrast to regular and random networks: the co-occurrence model network constructed by the association between words has the relevant characteristics of a small-world network. The emotion granularity feature classification can be rapidly carried out on the constructed word sense network by utilizing the small average length and the large aggregation coefficient.

1.3 Emotion word co-occurrence network model

In the document [ Shi sting, hu Ming, dai Guo-Zhong.topic analysis of Chinese text Based on small world model. Journal of Chinese Information Processing,2007,21 (3): 69-75 (Crystal, hu Ming, dai Guozhong. Chinese text topic analysis Based on small world model. Chinese Information science, 2007,21 (3): 69-75) ], a Random Network model is built Based on the common co-occurrence relationship between words, and in the document [ YANG Feng, PEQin-ke, XU Tao, sententient Classification for Based on Network word Theory, acta Automation, 2010.6Vol.36, no6], this model is built Based on the common co-occurrence relationship between words, which is not suitable for the calculation of a large number of Network element common Network words, but is not suitable for the Network element Classification algorithm, and the large number of Network element Classification rules.

In order to construct fine-grained emotion analysis, the invention adopts an emotion word construction sequence co-occurrence random network model, and the emotion word co-occurrence sequence reflects information in the aspect of the semantics of emotion words, such as front and back modification, and the co-occurrence distance of emotion words has a great relationship with the semantic relationship of words. The invention builds a model according to the close sequence co-occurrence relation of emotional words, namely the window length WL of the co-occurrence area is smaller (generally takes 2), and the sequence relation of vocabulary co-occurrence is considered.

After the emotional word co-occurrence network is constructed, the large network model is divided into seven small emotional word networks according to seven different emotional categories, and then relevant operation is carried out.

In order to describe the construction method of the emotional word co-occurrence network model, the related mathematical definition is used:

the method comprises the steps of (1) Chinese vocabulary collection, wherein the vocabulary collection used by the method is an emotion body vocabulary collection which is marked by an emotion vocabulary body library after stop words and nonsense real words are removed;

w is the extracted emotional words. W is in an E shape;

s, a sequence consisting of a plurality of sentences;

n is the number of the nodes of G;

m is the number of sides of G;

W＝{w _i |i∈[1,N]g, node set;

E＝{(w _i ,w _j )|w _i ,w _j is e.g. W, and W _i And w _j The ordered co-occurrence relationship between G } where (w) is _i ,w _j ) Represents the slave node w _i Point to node w _j A directed edge of (d);

N _W ＝{n _wi |w _i e.g. w, the weight of the node in G;

NE＝{n _wi,wj |(w _i ,w _j ) E.g., the weight of the edge in G represents the node w _i And w _j The weight of the edges in between.

The method for establishing the emotion word co-occurrence network model G is given as follows:

3) For each sentence S _i Extracting vocabulary pairs from sentences by using WL (generally taking 2) bit sliding window<w _i ,w _j &While ifA new node W is added to W _i And is w _i Weight n of _wi Setting an initial value as 1; otherwise n _wi Plus 1, to w _j Operation of and w _i Like this, ifAdd a new edge (w) to E _i ,w _j ) And is (w) _i ,w _j ) Weight n of _wi,wj Setting an initial value as 1; otherwise n _wi,wj And adding 1.

4) After all text processing is completed, the network model G is built.

5) In the sub-network splitting process, if a breaking situation occurs, the point with the highest weight is used for being connected with the broken network sub-blocks, and the method can be used for seven sub-networks (G) with fine-grained calculation ₁ ,G ₂ ,G ₃ ,G ₄ ,G ₅ ,G ₆ ,G ₇ ) And (5) completing construction.

2 sentiment fine-grained feature classification facing text

In the previous research, the emotional tendency of a text is taken as a research focus, along with the continuous deep research in the field, the research value and the application of fine granularity are highlighted, the analysis focus of the fine granularity is different from the analysis focus of the tendency, the fine granularity is a multi-classification problem, the tendency only needs to be calculated, the used artificial labeling dictionaries are also different, the research on the tendency only needs to label the tendency of a word, and the emotion vocabulary ontology library of the fine granularity labeling dictionary labels the related characteristics of the related emotion words, such as part of speech type, strength, polarity and the like. And carrying out fine-grained emotion classification by using an HMM machine learning algorithm in a matching way.

The longest matching classification method SWLM is characterized in that longest matching is carried out through the largest weight vocabulary of the emotional words, so that disambiguation and anti-noise processing are not adopted, classification can be accurately carried out under relevant emotional subjects, weight is calculated through seven small classification models, and parameters which can be used for machine learning classification of an HMM are obtained.

The invention is defined as follows:

definition 1 (longest weight matching path length d) _max (S))

Network G _x If the two emotion words are sequentially covered, the edges which are directly connected are used for matching; if two emotional words are in the network G _x Presence netAnd (4) network interval, selecting the node with the largest weight for matching when selecting the path, namely the length of the longest weight matching path S. The calculation formula is as follows:

wherein d is _max (w _i ,w _i+x ) Is the maximum weight matching path from the ith word to the (i + x) th word in the network.

Definition 2 (sentimental weight coefficient SW)

In the word meaning network G, the emotion polarity proportion occupied by each of the seven sub-networks can be more obviously classified by using the coefficient, the classification problem caused by fuzzy boundary is reduced, the reproduction times of words in the emotion word network are freq, and the polarity intensity is P. The calculation formula is as follows:

WC _i ＝freq×P

where WC is the sentiment value of each word in the sub-network, W _y For emotion values of sub-networks, SW _x Is the SW value of sub-network x, i.e. the emotion weight coefficient.

2.1 SWLM-TC unsupervised learning classification method Using TC Algorithm

Definition 3 (Classification coefficient CC)

The classification coefficient is defined in the classification of an SWLM-TC unsupervised algorithm, after a maximum matching word path is determined, the reappearance Re of words on the path and the emotional intensity power are also determined, if n words exist, the calculation formula is as follows:

CC _i ＝Re×power

wherein CC _i Is the classification coefficient of a single word.

Definition 4 (class prediction coefficient CPC)

The classification prediction coefficient is a prediction mechanism adopted when classification is performed by using a machine learning algorithm and classification of a sample cannot be judged. According to SW _x Sorting is performed if SW ₁ +SW ₂ &gt, 80%, wherein SW ₁ /SW ₂ &gt, 1.5, then fall under SW ₁ Lower, otherwise fall under SW ₁ And SW ₂ Descending; if SW ₁ +SW ₂ &And lt, 80 percent, indicates that the classification of the article is relatively complex and is classified under the corresponding classification according to the classification coefficient.

Due to the appearance of the text emotion words at the paragraph level, the main line context of the article emotion is presented, and the emotional context on the emotion is reserved through the use of the co-occurrence random network, so that the performance is good through the sequential random co-occurrence network.

Referring to fig. 2, the processing steps of the SWLM-based use emotion word labeling TC algorithm are as follows:

1) Clauses, S 'are carried out on articles needing to be classified' ₁ →S′ ₂ →…→S′ _n ；

4) Carrying out path selection on words in the network, and if the words are two adjacent words, using a directly connected path; II, if the words are two nonadjacent words, selecting the words on the paths connected with the words, searching the path with the maximum weight according to the steps and finding out d _max (S)；

5) Computing the maximum weight path d _max (S) the calculation process of the classification coefficient CC is defined in definition 3;

6) Calculating classification coefficients CC under each home sub-network, comparing the sizes of the coefficients, and if the sizes of the coefficients are the same, classifying the coefficients CC and SW; and II, if the coefficients are different, performing a step III, wherein the step III belongs to the corresponding emotion network according to the final sorting principle of the CC, if the first weight accounts for eighty percent, and if the first weight does not exceed eighty percent, the classification is classified into the weight coefficient CC and the first two classes.

7) And if the classification cannot be ensured, performing classification prediction on the text to be classified according to the definition 4.

2.2 Algorithm Classification of SWLM-HMM based on supervised machine learning

The machine learning plays a great role in text classification, the HMM algorithm has very good performance in NLP, due to the simplicity and small calculation amount of the HMM algorithm, a sample sequence with an indefinite length can be trained, and the HMM learns fine-grained emotion classification so as to improve the accuracy of SWLM-HMM classification.

When the SWLM-HMM is used for classification, the HMM cannot be directly used for training the corpus, the processing is carried out after the SWLM is combined for processing, and then the HMM algorithm is used for training, so that the classification accuracy can be improved, and the classification speed can be increased.

Referring to fig. 3, the corpus training method is as follows:

(1) Using a part of text in a sample library, using SWLM-TC to carry out fine-grained classification on a sample set, wherein the classification process of the SWLM-TC is shown as SWLM-TC, and calculating a weight coefficient SW of the emotion to which the sample belongs _x ；

(2) For all samples classified using SWLM-TC: calculate eachClassifying coefficient CC of the text according to the 6 th step of SWLM-TC algorithm, and adding the sample into the corresponding emotion classification set TS of x _x (Train Set), if the classification coefficient can not be determined, using SWLM-TC to perform prediction, using SWLM-TC algorithm 7 th step to enter corresponding classification;

(3) After the sub-emotions are classified by the sampled part of texts in the sample library by using an SWLM-TC algorithm, training an HMM classification model by using correspondingly classified texts of the obtained training texts; the classified characteristics are that the emotion words of each text are labeled and form chain words according to the sequence of the text, in the process of training by using an HMM model, the emotion word string and the classified sub-emotions of each text are taken as parameters and transmitted to the HMM model for training the HMM model, and all samples are input to an HMM algorithm for training;

a) And classifying the residual texts in the sample library by using an HMM algorithm, and if the residual texts can be classified correctly, performing corresponding classification calculation.

b) For non-classified result text, classification prediction is performed using definition 4.

3 feelings fine-grained feature classification experiment

3.1 Classification data

The experimental data of the invention adopts the collected blog data and NLP & CC2014 data of CCF natural language processing and Chinese computing conference evaluation. 7000 pieces of microblog data are crawled, 4000 pieces of blog data are selected, the blog data are fused with 2000 pieces of similar topics in a data set of NLP & CC2014, 6000 pieces of samples marked in the last corpus data are about 6000 pieces, all selected microblogs containing emotions are removed from the emotionless microblogs and the blog data, and the method comprises the following steps:

1) TrainDataNet: 6000 pieces of microblog data are used;

2) TrainDataHMM: 5000 pieces of microblog data in 6000 pieces are used, wherein microblogs containing 7 pieces of emotion data are sampled.

3) TrainDataTest: another 1000 pieces of data were used except the TrainDataHMM.

TABLE 1 data distribution

The accuracy and recall rate apply the two most commonly used metrics in the fields of information retrieval and statistical classification to evaluate the quality of the results.

The experiment adopts experiment acquisition data and a Chinese tendency evaluation data set of Chinese NLP & CC2014, and after the system test, the experiment result is as follows.

3.2 Classification results

A. Experimental raw data

1) Sentiment classification experiment of SWLM-TC

And verifying the effectiveness of the algorithm by adopting the SWLM-TC, and measuring the effectiveness of classification by using the accuracy, the recall rate and the F value.

Seven classification results are shown in table 2:

TABLE 2 SWLM-TC Algorithm Classification results

Emotion classification	Is correctly classified	Misclassifying into class	Belong to the class but are misclassified
				Musical instrument	641	378	242
Good taste	654	341	227
				Anger	621	412	225
Grief	610	384	239
				Fear of	609	351	245
Dislike of gastric cancer	619	362	222
				Surprise that	627	359	219

2) Emotion classification experiment of SWLM-HMM

The experimental results of the SWLM-HMM algorithm are shown in Table 3

TABLE 3 SWLM-HMM Algorithm Classification results

3) Emotion classification experiment of TC algorithm

The results of the verification test of the TC algorithm are shown in Table 4 below

TABLE 4 TC Algorithm classification results

Emotion classification	Is correctly classified	Misclassifying into that class	Belong to the class but are misclassified
				Musical instrument	531	468	352
Good taste	547	478	334
				Anger	511	426	335
Grief	521	437	328
				Fear of	534	456	320
Dislike of gastric cancer	508	434	333
				Surprise that	519	471	327

B. Accuracy, recall, F1 value

And performing data calculation on the experimental data, wherein the accuracy, recall rate and F1 value data obtained by SWLM-TC calculation are as follows:

1) Sentiment classification experiment of SWLM-TC

The P, R and F1 values of the SWLM-TC algorithm are shown in Table 5 below

TABLE 5P, R and F1 values of SWLM-TC

Emotion classification	Rate of accuracy	Recall rate	F1
				Musical instrument	62.90％	72.59％	67.40％
Good taste	65.73％	74.23％	69.72％
				Anger	60.12％	73.40％	66.10％
Grief of grief	61.37％	71.85％	66.20％
				Fear of	63.44％	71.31％	67.14％
Dislike of gastric cancer	63.10％	73.60％	67.95％
				Surprise that	63.59％	74.11％	68.45％

2) Emotion classification experiment of SWLM-HMM

The P, R and F1 values of the SWLM-HMM algorithm are shown in Table 6 below

TABLE 6 SWLM-HMM P, R and F1 values

3) Emotion classification experiment of TC algorithm

The P, R and F1 values of the TC algorithm are shown in Table 7 below

TABLE 7P, R and F1 values for the TC algorithm

Emotion classification	Rate of accuracy	Recall rate	F1
				Musical instrument	53.15％	60.14％	56.43％
Good taste	53.37％	62.09％	57.40％
				Anger	54.54％	60.40％	57.32％
Grief	54.38％	61.37％	57.66％
				Fear of	53.94％	62.53％	57.92％
Dislike of gastric cancer	53.93％	60.40％	56.98％
				Surprise that	52.42％	61.35％	56.54％

C. Macro-average and micro-average

The values of P, R and F1 averaged using the macro-average sum of each algorithm are shown in Table 8 below

Macro-and micro-averaging of the algorithm of Table 8

The word frequency-based labeling algorithm TC, the SWLM-TC heuristic algorithm, the SWLM-HMM algorithm, and pairs of several line graphs of experimental data are shown in fig. 4, fig. 5, and fig. 6.

Fig. 4,5 and 6 show the results of analysis using the same emotion dictionary, in which the TC algorithm is used instead of walking, and the most basic TC algorithm of emotion analysis is compared with the SWLM-TC algorithm and the SWLM-HMM algorithm,

1) SWLM-HMM > SWLM-TC > TC at accuracy, the accuracy range of the TC algorithm at 7 granularities is 52.42% -54.54%, the accuracy range of the SWLM-TC algorithm at 7 granularities is 60.12% -65.73%, and the accuracy range of the SWLM-HMM algorithm at 7 granularities is 69.60% -73.21%, thus it is seen that the TC algorithm lags behind the SWLM-TC and SWLM-HMM algorithms by a percentage scale, and the SWLM-TC lags behind the SWLM-HMM by a percentage scale.

2) The recall ratio of SWLM-HMM > SWLM-TC > TC, the recall ratio range of the TC algorithm on 7 granularities is 60.14% -62.09%, the recall ratio range of the SWLM-TC algorithm on 7 granularities is 71.31% -74.23%, the recall ratio range of the SWLM-HMM algorithm on 7 granularities is 79.39% -83.88%, the TC algorithm lags behind the SWLM-TC and SWLM-HMM algorithms in percentage scale, and the SWLM-TC lags behind the SWLM-HMM in percentage scale.

3) SWLM-HMM > SWLM-TC > TC at the F1 value, and among several evaluation criteria, SWLM-TC and SWLM-HMM both perform better than TC algorithms, the F1 value range of TC algorithms at 7 granularity is 56.43% -57.66%, the F1 value range of SWLM-TC algorithms at 7 granularity is 66.10% -69.72%, the F1 value range of SWLM-HMM algorithms at 7 granularity is 75.26% -77.90%, TC algorithms lag behind SWLM-TC and SWLM-HMM algorithms in percentage order, and SWLM-TC lags behind SWLM-HMM in percentage order.

From the data comparison, the results of the SWLM-TC algorithm and the SWLM-HMM algorithm are better than those of the TC algorithm in terms of accuracy, recall rate and F1 value, which proves that the traditional TC algorithm is poor in performance in fine-grained calculation.

The data for the micro-average and macro-average are shown in fig. 7 and 8.

It can be seen from the line graphs of the macro-average and the micro-average that the performance of the SWLM-HMM and the SWLM-TC algorithm is better than that of the TC algorithm, and in the SWLM-TC and the SWLM-HMM algorithm, the performance of the SWLM-HMM is better than that of the SWLM-TC algorithm. The performance of the three algorithms on the same data set is as much as an order of 10%.

From the data set distributions, several table histograms are derived, and three distributions of the three algorithms in classification are shown in fig. 9, 10 and 11:

as can be seen from fig. 9, fig. 10 and fig. 11, in the correct classification, the correct classification text of the TC algorithm is the lowest, the number of wrong entries wrongly classified into the class is also greater, in the item which belongs to the class, the number of TC algorithms is also the greatest, and the SWLM-TC and the SWLM-HMM and the corresponding data are all preceded by the TC algorithm.

3.3 conclusion of the experiment

Through the original experimental data and the calculation of the accuracy, the recall rate and the F1 value, the following viewpoints and conclusions can be obtained:

1) The accuracy of the SWLM-TC algorithm is 7.7% -13.31% higher than that of the traditional TC algorithm, and the accuracy of the SWLM-HMM algorithm is 9.48% -13.09% higher than that of the SWLM-TC algorithm. The data shows that the accuracy of the SWLM-X algorithm is higher than that of the traditional algorithm, because the algorithm is compared with the calculation method provided by the invention before classification, the SWLM-TC and the SWLM-HMM pass through the stage of the emotional word random co-occurrence network, the function of the co-occurrence network is reflected in the invention, and the initial purpose of the algorithm based on the emotional word co-occurrence network provided by the invention is verified; in comparison of the SWLM-TC algorithm and the SWLM-HMM algorithm, the fact that the accuracy of the SWLM-TC algorithm is lower than that of the SWLM-HMM algorithm is found because the SWLM-TC algorithm is used for training a training set before classification, a trained model is used for classification in the later period, emotion prediction is carried out on the text with fuzzy emotion in the later period, and the SWLM-TC algorithm and the SWLM-HMM algorithm are naturally surpassed in accuracy by using the two strategies.

The emotion word co-occurrence network emphasizes the emotion loss of the text and the emotion words with higher weight in the text, so that the SWLM-X algorithm has the advantage in fine-grained emotion calculation.

2) In recall rate, the SWLM-TC algorithm is 11.17% -14.09% higher than the traditional TC algorithm, and the SWLM-HMM algorithm is 8.08% -12.57% higher than the SWLM-HMM algorithm, so that the ability of correctly classifying certain classes is higher than that of the traditional algorithm when the SWLM-TC algorithm and the SWLM-HMM algorithm classify the certain classes, because the efficiency and the accuracy of the calibration of specific emotional words belonging to the classes are higher than those of the traditional algorithm when the algorithm and the framework provided by the invention classify the certain classes. The method has the advantages that the ability of highlighting important emotion words is higher than that of the traditional algorithm, the emotion polarity in the emotion vocabulary ontology library is adopted, the parameter is well utilized by the text, and the emotion words with strong emotion polarity are more prominent by multiplying the co-occurrence times and the emotion polarities, so that the interference of other emotion words is reduced, and the classification is more accurate.

3) In the comparison of F1 values, the SWLM-TC algorithm is 9.67-13.29% higher than the traditional algorithm, the SWLM-HMM algorithm is 9.16-11.8% higher than the SWLM-TC algorithm, and in the aspect of comprehensive evaluation, the SWLM-TC algorithm and the SWLM-HMM provided by the invention are both better than the traditional algorithm, and in the aspect, the algorithm provided by the invention is superior to the traditional algorithm in comprehensive performance for carrying out a fine-grained emotion classification algorithm by using the traditional algorithm.

4) From the experimental data, the calculation framework and the algorithm provided by the invention have good correlation performance in classification, but have great difference with the emotion tendency analysis aspect, because the fine-grained calculation not only requires the algorithm to carry out fine-grained classification in various aspects, but also has high anti-interference performance, interference is easy to occur by using a calibration method for emotion words appearing in an article, and unnecessary multi-classification problems are caused. The invention adopts the SWLM algorithm in the fine-grained calculation framework, absorbs the related knowledge of a complex network on a processing mechanism, has better effect on the performance of completing the emotion words in the text to be classified, and passes experimental verification research on the fine-grained calculation after the mechanism is adopted.

Claims

1. A fine-grained emotion classification method based on an emotion Word random co-occurrence network is characterized in that a random network theory is adopted, a Word co-occurrence phenomenon is utilized, a random network model which is constructed by emotion characteristics and is based on Word sequence is formed through labeling of an emotion body Word lexicon, model reduction is conducted on the basis, an emotion Word Longest matching method (SWLM, sentimental Word Longest Match) and a TC algorithm are combined to conduct SWLM-TC unsupervised learning classification, or the emotion Word Longest matching method and an HMM machine learning algorithm are further combined to establish a fine-grained emotion classification model and achieve classification prediction through the model, wherein the construction process of the emotion Word co-occurrence network model is as follows:

1) Performing a sentence-splitting operation on each text to obtain an ordered set of sentences S ₁ →S ₂ →…→S _n ；

3) For each sentence, extracting vocabulary pairs from the sentence by using WL bit sliding window<w _i ,w _j &In case ofA new node W is added to W _i And is w _i Weight n of _wi Setting an initial value as 1; otherwise n _wi Plus 1, ifAdd a new edge (w) to E _i ,w _j ) And is (w) _i ,w _j ) Weight n of _wi,wj Setting an initial value to be 1; otherwise n _wi,wj Adding 1;

the method comprises the following steps that S represents a sequence consisting of a plurality of sentences, w represents extracted emotion words, belongs to sigma, sigma is Chinese word collection, and the Chinese word collection is an emotion body word set which is marked by an emotion vocabulary body library after stop words and nonsense real words are removed; w is the node set of the network model G, W = { W _i |i∈[1,N]N is the number of nodes of G; e is the edge set of the network model G, the number of the edges of the network model G is M, and E = { (w) _i ,w _j )|w _i ,w _j Is e.g. W, and W _i And w _j There is a sequential co-occurrence relationship between (w), (w) _i ,w _j ) Represents the slave node w _i Point to node w _j A directed edge of (d); n is a radical of _W As weights, N, of nodes in the network model G _W ＝{n _wi |w _i E is w }; NE is the weight of the edge in the network model G and represents the nodew _i And w _j Weight of edges in between, NE = { n = { n } _wi,wj |(w _i ,w _j )∈E}；

Dividing a network model G into 7 sub-networks according to seven emotions of happiness, anger, sadness, fear and fright, connecting the node with the highest weight with the broken network sub-blocks in the sub-network splitting process if the broken situation occurs, and constructing seven sub-networks Gx | x = {1,2,3,4,5,6,7} which are G and can be used for fine-grained calculation ₁ ,G ₂ ,G ₃ ,G ₄ ,G ₅ ,G ₆ ,G ₇ ；

The method is characterized by comprising the following definitions when classifying:

longest weight matching path length d _max (S): the network Gx | x = {1,2,3,4,5,6,7}, if two emotion words are covered sequentially, matching is carried out by using directly connected edges, and if the two emotion words are in the network G _x If there is a network interval, then when selecting a path, selecting a node with the largest weight for matching, that is, the length of S, where the calculation formula is as follows:

WC _i ＝freq×P

classification coefficient CC (Classification coefficient): after the maximum matching word path is determined, the reproducibility Re and the emotion intensity power of the words on the path are calculated as follows, assuming that there are n words:

CC _i ＝Re×power

wherein CC _i Is the classification coefficient of a single word;

class prediction coefficient CPC (Classification prediction coefficient): when a machine learning algorithm is used for classification, a prediction mechanism adopted for classification of samples which cannot be judged is adopted; according to SW _x Sorting is performed if SW ₁ +SW ₂ >80％，SW ₁ /SW ₂ &gt, 1.5, then fall under SW ₁ If SW ₁ +SW ₂ >80％，SW ₁ /SW ₂ &=1.5, in this case falling under SW ₁ And SW ₂ Under two attributes; if SW ₁ +SW ₂ &And (80%), the classification of the article is relatively complex and is classified under the corresponding classification according to the classification coefficient:

2. the fine-grained emotion classification method based on the random co-occurrence network of emotion words according to claim 1, wherein emotions in the emotion word ontology library are classified into 7 categories, namely 21 categories, wherein the emotion classifications are happy { happy (PA), peaceful (PE) }, good { respect (PD), praise (PH), believe (PG), love (PB), wish (PK) }, anger { angry (NA) }, grief { NB), disappointment (NJ), guilt (NH), thinking (PE) }, fear { panic (NI), fear (NC), shy (NG) }, aversion { vexed (NE), hate (ND), derogatory (NN), genuineness (NK), suspicion (NL) }, surprise { surprise (PC) }; the emotion intensity power is 1,3,5,7,9 fifth gear, 9 represents the maximum intensity, 1 represents the minimum intensity, the part of speech types in the emotion vocabulary body are totally 7 types, and the part of speech types are noun (noun), verb (verb), adjective (adj), adverb (adv), network word (nw), idiom (idiom) and prepositional phrase (prep) which contain 27466 emotion words.

3. The fine-grained emotion classification method based on random shared network of emotion words as claimed in claim 1, wherein said emotion word longest matching method performs longest matching through the largest weighted vocabulary of emotion words, so that related emotion subjects can be accurately classified without disambiguation and anti-noise processing, and weight calculation is performed through seven small classification models to obtain parameters for machine learning classification.