CN112699240A

CN112699240A - Intelligent dynamic mining and classifying method for Chinese emotional characteristic words

Info

Publication number: CN112699240A
Application number: CN202011641702.XA
Authority: CN
Inventors: 刘文平; 高宏松
Original assignee: Jingmen Huiyijia Information Technology Co ltd
Current assignee: Jingmen Huiyijia Information Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-23

Abstract

The intelligent dynamic mining and classifying method for Chinese emotional characteristic words, provided by the invention, can better learn the characteristics of the language environment, context, part of speech and the like of the words by using the CRF improved model, directly marks and learns the new emotional characteristic words in the sentences without a new word discovery process, has simpler and more efficient process, expands the idea of a Z-label LDA optimization model algorithm, classifies the emotion types of the emotional characteristic words, has less dependence on additional resources, does not need to train corpus resources, intelligently marks the new words and the emotional polarities thereof appearing in the words, has higher algorithm efficiency, can be used for online judgment even mining the new emotional characteristic words appearing in the sentences, not only has good help for the intelligent expansion of an emotional characteristic dictionary, but also can play a good reference role in the analysis of phrase, sentence and chapter level emotional tendentiousness, the method has good feasibility, high accuracy and high efficiency.

Description

Intelligent dynamic mining and classifying method for Chinese emotional characteristic words

Technical Field

The invention relates to a Chinese emotional word mining and classifying method, in particular to an intelligent dynamic mining and classifying method for Chinese emotional characteristic words, and belongs to the technical field of intelligent mining and classifying of emotional words.

Background

With the rapid development of social informatization technology, internet technologies including social networks, electronic commerce and mobile communication are changing day by day, and a big data age is coming formally. In all characteristics of big data, high speed, mass and diversity are only representations, the key is the value contained in the data, and only through the analysis of the data, the really useful, valuable and intelligent information can be obtained. The core value of the big data is that the required information is quickly acquired through processing and information mining of mass data.

In data analysis and mining, emotional tendency analysis becomes a very important part, and it is possible to detect subjective emotions such as viewpoints, emotional tendency, and taste in a text. The earliest emotional tendency analysis is to process the movie comment data, judge the positive and negative polarities of other comment corpora through supervised learning, and then extend to the judgment of the positive and negative polarities of texts. Currently, emotional tendency analysis also gains more and more attention, and on the basis of electronic commerce, a recommendation system is developed by analyzing comment data of customers on products to provide targeted product recommendation for potential customers, and decision reference is provided for other intention purchasing customers by integrating commodity comment data.

The method has the advantages that the information generated by the Internet is effectively processed and data is mined, extremely high commercial and social values are gradually shown, the emotional tendency analysis of network texts such as microblogs and network comments is also widely concerned by the business, academic and even government departments, the valuable information is obtained by mining and analyzing the texts, and enterprises can mine the preference, interest tendency and consumption habits of customers through the emotional tendency analysis to carry out product recommendation and planning decision in a targeted manner; the government department can help to know the network public opinion information through the emotional tendency analysis technology and is beneficial to social governance.

At present, various information technologies are greatly developed, cross among different fields is more and more obvious, a good opportunity is provided for the development of emotional tendency analysis, social networks represented by microblog WeChat and e-commerce represented by Taobao cat and the like generate a large amount of information containing emotional tendency every day, and the development of the field of emotional tendency analysis can also promote the development of other cross fields. While emotional tendency analysis obtains some achievements, a plurality of problems exist, such as excessive dependence on an emotional feature dictionary, acquisition of emotional turning and comment objects and the like, and in addition, the demand of customers on more accurate, faster, more comprehensive and more intelligent emotional tendency analysis is increased, the emotional tendency analysis, particularly Chinese emotional tendency analysis, has a plurality of places which need improvement urgently, wherein intelligent mining and classification of emotional feature words are important parts of the emotional tendency analysis and the problems mainly solved by the invention.

Although certain achievements are obtained in emotion feature word mining in the prior art, a large number of problems exist, and the invention mainly solves the problems and mainly comprises the following aspects:

firstly, the excessive dependence on an emotional feature dictionary in the prior art, the lack of industry-recognized evaluation standards, the incapability of discovering newly-appearing emotional feature words, the involvement of a large amount of corpora and manual labor, the problems of acquiring emotional turns and comment objects and the like are all to be solved, in addition, the demand of customers on more accurate, faster, more comprehensive and more intelligent emotional tendency analysis is increased, the emotional tendency analysis, particularly the Chinese emotional tendency analysis, has a plurality of places which need to be improved urgently, wherein the intelligent mining and the classification of finer granularity of the emotional feature words are important parts, and the problems mainly solved by the invention are also solved;

secondly, the analysis of Chinese emotional tendency starts late, the analysis of Chinese emotional tendency has the particularity of Chinese language, such as the specific language phenomena of word segmentation problem, ambiguity, biguan and the like faced by Chinese, the text emotional classification in the prior art is mainly divided into a method based on an emotional characteristic dictionary and a method based on machine learning, the network language which is different day by day also brings challenges to research, and the appearance of words can influence the overall effect of the analysis of Chinese emotional tendency;

thirdly, the mining of new emotional characteristic words is taken as a part of new word discovery in the prior art, the new word discovery is always a difficult problem in Chinese information processing, large-scale corpus statistical analysis, language structure, semantics, word segmentation, filtering and the like are involved, and no good algorithm is provided for solving the problem;

fourthly, the emotion feature word mining and polarity identification processes in the prior art are complex, a large amount of manual work is involved in the emotion feature word mining and polarity identification stages, the subjectivity and the randomness of manual selection are high, large-scale data calculation is needed for mining new emotion feature words of large-scale microblog data, models at positions need to be trained through a large amount of training data, the complexity is not easy to operate, and emotion judgment is not accurate; in the prior art, extra resources are depended on more, corpus resources need to be trained, large-scale corpus calculation is involved, the discovery efficiency of new emotion characteristic words is low, and the feasibility in practical application is not strong;

fifthly, in the word segmentation aspect in the prior art, a word segmentation system based on a dictionary has a poor mining effect on new words, namely unknown words and ambiguous words, cannot consider frequency information of appearance of words, and cannot learn context information of the words well; in the aspect of part-of-speech identification in the prior art, an HMM has a great disadvantage in word sequence identification in that extraction of word context information is limited by the independence assumption of an algorithm, the context information is important for part-of-speech identification, and although the MEMM can consider the context information, only a local optimal solution can be found due to the normalization principle of the MEMM, and the problem of identification offset can be caused.

Disclosure of Invention

Aiming at the defects of the prior art, the intelligent dynamic mining and classifying method for the Chinese emotional characteristic words simplifies the whole process of mining and polarity identification of the emotional characteristic words, and the method provided by the invention is not only used in the mining and polarity identification stages of the emotional characteristic words, including training linguistic data and seed words required by supervision, but also completed by algorithm statistical identification, and involves less manpower as much as possible.

In order to achieve the technical effects, the technical scheme adopted by the invention is as follows:

the intelligent dynamic mining and classifying method for the Chinese emotional characteristic words is mainly divided into two parts, namely intelligent mining of the emotional characteristic words based on a CRF improved algorithm and polarity identification classification of the emotional characteristic words based on a Z-label LDA optimization model; the intelligent mining of the emotional feature words based on the CRF improved algorithm comprises the improvement of the CRF algorithm, the selection of emotional features and the mining of the emotional feature words of the CRF improved algorithm, and the polarity identification classification of the emotional feature words based on the Z-label LDA optimization model comprises the Z-label LDA optimization model and the classification of the emotional feature words;

firstly, mining emotion feature words by adopting a CRF (learning random number) improved algorithm, converting the mining of the emotion feature words into a sequence identification method, training a model by combining the part of speech and context information features of the emotion feature words, and then mining the emotion feature words of unknown texts;

secondly, in the emotion feature word polarity identification stage, classifying emotion feature words by adopting a Z-label LDA optimization model algorithm, corresponding themes in the LDA algorithm to word category information, defining theme information by using seed words to realize partial supervised learning, and then judging the polarity category of target emotion feature words by theme-word distribution;

classifying the emotional characteristic words by adopting a Z-label LDA optimization model, calculating the word list size of training data in advance, identifying words at corresponding positions of a text as set subject numbers according to a predefined rule, calculating a matrix of the word list corresponding to the text at the same time, obtaining a subject-word model and a document-subject model, performing statistical analysis on the subject-word model, and obtaining word classification from a predefined subject;

in the process of carrying out polarity identification on the emotional characteristic words, firstly, judging the polarity by utilizing the context information of the sentences where the emotional characteristic words are located or the co-occurrence condition of the emotional characteristic words, wherein the polarity identification of the emotional characteristic words is regarded as a classification process, the emotional characteristic words are divided into two types, namely positive and negative, the generation process of a theme model is regarded as that each word in a text selects a theme with a certain probability, a word is selected in the theme with a certain probability, and the process corresponds to the word;

by predefining partial theme-words, then by inputting the learning process of the corpus pair model, distributing the words in the training corpus to the predetermined theme with a certain probability, and expanding the new theme, the definition of the theme-words is the rule file.

The intelligent dynamic mining and classifying method of the Chinese emotional characteristic words further improves the CRF algorithm: CRF is a statistics-based sequence identification and sequence segmentation undirected graph model algorithm, serialized data is identified through a probability model, a probability model generating model constructs combined distribution Q (y, c) of an observation sequence y and an identification sequence c through all possible observation sequences to obtain a probability density model, and finally the identification sequence is predicted, the other type is a discriminant model, the possible identification probability of the identification sequence under the condition of the given observation sequence is calculated, and a discriminant function prediction model is generated to predict a target sequence through calculating the conditional probability distribution Q (c | y) of y and c;

the probability graph model is a model combining probability theory and graph theory, the probability dependency relationship between random variables is represented by a graph, if an edge exists between two points, the two points are considered to have the dependency relationship, otherwise, the two points are mutually independent;

the CRF improved algorithm is a Markov random field improved discriminant model based on an undirected graph model, wherein F ═ U, B is an undirected graph, and O ═ O is_uIf each variable in O satisfies the Markov formula under the condition of giving an observation value set Z:

Q(O_u|Z，O_v，v≠u)＝Q(O_u|Z，O_v，v～u)

wherein v-u means that v and u are adjacent nodes, under the condition, a conditional random field is formed by (Z, O), the CRF improved algorithm adds an observation set, under the condition of giving a marker sequence, the probability of the whole marker sequence is calculated, namely the conditional probability Q (O | Z) is calculated, and the MRF of the observation set is increased to become a discriminant model.

The intelligent dynamic mining and classifying method for the Chinese emotional characteristic words is further characterized in that a CRF (conditional random number) improved algorithm does not have strict independence condition requirements and can contain any context information, a loss function of the CRF improved algorithm is a convex function, the CRF improved algorithm calculates probability distribution for sequence data by utilizing maximum likelihood estimation, and the CRF improved algorithm is a state model which is not normalized and limited;

learning and testing process of CRF improved algorithm: firstly, performing word segmentation on a data set, adding part-of-speech marks and other characteristics, then performing model training by adopting a CRF (learning random number) improved algorithm, performing formatting processing on a test set, then testing by using a model obtained by training, and obtaining an emotional characteristic word set after processing.

The intelligent dynamic Chinese emotion feature word mining and classifying method further uses emotion feature word features in an emotion feature word mining process based on a CRF improved algorithm, wherein the emotion feature word features comprise:

1) the character characteristics are important aspects in the characteristics of the emotional characteristic words, some characters have emotional tendencies and have emotional tendencies, and in addition, certain characteristics exist in the context of the emotional characteristic words;

2) the part-of-speech characteristics are obvious characteristics of the emotional characteristic words, and include more adjectives, exclamation words and nouns, in the context of the emotional characteristic words, more nouns and pronouns appear before the words, more auxiliary words, exclamation words and punctuations appear after the words, and the context part-of-speech of the emotional characteristic words is taken as an important characteristic;

3) and degree adverb features, wherein degree adverbs are used before the emotional feature words to strengthen emotion, and the degree adverbs are also used as part of the extracted features.

The intelligent dynamic mining and classifying method of the Chinese emotional characteristic words, further, mining the emotional characteristic words by a CRF improved algorithm: the method comprises the steps of using a CRF + + tool package, a training set, a test set and a feature template to form a prediction model according to the feature template and the training set according to the standard structure of the tool package, predicting and evaluating the test set, and directly determining an obtaining mode of emotion word features by the feature template;

the characteristics are specifically expressed as the characteristics of the character, whether a degree word or other emotional characteristic word prefix characteristics appear before the character, the character characteristics of the emotional characteristic words at the beginning and the end of the character, the part-of-speech characteristics are expressed as whether the word context has adverbs, adjectives or pronouns, and the characteristic acquisition length is expanded to three digits.

The intelligent dynamic Chinese emotion feature word mining and classifying method is characterized in that new emotion feature words and old emotion feature words cannot be identified and distinguished on the basis of word identifiers, a training set uses an emotion mining corpus of NLP & CC, all emotion feature words are mined, an old dictionary is filtered to obtain new emotion feature words, the training set uses 4000 corpus data mined by Chinese microblog emotion on the basis of word and part-of-speech combined features, 2632 pieces of evaluation corpus are used as a test set, part-of-speech and word feature comparison is added, the comparison effect is achieved by increasing and decreasing a template set, and word features and word and part-of-speech combined features are also compared;

the training set is subjected to feature extraction after word segmentation, part of speech and emotional feature word identification, a model file is obtained after learning of a CRF (learning random number) improvement algorithm, word segmentation, part of speech and emotional feature word identification are also performed on the test set, the test set is tested by using the obtained model, an emotional feature word identification result is obtained, and the emotional feature words contained in the sentence are obtained through processing and restoring.

The intelligent dynamic mining and classifying method for the Chinese emotional characteristic words comprises the following steps of: LDA is a modeling method based on discrete data sets based on the assumptions: there is a set of topics R, so that the texts in the document set A all contain one or more topics, each topic is a probability distribution formed by a series of words, and potential topics in the documents can be found through probability learning, and from the perspective of words, a topic R is understood as:

1) a common related subject matter of at least one document content;

2) some or all of the semantic information expressed by the document;

3) a pattern of related word co-occurrence;

4) clustering co-occurrence words;

5) one classification of the word;

from a probabilistic perspective, one topic is understood as:

1) probability distribution of words;

2) a set of words arranged according to a probability of word co-occurrence;

LDA is a three-layer Bayesian structure probability topic model modeling on a discrete data set, which takes topics and words as Dirichlet distribution parameters, prevents the topic parameters from being uncontrollable along with the change of a document set, and utilizes a full probability generation model in the generation process of a document.

An intelligent dynamic Chinese emotion feature word mining and classifying method is characterized in that a LDA model assumes W potential topics, each topic is a multi-term distribution of words, a document is regarded as a sampling matrix of word frequency sequences of the document on the topics, each word depends on a specific topic, namely each document selects a topic with a certain probability, then a word in the topic is selected with a certain probability, and a document set A is represented as { k₁,k₂,…,k_NJ, a document k ═ j₁,j₂,…,j_MAnd the generation process of a document in a specific document set is represented as:

1. selecting one g-Dir (b)

2. For each word j in M_m：

1) Selecting a topic x_m～Mult(g)；

2) From subject x_mSelect a word j in the multi-term probability distribution_m～q(j_m|x_m,a)；

Wherein g is a prior parameter, controlling the mixing proportion of different subjects in a document, b is a Dirichlet distribution parameter, and the distribution form is as follows:

ensuring that each document contains all topics, b controlling the mean value and the sparse density of g, calculating g from a document set for evaluating b, wherein a is a multi-item distribution parameter of topic-word, ensuring that each topic contains all words and is expressed as a W multiplied by U matrix, wherein W is the possible topic number, U is the possible word number, each line is a vector with the sum of U dimension being 1, and a graph model representation of the process is generated;

the joint probability of LDA is obtained as:

the key in LDA is to calculate the posterior probability of the implicit part:

parameter estimation uses Gibbs sampling to directly calculate the value of a parameter, a joint distribution sample of a plurality of variables is used to approximate the joint distribution, and the set of variables is set as Z ═ Z { (Z)₁,Z₂,...,Z_mThe sampling process is as follows:

1) initializing assignment to each variable in the set Z;

2) the sampling of each variable is determined by the conditional probability distribution of the whole variable, the variable value obtained by the previous sampling and other non-sampled variable values are jointly determined in the calculation,

the value of the current sampling is also used in the calculation of the conditional probability of the next sampling to estimate other variable values;

is provided with

And

respectively representing document words and topic vectors, for a word j_iWhich belongs to a certain subject r_iIs dependent on other word pair topics r_iThe process is realized by the following formula:

wherein

Representing the frequency of the w word subordination and the subject r except the i word, and the denominator representing the total number of words contained in the corresponding subject;

representing the total number of words belonging to the subject r of other documents except the ith document, and the denominator is the total number of words except the words belonging to the subject r in the nth document, after the iteration is completed, obtaining the formulas of g and e as follows:

a Z-label LDA optimization model is an improvement of LDA under supervision requirements, the requirement of classification is met by strengthening one-to-one relation between a theme and required categories, an unsupervised algorithm and supervised knowledge are fused, learning is performed under semi-supervised condition under the condition of identifying characteristics, the characteristics are words, and a possible x identification set S of potential themes is introduced⁽ⁱ⁾Setting a strong constraint f (u is equal to S) for Gibbs sampling⁽ⁱ⁾) If u ∈ S⁽ⁱ⁾Then take 1, else take 0:

q(x_i＝u|x_-i，k，b，a)∝q(x_i＝u|x_-i，k)f(u∈S⁽ⁱ⁾)

if x is constrained_iTo two particular values 1,2, only S needs to be set⁽ⁱ⁾The above equation makes the reasoning of the underlying topic a flexible way under the knowledge of the target, and sets a predefined topic s (i) for each word in the corpus independently, to limit the variability of the constraint strength, an additional variable h is added, and 0 ≦ h ≦ 1, and when h takes 1, it is a strong constraint, and when h takes 0, it is an unconstrained sample:

q(x_i＝u|x_-i，k，b，a)∝q(x_i＝u|x_-i，k)(hf(u∈S⁽ⁱ⁾)+1-h)

and finishing the optimization of the Z-label LDA model, and constructing an emotion feature word mining and classifying model.

The intelligent dynamic mining and classifying method for the Chinese emotional characteristic words further comprises the following steps of identifying the polarity of the emotional characteristic words of a Z-label LDA optimization model in the classification of the emotional characteristic words:

inputting a document set A, an emotional characteristic word set K and emotional seed word sets Q and M by an algorithm

Outputting the words in the emotion characteristic word set K and the corresponding polarities thereof by an algorithm

The algorithm is as follows:

1) adding words in the emotional characteristic word set K into a client dictionary, carrying out word segmentation and duplication removal on the document set A, and obtaining a word list KR;

2) sorting the document set A 'to make the format of the document set A' composed of words in a word list;

3) generating a Z-label LDA optimization model input document digital matrix ax according to A' and KR;

4) generating a supervised word definition matrix ac according to A', KR and Q, M;

5) training by using ax and ac to obtain a theme-word matrix and a theme-document matrix;

6) and retrieving probability values of words K in the K corresponding to different topics in the topic-word matrix, and giving the polarity of the K after comparison.

The intelligent dynamic Chinese emotion feature word mining and classifying method is characterized by further comprising the steps of judging word categories in two modes, wherein one mode is based on theme-word distribution, in the process of training a theme model, the number of themes is far smaller than that of words, the words can have a co-occurrence clustering effect, after a supervision method is combined, weak classification of emotion feature word categories is realized, after distribution probability judgment, the categories of unknown words are obtained, rule files added in an algorithm are predefined themes defined according to the emotion feature categories and seed words thereof, after training of input data, target emotion feature words can be clustered to the predefined categories, and emotion tendencies are judged according to the clustering strength;

in another method, the polarity of the sentence is judged through the polarity of the emotional characteristic words, the polarity of the words is inferred by using the polarity of the sentence, the theme and the polarity of the input sentence are judged by using the trained model, and the polarity of the emotional characteristic words in the sentence is deduced reversely according to the condition of the polarity of the sentence.

Compared with the prior art, the invention has the following contributions and innovation points:

firstly, the intelligent dynamic mining and classifying method for Chinese emotion characteristic words provided by the invention can intelligently identify new words and emotion polarities thereof appearing in input sentences or chapters, does not need to use dictionaries and other resources in the mining process of the emotion characteristic words, has higher algorithm efficiency, can be even used for online judgment, and can be used for mining new emotion characteristic words appearing in sentences, thereby being good in help for intelligent expansion of the emotion characteristic dictionaries, and also playing a good reference role in the emotion tendency analysis of phrases, sentences and chapters, and the method has good feasibility and high accuracy and efficiency;

secondly, the intelligent dynamic mining and classifying method for the Chinese emotional characteristic words provided by the invention takes the mining problem of the new emotional characteristic words as a sequence identification problem, the mining problem is directly mined by a CRF improved algorithm, for the mining of the new emotional characteristic words, a plurality of researches take the mining problem as a part of new word discovery, and the new word discovery is always a difficult problem in Chinese information processing, and relates to large-scale corpus statistical analysis, language structure, semantics, word segmentation, filtering and the like. And because the emotion characteristic words are mined based on the method, the effect is not ideal, and if the mining of the new emotion characteristic words is regarded as a sequence identification problem rather than a word classification problem, the emotion characteristic words are directly identified and are treated as a whole, so that a plurality of complex treatment processes are avoided. According to the method, the characteristics of the language environment, context, part of speech and the like of the words can be better learned through the CRF improved model, the process of finding new words is not needed, the new emotional characteristic words in the sentences are directly identified and learned, and the process is simpler and more efficient;

thirdly, the intelligent dynamic mining and classifying method for Chinese emotional characteristic words provided by the invention adopts a Z-label LDA optimization model for word level classification, LDA is generally used for text clustering and is generally an unsupervised algorithm, topic classification is realized for large-scale documents, and a good effect is obtained in the field of text classification;

fourthly, the intelligent dynamic mining and classifying method for the Chinese emotional characteristic words simplifies the whole mining and polarity identification process of the emotional characteristic words, and the mining and polarity identification stages of the emotional characteristic words comprise training linguistic data and seed words required by supervision, the mining and polarity identification processes are completed by algorithm statistical identification, the intelligent identification is carried out by utilizing the existing dictionary and evaluation data resources as few as possible, the subjectivity and the randomness of manual selection are avoided, the large-scale data calculation is not needed for mining the new emotional characteristic words of the microblog data, only a small amount of training data is needed to train a model, and compared with other complex algorithms, the method is simple and easy to operate, and the emotion judgment is more accurate;

fifthly, the idea of the Z-label LDA optimization model algorithm is expanded, the emotion types of the emotion feature words are classified, the method has less dependence on extra resources, only part of known emotion feature words are used, and the classification of the polarity of the emotion feature words can be well realized without training corpus resources; on the aspect of mining the new emotional characteristic words and identifying the polarity, the method solves the problem from different angles, converts the large-scale statistical mining problem into the identification problem, classifies the emotional characteristic words by using a Z-label LDA optimization model method on the polarity identification, does not relate to the calculation of large-scale linguistic data, can improve the discovery efficiency of the new emotional characteristic words to a great extent, has stronger feasibility in practical application, and is a simple, efficient and strong-practicability intelligent dynamic mining and classifying method for the Chinese emotional characteristic words.

Drawings

FIG. 1 is a schematic diagram of the learning and testing process of a CRF improvement algorithm.

FIG. 2 is an exemplary diagram of a subject model of the present invention.

FIG. 3 is a model diagram of LDA model diagram and a model diagram of variable distribution according to the present invention.

FIG. 4 is a flow chart of an algorithm based on a CRF improved algorithm and a Z-label LDA optimization model.

Detailed Description

The technical scheme of the intelligent dynamic Chinese emotion characteristic word mining and classifying method provided by the invention is further described below with reference to the accompanying drawings, so that the technical scheme can be better understood and implemented by those skilled in the art.

The rapid development of the network technology generates a large amount of data information, and mass internet data contains extremely high potential utilization value, so that the analysis and processing of the data are very important, particularly the emotion analysis of the internet information is very important, decision support can be provided for the development of enterprises according to the preference of customers for the enterprises, more effective product recommendation can be provided for the customers, and public opinion and livelihood can be more effectively known according to the event feedback tendency of netizens for governments. The emotion analysis in the prior art relies on the ready-made emotion feature dictionary to judge the emotional tendency of sentences and chapters to a great extent, but because the internet language environment changes rapidly, a large number of network new words are generated every day, and the static emotion feature dictionary cannot well cover the new words, so that the overall emotion analysis effect is seriously influenced. Aiming at the defects of the prior art, the invention provides a method for mining and classifying the emotional feature words appearing in the Chinese text without depending on the existing emotional feature dictionary, which is mainly divided into two parts, namely, intelligent mining of the emotional feature words based on a CRF improved algorithm and polarity identification classification of the emotional feature words based on a Z-label LDA optimization model.

secondly, in the emotion feature word polarity identification stage, classifying emotion feature words by adopting a Z-label LDA optimization model algorithm, corresponding the subjects in the LDA algorithm to the word category information, defining the subject information by using seed words to realize partial supervised learning, and then judging the polarity or category of the target emotion feature words by the subject-word distribution.

Through experiments and comparison with other methods, the method provided by the invention is obviously superior to results of other methods in the prior art in the emotion characteristic word mining effect, does not need to carry out a large amount of word frequency statistical calculation, and is more concise and faster. The method of the invention is also greatly improved in the polarity identification effect of the emotional characteristic words. Further experiments are carried out on emotion classification of the emotion characteristic words by a Z-label LDA optimization model method, and the method is practical and efficient.

The emotion characteristic word mining level of the invention is a word level, and the specific requirements are as follows: for a given large-scale microblog data set, emotion new words in the microblog data set need to be intelligently mined, and polarity identification (positive, negative and neutral) is carried out on the emotion new words, the evaluation shows that the definition of the new words is a word not in a dictionary, a task is input into a microblog sentence, the emotion new words appearing in the microblog sentence and the polarity of the new words are output, and an emotion new word set is extracted from the input text, wherein a complete solution for the task comprises the following parts: the input text, namely a data set, and the emotion new word mining algorithm part can be divided into two parts, one part is used for mining new words, the other part is used for mining emotion characteristic words, the emotion new word mining can also be directly finished by using the algorithm, the second part is used for polarity identification of the emotion characteristic words or polarity classification, external resources including an existing emotion characteristic dictionary, stop words used by filtering, semantic rules, a word segmentation system and the like can be used for the part and the last part, and the external resources are used for the word mining and classification algorithm after arrangement.

The invention aims to intelligently identify new words and emotional polarities thereof appearing in input sentences or chapters. The invention divides the task into two parts in the experiment: the first part is the mining of the emotional characteristic words, the second part is the polarity identification of the emotional characteristic words, other resources such as dictionaries are not needed in the mining process of the emotional characteristic words, the algorithm efficiency is higher, and the method can be even used for online judgment.

Emotional feature word intelligent mining based on CRF improved algorithm

CRF algorithm improvement

The CRF is a statistics-based sequence identification and sequence segmentation undirected graph model algorithm, serialized data is identified through a probability model, a probability model generating model constructs combined distribution Q (y, c) of an observation sequence y and an identification sequence c through all possible observation sequences to obtain a probability density model, and finally the identification sequence is predicted, the other type is a discriminant model, the possible identification probability of the identification sequence under the condition of the given observation sequence is calculated, and a discriminant function prediction model is generated through calculating the conditional probability distribution Q (c | y) of y and c to predict a target sequence.

The probability graph model is a model combining probability theory and graph theory, the probability dependency relationship between random variables is represented by a graph, if an edge exists between two points, the two points are considered to have the dependency relationship, otherwise, the two points are mutually independent.

Q(O_u|Z，O_v，v≠u)＝Q(O_u|Z，O_v,v～u)

The CRF improved algorithm has no strict independence condition requirement, can contain any context information, is more flexible in feature design, has a convex function as a loss function, can achieve global optimum of the calculated conditional probability, and is a state model which is not normalized and limited by calculating the probability distribution for sequence data by using maximum likelihood estimation. The CRF improvement algorithm of the present invention presents advantages in many respects:

firstly, in the aspect of word segmentation, a word segmentation system based on a dictionary in the prior art has poor mining effect on new words, namely unknown words and ambiguous words, a CRF improved algorithm directly identifies words and then forms words by words, so that the frequency information of the appearance of the words can be considered, the context information of the words can be well learned, and the word segmentation system has great advantages in mining the unknown words;

secondly, in the aspect of part-of-speech identification, the HMM has a great disadvantage in word sequence identification that extraction of word context information is limited by the assumption of independence of the algorithm itself, and the context information is crucial to part-of-speech identification, although the MEMM can consider the context information, due to the normalization principle, only a local optimal solution can be found, and the problem of identification deviation can be possibly caused, so that the CRF improvement algorithm can well solve the problems;

according to the invention, the CRF improved algorithm is adopted to mine the emotional characteristic words, the discovering capability of the CRF improved algorithm on the unknown words can better mine the unknown words, and the context-based identification can consider the characteristics of the emotional words and carry out identification, so that the method has obvious advantages in mining the emotional characteristic words including new emotional characteristic words.

The learning and testing process of the CRF improvement algorithm is shown in figure 1: firstly, performing word segmentation on a data set, adding part-of-speech marks and other characteristics, then performing model training by adopting a CRF (learning random number) improved algorithm, performing formatting processing on a test set, then testing by using a model obtained by training, and obtaining an emotional characteristic word set after processing.

(II) selecting emotional characteristics

In the prior art, emotion tendency analysis mainly focuses on judgment of emotion tendency of sentences and emotion analysis of chapters, and related research on construction of an emotion feature word bank is less, and emotion analysis at word level is finer in granularity and less in usable context rules relative to sentence and chapter levels. In the emotional characteristic word mining process based on the CRF improved algorithm, the used emotional characteristic word characteristics comprise:

(1) the character characteristics, the characteristics of the character itself are important aspects in the characteristics of the emotional characteristic words, some characters have emotional tendency and have emotional tendency, in addition, in the context of the emotional characteristic words, certain characteristics are also provided,

(2) the part-of-speech characteristics are obvious characteristics of the emotional characteristic words, and include more adjectives, exclamation words and nouns, while in the context of the emotional characteristic words, more nouns and pronouns appear before the words, and more auxiliary words, exclamation words and punctuations appear after the words, so that the part-of-speech characteristics of the context of the emotional characteristic words are taken as an important characteristic.

(3) Degree adverb features, which are often used before the emotional feature words to enhance emotions, are also used as part of the extracted features.

In the process of mining the emotional characteristic words by the CRF algorithm, a method based on character identification and a method based on character and word combination identification are adopted, the effects of the two methods are compared,

therefore, the two groups of comparison algorithms are both character-based identifications aiming at the mining of new emotional characteristic words, the identification training linguistic data use intelligent identification methods without manual participation, one of the character-based identifications is a method of directly using similar participles, the characters are identified, the emotional characteristic words are mined completely based on the characteristics of the characters, the other method of combining the characters and the word identifications in a comparison experiment is adopted, and the part-of-speech characteristics are added. The identification adopted by the invention is based on the identification of characters, and the part-of-speech characteristics are added, in the identification based on words, the corpus is firstly subjected to word segmentation, the word segmentation system is theoretically set to directly segment unknown words, after word segmentation, the segmented new emotion characteristic words are directly identified, and then the model is obtained through training.

(III) emotional feature word mining of CRF improved algorithm

The invention uses CRF + + toolkit, forms of CRF algorithm, training set and test set and characteristic template realized by C + + are constructed according to toolkit standard, a prediction model is generated according to the characteristic template and the training set, the test set is predicted and evaluated, and the characteristic template directly determines the obtaining mode of emotional word characteristics.

A plurality of comparison experiments are carried out in the experiment, new emotion characteristic words and old emotion characteristic words cannot be identified and distinguished on marks based on characters, a training set uses emotion mining linguistic data of NLP & CC, all emotion characteristic words are mined, then old dictionaries are filtered to obtain new emotion characteristic words, 4000 linguistic data mined on the basis of words and part of speech combination characteristics are used by the training set, and 2632 pieces of evaluation linguistic data are used as a test set. And the comparison of parts of speech and vocabulary characteristics is also increased, the experimental effect is compared by increasing or decreasing the template set, and the experimental comparison is also performed on the characteristics based on the words and the combination characteristics of the words and the parts of speech.

The invention provides an algorithm basis and a processing flow for mining emotional feature words by using a CRF (conditional random access) improved algorithm, which firstly introduces the CRF improved algorithm, then introduces the selected features used in the CRF improved algorithm process, and concretely introduces a specific method for identifying and mining the emotional feature words.

Second, sentiment feature word polarity identification classification based on Z-label LDA optimization model

After the emotional feature words are obtained, preprocessing is carried out on the corpus containing the emotional feature words, then training is carried out by utilizing a Z-label LDA optimization model, and the emotional feature words are classified. The invention relates to an emotion characteristic word polarity identification algorithm based on a Z-label LDA optimization model, which is a classification process and expands emotion characteristic words on emotion classification. The algorithm is described in detail below.

Z-label LDA optimization model

LDA is a modeling method based on discrete data sets based on the assumptions: there is a set R of topics, such that the text in the document set a contains one or more topics, each topic is a probability distribution composed of a series of words, and potential topics in the document can be discovered through probability learning, as shown in fig. 2. From a word perspective, a topic R is understood as:

1) a common related subject matter of at least one document content;

2) some or all of the semantic information expressed by the document;

3) a pattern of related word co-occurrence;

4) clustering co-occurrence words;

5) a sort of classification of words.

From a probabilistic perspective, one topic is understood as:

1) probability distribution of words;

2) a set of words ranked according to probability of word co-occurrence.

LDA is a three-layer Bayesian structure probability topic model modeling on a discrete data set, topics and words are used as Dirichlet distribution parameters, the topic parameters are prevented from being uncontrollable along with the change of a document set, and a full probability generation model is used in the generation process of a document, so that the hierarchy is clearer.

The LDA model assumes W potential topics, each topic is a multinomial distribution of words, a document is regarded as a sampling matrix of word frequency sequences of the document on the topics, each word depends on a specific topic, namely each document selects a topic with a certain probability, and then the iterative process of selecting a word in the topic with a certain probability is obtained. A document set A is denoted as k₁,k₂,…,k_NJ, a document k ═ j₁,j₂,…,j_MAnd the generation process of a document in a specific document set is represented as:

1. selecting one g-Dir (b)

2. For each word j in M_m：

1) Selecting a topic x_m～Mult(g)；

ensuring that each document contains all topics, b controlling the mean value and the sparse density of g, calculating g from the document set for evaluating b, a being a multi-item distribution parameter of topic-words, ensuring that each topic contains all words and is represented as a W multiplied by U matrix, wherein W is the possible topic number, U is the possible word number, each line is a vector with the sum of U dimensions being 1, and the graph model of the generation process is represented as shown in FIG. 3:

the joint probability of LDA obtained from the graph is:

the key in LDA is to calculate the posterior probability of the implicit part:

(1) initializing assignment to each variable in the set Z;

(2) the sampling of each variable is determined by the conditional probability distribution of the whole variable, the variable value obtained by the previous sampling and other non-sampled variable values are jointly determined in the calculation,

the value of the current sample is also used in the calculation of the conditional probability of the next sample to estimate other variable values.

Is provided with

And

wherein

q(x_i＝u|x_-i，k，b，a)∝q(x_i＝u|x_-i，k)f(u∈S⁽ⁱ⁾)

if x is constrained_iTo two particular values 1,2, only S needs to be set⁽ⁱ⁾1,2, the above formula submergesThe topic reasoning has a flexible method under the existing knowledge of the target, a predefined topic S (i) is independently set for each word in the corpus, an additional variable h is added for limiting the variability of the constraint strength, h is more than or equal to 0 and less than or equal to 1, h is strong constraint when h is 1, and is unconstrained sampling when h is 0:

q(x_i＝u|x_-i，k，b，a)∝q(x_i＝u|x_-i，k)(hf(u∈S⁽ⁱ⁾)+1-h)

(II) classifying emotional characteristic words

The method comprises the steps of adopting a Z-label LDA optimization model, calculating the word list size of training data in advance, identifying words at the corresponding positions of a text as set subject numbers according to a predefined rule, calculating a matrix of the word list corresponding to the text to obtain a subject-word model and a document-subject model, carrying out statistical analysis on the subject-word model, and obtaining word classification from a predefined subject.

In the process of carrying out polarity identification on the emotional characteristic words, firstly, the context information of the sentences in which the emotional characteristic words are positioned is used for judging the polarity or the co-occurrence condition of the emotional characteristic words, the polarity identification of the emotional characteristic words is regarded as a classifying process, the emotional characteristic words are divided into two types, namely positive and negative, the generation process of a theme model is regarded as that each word in the text selects a theme with a certain probability, and a word is selected in the theme with a certain probability. Corresponding to the process, the invention provides that the emotional characteristic words are classified by adopting a supervised Z-label LDA optimization method, and the subjects in the Z-label LDA optimization model correspond to the categories of the classified words.

The emotion characteristic word polarity identification step of the Z-label LDA optimization model comprises the following steps:

The algorithm is as follows:

The invention judges the polarity of the emotional characteristic words by using a Z-label LDA optimization model, namely, the emotional characteristic words are divided into two types, and the emotional characteristic words are also classified into a plurality of types, wherein seven types of examples of the emotional characteristic words are as follows:

1) music: happy, graceful, relaxed, happy face, smiling, comedy, happy and college;

2) well: magical, crisp, excellent, beautiful, straight, reliable, worship, intelligibility, acknowledged, treasure and German high-looking;

3) rage: anger, rage, irritation, anger, fierce discolouration, council, liver fire, and fire thriving;

4) grief: poor, sour, mental retardation, bitter, solitary, pain and pithy, ashen and cold mind, downy mildew, desperate and grieve;

5) fear: panic, fear, panic, uneasy, fear, palpitation, charming, embarrassment, loss of one's life, tooth arousing, and difficulty in feeling;

6) oxa: depressed, boring, taking up, unknown and jeopardy, jealousy, suspicion, hesitation and wise behavior;

7) surprisingly: surprisal, starry, strange, unexpected, incredible, surprise, touching the eyes and surprise;

the method comprises the steps of judging the category of words in two modes, wherein one mode is based on theme-word distribution, in the training process of a theme model, the number of themes is far smaller than that of words, the words can have a co-occurrence clustering effect, after a supervision method is combined, weak classification of the emotion feature word category is realized, and after the judgment of distribution probability, the category of unknown words is obtained. The rule files added in the algorithm are predefined themes and seed words thereof defined according to the types of the emotion characteristic words, and the target emotion characteristic words can be clustered to predefined types through training of input data, so that the emotion tendentiousness can be judged according to the clustering strength.

Third, examples and evaluation of effects

Design of the embodiment

The microblog is taken as a typical representative of the social network, has the characteristics of rapid propagation, large information quantity, wide audience scope, colloquial and the like, and is also an important way for netizens to obtain news, entertainment information, viewpoint publication and interaction of net friends, so that microblog linguistic data has great emotional tendency, meanwhile, the microblog is also one of main sites for generating new words of the network, and the discovery and polarity analysis of the new words have great significance for the analysis of the emotional tendency of the Chinese, so that microblog data is taken as a measurement data set of the task, and the microblog data is aimed at providing an emotion new word (defined as a word which does not appear in a dictionary) which can be intelligently mined and the emotional tendency (positive, negative and neutral) of the emotion new word appearing in the microblog.

For given microblog data, the emotion feature words in the microblog data can be intelligently mined and subjected to polarity identification, and on the basis, the emotion feature words are classified in more detail and tested, and the emotion feature word classification standard shows the word classification effect of the algorithm according to seven types of emotion classification standards of the structure of the emotion feature word body through practical verification.

The input of the Z-label LDA optimization model needs to format data into a format specified by an algorithm, namely a format of words and word frequency, firstly, the identified emotional characteristic words are added into a word segmentation dictionary to prevent the words from being separated in the word segmentation process, the invention uses an NLPIR word segmentation tool to segment words, constructs a word list, and then converts the input into a digital matrix for input, and the embodiment of the processing process is as follows:

a word list: honor and friend abundantly, that is, enemy becomes friend and trusts others to obtain

Inputting: the enemy can become a friend; trust others as well, and also obtain the trust of others.

Outputting a document: [1,2,3,4,5,6,7,8,9,10,11,9]

After the input file is formatted, a supervised rule file needs to be defined, matrixes corresponding to the documents are output and identified according to different purposes, the positions of the matrixes corresponding to the emotional characteristic words in the documents are identified, the topics are identified as defined emotional characteristic word topics, and the topic rules are divided into two classifications and multiple classifications.

The Z-label LDA optimization model is a supervised theme model algorithm, the category of the rule file is embodied as a theme in the algorithm, supervision is embodied in the process of training the model, partial theme-words are well defined, the algorithm expands the theme on the basis of the existing theme category through the model training process, a new theme is added, and the mined emotional characteristic words are distributed to a predefined rule with a certain probability. The words in the rule finally appear in the numerical format of the sequence number appearing in the word list, and the matrix corresponding to the sentence forms another theme supervision matrix, and the numerical identifier of the theme category to which the words belong is identified at the corresponding position where the emotional characteristic words appear.

The overall algorithm is tested, the test results are analyzed and evaluated, the test is divided into two parts, namely mining of emotion feature words and polarity judgment according to the overall framework of the algorithm, and then the test is carried out on each part and the test results are analyzed based on the evaluation standard. The flow chart of the overall algorithm is shown in fig. 4.

And in the emotion feature word mining stage, preprocessing data into an input format required by a CRF (learning random number) improved algorithm, training to obtain a model, then inputting test data, and sorting and obtaining emotion feature words according to identification results. In the classification stage of the emotional characteristic words, a corpus and an emotional characteristic word set are input, are input into a Z-label LDA optimization model algorithm for learning through data preprocessing, generate theme-word distribution, and are compared with a word list to obtain the category of the emotional characteristic words.

(II) evaluation of Effect

With the rapid development of internet big data, the analysis and processing algorithm of the data become popular applications, the emotional tendency analysis of the data is an important part, the analysis of a large number of personal views on the internet, including the emotional tendency analysis of commodity comments, microblog data, news comments, customer feedback information and the like, plays an important role in practical applications, for example, an e-commerce platform makes product plans with pertinence according to the commodity comments and the like, and a government department needs to make a response to the emotional tendency of the network public opinion of social events in time and the like. In the prior art, emotion analysis in Chinese is started later, and certain challenges are brought to research and application due to the particularity of Chinese word formation. The intelligent mining capability of new emotional characteristic words existing in texts plays an important role in the field of emotional tendency analysis, Chinese emotional tendency analysis in the prior art has more dependence on dictionary resources, a dictionary has the problem of incomplete collection of new emotional characteristic words appearing in a network, the effect of sentence and chapter emotional tendency judgment is directly influenced, and aiming at the current situation of emotional tendency analysis, the intelligent mining capability mainly contributes to the following problems of mining and polarity identification of new emotional characteristic words appearing in network environments such as microblogs and the like:

firstly, intelligently mining emotional feature words including new emotional feature words appearing in a microblog by adopting a CRF improved algorithm, directly simplifying the problem of mining the emotional feature words into a sequence identification problem by combining context information, part-of-speech features and degree adverb features of the words, realizing intelligent mining of the emotional feature words, and proving the high efficiency of the selected features on the mining of the new emotional feature words by performing comparison experiments on the effects of various features;

secondly, in the judgment of the polarity of the emotional characteristic words, the polarity identification of the emotional characteristic words is regarded as a classification problem, the words are classified by adopting a Z-label LDA optimization model method, the topic model idea of the algorithm can utilize the context information of the words and the existing dictionary resources to realize supervised learning, and the polarity inference is made on the emotional characteristic words with unknown polarity.

Thirdly, the idea of the Z-label LDA optimization model algorithm is expanded to the problem of classifying other words, the emotion types of the emotion characteristic words are classified, the effectiveness of the algorithm is proved through experiments, the method has less dependence on extra resources, only part of known emotion characteristic words are used, and the classification of the polarity of the emotion characteristic words can be well realized without training corpus resources.

On the aspect of mining and polarity identification of new emotional characteristic words, the method solves the problem from different angles, converts large-scale statistics mining problems into identification problems, and classifies the emotional characteristic words by using a Z-label LDA optimization model method on the polarity identification.

Claims

1. The intelligent dynamic mining and classifying method for the Chinese emotional characteristic words is characterized by mainly comprising two parts, namely intelligent mining of the emotional characteristic words based on a CRF improved algorithm and polarity identification classification of the emotional characteristic words based on a Z-label LDA optimization model; the intelligent mining of the emotional feature words based on the CRF improved algorithm comprises the improvement of the CRF algorithm, the selection of emotional features and the mining of the emotional feature words of the CRF improved algorithm, and the polarity identification classification of the emotional feature words based on the Z-label LDA optimization model comprises the Z-label LDA optimization model and the classification of the emotional feature words;

2. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 1, wherein the improvement of the CRF algorithm: CRF is a statistics-based sequence identification and sequence segmentation undirected graph model algorithm, serialized data is identified through a probability model, a probability model generating model constructs combined distribution Q (y, c) of an observation sequence y and an identification sequence c through all possible observation sequences to obtain a probability density model, and finally the identification sequence is predicted, the other type is a discriminant model, the possible identification probability of the identification sequence under the condition of the given observation sequence is calculated, and a discriminant function prediction model is generated to predict a target sequence through calculating the conditional probability distribution Q (c | y) of y and c;

Q(O_u|Z，O_v，v≠u)＝Q(O_u|Z，O_v，v～u)

3. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 2, wherein a CRF improvement algorithm has no strict independence condition requirement and can contain any context information, and a CRF improvement algorithm loss function is a convex function, and the CRF improvement algorithm calculates probability distribution for sequence data by using maximum likelihood estimation, and is a state model which is not normalized and limited;

4. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 1, wherein the emotion feature word features used in the emotion feature word mining process based on the CRF improved algorithm include:

5. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 1, wherein emotion feature word mining of a CRF improved algorithm: the method comprises the steps of using a CRF + + tool package, a training set, a test set and a feature template to form a prediction model according to the feature template and the training set according to the standard structure of the tool package, predicting and evaluating the test set, and directly determining an obtaining mode of emotion word features by the feature template;

6. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 5, wherein new emotion feature words and old emotion feature words cannot be identified and distinguished on the basis of word identifiers, a training set uses emotion mining corpora of NLP & CC, all emotion feature words are mined, then an old dictionary is filtered to obtain new emotion feature words, the training set uses 4000 corpus data mined by Chinese microblog emotion on the basis of word and part-of-speech combined features, 2632 test sets are used as test sets, part-of-speech and word feature comparison is further increased, effects are compared by increasing or decreasing template sets, and word-based features and word-and-part-of-speech combined features are also compared;

7. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 1, wherein the Z-label LDA optimization model: LDA is a modeling method based on discrete data sets based on the assumptions: there is a set of topics R, so that the texts in the document set A all contain one or more topics, each topic is a probability distribution formed by a series of words, and potential topics in the documents can be found through probability learning, and from the perspective of words, a topic R is understood as:

1) a common related subject matter of at least one document content;

2) some or all of the semantic information expressed by the document;

3) a pattern of related word co-occurrence;

4) clustering co-occurrence words;

5) one classification of the word;

from a probabilistic perspective, one topic is understood as:

1) probability distribution of words;

2) a set of words arranged according to a probability of word co-occurrence;

8. The method of claim 7, wherein the LDA model assumes W potential topics, each topic is a multi-term distribution of words, a document is regarded as a sampling matrix of word-frequency sequences of documents on the topics, each word depends on a specific topic, that is, each document selects a topic with a certain probability, and then an iterative process of selecting a word in the topic with a certain probability is performed, and a document set a is represented as { k }₁,k₂,…,k_NJ, a document k ═ j₁,j₂,…,j_MAnd the generation process of a document in a specific document set is represented as:

1. selecting one g-Dir (b)

2. For each word j in M_m：

1) Selecting a topic x_m～Mult(g)；

the joint probability of LDA is obtained as:

the key in LDA is to calculate the posterior probability of the implicit part:

1) initializing assignment to each variable in the set Z;

is provided with

And

wherein

q(X_i＝u|X_-i，k，b，a)∝q(X_i＝u|x_-i，k)f(u∈S⁽ⁱ⁾)

q(X_i＝u|X_-i，k，b，a)∝q(X_i＝u|X_-i，k)(hf(u∈S⁽ⁱ⁾)+1-h)

9. The intelligent dynamic Chinese emotion feature word mining and classifying method according to claim 1, wherein in the emotion feature word classification, the emotion feature word polarity identification step of the Z-label LDA optimization model is as follows:

The algorithm is as follows:

10. The intelligent dynamic mining and classifying method for Chinese emotional characteristic words according to claim 9, wherein the word categories are determined in two ways, one is based on topic-word distribution, in the course of training a topic model, the number of topics is much smaller than the number of words, the words have a co-occurrence clustering effect, after a supervision method is combined, a weak classification for the emotional characteristic word categories is realized, after the probability of distribution is determined, the categories of unknown words are obtained, the rule files added in the algorithm are predefined topics defined according to the emotional characteristic word categories and their seed words, after the training of input data, the target emotional characteristic words are clustered to the predefined categories, and the judgment of emotional tendencies according to the clustering strength is realized;