CN114722835A

CN114722835A - Text emotion recognition method based on LDA and BERT fusion improved model

Info

Publication number: CN114722835A
Application number: CN202210447516.5A
Authority: CN
Inventors: 朱李玥; 戴梦瑶; 刘文强; 邢莉娟; 柏雪嫣
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-07-08

Abstract

The invention discloses a text emotion recognition method based on an LDA and BERT fusion improved model, which comprises the following steps: (1) acquiring a social network text and preprocessing the social network text; (2) fusing semantic features and topic features of the text and outputting a word vector matrix; (3) inputting the characteristics into a bidirectional Transformer encoder, connecting the characteristic with a Softmax layer improved by gradient optimization, and outputting a classification model; (4) and putting formal corpora into the classification model, finely adjusting parameters and improving the model. And performing emotion recognition on the social network text by using the obtained final classification model to obtain a more accurate recognition result.

Description

Text emotion recognition method based on LDA and BERT fusion improved model

Technical Field

The invention relates to a text emotion recognition method based on an LDA and BERT fusion improved model, and belongs to the technical field of text data recognition.

Background

With the arrival of the big data age and the vigorous development of the 5G network, the internet gradually calls for an open architecture taking users as the center, and the release of network information is increasingly changed from 'in time' to 'real time'. Internet users are transformed from recipients of information to publishers. Social networks, a platform for publishing and retrieving information easily, are attracting more and more users to publish emotional text of their personal lives on news and facts. Therefore, how to accurately, timely and effectively acquire the emotion information of the social network text has important value significance. There are three current common ways of text sentiment analysis. The emotion analysis method based on the emotion dictionary, the emotion analysis method based on machine learning and the emotion analysis method based on deep learning are respectively.

The analysis method based on the emotion dictionary is the earliest emotion analysis method, the preprocessed dictionary is matched with words in the emotion dictionary, then the emotion score is calculated according to the matching degree, the emotion polarity is judged, and the calculation is simple but the accuracy is low. The key point lies in the construction of an emotion dictionary, most of the traditional construction methods are based on semantic similarity, the core idea is to measure the distance between a candidate word and a positive emotion label and a negative emotion label, a point mutual information method PMI is generally adopted as a measurement method, and in recent years, along with the rapid development of artificial intelligence, a dictionary construction method based on machine learning and a deep neural network is also provided. Although the method is flexible and convenient, the constructed emotion dictionary is generally used in the field, so that the universality of the dictionary-based method is poor.

The emotion analysis method based on machine learning is characterized in that features are screened from a large amount of linguistic data in a mode of mainly manually screening, then the selected features are used for representing the whole text, and finally the text is classified by adopting a machine learning method. The emotion analysis method based on machine learning can be divided into a supervised method and an unsupervised method. Common supervised emotion analysis methods include naive Bayes NB, support vector machine SVM, conditional random field CRF and the like, and the methods have high learning precision, but need a large amount of manual labeling data and have high requirements on people; the common unsupervised machine learning method gets rid of the dependence on manpower, and has potential semantic analysis PLSA, potential Dirichlet allocation model LDA and the like, but the accuracy of the method is generally low.

The emotion analysis method based on deep learning utilizes the neural network to autonomously learn, extract and combine text features into high-level features, and then automatically executes classification tasks, so that the defect of machine learning is overcome. The existing neural network model commonly used for emotion classification is LSTM, which alleviates the problem of gradient explosion of general RNN to a certain extent, but still has some problems, such as low parallel computing efficiency, slow running speed and the like. With the advent of the Transformer model, the BERT model based on the former is excellent in multiple NLP tasks, but due to the lack of large-scale emotion corpus input in the pre-training stage, a certain bottleneck still exists in the implementation of emotion analysis tasks.

Therefore, there is a need for a method for emotion analysis of text data with better performance.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, in order to improve the accuracy of text emotion extraction, the invention provides a text emotion recognition method based on an LDA and BERT fusion improved model.

The technical scheme is as follows: a text emotion recognition method based on an LDA and BERT fusion improved model is characterized in that social network text theme characteristics are obtained through LDA theme analysis, text semantic characteristics are obtained through a BERT model, word vectors of the LDA and the BERT are spliced and put into an improved emotion classification model, so that the model can accurately recognize text emotion, and the optimized classification model is output, and comprises the following steps:

step 1: acquiring a social network text corpus and preprocessing a text;

and 2, step: putting the preprocessed text corpus into a BERT pre-training model to extract semantic features, and obtaining a word vector matrix of the semantic features;

and 3, step 3: putting the preprocessed text corpus into an LDA model to extract topic feature expansion, and splicing the topic feature expansion with the word vectors of the semantic features obtained in the step 2 to obtain a word vector matrix fusing semantics and topic features;

and 4, step 4: building an emotion classifier, wherein the emotion classifier is a model for identifying positive and negative emotions of a text, re-transmitting a word vector fusing semantic and theme characteristics into a bidirectional Transformer encoder, optimizing a vector output by a Softmax regression model and connecting the Transformer by using a random gradient descent method, adapting to various tasks, and outputting a classification model after training;

and 5: and (3) putting the social text corpus for deep testing into a classifier (classification model) for deep pre-training, evaluating the performance of the classification model, carrying out parameter fine adjustment to obtain the trained classification model, and classifying the emotional tendency of the text.

By adopting the technical scheme, the BERT is used for obtaining the vector representation of the short text, so that the semantic features of the short text are better extracted; by extracting the theme features and fusing the semantic features by means of the LDA theme model, the feature types during training are enriched, the defects of the LDA theme model in the field of short texts are overcome, the features are used as high-quality input and are transmitted into a transform model, the output vector of the transform is connected by using a single-layer neural network, GPU resources are fully utilized, the emotion analysis is more detailed and efficient, and the emotion recognition of text data is more accurate.

The specific steps of preprocessing the text in the step 1 include:

step S11: text cleaning: mainly comprises 3 steps of removing Chinese uncommon symbols, redundant blanks and conversion from traditional Chinese to simplified Chinese.

Step S12: word segmentation and stop word removal: and removing the nonsense vocabulary according to the Chinese inactive word list, and then carrying out word segmentation on the text by using jieba.

Step S13: text filtering: and deleting the text with the text length not within the set length range in the social network text corpus set. The social network mainly comprises short texts, but as the training corpus of the LDA model, the text length cannot be too short, so samples with the length of less than 20 and more than 200 are screened out.

In the step 2, semantic feature extraction is carried out on the preprocessed text corpus data through a BERT pre-training model, and each word is respectively mapped into 3 vectors and expression w_ij(ω + δ + ρ), 3 vectors, i.e., word vectors, text vectors, and position vectors of the text, to obtain a word vector matrix of semantic features. Since the BERT pre-training model only uses a feed-forward neural network and a multi-attention mechanism, the BERT model also adds self-learned position vectors relative to word vectors and text vectors.

In the step 3, the preprocessed text corpus is put into an LDA model to extract topic feature expansion, and is spliced with the word vector of the semantic features obtained in the step 2 to obtain a word vector matrix w fusing the semantic features and the topic features_ij(ω + δ + ρ + μ '), μ' being the topic vector, hereinafter this matrix is referred to collectively as the word vector; the method comprises the following specific steps:

step S31: counting words in the text corpus set to generate a dictionary;

step S32: training the materials by using an LDA model in a Gensim module, and weighting the obtained matrix by using a tf-idf algorithm to obtain an expanded theme characteristic vector;

step S33: and after the expanded subject feature vector is obtained, splicing the expanded subject feature vector with the word vector of the semantic feature obtained in the step 2, and completing the expansion of the text under the subject force by using a vector splicing mode, so that the semantic feature extracted by BERT and the subject feature extracted by LDA are fused.

And 4, transmitting the word vectors which are output in the step 3 and are fused with the semantics and the theme characteristics into a Transformer encoder, connecting the output of the Transformer by a Softmax layer which is improved by gradient descent optimization, wherein the neural network is mainly used for executing an emotion analysis task, and simultaneously does not influence the original MLM and NSP tasks of BERT, because the network is also connected with the output of the two tasks after execution.

The method specifically comprises the following steps:

step S1: word vector w_ij(ω + δ + ρ + μ') into a bidirectional Transformer encoder;

step S2: the word vector passes through a Self-extension layer, and a Query matrix, a Key matrix and a Value matrix are calculated firstly;

step S3: according to the formula: computing Attention by using a self-Attention mechanism computing formula, wherein Softmax is a normalized exponential function, so that the sum of output characteristic elements is 1;

step S4: and setting the number parameter of Attention head, setting head as n, transversely splicing n Self-orientation matrixes, and finally multiplying the Self-orientation matrixes by additional weight matrixes to compress the matrixes into one matrix.

Step S5: and executing an emotion classification task, and accessing a single-layer neural network Softmax to obtain and output the word vectors of each sentence in the corpus and the corresponding sample categories.

Step S6: and executing a Masked LM task, randomly masking words with a set proportion for prediction aiming at each sentence in the training sample, and predicting the output of the Masked word part by using the residual words according to the set proportion.

Step S7: and performing an NSP task, selecting two sentences A and B aiming at each sentence in the training sample, wherein A is the correct next sentence, and B is the wrong next sentence, and obtaining the binary loss by using the CLS token output.

Step S8: and outputting the preliminary classification model.

And step 5, putting the formal corpus into the classifier set up in the step 4 for training, setting initial parameters, calculating the accuracy rate and the recall rate, simultaneously searching the threshold value of positive and negative emotion classification by adopting an F1 score, and calculating a Loss function as an index of model evaluation.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a diagram of an LDA model according to an embodiment of the present invention;

FIG. 3 is a diagram of a emotion classifier model structure according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, a text emotion recognition method based on an LDA and BERT fusion improved model includes the following steps:

step 1: acquiring a social network text corpus; crawling short texts of the social network, such as keywords 'kennels' (antidepressant drugs), by using a crawler to obtain speeches containing the keywords, and constructing an initial corpus; then cleaning, segmenting words, removing stop words and the like on the document; the processed documents are filtered to reduce the data scale and experimental cost.

The text preprocessing comprises the following specific steps:

step S11: text cleaning: mainly comprises 3 steps of removing special symbols, redundant blanks and changing complex bodies into simplified bodies.

Step S12: word segmentation and stop word removal: and (4) acquiring a high-frequency Chinese stop word list to stop the nonsense words, and then performing word segmentation on the text by using jieba.

Step 2: and putting the preprocessed corpus into a BERT pre-training model to extract semantic features, and obtaining a word vector matrix of the semantic features.

Extracting features based on a BERT model, and assuming that omega, delta and rho respectively represent a word vector, a text vector and a position vector of a text; the position-coding vector is unique based on the Transformer model, and determines the position of the current word in the sequence. However, unlike the Transformer model, the position vector of BERT is not calculated by a trigonometric function, but is learned.

BERT is based on a bidirectional Transformer encoder module, a multi-head attention mechanism can enhance the attention capability of a model, and the number of heads is set to be 8; transversely splicing 8 self-attention matrixes, and then utilizing an additional weight matrix W^oMultiplying the matrix and compressing the matrix into a matrix with the same dimension as the input sequence, wherein the calculation formula is as follows:

Multihead(Q，K，V)＝

Concat(head₁，head₂，...，head_h)W^o

Where head_i＝Attention(Q_i，K_i，V_i)

q, K, V represent the query, key and value vectors for each word in the input sequence, respectively, and are calculated as follows:

Q＝XW^Q

K＝XW^K

V＝XW^V

wherein, X is the input word vector matrix. Q, K, V are new matrices obtained by multiplying the word vector matrix by the weight matrix corresponding to the respective hyperparameter head.

Document d after word segmentation_i＝{w_ij|j∈{1，2，...，N_iInputting into BERT model to extract semantic features, after the calculation, each word is mapped into 3 vectors represented as w_ij(ω+δ+ρ)。

And step 3: and (3) performing theme feature expansion on the semantic feature vector obtained in the step (2). Selecting LDA theme model, and preprocessing the languageAnd each document theme in the material set is given in a probability distribution mode, and theme clustering or text classification optimization is carried out according to the theme distribution. Suppose a corpus consists of M documents, corresponding to the d-th document, there is N_iWord (d)_i＝{w_ij|j∈{1，2，...，N_i}) then these words all correspond to a potential topic. The variables of the LDA model are distributed as follows:

wherein α and β obey a priori Dirichlet distribution; theta_iIs the distribution probability of the text topic; p (theta)_i| α) is the topic distribution probability generated by the Dirichlet prior parameter α; p (phi | beta) generates the topic z for the prior parameter beta_ijThe "topic-word" distribution matrix of (1); p (z)_ij|θ_i) Distributing theta for the subject_iSampling the generated document d_iThe topic probability corresponding to the jth word;

is a word distribution

To generate word w_ijThe corresponding probability, Φ, is the overall distribution.

Utilizing an LDA module in a Gensim module to complete probability distribution of subject words, putting a corpus set into an LDA model training subject vector mu, obtaining a better subject result mu 'after iterative computation, and combining the better subject result mu' with the semantic vector obtained in the step 2 to obtain a feature vector w fusing the subject vector_ij(ω+δ+ρ+μ′)。

And 4, step 4: and constructing a classifier. Then the eigenvector w obtained in the step 3 is used_ij(ω + δ + ρ + μ') is re-transmitted into the bi-directional Transformer encoder. The Transformer encoder realizes parallel operation through a series of Self-attentions, has good running speed, connects a multi-head attention mechanism and a feedforward layer through a residual error network structure, and performs multiple linear changes on input vectors through the multi-head mechanismAnd (5) converting to obtain different linear values, and calculating the attention weight. The calculation formula is as follows:

Multihead(Q，K，V)＝

Concat(head₁，head₂，...，head_h)W^o

Where head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V)

q, K, V representing query, key and value vectors of each word in the input sequence, calculating the Attention after parameter matrix mapping, repeating for h times, and splicing the results_iIs the ith hyper-parameter header. W^oTo add a weight matrix, W_i ^Q，W_i ^KAnd W_i ^VAnd the weight matrix corresponding to the ith hyper-parameter head. The specific process is introduced in step 2, and is not described in detail herein.

Whereby the encoder learns and stores the document d_iSemantic relationship and syntactic structure information. Also because of the improved d_iThe theme feature vectors are fused, so that the modified document is connected with an output layer of Softmax to adapt to learning under multiple tasks, namely, MLM and NSP tasks are reserved while an emotion classification task (SC) is executed, and the Softmax output layer is respectively connected with a MASK word ([ MASK ] of a Transformer]) And a text start ([ CLS)]) After the corresponding output vector. And optimizing a loss function of the Softmax regression model by adopting a random gradient descent method, wherein the loss function J (theta) and the optimized gradient calculation formula are as follows:

where θ is the overall parameter of the model, θ_jFor the classifier parameters corresponding to each class, m is the number of samples, k is the number of classes, i represents a certain sample,xⁱis a vector representation of the ith sample x, and j represents a certain class. I (theta) is a log-likelihood function,

each iteration of the gradient descent method updates the parameters, so that softmax realizes the prediction and classification of the input classes.

Therefore, the topic features and the semantic features are combined, an improved classification model M is output, and the emotion recognition accuracy of the model for different texts is improved.

And 5: step 4, outputting a classification model M, inputting a corpus into the M, setting initial batch sample batch and one-time iteration epoch parameters, calculating Recall rate Recall and Precision and measuring model Precision by using an F1-score calculation mode, wherein the calculation formula is as follows:

and finally, taking the sum of the Loss rates of the three tasks of SC, MLM and NSP as a total Loss function Loss, and calculating the formula as follows:

Loss＝λ₁Loss_SC+λ₂Loss_MLM+λ₃Loss_NSP

where λ is the weight assignment for the three tasks. Model performance can be evaluated by calculating the F1-score and the loss function. Therefore, the model with the LDA-BERT integration in the embodiment can better simulate the emotion of the social network text, and a more accurate emotion analysis result is obtained.

The LDA topic analysis model is shown in fig. 2, and the specific steps of the LDA training algorithm are as follows:

step S1: randomly initializing alpha and beta (generally, the value of alpha is 50/theme number, and the value of beta is 0.1);

step S2: sequentially locking the documents in the training set;

step S3: calculating the theme distribution of each document and the theme distribution of each word by using the current alpha and beta values; checking whether the document is the last one, and if so, proceeding to S4; otherwise, returning to the step S2;

step S4: accumulating the number of words belonging to the subject k in all the documents to obtain a vector gamma; and the times that the word i belongs to the theme k are obtained to obtain a matrix beta;

step S5: obtaining a current optimal alpha value by utilizing Newton-Raphson iteration according to the current gamma;

step S6: normalizing columns of the matrix beta to directly obtain a current beta value, namely word distribution of each topic; checking whether convergence is achieved, and if yes, entering S7; otherwise, returning to the step S2;

step S7: the values of alpha and beta at which convergence is reached are output.

The emotion classifier model structure of the embodiment of the invention is shown in fig. 3, and the vector fused with the BERT semantic features and the LDA topic features is put into a bidirectional Transformer encoder, and is connected with the corresponding output vector by a softmax network after passing through a full connection layer, so as to execute a classification task. The method comprises the following specific steps:

step S3: according to the formula:

attenttion, Softmax, is calculated as a normalized exponential function, such that the sum of the output characteristic elements is 1.

Step S4: and setting an Attention head number parameter, transversely splicing the 8 Self-orientation matrixes if the head is set to be 8, and finally multiplying the matrixes by using additional weight matrixes to compress the matrixes into one matrix.

Step S5: and executing an emotion classification task, and accessing a single-layer neural network Softmax to acquire and output word vectors and corresponding categories of each sentence in the training set.

Step S6: the Masked LM task is executed, 15% of words in the training sample are randomly Masked for prediction aiming at each sentence in the training sample, and the output of the words is predicted according to the ratio of 8:1: 1.

Step S7: and executing the NSP task, selecting two sentences A and B aiming at each sentence in the training sample, wherein A is the correct next sentence, and B is the error, and obtaining the binary loss by using the CLS token output.

Step S8: through the steps, a preliminary classification model is output. Putting the formal corpus into a model, setting initial Epoch and Batch parameters, and evaluating the effect of the model by adopting an F1-score and Loss function.

Claims

1. A text emotion recognition method based on an LDA and BERT fusion improved model is characterized in that social network text theme characteristics are obtained through LDA theme analysis, text semantic characteristics are obtained through a BERT model, word vectors of the two are spliced and put into an emotion classification model, so that the model can accurately recognize text emotions, an optimized classification model is output and used for recognizing the text emotions, and the method comprises the following steps:

step 1: acquiring a social network text corpus and preprocessing a text;

step 2: putting the preprocessed text corpus into a BERT pre-training model to extract semantic features, and obtaining a word vector matrix of the semantic features;

and step 3: putting the preprocessed text corpus into an LDA model to extract topic feature expansion, and splicing the topic feature expansion with the word vectors of the semantic features obtained in the step 2 to obtain a word vector matrix fusing semantics and topic features;

and 4, step 4: building an emotion classifier, transmitting word vectors with semantic and theme characteristics fused into a bidirectional Transformer encoder again, connecting vectors output by the Transformer by using a gradient optimization Softmax regression model, adapting to various tasks, and outputting a classification model after training;

and 5: and putting the social text corpus for deep testing into a classifier for deep pre-training, evaluating the performance of the model, performing parameter fine adjustment to obtain a trained classification model, and classifying the emotional tendency of the text.

2. The text emotion recognition method based on LDA and BERT fusion improvement model as claimed in claim 1, wherein the specific step of preprocessing the text in step 1 comprises:

step S11: text cleaning;

step S12: word segmentation and stop word removal: removing nonsense words according to the Chinese inactive word list, and then performing word segmentation processing on the text by using jieba;

step S13: text filtering: and deleting the text with the text length not within the set length range in the social network text corpus set.

3. The method as claimed in claim 1, wherein in the step 2, semantic feature extraction is performed on the preprocessed text corpus data through a BERT pre-training model, and each word is mapped into 3 vectors and w is represented by a word_ij(ω + δ + ρ), 3 vectors, i.e., word vectors, text vectors, and location vectors of the text.

4. The method for recognizing emotion of text based on LDA and BERT fusion improvement model as claimed in claim 1, wherein in step 3, the preprocessed corpus of text is put into LDA model to extract subject feature expansion, and is spliced with the word vector of semantic feature obtained in step 2 to obtain word vector matrix w fusing semantic and subject features_ij(ω + δ + ρ + μ '), μ' being the topic vector, hereinafter this matrix is referred to collectively as the word vector; the method comprises the following specific steps:

step S31: counting words in the text corpus set to generate a dictionary;

step S33: and after the expanded subject feature vector is obtained, splicing the expanded subject feature vector with the word vector of the semantic feature obtained in the step 2, and completing the expansion of the text under the subject force by using a vector splicing mode, so that the semantic feature extracted by the BERT and the subject feature extracted by the LDA are fused.

5. The method for recognizing emotion of text based on LDA and BERT fusion improvement model as claimed in claim 1, wherein said step 4 is to transmit the word vector with fused semantic and topic features output in step 3 to a transform coder, and the output of the transform is connected by a gradient descent optimized Softmax layer for adapting the execution of multiple tasks, comprising the following steps:

step S3: according to the formula: calculating Attention by using a self-Attention mechanism calculation formula, wherein Softmax is a normalized exponential function, and the sum of output characteristic elements is 1;

Step S8: and outputting the preliminary classification model.

6. The text emotion recognition method based on the LDA and BERT fusion improvement model as claimed in claim 1, wherein in step 5, a formal corpus is put into the classifier built in step 4 for training, initial parameters are set, the accuracy rate and the recall rate are calculated, meanwhile, the F1 score is adopted to search the threshold value of positive and negative emotion classification, and a Loss function is calculated and used as an index of model evaluation.