CN114722835A - Text emotion recognition method based on LDA and BERT fusion improved model - Google Patents

Text emotion recognition method based on LDA and BERT fusion improved model Download PDF

Info

Publication number
CN114722835A
CN114722835A CN202210447516.5A CN202210447516A CN114722835A CN 114722835 A CN114722835 A CN 114722835A CN 202210447516 A CN202210447516 A CN 202210447516A CN 114722835 A CN114722835 A CN 114722835A
Authority
CN
China
Prior art keywords
text
model
word
lda
bert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210447516.5A
Other languages
Chinese (zh)
Inventor
朱李玥
戴梦瑶
刘文强
邢莉娟
柏雪嫣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202210447516.5A priority Critical patent/CN114722835A/en
Publication of CN114722835A publication Critical patent/CN114722835A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/163Handling of whitespace
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a text emotion recognition method based on an LDA and BERT fusion improved model, which comprises the following steps: (1) acquiring a social network text and preprocessing the social network text; (2) fusing semantic features and topic features of the text and outputting a word vector matrix; (3) inputting the characteristics into a bidirectional Transformer encoder, connecting the characteristic with a Softmax layer improved by gradient optimization, and outputting a classification model; (4) and putting formal corpora into the classification model, finely adjusting parameters and improving the model. And performing emotion recognition on the social network text by using the obtained final classification model to obtain a more accurate recognition result.

Description

Text emotion recognition method based on LDA and BERT fusion improved model
Technical Field
The invention relates to a text emotion recognition method based on an LDA and BERT fusion improved model, and belongs to the technical field of text data recognition.
Background
With the arrival of the big data age and the vigorous development of the 5G network, the internet gradually calls for an open architecture taking users as the center, and the release of network information is increasingly changed from 'in time' to 'real time'. Internet users are transformed from recipients of information to publishers. Social networks, a platform for publishing and retrieving information easily, are attracting more and more users to publish emotional text of their personal lives on news and facts. Therefore, how to accurately, timely and effectively acquire the emotion information of the social network text has important value significance. There are three current common ways of text sentiment analysis. The emotion analysis method based on the emotion dictionary, the emotion analysis method based on machine learning and the emotion analysis method based on deep learning are respectively.
The analysis method based on the emotion dictionary is the earliest emotion analysis method, the preprocessed dictionary is matched with words in the emotion dictionary, then the emotion score is calculated according to the matching degree, the emotion polarity is judged, and the calculation is simple but the accuracy is low. The key point lies in the construction of an emotion dictionary, most of the traditional construction methods are based on semantic similarity, the core idea is to measure the distance between a candidate word and a positive emotion label and a negative emotion label, a point mutual information method PMI is generally adopted as a measurement method, and in recent years, along with the rapid development of artificial intelligence, a dictionary construction method based on machine learning and a deep neural network is also provided. Although the method is flexible and convenient, the constructed emotion dictionary is generally used in the field, so that the universality of the dictionary-based method is poor.
The emotion analysis method based on machine learning is characterized in that features are screened from a large amount of linguistic data in a mode of mainly manually screening, then the selected features are used for representing the whole text, and finally the text is classified by adopting a machine learning method. The emotion analysis method based on machine learning can be divided into a supervised method and an unsupervised method. Common supervised emotion analysis methods include naive Bayes NB, support vector machine SVM, conditional random field CRF and the like, and the methods have high learning precision, but need a large amount of manual labeling data and have high requirements on people; the common unsupervised machine learning method gets rid of the dependence on manpower, and has potential semantic analysis PLSA, potential Dirichlet allocation model LDA and the like, but the accuracy of the method is generally low.
The emotion analysis method based on deep learning utilizes the neural network to autonomously learn, extract and combine text features into high-level features, and then automatically executes classification tasks, so that the defect of machine learning is overcome. The existing neural network model commonly used for emotion classification is LSTM, which alleviates the problem of gradient explosion of general RNN to a certain extent, but still has some problems, such as low parallel computing efficiency, slow running speed and the like. With the advent of the Transformer model, the BERT model based on the former is excellent in multiple NLP tasks, but due to the lack of large-scale emotion corpus input in the pre-training stage, a certain bottleneck still exists in the implementation of emotion analysis tasks.
Therefore, there is a need for a method for emotion analysis of text data with better performance.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, in order to improve the accuracy of text emotion extraction, the invention provides a text emotion recognition method based on an LDA and BERT fusion improved model.
The technical scheme is as follows: a text emotion recognition method based on an LDA and BERT fusion improved model is characterized in that social network text theme characteristics are obtained through LDA theme analysis, text semantic characteristics are obtained through a BERT model, word vectors of the LDA and the BERT are spliced and put into an improved emotion classification model, so that the model can accurately recognize text emotion, and the optimized classification model is output, and comprises the following steps:
step 1: acquiring a social network text corpus and preprocessing a text;
and 2, step: putting the preprocessed text corpus into a BERT pre-training model to extract semantic features, and obtaining a word vector matrix of the semantic features;
and 3, step 3: putting the preprocessed text corpus into an LDA model to extract topic feature expansion, and splicing the topic feature expansion with the word vectors of the semantic features obtained in the step 2 to obtain a word vector matrix fusing semantics and topic features;
and 4, step 4: building an emotion classifier, wherein the emotion classifier is a model for identifying positive and negative emotions of a text, re-transmitting a word vector fusing semantic and theme characteristics into a bidirectional Transformer encoder, optimizing a vector output by a Softmax regression model and connecting the Transformer by using a random gradient descent method, adapting to various tasks, and outputting a classification model after training;
and 5: and (3) putting the social text corpus for deep testing into a classifier (classification model) for deep pre-training, evaluating the performance of the classification model, carrying out parameter fine adjustment to obtain the trained classification model, and classifying the emotional tendency of the text.
By adopting the technical scheme, the BERT is used for obtaining the vector representation of the short text, so that the semantic features of the short text are better extracted; by extracting the theme features and fusing the semantic features by means of the LDA theme model, the feature types during training are enriched, the defects of the LDA theme model in the field of short texts are overcome, the features are used as high-quality input and are transmitted into a transform model, the output vector of the transform is connected by using a single-layer neural network, GPU resources are fully utilized, the emotion analysis is more detailed and efficient, and the emotion recognition of text data is more accurate.
The specific steps of preprocessing the text in the step 1 include:
step S11: text cleaning: mainly comprises 3 steps of removing Chinese uncommon symbols, redundant blanks and conversion from traditional Chinese to simplified Chinese.
Step S12: word segmentation and stop word removal: and removing the nonsense vocabulary according to the Chinese inactive word list, and then carrying out word segmentation on the text by using jieba.
Step S13: text filtering: and deleting the text with the text length not within the set length range in the social network text corpus set. The social network mainly comprises short texts, but as the training corpus of the LDA model, the text length cannot be too short, so samples with the length of less than 20 and more than 200 are screened out.
In the step 2, semantic feature extraction is carried out on the preprocessed text corpus data through a BERT pre-training model, and each word is respectively mapped into 3 vectors and expression wij(ω + δ + ρ), 3 vectors, i.e., word vectors, text vectors, and position vectors of the text, to obtain a word vector matrix of semantic features. Since the BERT pre-training model only uses a feed-forward neural network and a multi-attention mechanism, the BERT model also adds self-learned position vectors relative to word vectors and text vectors.
In the step 3, the preprocessed text corpus is put into an LDA model to extract topic feature expansion, and is spliced with the word vector of the semantic features obtained in the step 2 to obtain a word vector matrix w fusing the semantic features and the topic featuresij(ω + δ + ρ + μ '), μ' being the topic vector, hereinafter this matrix is referred to collectively as the word vector; the method comprises the following specific steps:
step S31: counting words in the text corpus set to generate a dictionary;
step S32: training the materials by using an LDA model in a Gensim module, and weighting the obtained matrix by using a tf-idf algorithm to obtain an expanded theme characteristic vector;
step S33: and after the expanded subject feature vector is obtained, splicing the expanded subject feature vector with the word vector of the semantic feature obtained in the step 2, and completing the expansion of the text under the subject force by using a vector splicing mode, so that the semantic feature extracted by BERT and the subject feature extracted by LDA are fused.
And 4, transmitting the word vectors which are output in the step 3 and are fused with the semantics and the theme characteristics into a Transformer encoder, connecting the output of the Transformer by a Softmax layer which is improved by gradient descent optimization, wherein the neural network is mainly used for executing an emotion analysis task, and simultaneously does not influence the original MLM and NSP tasks of BERT, because the network is also connected with the output of the two tasks after execution.
The method specifically comprises the following steps:
step S1: word vector wij(ω + δ + ρ + μ') into a bidirectional Transformer encoder;
step S2: the word vector passes through a Self-extension layer, and a Query matrix, a Key matrix and a Value matrix are calculated firstly;
step S3: according to the formula: computing Attention by using a self-Attention mechanism computing formula, wherein Softmax is a normalized exponential function, so that the sum of output characteristic elements is 1;
step S4: and setting the number parameter of Attention head, setting head as n, transversely splicing n Self-orientation matrixes, and finally multiplying the Self-orientation matrixes by additional weight matrixes to compress the matrixes into one matrix.
Step S5: and executing an emotion classification task, and accessing a single-layer neural network Softmax to obtain and output the word vectors of each sentence in the corpus and the corresponding sample categories.
Step S6: and executing a Masked LM task, randomly masking words with a set proportion for prediction aiming at each sentence in the training sample, and predicting the output of the Masked word part by using the residual words according to the set proportion.
Step S7: and performing an NSP task, selecting two sentences A and B aiming at each sentence in the training sample, wherein A is the correct next sentence, and B is the wrong next sentence, and obtaining the binary loss by using the CLS token output.
Step S8: and outputting the preliminary classification model.
And step 5, putting the formal corpus into the classifier set up in the step 4 for training, setting initial parameters, calculating the accuracy rate and the recall rate, simultaneously searching the threshold value of positive and negative emotion classification by adopting an F1 score, and calculating a Loss function as an index of model evaluation.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a diagram of an LDA model according to an embodiment of the present invention;
FIG. 3 is a diagram of a emotion classifier model structure according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, a text emotion recognition method based on an LDA and BERT fusion improved model includes the following steps:
step 1: acquiring a social network text corpus; crawling short texts of the social network, such as keywords 'kennels' (antidepressant drugs), by using a crawler to obtain speeches containing the keywords, and constructing an initial corpus; then cleaning, segmenting words, removing stop words and the like on the document; the processed documents are filtered to reduce the data scale and experimental cost.
The text preprocessing comprises the following specific steps:
step S11: text cleaning: mainly comprises 3 steps of removing special symbols, redundant blanks and changing complex bodies into simplified bodies.
Step S12: word segmentation and stop word removal: and (4) acquiring a high-frequency Chinese stop word list to stop the nonsense words, and then performing word segmentation on the text by using jieba.
Step S13: text filtering: and deleting the text with the text length not within the set length range in the social network text corpus set. The social network mainly comprises short texts, but as the training corpus of the LDA model, the text length cannot be too short, so samples with the length of less than 20 and more than 200 are screened out.
Step 2: and putting the preprocessed corpus into a BERT pre-training model to extract semantic features, and obtaining a word vector matrix of the semantic features.
Extracting features based on a BERT model, and assuming that omega, delta and rho respectively represent a word vector, a text vector and a position vector of a text; the position-coding vector is unique based on the Transformer model, and determines the position of the current word in the sequence. However, unlike the Transformer model, the position vector of BERT is not calculated by a trigonometric function, but is learned.
BERT is based on a bidirectional Transformer encoder module, a multi-head attention mechanism can enhance the attention capability of a model, and the number of heads is set to be 8; transversely splicing 8 self-attention matrixes, and then utilizing an additional weight matrix WoMultiplying the matrix and compressing the matrix into a matrix with the same dimension as the input sequence, wherein the calculation formula is as follows:
Multihead(Q,K,V)=
Concat(head1,head2,...,headh)Wo
Where headi=Attention(Qi,Ki,Vi)
q, K, V represent the query, key and value vectors for each word in the input sequence, respectively, and are calculated as follows:
Q=XWQ
K=XWK
V=XWV
wherein, X is the input word vector matrix. Q, K, V are new matrices obtained by multiplying the word vector matrix by the weight matrix corresponding to the respective hyperparameter head.
Document d after word segmentationi={wij|j∈{1,2,...,NiInputting into BERT model to extract semantic features, after the calculation, each word is mapped into 3 vectors represented as wij(ω+δ+ρ)。
And step 3: and (3) performing theme feature expansion on the semantic feature vector obtained in the step (2). Selecting LDA theme model, and preprocessing the languageAnd each document theme in the material set is given in a probability distribution mode, and theme clustering or text classification optimization is carried out according to the theme distribution. Suppose a corpus consists of M documents, corresponding to the d-th document, there is NiWord (d)i={wij|j∈{1,2,...,Ni}) then these words all correspond to a potential topic. The variables of the LDA model are distributed as follows:
Figure BDA0003615992340000051
wherein α and β obey a priori Dirichlet distribution; thetaiIs the distribution probability of the text topic; p (theta)i| α) is the topic distribution probability generated by the Dirichlet prior parameter α; p (phi | beta) generates the topic z for the prior parameter betaijThe "topic-word" distribution matrix of (1); p (z)iji) Distributing theta for the subjectiSampling the generated document diThe topic probability corresponding to the jth word;
Figure BDA0003615992340000052
is a word distribution
Figure BDA0003615992340000053
To generate word wijThe corresponding probability, Φ, is the overall distribution.
Utilizing an LDA module in a Gensim module to complete probability distribution of subject words, putting a corpus set into an LDA model training subject vector mu, obtaining a better subject result mu 'after iterative computation, and combining the better subject result mu' with the semantic vector obtained in the step 2 to obtain a feature vector w fusing the subject vectorij(ω+δ+ρ+μ′)。
And 4, step 4: and constructing a classifier. Then the eigenvector w obtained in the step 3 is usedij(ω + δ + ρ + μ') is re-transmitted into the bi-directional Transformer encoder. The Transformer encoder realizes parallel operation through a series of Self-attentions, has good running speed, connects a multi-head attention mechanism and a feedforward layer through a residual error network structure, and performs multiple linear changes on input vectors through the multi-head mechanismAnd (5) converting to obtain different linear values, and calculating the attention weight. The calculation formula is as follows:
Multihead(Q,K,V)=
Concat(head1,head2,...,headh)Wo
Where headi=Attention(QWi Q,KWi K,VWi V)
q, K, V representing query, key and value vectors of each word in the input sequence, calculating the Attention after parameter matrix mapping, repeating for h times, and splicing the resultsiIs the ith hyper-parameter header. WoTo add a weight matrix, Wi Q,Wi KAnd Wi VAnd the weight matrix corresponding to the ith hyper-parameter head. The specific process is introduced in step 2, and is not described in detail herein.
Whereby the encoder learns and stores the document diSemantic relationship and syntactic structure information. Also because of the improved diThe theme feature vectors are fused, so that the modified document is connected with an output layer of Softmax to adapt to learning under multiple tasks, namely, MLM and NSP tasks are reserved while an emotion classification task (SC) is executed, and the Softmax output layer is respectively connected with a MASK word ([ MASK ] of a Transformer]) And a text start ([ CLS)]) After the corresponding output vector. And optimizing a loss function of the Softmax regression model by adopting a random gradient descent method, wherein the loss function J (theta) and the optimized gradient calculation formula are as follows:
Figure BDA0003615992340000061
Figure BDA0003615992340000062
where θ is the overall parameter of the model, θjFor the classifier parameters corresponding to each class, m is the number of samples, k is the number of classes, i represents a certain sample,xiis a vector representation of the ith sample x, and j represents a certain class. I (theta) is a log-likelihood function,
Figure BDA0003615992340000063
each iteration of the gradient descent method updates the parameters, so that softmax realizes the prediction and classification of the input classes.
Therefore, the topic features and the semantic features are combined, an improved classification model M is output, and the emotion recognition accuracy of the model for different texts is improved.
And 5: step 4, outputting a classification model M, inputting a corpus into the M, setting initial batch sample batch and one-time iteration epoch parameters, calculating Recall rate Recall and Precision and measuring model Precision by using an F1-score calculation mode, wherein the calculation formula is as follows:
Figure BDA0003615992340000071
Figure BDA0003615992340000072
Figure BDA0003615992340000073
and finally, taking the sum of the Loss rates of the three tasks of SC, MLM and NSP as a total Loss function Loss, and calculating the formula as follows:
Loss=λ1LossSC2LossMLM3LossNSP
where λ is the weight assignment for the three tasks. Model performance can be evaluated by calculating the F1-score and the loss function. Therefore, the model with the LDA-BERT integration in the embodiment can better simulate the emotion of the social network text, and a more accurate emotion analysis result is obtained.
The LDA topic analysis model is shown in fig. 2, and the specific steps of the LDA training algorithm are as follows:
step S1: randomly initializing alpha and beta (generally, the value of alpha is 50/theme number, and the value of beta is 0.1);
step S2: sequentially locking the documents in the training set;
step S3: calculating the theme distribution of each document and the theme distribution of each word by using the current alpha and beta values; checking whether the document is the last one, and if so, proceeding to S4; otherwise, returning to the step S2;
step S4: accumulating the number of words belonging to the subject k in all the documents to obtain a vector gamma; and the times that the word i belongs to the theme k are obtained to obtain a matrix beta;
step S5: obtaining a current optimal alpha value by utilizing Newton-Raphson iteration according to the current gamma;
step S6: normalizing columns of the matrix beta to directly obtain a current beta value, namely word distribution of each topic; checking whether convergence is achieved, and if yes, entering S7; otherwise, returning to the step S2;
step S7: the values of alpha and beta at which convergence is reached are output.
The emotion classifier model structure of the embodiment of the invention is shown in fig. 3, and the vector fused with the BERT semantic features and the LDA topic features is put into a bidirectional Transformer encoder, and is connected with the corresponding output vector by a softmax network after passing through a full connection layer, so as to execute a classification task. The method comprises the following specific steps:
step S1: word vector wij(ω + δ + ρ + μ') into a bidirectional Transformer encoder;
step S2: the word vector passes through a Self-extension layer, and a Query matrix, a Key matrix and a Value matrix are calculated firstly;
step S3: according to the formula:
Figure BDA0003615992340000074
attenttion, Softmax, is calculated as a normalized exponential function, such that the sum of the output characteristic elements is 1.
Step S4: and setting an Attention head number parameter, transversely splicing the 8 Self-orientation matrixes if the head is set to be 8, and finally multiplying the matrixes by using additional weight matrixes to compress the matrixes into one matrix.
Step S5: and executing an emotion classification task, and accessing a single-layer neural network Softmax to acquire and output word vectors and corresponding categories of each sentence in the training set.
Step S6: the Masked LM task is executed, 15% of words in the training sample are randomly Masked for prediction aiming at each sentence in the training sample, and the output of the words is predicted according to the ratio of 8:1: 1.
Step S7: and executing the NSP task, selecting two sentences A and B aiming at each sentence in the training sample, wherein A is the correct next sentence, and B is the error, and obtaining the binary loss by using the CLS token output.
Step S8: through the steps, a preliminary classification model is output. Putting the formal corpus into a model, setting initial Epoch and Batch parameters, and evaluating the effect of the model by adopting an F1-score and Loss function.

Claims (6)

1. A text emotion recognition method based on an LDA and BERT fusion improved model is characterized in that social network text theme characteristics are obtained through LDA theme analysis, text semantic characteristics are obtained through a BERT model, word vectors of the two are spliced and put into an emotion classification model, so that the model can accurately recognize text emotions, an optimized classification model is output and used for recognizing the text emotions, and the method comprises the following steps:
step 1: acquiring a social network text corpus and preprocessing a text;
step 2: putting the preprocessed text corpus into a BERT pre-training model to extract semantic features, and obtaining a word vector matrix of the semantic features;
and step 3: putting the preprocessed text corpus into an LDA model to extract topic feature expansion, and splicing the topic feature expansion with the word vectors of the semantic features obtained in the step 2 to obtain a word vector matrix fusing semantics and topic features;
and 4, step 4: building an emotion classifier, transmitting word vectors with semantic and theme characteristics fused into a bidirectional Transformer encoder again, connecting vectors output by the Transformer by using a gradient optimization Softmax regression model, adapting to various tasks, and outputting a classification model after training;
and 5: and putting the social text corpus for deep testing into a classifier for deep pre-training, evaluating the performance of the model, performing parameter fine adjustment to obtain a trained classification model, and classifying the emotional tendency of the text.
2. The text emotion recognition method based on LDA and BERT fusion improvement model as claimed in claim 1, wherein the specific step of preprocessing the text in step 1 comprises:
step S11: text cleaning;
step S12: word segmentation and stop word removal: removing nonsense words according to the Chinese inactive word list, and then performing word segmentation processing on the text by using jieba;
step S13: text filtering: and deleting the text with the text length not within the set length range in the social network text corpus set.
3. The method as claimed in claim 1, wherein in the step 2, semantic feature extraction is performed on the preprocessed text corpus data through a BERT pre-training model, and each word is mapped into 3 vectors and w is represented by a wordij(ω + δ + ρ), 3 vectors, i.e., word vectors, text vectors, and location vectors of the text.
4. The method for recognizing emotion of text based on LDA and BERT fusion improvement model as claimed in claim 1, wherein in step 3, the preprocessed corpus of text is put into LDA model to extract subject feature expansion, and is spliced with the word vector of semantic feature obtained in step 2 to obtain word vector matrix w fusing semantic and subject featuresij(ω + δ + ρ + μ '), μ' being the topic vector, hereinafter this matrix is referred to collectively as the word vector; the method comprises the following specific steps:
step S31: counting words in the text corpus set to generate a dictionary;
step S32: training the materials by using an LDA model in a Gensim module, and weighting the obtained matrix by using a tf-idf algorithm to obtain an expanded theme characteristic vector;
step S33: and after the expanded subject feature vector is obtained, splicing the expanded subject feature vector with the word vector of the semantic feature obtained in the step 2, and completing the expansion of the text under the subject force by using a vector splicing mode, so that the semantic feature extracted by the BERT and the subject feature extracted by the LDA are fused.
5. The method for recognizing emotion of text based on LDA and BERT fusion improvement model as claimed in claim 1, wherein said step 4 is to transmit the word vector with fused semantic and topic features output in step 3 to a transform coder, and the output of the transform is connected by a gradient descent optimized Softmax layer for adapting the execution of multiple tasks, comprising the following steps:
step S1: word vector wij(ω + δ + ρ + μ') into a bidirectional Transformer encoder;
step S2: the word vector passes through a Self-extension layer, and a Query matrix, a Key matrix and a Value matrix are calculated firstly;
step S3: according to the formula: calculating Attention by using a self-Attention mechanism calculation formula, wherein Softmax is a normalized exponential function, and the sum of output characteristic elements is 1;
step S4: and setting the number parameter of Attention head, setting head as n, transversely splicing n Self-orientation matrixes, and finally multiplying the Self-orientation matrixes by additional weight matrixes to compress the matrixes into one matrix.
Step S5: and executing an emotion classification task, and accessing a single-layer neural network Softmax to obtain and output the word vectors of each sentence in the corpus and the corresponding sample categories.
Step S6: and executing a Masked LM task, randomly masking words with a set proportion for prediction aiming at each sentence in the training sample, and predicting the output of the Masked word part by using the residual words according to the set proportion.
Step S7: and performing an NSP task, selecting two sentences A and B aiming at each sentence in the training sample, wherein A is the correct next sentence, and B is the wrong next sentence, and obtaining the binary loss by using the CLS token output.
Step S8: and outputting the preliminary classification model.
6. The text emotion recognition method based on the LDA and BERT fusion improvement model as claimed in claim 1, wherein in step 5, a formal corpus is put into the classifier built in step 4 for training, initial parameters are set, the accuracy rate and the recall rate are calculated, meanwhile, the F1 score is adopted to search the threshold value of positive and negative emotion classification, and a Loss function is calculated and used as an index of model evaluation.
CN202210447516.5A 2022-04-26 2022-04-26 Text emotion recognition method based on LDA and BERT fusion improved model Pending CN114722835A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210447516.5A CN114722835A (en) 2022-04-26 2022-04-26 Text emotion recognition method based on LDA and BERT fusion improved model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210447516.5A CN114722835A (en) 2022-04-26 2022-04-26 Text emotion recognition method based on LDA and BERT fusion improved model

Publications (1)

Publication Number Publication Date
CN114722835A true CN114722835A (en) 2022-07-08

Family

ID=82245573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210447516.5A Pending CN114722835A (en) 2022-04-26 2022-04-26 Text emotion recognition method based on LDA and BERT fusion improved model

Country Status (1)

Country Link
CN (1) CN114722835A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563284A (en) * 2022-10-24 2023-01-03 重庆理工大学 Deep multi-instance weak supervision text classification method based on semantics
CN116992026A (en) * 2023-07-12 2023-11-03 华南师范大学 Text clustering method and device, electronic equipment and storage medium
CN117633239A (en) * 2024-01-23 2024-03-01 中国科学技术大学 End-to-end face emotion recognition method combining combined category grammar

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563284A (en) * 2022-10-24 2023-01-03 重庆理工大学 Deep multi-instance weak supervision text classification method based on semantics
CN115563284B (en) * 2022-10-24 2023-06-23 重庆理工大学 Deep multi-instance weak supervision text classification method based on semantics
CN116992026A (en) * 2023-07-12 2023-11-03 华南师范大学 Text clustering method and device, electronic equipment and storage medium
CN117633239A (en) * 2024-01-23 2024-03-01 中国科学技术大学 End-to-end face emotion recognition method combining combined category grammar

Similar Documents

Publication Publication Date Title
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN108984526B (en) Document theme vector extraction method based on deep learning
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN111160037B (en) Fine-grained emotion analysis method supporting cross-language migration
CN109766277B (en) Software fault diagnosis method based on transfer learning and DNN
CN109325231B (en) Method for generating word vector by multitasking model
CN112015863B (en) Multi-feature fusion Chinese text classification method based on graphic neural network
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN111985247B (en) Microblog user interest identification method and system based on multi-granularity text feature representation
CN114722835A (en) Text emotion recognition method based on LDA and BERT fusion improved model
CN110427616B (en) Text emotion analysis method based on deep learning
CN111353029B (en) Semantic matching-based multi-turn spoken language understanding method
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN111753058B (en) Text viewpoint mining method and system
CN112163089B (en) High-technology text classification method and system integrating named entity recognition
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN112395417A (en) Network public opinion evolution simulation method and system based on deep learning
CN113705238B (en) Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN112070139A (en) Text classification method based on BERT and improved LSTM
Wang et al. Convolutional Poisson gamma belief network
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
Chan et al. Applying and optimizing NLP model with CARU
CN116050419B (en) Unsupervised identification method and system oriented to scientific literature knowledge entity
Alisamir et al. An end-to-end deep learning model to recognize Farsi speech from raw input

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination