CN114757182A

CN114757182A - BERT short text sentiment analysis method for improving training mode

Info

Publication number: CN114757182A
Application number: CN202210354141.8A
Authority: CN
Inventors: 魏泽阳; 张文博; 姬红兵
Original assignee: Shaanxi Fangcun Jihui Intelligent Technology Co ltd; Xidian University
Current assignee: Shaanxi Fangcun Jihui Intelligent Technology Co ltd; Xidian University
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-15

Abstract

A BERT short text sentiment analysis method for improving a training mode is characterized in that a short text sentiment analysis model is constructed and comprises an input layer, a semantic feature extraction layer, a pooling layer, a full connection layer and a classification output layer; collecting a data set and preprocessing; coding in an input layer to obtain word vector representation of an input text; adding disturbance in the word vector to obtain a confrontation sample; the semantic feature extraction layer is used for extracting semantic features of the countermeasure samples based on a BERT model and outputting feature vectors; after passing through the pooling layer and the full-connection layer, performing normalization processing by using Softmax to obtain a final emotion polarity classification result; the short text sentiment analysis model is trained in an antagonistic training mode, so that the problems that sentiment misclassification is caused by Chinese word ambiguity, new network words and the like, and the traditional model cannot extract the context information and the local key information at the same time are solved, the robustness of the model is enhanced, and the problems of poor model training efficiency and model performance degradation are relieved.

Description

BERT short text sentiment analysis method for improving training mode

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to natural language processing by using artificial intelligence, and particularly relates to a BERT short text sentiment analysis method for improving a training mode.

Background

With the rapid development of information technology and the rise of social networks, more and more netizens publish opinions and opinions on the networks, massive text comments can be generated on internet platforms such as microblogs, facebooks and twitter every day, and the mining and analysis of potential emotion tendencies have important value for assisting group organizations such as governments and enterprises in making decisions.

The core of the emotion analysis technology lies in the construction of emotion classification models, and the traditional emotion analysis method comprises an emotion dictionary-based method and a machine learning method. The method mainly depends on the construction of the sentiment dictionary, because the network is rapidly developed at the present stage, the information updating speed is accelerated, new words are continuously generated on the network, if the dictionary cannot be updated in time, the sentiment tendency misjudgment can be caused, and the analysis result is deviated, so that the sentiment dictionary needs to be continuously expanded to meet the requirement on sentiment analysis, and a large amount of time and resources are needed for expanding the sentiment dictionary.

The model construction method by using the machine learning means a method for training a model through specific data and predicting a result based on the model, and common models comprise naive Bayes NB, a support vector machine SVM, a maximum entropy ME and the like. And constructing an emotion analysis model based on machine learning, extracting features by using a statistical machine learning algorithm through a large number of labeled or unlabeled corpora, and finally outputting emotion polarity judgment on the text. Although the method using machine learning reduces the workload of manual processing to some extent, it consumes much time and effort in manually constructing features, and has poor model generalization ability.

The emotion analysis method based on deep learning is most widely used at present, deep learning can extract deep features of a text through text representation, the text features can be well learned, and the classification accuracy is improved. For Text representation, Word vectors are trained by Word2Vec or Glove, then the Word vectors are used as input of a neural network model, deep semantic features are learned by a deep learning method, and common deep learning models comprise Text-CNN, RNN, LSTM and the like. However, Word vectors trained by Word2VEc and Glove models are both static Word vectors, the static Word vectors can change feature words of a sample into vectors with the same dimension, and the trained Word vectors are all fixed, namely, each Word only has one corresponding digital vector, so that the phenomenon of polysemy of a Chinese Word and a new network Word with rich semantic expression are difficult to solve, and the traditional Word embedding mode is not suitable for the network comment emotion classification task at the present stage.

The BERT supports a dynamic word vector model capable of transfer learning, the whole model of the BERT is transferred and learned in a training stage, and word vectors of texts under specific scenes are generated.

Deep learning models are vulnerable to well-designed input samples, which are referred to as confrontational samples. The countermeasure sample is added with some disturbance which cannot be perceived by human eyes to the original sample, so that the model makes wrong judgment, the model vulnerability caused by the unknown countermeasure sample is processed, and the enhancement of the model robustness becomes an important task at present.

In the multi-classification problem, the target labels of the class-balanced dataset are evenly distributed. If a sample of a certain class of targets is quantitatively dominant over other classes, the data set may be considered an unbalanced data set. This imbalance will lead to two problems: low training efficiency and degraded model performance.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a BERT short text sentiment analysis method for improving a training mode, which adopts a generated confrontation sample to carry out confrontation training so as to enhance the robustness and generalization capability of a sentiment analysis model, adopts a deep learning method to automatically extract features from a text, solves the problem that a large amount of manual participation is needed in dictionary construction and feature engineering in the traditional sentiment analysis model, adopts a Focal loss function to replace a cross entropy loss function commonly used in the classification problem, and further solves the problems of low training efficiency, model performance degradation and the like caused by data imbalance.

In order to achieve the purpose, the invention adopts the technical scheme that:

a BERT short text sentiment analysis method for improving a training mode comprises the following steps:

step 1: constructing a short text emotion analysis model, wherein the short text emotion analysis model comprises an input layer, a semantic feature extraction layer, a pooling layer, a full-link layer and a classification output layer;

step 2: collecting a data set, wherein the data set is a public data set or a data set constructed by collecting short text comment data; when the data set is a self-constructed data set, marking the emotion polarity label of each piece of collected short text data, wherein the emotion polarity comprises six emotions of happy, sad, angry, surfrise, neutral and fear;

and step 3: preprocessing the short text data in the data set, removing characters which are useless for emotion analysis, converting the content of non-simplified Chinese into simplified Chinese to obtain a cleaned short text data set, and conveniently constructing a subsequent short text emotion analysis model;

and 4, step 4: in an input layer, firstly segmenting input simplified Chinese text, and then coding the segmented text to obtain word vector representation of the input text, wherein the word vector is obtained by adding a word vector, a text vector and a position vector;

And 5: adding a disturbance in the word vector to obtain a confrontation sample;

and 6: the semantic feature extraction layer is based on a BERT model and is used for the placeSemantic feature extraction is carried out on the confrontation sample, a feature vector is output, and a feature vector matrix B epsilon R is obtained^s*eWherein s is the text length in units of words, and e is the dimension of the feature vector;

and 7: the pooling layer performs pooling processing on the feature vectors, reduces dimensions, removes redundant information, compresses the features, simplifies network complexity and outputs the pooled feature vectors to the full-connection layer;

and 8: the full-connection layer extracts semantic features based on the pooled feature vectors, captures emotion information, and finally performs normalization processing on the feature vectors output by the full-connection layer by using a Softmax classification function to obtain a final emotion polarity classification result;

and step 9: training the short text emotion analysis model, wherein the confrontation training process comprises the following steps:

step1, calculating the loss value of the word vector x propagating forward along the model, then propagating backward to obtain the gradient g of the loss function with respect to the input word vector x,

f_θ() Obtaining a predicted value by a neural network function, wherein y is a real emotion polarity label of the sample, and L () is a loss function;

step2, from the formula

Calculating the disturbance r_advWherein ∈ represents a perturbation space;

step3, will disturb r_advAdded to the word vector x, i.e. x + r_advCalculating x + r_advThe loss value of forward propagation along the model and then backward propagation are used to obtain the loss function about x + r_advG' of the loss function with respect to x + r is found iteratively_advThe loss value of the maximum of the true emotion polarity label y, r at this time_advThe disturbance is the optimal disturbance;

step 4: adding the optimal disturbance into the word vector, namely training the model to enable the loss function to be related to the loss value which is the minimum between the loss function and the real emotion polarity label y under the input sample when the disturbance is fixed, ending the training process of the model when the loss value of the loss function tends to be stable in two continuous iteration processes to obtain a short text emotion analysis model, and performing short text emotion analysis by using the short text emotion analysis model.

In one embodiment, in step4, the method for word segmentation and coding includes:

performing word segmentation through a WordPiece model, directly taking a single word as a basic unit for forming a text, namely taking one word as one word, converting each word in the text into a one-dimensional word vector according to a query word vector table, and automatically learning the value of the text vector in the model training process, wherein the value is used for depicting the global semantic information of the text and is fused with the semantic information of the single word; and adding a different vector to the words at different positions for distinguishing, and finally adding the word vector, the text vector and the position vector to obtain the word vector representation of the input text.

In one embodiment, said step 5, the perturbation added in said word vector is the best perturbation.

In one embodiment, in the step 6, the BERT model is constructed by using a deep bidirectional Transformer encoder, thereby structurally maximizing the utilization of context information; the Transformer encoder comprises a word vector and position encoding, a multi-head attention mechanism, a residual error connection and layer standardization and a feedforward network;

providing position information of each word in the short text by using the word vector and the position code, so that the dependency relationship and the time sequence relationship of the word in the short text can be identified;

by utilizing a multi-head self-attention mechanism, the correlation between each word in the short text and the rest words in the sentence is obtained through calculation, so that each word vector contains the information of all the word vectors in the short text;

the word vectors obtained by using the multi-head self-attention mechanism are input into the feedforward network, and the feedforward network has two layers, wherein the first layer is an activation function ReLU, and the second layer is a linear activation function.

In one embodiment, all output results of the last layer of transformers, i.e., bits, of the BERT modelEigenvector matrix B ∈ R^s*e。

In one embodiment, step 7, the pooling layer pools the feature vectors using max-average pooling. The pooling layer reduces the dimension of the feature vector, removes redundant information, compresses features, and simplifies the network complexity to achieve the effects of relieving overfitting, reducing the calculated amount and reducing the memory consumption.

In one embodiment, in step 9, a multi-parameter adjustment method is adopted to adjust the hyper-parameters of the short text sentiment analysis model, and a Dropout strategy and L2 regularization are used during parameter adjustment to avoid the overfitting problem of the model.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, through improving a training mode, the FGM algorithm is used for generating the confrontation samples to carry out the confrontation training, and the robustness and the generalization capability of the model are greatly enhanced. The model misclassification problem caused by Chinese word ambiguity and new network words is solved by adopting the BERT pre-training model, and the time and difficulty for training the deep learning model from 0 are reduced by using the pre-training model. On the basis of original BERT, the pool layer is added, so that the overfitting problem generated when a BERT model is directly applied to an emotion analysis task is improved. And the Focal local function is adopted to carry out model iteration, so that the problems of low training efficiency and model performance degradation caused by unbalanced data categories are solved to a certain extent. Compared with the traditional emotion analysis model, the emotion analysis model has the advantages of strong robustness, high accuracy and easiness in training.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a BERT model structure.

FIG. 3 shows a transform encoder structure.

Detailed Description

The technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the drawings in the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

As shown in FIG. 1, the invention relates to a BERT short text sentiment analysis method for improving a training mode, which comprises the following steps:

step 1: and (3) constructing a short text sentiment analysis model, namely a BERT-AdvFL model for short.

The short text sentiment analysis model mainly comprises an input layer, a semantic feature extraction layer, a pooling layer, a full-link layer and a classification output layer.

Step 2: a data set is collected.

The data set can be an open data set or a data set which is self-constructed by collecting short text data. When the data set is a self-constructed data set, each piece of collected short text data is labeled with an emotional polarity label, wherein the emotional polarities comprise six emotions, i.e. happy emotion, sad emotion, angry emotion, surfrise emotion, neutral emotion and ear emotion.

And 3, step 3: and preprocessing the short text data in the data set.

Taking simplified Chinese as an example, the data preprocessing mainly comprises the following steps: and removing characters which are useless for emotion analysis, and converting the content of non-simplified Chinese into simplified Chinese to obtain a cleaned short text data set, so that the subsequent construction of a short text emotion analysis model is facilitated.

And 4, step 4: taking simplified Chinese as an example, for a piece of short text data (i.e. simplified Chinese text), in an input layer, firstly performing word segmentation on the short text data, and then encoding the segmented text to obtain word vector representation of the input text, wherein the word vector is obtained by adding a word vector, a text vector and a position vector.

Specifically, the word segmentation and coding method of the present invention comprises:

the word segmentation is carried out through a WordPiece model, single words are directly used as basic units for forming the text, namely one word is used as one word, and each word in the text is converted into a one-dimensional word vector according to a query word vector table to be used as the input of the model. In addition, the model input includes a text vector and a position vector in addition to the word vector. The value of the text vector is automatically learned in the model training process, is used for depicting the global semantic information of the text and is fused with the semantic information of the single character. Because semantic information carried by words appearing at different positions of the text is different, different position vectors are respectively added to the words at different positions for distinguishing, and finally, the word vectors, the text vectors and the position vectors are added to obtain word vector representation of the input text.

The invention can dynamically encode the semantics of a certain word in the input text in different contexts and encode the logic relation among each clause in the input text by adopting the context-related dynamic byte encoding model WordPiece.

And 5: adding a perturbation to the word vector yields a challenge sample.

The specific way of obtaining perturbations is presented in step 9, where the final goal is to add so-called optimal perturbations to the word vector.

Step 6: the semantic feature extraction layer is used for extracting semantic features of the countermeasure samples based on a BERT pre-training model to obtain feature vectors, and finally outputting a feature vector matrix B belonging to R^s*eWhere s is the text length in words and e is the dimension of the feature vector.

The invention adopts the BERT pre-training model of sentence vector level coding related to context, can accurately quantize the meaning of the same word in the short text under different contexts, and can carry out the relation coding between the clauses on the text.

Referring to fig. 2, in the present invention, the BERT pre-training model is constructed by using a deep bi-directional Transformer encoder, the BERT pre-training model used in the present invention is a Base version thereof, and the version has 12 layers of transformers, and is a multi-layer bi-directional decoder composed of a transform encoder as a basic unit, and structurally maximizes the utilization of context information. All output results of the final layer of Transformer of the BERT model, namely a feature vector matrix B epsilon R ^s*e。

Referring to fig. 3, the transform encoder includes word vector and position coding, multi-headed attention mechanism, residual concatenation and layer normalization, and feed-forward network. Providing position information of each word in the short text through the word vector and the position code, so that the dependency relationship and the time sequence relationship of the word in the short text can be identified; on the basis of a multi-head self-attention mechanism, the correlation between each word in the short text and the rest words in the sentence is obtained through calculation, so that each word vector contains the information of all the word vectors in the short text; what is obtained in short texts by using the multi-head self-attention mechanism is a vector matrix, and each word vector in the vector matrix contains information of other word vectors because of being processed by the multi-head self-attention mechanism. The resulting word vectors are input into a feed-forward network to increase generalization capability. The feedforward network has two layers, the first layer being an activation function ReLU and the second layer being a linear activation function.

In order to solve gradient disappearance and accelerated model training in the traditional deep learning, the transform coder also uses a residual error connection and layer standardization method, wherein the layer standardization is to accelerate the model training speed and accelerate the model convergence by normalizing a hidden layer in a neural network to be standard normal distribution; residual concatenation is then used to solve the gradient vanishing and network degradation problems.

And 7: the pooling layer performs pooling on the feature vectors, reduces dimensions, removes redundant information, compresses the features, simplifies network complexity, and outputs the pooled feature vectors to the full-connection layer.

In the present invention, the pooling layer pools feature vectors using max-average pooling. And respectively solving the mean value and the maximum value along the text length and the embedding dimension, then cascading the mean value and the maximum value into a vector, and carrying out conversion from the hidden sequence to the vector. The pooling layer reduces the dimension of the feature vector, removes redundant information, compresses features, and simplifies the network complexity to achieve the effects of relieving overfitting, reducing the calculated amount and reducing the memory consumption.

And step 8: and finally, performing normalization processing on the feature vectors output by the full connection layer by using a Softmax classification function to obtain a final emotion polarity classification result.

And step 9: and training the short text emotion analysis model by adopting a confrontation training mode.

The idea of the countermeasure training as a method for defending the countermeasure attack is to add the generated countermeasure sample into a training set to perform data enhancement, so that the model learns the countermeasure sample once in the training process. During the course of the countertraining, the samples are mixed with some minor perturbations (which are small in change but likely to cause misclassification) and then the neural network is adapted to this change, thus being robust to the countertraining samples. The anti-training essence is to perform two gradient updates in one step, perform gradient rise first, find the best disturbance, and make the loss value maximum; and then gradient descent is carried out to find the optimal model parameter, so that the loss value is minimum. In the invention, the model parameters at least comprise learning rate, maximum length of input text and training round number.

The confrontational training may be summarized as the maximum minimization formula as follows:

wherein the maximum formula

The aim of (1) is to find the perturbation that maximizes the loss function, i.e. the perturbation added is to try to confuse the neural network, x represents the word vector representation of the input sample, r_advRepresenting the perturbation added on x, e represents the perturbation space, f_θ() Is the neural network function through which the predicted value is obtained, y is the true emotion polarity label of the sample, L () is the loss function, L (f)_θ(x+r_adv) Y) then means that a perturbation r is superimposed on the word vector x_advThen passes through a neural network function f_θ() The resulting loss is compared to y.

The invention adopts FGM algorithm to calculate disturbance, FGM adopts L₂Normalization, L₂Normalization, i.e. dividing the value of each dimension of the gradient by the L of the gradient₂Norm, formula is expressed as

Wherein

I.e. the gradient of the loss function L () with respect to the input word vector x.

Minimization formula min_θE_(x,y)The neural network is optimized so that the model is trained with minimal loss in training data when the perturbation is fixed, i.e., the model has a certain robustness to adapt to the perturbation,

the present invention confrontation training process can be described as:

step1, calculating the loss value of the word vector x propagating forward along the model, and then propagating backward to obtain the gradient g of the loss function about the input word vector x;

step2, calculating disturbance r_adv；

step3, will disturb r_advAdded to the word vector x, i.e. x + r_advCalculating x + r_advThe loss value of forward propagation along the model and then backward propagation are used to obtain the loss function about x + r_advThe gradient g' of the challenge sample is added to the gradient of the original word vector x since a backward propagation is made after step 1; iterate continuously to find a function which makes the loss function about x + r_advThe loss value of the maximum of the true emotion polarity label y, r at this time_advThe disturbance is the optimal disturbance;

The invention can find the optimal disturbance by continuously repeating step2 and step3, and the optimal disturbance is calculated by a formula

It is derived that the disturbance which maximizes the loss is found through continuous forward calculation and backward propagation, and the disturbance is called as the optimal disturbance, and the process is called as performing attack on the word vector. This process can be understood as: the first round calculates a gradient to obtain a perturbation, adds the perturbation to the word vector, and continuously performs the iterative process to find the optimal perturbation with the largest loss.

The invention uses Focal loss Focal local function to calculate loss value, measures the difference between the real probability distribution and the predicted probability distribution of the emotion label, and the calculation formula is as follows:

FL(p_t)＝-α(1-p_t)^γlog(p_t)

in the loss function, a modulation factor (1-p) is added on the basis of the conventional cross entropy_t)^γ，γ∈[0,5]For focusing parameters, the different values of gamma have different influences on the result, and when gamma is 0, FL is CE and is equal to the traditional cross entropy function; when gamma is equal to>At 0, the relative loss value of simple samples (samples with large errors from the true labels in prediction) is reduced, and attention is paid to difficult samples and misclassified samples. Therefore, only the difficult samples (samples with small error with the real labels during prediction) are trained in the training process, and the training is reduced for the simple samples. Alpha is used as balance weight, the sharing weight of positive and negative samples to total loss is controlled, the scaling ratio is adjusted, and alpha belongs to [0,1 ]]。

Based on this, the problems of low training efficiency, model performance degradation and the like caused by data imbalance are effectively relieved by the aid of the Focal loss function compared with the cross entropy loss function commonly used by classification problems, and no matter what kind of data is less, misjudgment is easier to perform in the actual training process due to less samples, type features are not learned enough, confidence coefficient is low, and loss is increased accordingly. Meanwhile, simple samples are gradually abandoned in the learning process, so that difficult samples of various categories are left, and the same training optimization purpose can be achieved.

Further, in the countermeasure training, the hyper-parameters of the model can be adjusted in a multi-parameter adjustment mode, and a Dropout strategy and L2 regularization are used during parameter adjustment to avoid the overfitting problem of the model.

Further, in the countermeasure training, Adam optimizer can be used for gradient updating, Adam is an extension of the gradient descent optimization algorithm, and can be used for correcting the learning rate and correcting the gradient.

Furthermore, in the confrontation training, the invention can adopt the Warmup strategy training, the Warmup is a learning rate optimization method, a small learning rate is used when the model starts to train, then a preset learning rate is used after a certain training step number, and a small learning rate is used when the model is close to the convergence point, so that the learning rate setting can ensure that the model does not over-fit when the model starts to train.

When the performance of the model obtained by training is verified, the accuracy, precision, recall rate, F1 value and the like can be used as indexes.

In one embodiment of the invention, by taking the microblog comment content sentiment analysis of a Chinese website as an example, in the data preprocessing link, the user ID, forwarding marks, URLs, @ and the like which are irrelevant to sentiment comments are removed; and converting non-simplified Chinese words such as traditional Chinese words and English words in the comments into simplified Chinese words, and converting the emoji expression into characters for output.

And dividing the microblog comment text into a short sentence comment text and a long sentence comment text according to different structures of the microblog comment text sentences. A language model is built by adopting an MLM method aiming at a short sentence comment text, and the model predicts the covered or replaced part by understanding the content of a context through randomly covering or replacing 15% of any Chinese characters in the comment text, wherein the replacing method comprises the following steps: the probability of 80% is replaced by [ MASK ], such as that I love my hometown xi' an > I love my hometown [ MASK ]; the 10% probability is replaced by another character, such as me love my hometown xi an- > me love my hometown beijing; the 10% probability keeps the original content unchanged, as I love my hometown xi 'an > I love my hometown xi' an. For the long sentence comment text, in addition to performing MLM, adding [ CLS ] and [ SEP ] for judging the starting positions of the upper sentence and the lower sentence at the semantic logic of the comment text, and according to 1: a scale of 1 inputs context-dependent and context-independent comment text as an input token layer so that the model understands the relationship between text sentences.

And calculating disturbance r according to the confrontation training formula and the gradient, adding the disturbance to the word vector, and finally inputting the word vector added with the disturbance by the BERT.

The data set is used as microblog comment texts crawled according to related topics of the new crown epidemic situation, 27768 training sets, 2000 verification sets and 5000 test sets are obtained through manual labeling. The model comparison experiment of the model obtained by the test of the test set after the model obtained by training in the training set is shown in the table, and the evaluation indexes are the accuracy ACC and the comprehensive evaluation index F1 value. The following table compares the analytical results using the model BERT-adv of the present invention with the existing BIGRU and BILSTM models, respectively.

As can be seen from the table, the accuracy and the comprehensive evaluation index of the invention are far higher than those of the existing model. Therefore, the robustness of the model facing malicious countermeasure samples is effectively improved by improving the training mode, the occurrence of overfitting is reduced, and the prediction capability of the model for unknown samples is improved. Compared with the traditional short text sentiment analysis model, the model effectively solves the problem of misclassification of sentiment caused by multiple meaning of Chinese words, new network words and the like, solves the problem that the traditional model can not effectively extract context information and local key information, enhances the robustness of the model, and simultaneously relieves the problems of poor model training efficiency and model performance degradation caused by unbalanced data categories to a certain extent, and has the advantages of high accuracy and easy training.

The above disclosure is only for the preferred embodiments of the present invention, but the embodiments of the present invention are not limited thereto, and any variations that can be considered by those skilled in the art are within the scope of the present invention.

Claims

1. A BERT short text sentiment analysis method for improving a training mode is characterized by comprising the following steps:

and 2, step: collecting a data set, wherein the data set is a public data set or a data set constructed by collecting short text comment data; when the data set is a self-constructed data set, marking the emotion polarity label of each piece of collected short text data, wherein the emotion polarity comprises six emotions of happy, sad, angry, surfrise, neutral and fear;

step 6: the semantic feature extraction layer is based on a BERT model, performs semantic feature extraction on the confrontation sample, outputs a feature vector, and obtains a feature vector matrix B belonging to R^s*eWherein s is the text length in units of words, and e is the dimension of the feature vector;

step2, from the formula

Calculating disturbance r_advWhere e represents the perturbation space;

step 4: adding the optimal disturbance into the word vector, namely training the model to enable the loss function to be related to the loss value which is the minimum between the input sample and the real emotion polarity label y when the disturbance is fixed; when the loss value of the loss function tends to be stable in two continuous iteration processes, the training process of the model is ended to obtain a short text emotion analysis model, and short text emotion analysis can be carried out by using the short text emotion analysis model.

2. The method for analyzing the short text emotion of BERT in an improved training mode as claimed in claim 1, wherein in said step4, the method for word segmentation and coding is:

3. The method for analyzing emotion of BERT short text in an improved training mode as claimed in claim 1, wherein in said step 5, the added perturbation in said word vector is the optimal perturbation.

4. The method for analyzing BERT short text sentiment according to the claim 1, characterized in that in the step 6, the BERT model is constructed by a deep bidirectional transducer coder, thereby structurally maximizing the use of context information; the transform encoder comprises a word vector and position encoding, a multi-head attention mechanism, a residual error connection and layer standardization and a feedforward network;

inputting a word vector obtained by using a multi-head self-attention mechanism into the feedforward network, wherein the feedforward network has two layers, the first layer is an activation function ReLU, and the second layer is a linear activation function.

5. The method for analyzing emotion of BERT short text with improved training mode as claimed in claim 4, wherein all output results of the last layer of transform of the BERT model, i.e. the feature vector matrix B e R^s*e。

6. The method for analyzing emotion in short text by BERT in an improved training mode as claimed in claim 1, wherein said step 7, the pooling layer utilizes max-average pooling to pool the feature vectors.

7. The method for analyzing short text emotion of BERT in an improved training mode as claimed in claim 1, wherein in said step 9, the hyper-parameters of the short text emotion analysis model are adjusted by using a multi-parameter tuning method, and a Dropout strategy and L2 regularization are used during parameter adjustment to avoid the overfitting problem of the model.