CN114722797A

CN114722797A - Multi-mode evaluation object emotion classification method based on grammar guide network

Info

Publication number: CN114722797A
Application number: CN202210352422.XA
Authority: CN
Inventors: 李露; 李昕玮; 吴国威; 华梓萱; 魏素忠; 周爱华; 吴含前; 陈锦铭; 叶迪卓然; 陈烨; 焦昊; 郭雅娟
Original assignee: Southeast University; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Southeast University; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2022-04-05
Filing date: 2022-04-05
Publication date: 2022-07-08

Abstract

The invention discloses a grammar guide network based on a pre-training model for a multi-mode evaluation object emotion classification task, which can perform end-to-end fine-grained emotion analysis and judge the emotion polarity of an evaluation object while extracting the evaluation object. Firstly, performing modal alignment and fusion on multi-modal social media corpora by adopting a pre-training model to obtain multi-modal characteristics based on external information; secondly, filtering noise in the multi-mode feature matrix based on the selected pre-training model; then, performing attention calculation on the mode fusion sequence based on the syntactic dependency tree to capture context attention representation based on syntactic information; and finally, for the evaluation object extraction and evaluation object emotion classification tasks, constructing a decoding layer and optimizing a loss function. The model network provided by the invention has excellent performance on a multi-mode fine-grained end-to-end emotion analysis task. Compared with a baseline method, the performance of the method in all aspects of the evaluation object emotion classification task is improved to a certain extent.

Description

Multi-mode evaluation object emotion classification method based on grammar guide network

Technical Field

The invention relates to a fine-grained emotion classification task, in particular to a multi-modal evaluation object emotion classification method based on a grammar guide network.

Background

At the present stage, most fine-grained emotion analysis methods do not consider emotion analysis tasks as a whole, but respectively model specific subtasks. However, solving each subtask separately cannot utilize the common place between two subtasks, and performing model training separately also consumes a lot of extra resources. Especially for the task of fine-grained sentiment classification, most methods predict sentiment polarity given an evaluation object. However, in practical cases, the evaluation objects in the corpus are hidden in the text and are not explicitly given. Therefore, the emotion classification method based on the given evaluation object at the present stage has no strong application value.

Disclosure of Invention

The invention aims to: based on the defects of the prior art, the invention provides a multi-mode evaluation object emotion classification method based on a grammar guide network. The model can effectively encode multi-modal features and perform targeted attention calculation facing grammatical features.

The technical scheme is as follows: a multi-mode evaluation object emotion classification method based on a grammar guide network. The model framework of the method is integrally divided into four parts, namely a coding layer, a noise filtering layer, a grammar attention layer and a decoding layer.

The coding layer is the input to the overall model. And in the coding layer, LXMERT is selected to carry out multi-mode feature coding on the corpus. Besides the LXMERT layer, the coding layer also has a character-level word vector and an independent BERT coding block. For text, the LXMERT model first uniformly fills or clips sentences to a maximum length of n, and then represents them as position-based word vectors via word coding and position coding. And finally, the model is sent to a cross-modal coding layer through R transform coding layers. For the picture part, the foreground object in the picture is firstly identified by the model through the fast-RCNN model, and a 2048-dimensional feature matrix of the foreground object is obtained. In addition, the model takes into account the position information of the object, and expresses its vertex coordinates as a position vector. And finally, averaging the image characteristics and the position characteristics after layer normalization to be used as the characteristic representation of the object. This step can be formulated exactly as:

wherein W_F，W_P，b_FAnd b_PIs a parameter, v_jIs a characteristic representation of the final object. Similarly, the model sends the obtained object representation to a stacked Transformer for encoding, obtains an object feature representation with dependency relationship between objects, and then sends the object feature representation to the cross-modal encoding layer.

Each cross-mode coding layer of the LXMERT has two modules of a bidirectional cross attention sublayer and a feedforward attention sublayer. The bidirectional cross-attention sublayer also includes two portions. Firstly, for a time step i in a k-th layer character sequence, the time step i exerts attention on all m objects of a k-1-th layer; similarly, the jth object exerts attention on the entire text sequence:

to further establish the internal connections, the model again applies self-attention to the text sequence and object features:

and finally, the two modal vectors are respectively used as the final output of the LXMERT through a feedforward network of a Transformer.

Meanwhile, the output layer also adopts the average of 12 layers of transform output vectors in the BERT to be the final output of the BERT:

character-level word vector coding is carried out on the text by using Char-CNN, and the character-level word vector coding is respectively fused with character coding and BERT coding obtained by LXMERT. For fusion with BERT, the BERT output is directly stitched with the character-level word vectors:

different strategies were adopted for fusion with LXMERT. Compressing the two vectors to the same dimension, and then performing weighted summation on the character-level word vector and the LXMERT output vector by a smaller weight:

w_i＝a_c(W_cc_i+b_c)+(1-a_c)(W_ll_i+b_l)

wherein W_c，W_l，b_cAnd b_lAs a parameter, a_cThe weights occupied by the character-level word vectors.

In the noise filtering layer, the method uniformly encodes<Text-picture>And (4) centering the correlation degree of the picture and the text, thereby filtering the noise in the coding layer. The noise filter layer obtains multi-mode text coding W ═ W { W } based on LXMERT from the coding layer₁,w₂,···,w_nAnd BERT-based single-modality text coding

The first position of a feature matrix output by the multi-mode text is fully connected, then the weights of the two coding modes are output by a Softmax function, and finally the weight guides the fusion of the two codes, namely:

α＝Softmax(W_αw_CLS+b_α)

wherein a is a weight, and wherein a is a weight,

feature matrix head for multimodal text output, d feature dimension, W_αAnd b_αIs a parameter of

For multi-modal feature coding after noise filtering, N is the length of a text sequence.

Within the syntactic attention layer, the model enhances model syntactical interpretability by introducing syntactic dependency trees to model the dependency relationships between sentences. For inputting multimodal sequences

There is a graph G (V, E) in which each unit in the multimodal sequence is a fixed point in the graph G, i.e.

The edge E is a set of all edges having a dependency relationship in the adjacency matrix M, and the edges have corresponding weights, which are calculated from correlation coefficients between vertices at both ends of the edge. Specifically, when updating node i in the multimodal sequence, the attention weight value of a unit having a dependency relationship with respect to i is calculated by the following formula:

wherein

For all neighbor nodes of i, LeakyReLU is an activation function, which is an improvement of the ReLU activation function, and cell death is avoided by setting the activation units smaller than 0 to a small negative number. W and omega are trainable parametersNumber, | | represents a vector stitching operation.

After obtaining the attention weight of the node i to its neighbor nodes, the model updates the multi-modal feature representation of the node i according to the attention weight set, which can be represented by the following formula:

wherein W and b are parameters. To make the layer more stable, the model uses multiple heads, where the parameters of each head are independent of each other. When outputting, the average of a plurality of attention heads is taken as the final representation, namely:

since abbreviations and linguistic diseases often cause complete failure of syntactic analysis, in order to alleviate the large loss caused by failure of syntactic analysis, the model introduces a French portal to directly decide whether an input sequence passes through a grammatical attention level. The input monomodal text coding first CLS represents a text sequence, and the text sequence is classified twice through a full connection layer to judge whether the syntax of the text sequence is disordered:

z^★＝argmax(z_CLS)

wherein W_CLS，b_CLSAre parameters. And if the multi-modal sequence is classified into a chaotic sequence, directly outputting the input multi-modal sequence without passing through a grammar attention layer, or passing through the grammar attention layer.

The grammar attention layer judges the dependency relationship among the words through syntactic analysis, and only carries out attention calculation in the words with the mutual relationship through the graph attention network, thereby strengthening the grammar interpretability of the model. In addition, the layer considers that syntactic analysis failure possibly caused by social media text characteristics is possible, and whether an input sequence passes through a text graph network is determined through a French door, so that the effectiveness and reliability of the layer are enhanced.

In order to enable the model to obtain the maximum output probability of the real label sequence and avoid the interference caused by the invalid sequence, a CRF model is adopted as a decoding layer in the decoding layer, and the negative value of the log-likelihood function of the CRF model is used as the loss function of the model, and the specific formula is as follows:

wherein N is the size of the sample set, X is the input text sequence, Y is the real labeling sequence corresponding to the text, and YX is all possible output sets conforming to the dependency relationship.

During training, the model is initialized by using an Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with

Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. The model is optimized using an Adam optimizer that can dynamically adjust the learning rate and that allows frequently changing parameters to be updated in smaller steps for different parameters, while sparse parameter updates have larger steps.

Has the advantages that:

1) the invention provides a GOPREM model aiming at an evaluation object attribute classification task facing a social media corpus, and the model does not need to provide an evaluation object in advance.

2) The model introduces the LXMERT pre-training model to carry out multi-mode feature coding, thereby solving the problem that the performance of a shallow network is limited, and the pre-training model has rich prior knowledge.

3) The model uses a noise filter layer to reject noise in the multi-modal fusion features. Aiming at errors still existing in a self-attention mechanism in a Transformer model, the model introduces a grammar attention layer, and a syntactic dependency tree is obtained by analyzing input text, so that attention is calculated by using a graph attention network to ensure that each word only has the attention of words with grammatical dependency relation in a sequence, and the model errors are reduced as much as possible.

4) For syntactic analysis failures that may be caused by text informality in social media, the model introduces a phylum of grammar to decide whether the input passes through the grammatical attention layer. The model judges the emotion polarity while extracting the evaluation object, and has high practical value.

Drawings

FIG. 1 is an example of a corpus used in the training of the present invention;

FIG. 2 is a diagram of the model architecture of the present invention;

FIG. 3 is a block diagram of the model coding layer architecture of the present invention

FIG. 4 is a block diagram of a model noise filter layer of the present invention

FIG. 5 is an example of a syntactic dependency tree and corresponding adjacency matrix.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention discloses a model of a grammar guide network based on a pre-training model for a multi-modal evaluation object emotion classification task. Evaluation object extraction (ATE) and evaluation object sentiment classification (ATP) are two important subtasks in fine-grained sentiment analysis. However, most of the current fine-grained emotion analysis tasks focus on one of them. The present model will solve both subtasks simultaneously and define it as ATEP. Namely, the emotion polarity of the sentence for the evaluation object is judged while the evaluation object in the sentence is recognized. In the definition of the task, the ATEP task is input as a natural language sequence, and output as a label sequence corresponding to each word in the sequence, i.e. the task is also defined as the sequence in this documentThe columns label the tasks. For an input sequence X ═ w₁,w₂,···,w_nN is the sequence length, corresponding to output sequence Y ═ Y₁,y₂,···,y_n}}. The labels of the output sequence follow the modified BIO-2 standard, i.e. y_iBelongs to { B-POS, B-NEU, B-NEG, I-POS, I-NEU, I-NEG, O }. Wherein the suffixes of labels B and I represent their emotional polarity, POS represents positive, NEG represents negative, and NEU represents neutral, respectively. Similarly, for The example sentence "The size is ideal while The quality is bad", The class tag sequence is Y ═ { O, B-POS, O, O, O, O, B-NEG, O, O }. Fig. 2 illustrates an example of a triplet of corpus text.

The model framework of the invention is integrally divided into four parts, namely a coding layer, a noise filtering layer, a grammar attention layer and a decoding layer. The overall structure is shown in fig. 1.

In the coding layer, LXMERT is selected to carry out multi-mode feature coding on the corpus. Besides the LXMERT layer, the coding layer also has a character-level word vector and a separate BERT coding block, and the structure is shown in FIG. 3. Visual coding, text coding and multi-modal coding in the LXMERT model are stacked by using 5, 9 and 5-layer transform encoders respectively, and the output vector dimension is 768. BERT adopts a BERT-base pre-training model 2 provided by Google officials, wherein the model comprises 12 transform layers, and the dimension of the obtained word vector is 768. The dimension of the character-level word vector is set to 30, which is initialized to follow a uniform distribution of (-0.25, 0.25). Because the word segmentation adopted by the LXMERT and the BERT can divide a word into a plurality of word fragments, when the word segmentation is aligned, the model respectively splices the character-level word vector of the word with all the word fragments corresponding to the word. The syntax spanning tree employs API3 provided by spaCy. The sentence length and word length are set to 40 and 30, respectively. The initial learning rate for the model was set to 0.001 and the learning rate for the BERT and LXMERT trims was set to 0.0001. In consideration of the characteristics of the social media corpus, the picture does not always play a positive role as a supplement to the text, and in some cases, the picture exists as noise. Therefore, we encode the text separately as input to the subsequent network using a BERT model independently. Taking the average of the 12-layer Transformer output vectors in the BERT as the final output of the BERT:

meanwhile, character-level word vector coding is carried out on the text by using Char-CNN, and the character-level word vector coding is respectively fused with character coding and BERT coding obtained by LXMERT. For fusion with BERT, the BERT output is directly stitched with the character-level word vectors:

w_i＝a_c(W_cc_i+b_c)+(1-a_c)(W_ll_i+b_l)

wherein W_c，W_l，b_cAnd b_lIs a parameter, a_cThe weights that the character-level word vectors occupy.

The input through the coding layer enters the noise filter layer where we encode uniformly<Text-picture>And (4) the correlation degree of the picture and the text is centered, so that the noise generated in the coding layer is filtered. The noise filtering layer structure is shown in fig. 3. The noise filtering layer obtains multi-mode text coding W ═ { W ═ based on LXMERT from the coding layer₁,w₂,···,w_nAnd BERT-based single-modality text coding

α＝Softmax(W_αw_CLS+b_α)

wherein a is a weight, and wherein a is a weight,

for the first feature matrix of multimodal text output, d is the feature dimension, W_αAnd b_αIs a parameter of

Filtering noise through the noise filtering layer, the grammar attention layer model models the dependency relationship between sentences by introducing syntax dependency tree, thereby enhancing the interpretability of the model grammar. The method also reduces the problem that LXMERT and BERT models can have some residual errors at a grammatical level, thereby influencing the accuracy of the emotion classification task of the evaluation object. The grammar attention layer parses the sentence at a grammar level based on the syntactic dependency tree and the graph attention network. Specifically, first, for the sentence T ═ T₁,t₂,···,t_n} generating an adjacency matrix according to the syntax dependency tree D

Where N is the sentence length. In the adjacency matrix M, 1 is set between words having dependency relationships in the syntactic dependency tree D (i.e., a word and itself are also set to have dependency), and 0 is set in other cells. FIG. 4 illustrates an example of a syntactic dependency tree and corresponding adjacency matrix. The grammar attention layer attempts to enforce the interconnections in the grammar in the final attention representation. However, models always emphasize the characteristics of social media corpora, i.e., multiple abbreviations, multiple linguistics, multiple wrongly written words. This makes the accuracy of syntactic analysis extremely challenging. Furthermore, it is known that syntactic analysis after abbreviation has serious errors in both parts of speech and judgment of word relations, which will cause failure of the subsequent attention network and greatly increase the loss of models. Therefore, to mitigate the large loss of syntactic analysis failure, the model introduces a linguistic gate to directly decide whether the input sequence passes through a grammatical attention layer. The input monomodal text coding first CLS represents a text sequence, and the text sequence is classified twice through a full connection layer to judge whether the syntax of the text sequence is disordered:

z^★＝argmax(z_CLS)

Finally, through the output of the decoding layer, the real label sequence can be obtained. In order to enable the model to obtain the maximum output probability of the real label sequence and avoid the interference caused by the invalid sequence, a CRF model is adopted as a decoding layer in the decoding layer, and the negative value of the log-likelihood function of the CRF model is used as the loss function of the model, and the specific formula is as follows:

To verify the advantages of the present invention over other models, a series of comparative experiments were performed. The experimental computer CPU is 8-core 16-thread

Core^TMi9-9900K, GPU is Gig of 11G video memoryabyte RTX 2080 Ti. The experimental steps mainly comprise three aspects, namely firstly, data preparation; then training a model; and finally, testing through the trained model to show the effect of the model.

1) Data preparation

The data set used in the experiment is the Adaptive Co-orientation Network for the Name energy registration in Tweets published by Zhang et al and the Visual orientation Model for the Name Tagging in Multimodal social media published by Lu et al. When the corpus is screened, firstly deleting twitter with text except English and twitter without picture in the data set. And in the rest data, if more than one picture is corresponding to the text, randomly selecting one picture as a representative. Finally, items that do not contain any rating objects, have text lengths less than 3, or have text that is difficult to understand are deleted. Corpus annotation follows the BIO-2 standard. The whole corpus is divided into three parts, namely a training set, a verification set and a test set, and the statistical information of the corresponding corpora is shown in the following table:

2) model training

Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. The sentence length and word length are set to 40 and 30, respectively. The initial learning rate for the model was set to 0.001 and the learning rate for the BERT and LXMERT trims was set to 0.0001. The training batch was 8 samples.

For better evaluation of the performance of the goperm model, a comparison experiment was performed on the goperm model with the following different models under the same experimental environment, model setup and corpus content:

BERT + BilSTM + CRF the method uses a BERT pre-training model to perform text encoding, and uses a bidirectional LSTM model to extract context semantic relations of text sequences, and finally uses a CRF model to decode, which is a basic framework of a plurality of text sequence labeling models. And during decoding, a mode that the evaluation object and the emotion polarity are labeled together is adopted.

GRACE the method is an end-to-end emotion analysis model, which proposes a gradient equilibrium loss function and virtual confrontation training. The model is only for single-modality corpora.

GOPREM-LXMERT the method removes the LXMERT model in GOPREM and the subsequent noise filtering layer, i.e. multi-modal fusion is not performed, and only text is used for emotion polarity prediction of an evaluation object.

RAN the method is a multi-modal assessment object extraction method.

goperm-FG this method removes the noise filtering layer in the goperm model and uses multimodal coding directly as input to the syntactic attention layer.

goperm-GAt the method removes the syntactic attention layer in the goperm model and takes the output of the noise filter layer directly as input to the decoding layer.

GOPREM-F TBERT the method does not make any fine-tuning of the BERT model in GOPREM, using only its encoding as a network input.

GOPREM-F TLXMERT the method does not make any fine-tuning to the LXMERT model in GOPREM, using only its encoding as network input.

3) Results of the experiment

Applying the prepared data to the above model, the results shown in table 1 were obtained. The results show the precision rate, the recall rate and the F1-measure of the trained model on the test set, and the larger the evaluation index values are, the more excellent the model is.

TABLE 1

From table 1 it can be seen that the best results are obtained with the goperm model proposed by the present invention. First, comparing the first three single-mode models, finding that BERT + BilSTM + CRF performs the worst in the three single-mode models, which shows that although the structure has strong semantic capture, timing modeling and rule judgment capabilities, the structure is a general-purpose architecture and still has a large optimization space in a specific field. GRACE model is one of the most excellent end-to-end emotion analysis models at present, and the performance of the GRACE model greatly exceeds that of the simple BERT + BilSTM + CRF model, while the GOPREM model with LXMERT multi-modal interaction removed is closer to the performance of the GRACE model. The GRACE model enhances the performance of the model through innovative gradient balance loss functions and virtual confrontation training, and is mainly innovated in the aspect of model training, while the GOPREM model provided by the invention carries out targeted modeling through analyzing the characteristics of corpora, and achieves similar or even better effect on performance.

Secondly, performing an ablation experiment on the GOPREM model, it can be found that: (a) the model after removing the noise filter shows that F1-measure on the ATE task is reduced by 9.4 percent and F1-measure on the ATP task is reduced by 9.6 percent. The method is not even similar to a unimodal GOPREM-LXMERT model, and is presumed that noise has a large influence on the fusion of texts and pictures, while the LXMERT model deeply fuses the texts and the pictures through a multi-layer Transformer encoder, and if noise filtering is not performed, the model performance is greatly influenced, and the noise filtering layer in the GOPREM can effectively filter the noise in the pictures; (b) the F1-measure of the model without the grammar attention layer on the ATE task is reduced by 0.83%, the F1-measure on the ATP task is reduced by 0.74%, and the grammar attention layer can effectively analyze the grammar structure of the sentence and exert greater attention between the grammar-related word pairs, so that the model performance is improved; (c) if the pre-training model is not finely adjusted, the model effect is greatly reduced, wherein the F1-measure of the ATE and ATP tasks is reduced by 2.5% and 3.3% respectively without finely adjusting the BERT, and the F1-measure of the ATE and ATP tasks is reduced by 8.9% and 9.8% respectively without finely adjusting the LXMERT. This shows that task-oriented fine tuning of the pre-trained model is important, and particularly, the CLS features of the model are used in large quantities in the model, and fine tuning is required on the task to exert good effects. Through ablation experiments, it can be seen that each part in the GOPREM model greatly contributes to the performance thereof. The GOPREM model provided by the invention has excellent performance as shown by experimental results.

Claims

1. A multi-modal evaluation object emotion classification method based on a grammar oriented network is characterized in that a model of the method comprises a coding layer, a noise filtering layer, a grammar attention layer and a decoding layer, and a multi-modal fusion text vector matrix is obtained through the coding layer

And the monomodal text vector matrix W is used as the input of a subsequent model, the noise of a coding layer is filtered through a noise filtering layer, in a grammar attention layer, a syntactic dependency tree is introduced to model the dependency relationship between sentences, so that the interpretability of the model grammar is enhanced, and finally a label sequence is obtained through a decoding layer.

2. The multi-modal assessment object emotion classification method based on the grammar-oriented network as claimed in claim 1, wherein LXMERT is selected for multi-modal feature coding of corpus in the coding layer, the coding layer is further provided with a character-level word vector and an independent BERT coding block in addition to the LXMERT layer, the LXMERT first needs to process single-modal input before cross-modal coding, the BERT adopts an average of 12 layers of transform output vectors to be a final output of the BERT, and the character-level word vector adopts Char-CNN for character-level word vector coding of text, thereby alleviating negative effects caused by more abbreviations, more language diseases and more wrongly-recognized words.

3. The multi-modal assessment object emotion classification method based on grammar-oriented network as claimed in claim 2, wherein the LXMERT model used for multi-modal coding is from Huggingface1, wherein visual coding, text coding and multi-modal coding are stacked by using 5, 9 and 5-layer transform encoders respectively, and the output vector dimension is 768.

4. The multi-modal assessment object emotion classification method based on grammar-oriented network as claimed in claim 1, wherein BERT adopts a BERT-base pre-training model provided by Google official, wherein the model comprises 12 transform layers, the dimension of the obtained word vector is 768, the dimension of the character-level word vector is set to 30, and the initialization thereof follows the uniform distribution of (-0.25, 0.25).

5. The method of claim 1, wherein the initial learning rate of the model in the model training phase is set to 0.001, the learning rate for BERT and LXMERT trimming is set to 0.0001, and the training batch is 8 samples.

6. The multi-modal evaluation object emotion classification method based on the grammar-oriented network as claimed in claim 1, wherein the correlation degree between the picture and the text in the < text-picture > pair is uniformly coded in the noise filter layer, so as to filter the noise in the coding layer, and after the noise filter layer is trained, the relationship between the picture and the text is more accurately judged, so as to reasonably distribute the weight to determine the proportion of the picture in the output features, thereby alleviating the noise problem in the social media corpus.

7. The method as claimed in claim 6, wherein the noise filter layer obtains LXMERT-based multi-modal text coding W ═ W from the coding layer₁,w₂,…,w_nAnd BERT-based single-modality text coding

The first position of a feature matrix output by the multi-mode text is fully connected and then two coding modes are output by a Softmax functionAnd (3) weighting, and finally guiding the fusion of two codes by the weighting, namely:

α＝Softmax(W_αw_CLS+b_α)

wherein a is a weight of the object,

For multi-modal feature coding after noise filtering, N is the text sequence length.

8. The multi-modal sentiment classification method based on the grammar guide network as claimed in claim 1, wherein the grammar attention layer judges the dependency relationship between words through syntactic analysis, and only performs attention calculation in the words with mutual relationship through the graph attention network, thereby enhancing the grammar interpretability of the model.

9. The method as claimed in claim 8, wherein in the grammar attention layer, the syntactic dependency tree is introduced to model the dependency relationship between sentences so as to enhance the interpretability of the model grammar, and the multi-modal sequence input is subjected to emotion classification based on the grammar guidance network

The edge E is a set of all edges having a dependency relationship in the adjacent matrix M, and each edge has a corresponding weight, and the weight is calculated from a correlation coefficient between vertices at two ends of the edge, specifically, when the node i is updated in the multi-modal sequence, an attention weight value of a unit having a dependency relationship with respect to i is calculated by the following formula:

wherein

For all neighbor nodes of i, LeakyReLU is an activation function which is an improvement of the ReLU activation function, the unit death is avoided by setting the activation units smaller than 0 to be a small negative number, W and omega are trainable parameters, and | l represents the vector splicing operation.

10. The method for classifying multi-modal evaluation object emotions based on the grammar guide network as claimed in claim 1, wherein a CRF model is used as a decoding layer, and a negative value of a log-likelihood function of the CRF model is used as a loss function of the CRF model, and the specific formula is as follows: