CN114722797A - Multi-mode evaluation object emotion classification method based on grammar guide network - Google Patents

Multi-mode evaluation object emotion classification method based on grammar guide network Download PDF

Info

Publication number
CN114722797A
CN114722797A CN202210352422.XA CN202210352422A CN114722797A CN 114722797 A CN114722797 A CN 114722797A CN 202210352422 A CN202210352422 A CN 202210352422A CN 114722797 A CN114722797 A CN 114722797A
Authority
CN
China
Prior art keywords
model
layer
coding
modal
grammar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210352422.XA
Other languages
Chinese (zh)
Inventor
李露
李昕玮
吴国威
华梓萱
魏素忠
周爱华
吴含前
陈锦铭
叶迪卓然
陈烨
焦昊
郭雅娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Southeast University
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical Southeast University
Priority to CN202210352422.XA priority Critical patent/CN114722797A/en
Publication of CN114722797A publication Critical patent/CN114722797A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a grammar guide network based on a pre-training model for a multi-mode evaluation object emotion classification task, which can perform end-to-end fine-grained emotion analysis and judge the emotion polarity of an evaluation object while extracting the evaluation object. Firstly, performing modal alignment and fusion on multi-modal social media corpora by adopting a pre-training model to obtain multi-modal characteristics based on external information; secondly, filtering noise in the multi-mode feature matrix based on the selected pre-training model; then, performing attention calculation on the mode fusion sequence based on the syntactic dependency tree to capture context attention representation based on syntactic information; and finally, for the evaluation object extraction and evaluation object emotion classification tasks, constructing a decoding layer and optimizing a loss function. The model network provided by the invention has excellent performance on a multi-mode fine-grained end-to-end emotion analysis task. Compared with a baseline method, the performance of the method in all aspects of the evaluation object emotion classification task is improved to a certain extent.

Description

Multi-mode evaluation object emotion classification method based on grammar guide network
Technical Field
The invention relates to a fine-grained emotion classification task, in particular to a multi-modal evaluation object emotion classification method based on a grammar guide network.
Background
At the present stage, most fine-grained emotion analysis methods do not consider emotion analysis tasks as a whole, but respectively model specific subtasks. However, solving each subtask separately cannot utilize the common place between two subtasks, and performing model training separately also consumes a lot of extra resources. Especially for the task of fine-grained sentiment classification, most methods predict sentiment polarity given an evaluation object. However, in practical cases, the evaluation objects in the corpus are hidden in the text and are not explicitly given. Therefore, the emotion classification method based on the given evaluation object at the present stage has no strong application value.
Disclosure of Invention
The invention aims to: based on the defects of the prior art, the invention provides a multi-mode evaluation object emotion classification method based on a grammar guide network. The model can effectively encode multi-modal features and perform targeted attention calculation facing grammatical features.
The technical scheme is as follows: a multi-mode evaluation object emotion classification method based on a grammar guide network. The model framework of the method is integrally divided into four parts, namely a coding layer, a noise filtering layer, a grammar attention layer and a decoding layer.
The coding layer is the input to the overall model. And in the coding layer, LXMERT is selected to carry out multi-mode feature coding on the corpus. Besides the LXMERT layer, the coding layer also has a character-level word vector and an independent BERT coding block. For text, the LXMERT model first uniformly fills or clips sentences to a maximum length of n, and then represents them as position-based word vectors via word coding and position coding. And finally, the model is sent to a cross-modal coding layer through R transform coding layers. For the picture part, the foreground object in the picture is firstly identified by the model through the fast-RCNN model, and a 2048-dimensional feature matrix of the foreground object is obtained. In addition, the model takes into account the position information of the object, and expresses its vertex coordinates as a position vector. And finally, averaging the image characteristics and the position characteristics after layer normalization to be used as the characteristic representation of the object. This step can be formulated exactly as:
Figure BDA0003581384510000011
Figure BDA0003581384510000012
Figure BDA0003581384510000013
wherein WF,WP,bFAnd bPIs a parameter, vjIs a characteristic representation of the final object. Similarly, the model sends the obtained object representation to a stacked Transformer for encoding, obtains an object feature representation with dependency relationship between objects, and then sends the object feature representation to the cross-modal encoding layer.
Each cross-mode coding layer of the LXMERT has two modules of a bidirectional cross attention sublayer and a feedforward attention sublayer. The bidirectional cross-attention sublayer also includes two portions. Firstly, for a time step i in a k-th layer character sequence, the time step i exerts attention on all m objects of a k-1-th layer; similarly, the jth object exerts attention on the entire text sequence:
Figure BDA0003581384510000021
Figure BDA0003581384510000022
to further establish the internal connections, the model again applies self-attention to the text sequence and object features:
Figure BDA0003581384510000023
Figure BDA0003581384510000024
and finally, the two modal vectors are respectively used as the final output of the LXMERT through a feedforward network of a Transformer.
Meanwhile, the output layer also adopts the average of 12 layers of transform output vectors in the BERT to be the final output of the BERT:
Figure BDA0003581384510000025
character-level word vector coding is carried out on the text by using Char-CNN, and the character-level word vector coding is respectively fused with character coding and BERT coding obtained by LXMERT. For fusion with BERT, the BERT output is directly stitched with the character-level word vectors:
Figure BDA0003581384510000026
different strategies were adopted for fusion with LXMERT. Compressing the two vectors to the same dimension, and then performing weighted summation on the character-level word vector and the LXMERT output vector by a smaller weight:
wi=ac(Wcci+bc)+(1-ac)(Wlli+bl)
wherein Wc,Wl,bcAnd blAs a parameter, acThe weights occupied by the character-level word vectors.
In the noise filtering layer, the method uniformly encodes<Text-picture>And (4) centering the correlation degree of the picture and the text, thereby filtering the noise in the coding layer. The noise filter layer obtains multi-mode text coding W ═ W { W } based on LXMERT from the coding layer1,w2,···,wnAnd BERT-based single-modality text coding
Figure BDA0003581384510000031
The first position of a feature matrix output by the multi-mode text is fully connected, then the weights of the two coding modes are output by a Softmax function, and finally the weight guides the fusion of the two codes, namely:
α=Softmax(WαwCLS+bα)
Figure BDA0003581384510000032
wherein a is a weight, and wherein a is a weight,
Figure BDA0003581384510000033
feature matrix head for multimodal text output, d feature dimension, WαAnd bαIs a parameter of
Figure BDA0003581384510000034
For multi-modal feature coding after noise filtering, N is the length of a text sequence.
Within the syntactic attention layer, the model enhances model syntactical interpretability by introducing syntactic dependency trees to model the dependency relationships between sentences. For inputting multimodal sequences
Figure BDA0003581384510000035
There is a graph G (V, E) in which each unit in the multimodal sequence is a fixed point in the graph G, i.e.
Figure BDA0003581384510000036
The edge E is a set of all edges having a dependency relationship in the adjacency matrix M, and the edges have corresponding weights, which are calculated from correlation coefficients between vertices at both ends of the edge. Specifically, when updating node i in the multimodal sequence, the attention weight value of a unit having a dependency relationship with respect to i is calculated by the following formula:
Figure BDA0003581384510000037
wherein
Figure BDA0003581384510000038
For all neighbor nodes of i, LeakyReLU is an activation function, which is an improvement of the ReLU activation function, and cell death is avoided by setting the activation units smaller than 0 to a small negative number. W and omega are trainable parametersNumber, | | represents a vector stitching operation.
After obtaining the attention weight of the node i to its neighbor nodes, the model updates the multi-modal feature representation of the node i according to the attention weight set, which can be represented by the following formula:
Figure BDA0003581384510000039
wherein W and b are parameters. To make the layer more stable, the model uses multiple heads, where the parameters of each head are independent of each other. When outputting, the average of a plurality of attention heads is taken as the final representation, namely:
Figure BDA00035813845100000310
since abbreviations and linguistic diseases often cause complete failure of syntactic analysis, in order to alleviate the large loss caused by failure of syntactic analysis, the model introduces a French portal to directly decide whether an input sequence passes through a grammatical attention level. The input monomodal text coding first CLS represents a text sequence, and the text sequence is classified twice through a full connection layer to judge whether the syntax of the text sequence is disordered:
Figure BDA0003581384510000041
z=argmax(zCLS)
wherein WCLS,bCLSAre parameters. And if the multi-modal sequence is classified into a chaotic sequence, directly outputting the input multi-modal sequence without passing through a grammar attention layer, or passing through the grammar attention layer.
The grammar attention layer judges the dependency relationship among the words through syntactic analysis, and only carries out attention calculation in the words with the mutual relationship through the graph attention network, thereby strengthening the grammar interpretability of the model. In addition, the layer considers that syntactic analysis failure possibly caused by social media text characteristics is possible, and whether an input sequence passes through a text graph network is determined through a French door, so that the effectiveness and reliability of the layer are enhanced.
In order to enable the model to obtain the maximum output probability of the real label sequence and avoid the interference caused by the invalid sequence, a CRF model is adopted as a decoding layer in the decoding layer, and the negative value of the log-likelihood function of the CRF model is used as the loss function of the model, and the specific formula is as follows:
Figure BDA0003581384510000042
Figure BDA0003581384510000043
wherein N is the size of the sample set, X is the input text sequence, Y is the real labeling sequence corresponding to the text, and YX is all possible output sets conforming to the dependency relationship.
During training, the model is initialized by using an Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with
Figure BDA0003581384510000044
Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. The model is optimized using an Adam optimizer that can dynamically adjust the learning rate and that allows frequently changing parameters to be updated in smaller steps for different parameters, while sparse parameter updates have larger steps.
Has the advantages that:
1) the invention provides a GOPREM model aiming at an evaluation object attribute classification task facing a social media corpus, and the model does not need to provide an evaluation object in advance.
2) The model introduces the LXMERT pre-training model to carry out multi-mode feature coding, thereby solving the problem that the performance of a shallow network is limited, and the pre-training model has rich prior knowledge.
3) The model uses a noise filter layer to reject noise in the multi-modal fusion features. Aiming at errors still existing in a self-attention mechanism in a Transformer model, the model introduces a grammar attention layer, and a syntactic dependency tree is obtained by analyzing input text, so that attention is calculated by using a graph attention network to ensure that each word only has the attention of words with grammatical dependency relation in a sequence, and the model errors are reduced as much as possible.
4) For syntactic analysis failures that may be caused by text informality in social media, the model introduces a phylum of grammar to decide whether the input passes through the grammatical attention layer. The model judges the emotion polarity while extracting the evaluation object, and has high practical value.
Drawings
FIG. 1 is an example of a corpus used in the training of the present invention;
FIG. 2 is a diagram of the model architecture of the present invention;
FIG. 3 is a block diagram of the model coding layer architecture of the present invention
FIG. 4 is a block diagram of a model noise filter layer of the present invention
FIG. 5 is an example of a syntactic dependency tree and corresponding adjacency matrix.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
The invention discloses a model of a grammar guide network based on a pre-training model for a multi-modal evaluation object emotion classification task. Evaluation object extraction (ATE) and evaluation object sentiment classification (ATP) are two important subtasks in fine-grained sentiment analysis. However, most of the current fine-grained emotion analysis tasks focus on one of them. The present model will solve both subtasks simultaneously and define it as ATEP. Namely, the emotion polarity of the sentence for the evaluation object is judged while the evaluation object in the sentence is recognized. In the definition of the task, the ATEP task is input as a natural language sequence, and output as a label sequence corresponding to each word in the sequence, i.e. the task is also defined as the sequence in this documentThe columns label the tasks. For an input sequence X ═ w1,w2,···,wnN is the sequence length, corresponding to output sequence Y ═ Y1,y2,···,yn}}. The labels of the output sequence follow the modified BIO-2 standard, i.e. yiBelongs to { B-POS, B-NEU, B-NEG, I-POS, I-NEU, I-NEG, O }. Wherein the suffixes of labels B and I represent their emotional polarity, POS represents positive, NEG represents negative, and NEU represents neutral, respectively. Similarly, for The example sentence "The size is ideal while The quality is bad", The class tag sequence is Y ═ { O, B-POS, O, O, O, O, B-NEG, O, O }. Fig. 2 illustrates an example of a triplet of corpus text.
The model framework of the invention is integrally divided into four parts, namely a coding layer, a noise filtering layer, a grammar attention layer and a decoding layer. The overall structure is shown in fig. 1.
In the coding layer, LXMERT is selected to carry out multi-mode feature coding on the corpus. Besides the LXMERT layer, the coding layer also has a character-level word vector and a separate BERT coding block, and the structure is shown in FIG. 3. Visual coding, text coding and multi-modal coding in the LXMERT model are stacked by using 5, 9 and 5-layer transform encoders respectively, and the output vector dimension is 768. BERT adopts a BERT-base pre-training model 2 provided by Google officials, wherein the model comprises 12 transform layers, and the dimension of the obtained word vector is 768. The dimension of the character-level word vector is set to 30, which is initialized to follow a uniform distribution of (-0.25, 0.25). Because the word segmentation adopted by the LXMERT and the BERT can divide a word into a plurality of word fragments, when the word segmentation is aligned, the model respectively splices the character-level word vector of the word with all the word fragments corresponding to the word. The syntax spanning tree employs API3 provided by spaCy. The sentence length and word length are set to 40 and 30, respectively. The initial learning rate for the model was set to 0.001 and the learning rate for the BERT and LXMERT trims was set to 0.0001. In consideration of the characteristics of the social media corpus, the picture does not always play a positive role as a supplement to the text, and in some cases, the picture exists as noise. Therefore, we encode the text separately as input to the subsequent network using a BERT model independently. Taking the average of the 12-layer Transformer output vectors in the BERT as the final output of the BERT:
Figure BDA0003581384510000061
meanwhile, character-level word vector coding is carried out on the text by using Char-CNN, and the character-level word vector coding is respectively fused with character coding and BERT coding obtained by LXMERT. For fusion with BERT, the BERT output is directly stitched with the character-level word vectors:
Figure BDA0003581384510000062
different strategies were adopted for fusion with LXMERT. Compressing the two vectors to the same dimension, and then performing weighted summation on the character-level word vector and the LXMERT output vector by a smaller weight:
wi=ac(Wcci+bc)+(1-ac)(Wlli+bl)
wherein Wc,Wl,bcAnd blIs a parameter, acThe weights that the character-level word vectors occupy.
The input through the coding layer enters the noise filter layer where we encode uniformly<Text-picture>And (4) the correlation degree of the picture and the text is centered, so that the noise generated in the coding layer is filtered. The noise filtering layer structure is shown in fig. 3. The noise filtering layer obtains multi-mode text coding W ═ { W ═ based on LXMERT from the coding layer1,w2,···,wnAnd BERT-based single-modality text coding
Figure BDA0003581384510000071
The first position of a feature matrix output by the multi-mode text is fully connected, then the weights of the two coding modes are output by a Softmax function, and finally the weight guides the fusion of the two codes, namely:
α=Softmax(WαwCLS+bα)
Figure BDA0003581384510000076
wherein a is a weight, and wherein a is a weight,
Figure BDA0003581384510000073
for the first feature matrix of multimodal text output, d is the feature dimension, WαAnd bαIs a parameter of
Figure BDA0003581384510000074
For multi-modal feature coding after noise filtering, N is the length of a text sequence.
Filtering noise through the noise filtering layer, the grammar attention layer model models the dependency relationship between sentences by introducing syntax dependency tree, thereby enhancing the interpretability of the model grammar. The method also reduces the problem that LXMERT and BERT models can have some residual errors at a grammatical level, thereby influencing the accuracy of the emotion classification task of the evaluation object. The grammar attention layer parses the sentence at a grammar level based on the syntactic dependency tree and the graph attention network. Specifically, first, for the sentence T ═ T1,t2,···,tn} generating an adjacency matrix according to the syntax dependency tree D
Figure BDA0003581384510000075
Where N is the sentence length. In the adjacency matrix M, 1 is set between words having dependency relationships in the syntactic dependency tree D (i.e., a word and itself are also set to have dependency), and 0 is set in other cells. FIG. 4 illustrates an example of a syntactic dependency tree and corresponding adjacency matrix. The grammar attention layer attempts to enforce the interconnections in the grammar in the final attention representation. However, models always emphasize the characteristics of social media corpora, i.e., multiple abbreviations, multiple linguistics, multiple wrongly written words. This makes the accuracy of syntactic analysis extremely challenging. Furthermore, it is known that syntactic analysis after abbreviation has serious errors in both parts of speech and judgment of word relations, which will cause failure of the subsequent attention network and greatly increase the loss of models. Therefore, to mitigate the large loss of syntactic analysis failure, the model introduces a linguistic gate to directly decide whether the input sequence passes through a grammatical attention layer. The input monomodal text coding first CLS represents a text sequence, and the text sequence is classified twice through a full connection layer to judge whether the syntax of the text sequence is disordered:
Figure BDA0003581384510000081
z=argmax(zCLS)
wherein WCLS,bCLSAre parameters. And if the multi-modal sequence is classified into a chaotic sequence, directly outputting the input multi-modal sequence without passing through a grammar attention layer, or passing through the grammar attention layer.
Finally, through the output of the decoding layer, the real label sequence can be obtained. In order to enable the model to obtain the maximum output probability of the real label sequence and avoid the interference caused by the invalid sequence, a CRF model is adopted as a decoding layer in the decoding layer, and the negative value of the log-likelihood function of the CRF model is used as the loss function of the model, and the specific formula is as follows:
Figure BDA0003581384510000082
Figure BDA0003581384510000083
wherein N is the size of the sample set, X is the input text sequence, Y is the real labeling sequence corresponding to the text, and YX is all possible output sets conforming to the dependency relationship.
To verify the advantages of the present invention over other models, a series of comparative experiments were performed. The experimental computer CPU is 8-core 16-thread
Figure BDA0003581384510000084
CoreTMi9-9900K, GPU is Gig of 11G video memoryabyte RTX 2080 Ti. The experimental steps mainly comprise three aspects, namely firstly, data preparation; then training a model; and finally, testing through the trained model to show the effect of the model.
1) Data preparation
The data set used in the experiment is the Adaptive Co-orientation Network for the Name energy registration in Tweets published by Zhang et al and the Visual orientation Model for the Name Tagging in Multimodal social media published by Lu et al. When the corpus is screened, firstly deleting twitter with text except English and twitter without picture in the data set. And in the rest data, if more than one picture is corresponding to the text, randomly selecting one picture as a representative. Finally, items that do not contain any rating objects, have text lengths less than 3, or have text that is difficult to understand are deleted. Corpus annotation follows the BIO-2 standard. The whole corpus is divided into three parts, namely a training set, a verification set and a test set, and the statistical information of the corresponding corpora is shown in the following table:
Figure BDA0003581384510000091
2) model training
During training, the model is initialized by using an Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with
Figure BDA0003581384510000092
Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. The sentence length and word length are set to 40 and 30, respectively. The initial learning rate for the model was set to 0.001 and the learning rate for the BERT and LXMERT trims was set to 0.0001. The training batch was 8 samples.
For better evaluation of the performance of the goperm model, a comparison experiment was performed on the goperm model with the following different models under the same experimental environment, model setup and corpus content:
BERT + BilSTM + CRF the method uses a BERT pre-training model to perform text encoding, and uses a bidirectional LSTM model to extract context semantic relations of text sequences, and finally uses a CRF model to decode, which is a basic framework of a plurality of text sequence labeling models. And during decoding, a mode that the evaluation object and the emotion polarity are labeled together is adopted.
GRACE the method is an end-to-end emotion analysis model, which proposes a gradient equilibrium loss function and virtual confrontation training. The model is only for single-modality corpora.
GOPREM-LXMERT the method removes the LXMERT model in GOPREM and the subsequent noise filtering layer, i.e. multi-modal fusion is not performed, and only text is used for emotion polarity prediction of an evaluation object.
RAN the method is a multi-modal assessment object extraction method.
goperm-FG this method removes the noise filtering layer in the goperm model and uses multimodal coding directly as input to the syntactic attention layer.
goperm-GAt the method removes the syntactic attention layer in the goperm model and takes the output of the noise filter layer directly as input to the decoding layer.
GOPREM-F TBERT the method does not make any fine-tuning of the BERT model in GOPREM, using only its encoding as a network input.
GOPREM-F TLXMERT the method does not make any fine-tuning to the LXMERT model in GOPREM, using only its encoding as network input.
3) Results of the experiment
Applying the prepared data to the above model, the results shown in table 1 were obtained. The results show the precision rate, the recall rate and the F1-measure of the trained model on the test set, and the larger the evaluation index values are, the more excellent the model is.
TABLE 1
Figure BDA0003581384510000101
From table 1 it can be seen that the best results are obtained with the goperm model proposed by the present invention. First, comparing the first three single-mode models, finding that BERT + BilSTM + CRF performs the worst in the three single-mode models, which shows that although the structure has strong semantic capture, timing modeling and rule judgment capabilities, the structure is a general-purpose architecture and still has a large optimization space in a specific field. GRACE model is one of the most excellent end-to-end emotion analysis models at present, and the performance of the GRACE model greatly exceeds that of the simple BERT + BilSTM + CRF model, while the GOPREM model with LXMERT multi-modal interaction removed is closer to the performance of the GRACE model. The GRACE model enhances the performance of the model through innovative gradient balance loss functions and virtual confrontation training, and is mainly innovated in the aspect of model training, while the GOPREM model provided by the invention carries out targeted modeling through analyzing the characteristics of corpora, and achieves similar or even better effect on performance.
Secondly, performing an ablation experiment on the GOPREM model, it can be found that: (a) the model after removing the noise filter shows that F1-measure on the ATE task is reduced by 9.4 percent and F1-measure on the ATP task is reduced by 9.6 percent. The method is not even similar to a unimodal GOPREM-LXMERT model, and is presumed that noise has a large influence on the fusion of texts and pictures, while the LXMERT model deeply fuses the texts and the pictures through a multi-layer Transformer encoder, and if noise filtering is not performed, the model performance is greatly influenced, and the noise filtering layer in the GOPREM can effectively filter the noise in the pictures; (b) the F1-measure of the model without the grammar attention layer on the ATE task is reduced by 0.83%, the F1-measure on the ATP task is reduced by 0.74%, and the grammar attention layer can effectively analyze the grammar structure of the sentence and exert greater attention between the grammar-related word pairs, so that the model performance is improved; (c) if the pre-training model is not finely adjusted, the model effect is greatly reduced, wherein the F1-measure of the ATE and ATP tasks is reduced by 2.5% and 3.3% respectively without finely adjusting the BERT, and the F1-measure of the ATE and ATP tasks is reduced by 8.9% and 9.8% respectively without finely adjusting the LXMERT. This shows that task-oriented fine tuning of the pre-trained model is important, and particularly, the CLS features of the model are used in large quantities in the model, and fine tuning is required on the task to exert good effects. Through ablation experiments, it can be seen that each part in the GOPREM model greatly contributes to the performance thereof. The GOPREM model provided by the invention has excellent performance as shown by experimental results.

Claims (10)

1. A multi-modal evaluation object emotion classification method based on a grammar oriented network is characterized in that a model of the method comprises a coding layer, a noise filtering layer, a grammar attention layer and a decoding layer, and a multi-modal fusion text vector matrix is obtained through the coding layer
Figure FDA0003581384500000011
And the monomodal text vector matrix W is used as the input of a subsequent model, the noise of a coding layer is filtered through a noise filtering layer, in a grammar attention layer, a syntactic dependency tree is introduced to model the dependency relationship between sentences, so that the interpretability of the model grammar is enhanced, and finally a label sequence is obtained through a decoding layer.
2. The multi-modal assessment object emotion classification method based on the grammar-oriented network as claimed in claim 1, wherein LXMERT is selected for multi-modal feature coding of corpus in the coding layer, the coding layer is further provided with a character-level word vector and an independent BERT coding block in addition to the LXMERT layer, the LXMERT first needs to process single-modal input before cross-modal coding, the BERT adopts an average of 12 layers of transform output vectors to be a final output of the BERT, and the character-level word vector adopts Char-CNN for character-level word vector coding of text, thereby alleviating negative effects caused by more abbreviations, more language diseases and more wrongly-recognized words.
3. The multi-modal assessment object emotion classification method based on grammar-oriented network as claimed in claim 2, wherein the LXMERT model used for multi-modal coding is from Huggingface1, wherein visual coding, text coding and multi-modal coding are stacked by using 5, 9 and 5-layer transform encoders respectively, and the output vector dimension is 768.
4. The multi-modal assessment object emotion classification method based on grammar-oriented network as claimed in claim 1, wherein BERT adopts a BERT-base pre-training model provided by Google official, wherein the model comprises 12 transform layers, the dimension of the obtained word vector is 768, the dimension of the character-level word vector is set to 30, and the initialization thereof follows the uniform distribution of (-0.25, 0.25).
5. The method of claim 1, wherein the initial learning rate of the model in the model training phase is set to 0.001, the learning rate for BERT and LXMERT trimming is set to 0.0001, and the training batch is 8 samples.
6. The multi-modal evaluation object emotion classification method based on the grammar-oriented network as claimed in claim 1, wherein the correlation degree between the picture and the text in the < text-picture > pair is uniformly coded in the noise filter layer, so as to filter the noise in the coding layer, and after the noise filter layer is trained, the relationship between the picture and the text is more accurately judged, so as to reasonably distribute the weight to determine the proportion of the picture in the output features, thereby alleviating the noise problem in the social media corpus.
7. The method as claimed in claim 6, wherein the noise filter layer obtains LXMERT-based multi-modal text coding W ═ W from the coding layer1,w2,…,wnAnd BERT-based single-modality text coding
Figure FDA0003581384500000021
The first position of a feature matrix output by the multi-mode text is fully connected and then two coding modes are output by a Softmax functionAnd (3) weighting, and finally guiding the fusion of two codes by the weighting, namely:
α=Softmax(WαwCLS+bα)
Figure FDA0003581384500000022
wherein a is a weight of the object,
Figure FDA0003581384500000023
for the first feature matrix of multimodal text output, d is the feature dimension, WαAnd bαIs a parameter of
Figure FDA0003581384500000024
For multi-modal feature coding after noise filtering, N is the text sequence length.
8. The multi-modal sentiment classification method based on the grammar guide network as claimed in claim 1, wherein the grammar attention layer judges the dependency relationship between words through syntactic analysis, and only performs attention calculation in the words with mutual relationship through the graph attention network, thereby enhancing the grammar interpretability of the model.
9. The method as claimed in claim 8, wherein in the grammar attention layer, the syntactic dependency tree is introduced to model the dependency relationship between sentences so as to enhance the interpretability of the model grammar, and the multi-modal sequence input is subjected to emotion classification based on the grammar guidance network
Figure FDA0003581384500000025
There is a graph G (V, E) in which each unit in the multimodal sequence is a fixed point in the graph G, i.e.
Figure FDA0003581384500000026
The edge E is a set of all edges having a dependency relationship in the adjacent matrix M, and each edge has a corresponding weight, and the weight is calculated from a correlation coefficient between vertices at two ends of the edge, specifically, when the node i is updated in the multi-modal sequence, an attention weight value of a unit having a dependency relationship with respect to i is calculated by the following formula:
Figure FDA0003581384500000027
wherein
Figure FDA0003581384500000028
For all neighbor nodes of i, LeakyReLU is an activation function which is an improvement of the ReLU activation function, the unit death is avoided by setting the activation units smaller than 0 to be a small negative number, W and omega are trainable parameters, and | l represents the vector splicing operation.
10. The method for classifying multi-modal evaluation object emotions based on the grammar guide network as claimed in claim 1, wherein a CRF model is used as a decoding layer, and a negative value of a log-likelihood function of the CRF model is used as a loss function of the CRF model, and the specific formula is as follows:
Figure FDA0003581384500000031
Figure FDA0003581384500000032
wherein N is the size of the sample set, X is the input text sequence, Y is the real labeling sequence corresponding to the text, and YX is all possible output sets conforming to the dependency relationship.
CN202210352422.XA 2022-04-05 2022-04-05 Multi-mode evaluation object emotion classification method based on grammar guide network Pending CN114722797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210352422.XA CN114722797A (en) 2022-04-05 2022-04-05 Multi-mode evaluation object emotion classification method based on grammar guide network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210352422.XA CN114722797A (en) 2022-04-05 2022-04-05 Multi-mode evaluation object emotion classification method based on grammar guide network

Publications (1)

Publication Number Publication Date
CN114722797A true CN114722797A (en) 2022-07-08

Family

ID=82242616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210352422.XA Pending CN114722797A (en) 2022-04-05 2022-04-05 Multi-mode evaluation object emotion classification method based on grammar guide network

Country Status (1)

Country Link
CN (1) CN114722797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390141A (en) * 2023-12-11 2024-01-12 江西农业大学 Agricultural socialization service quality user evaluation data analysis method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390141A (en) * 2023-12-11 2024-01-12 江西农业大学 Agricultural socialization service quality user evaluation data analysis method
CN117390141B (en) * 2023-12-11 2024-03-08 江西农业大学 Agricultural socialization service quality user evaluation data analysis method

Similar Documents

Publication Publication Date Title
US11106714B2 (en) Summary generating apparatus, summary generating method and computer program
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
Al Sallab et al. Deep learning models for sentiment analysis in Arabic
CN110110337B (en) Translation model training method, medium, device and computing equipment
CN111460820B (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
JP2010250814A (en) Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
CN110210032A (en) Text handling method and device
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
Gao et al. Generating natural adversarial examples with universal perturbations for text classification
CN114691864A (en) Text classification model training method and device and text classification method and device
CN110633456B (en) Language identification method, language identification device, server and storage medium
CN114722797A (en) Multi-mode evaluation object emotion classification method based on grammar guide network
CN112528653A (en) Short text entity identification method and system
CN116681061A (en) English grammar correction technology based on multitask learning and attention mechanism
CN111368524A (en) Microblog viewpoint sentence recognition method based on self-attention bidirectional GRU and SVM
Chen et al. Audio captioning with meshed-memory transformer
CN115906818A (en) Grammar knowledge prediction method, grammar knowledge prediction device, electronic equipment and storage medium
CN115774782A (en) Multilingual text classification method, device, equipment and medium
CN115640850A (en) Automatic knowledge point identification method and device based on comparative learning
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
CN115017356A (en) Image text pair judgment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination