CN115631504B

CN115631504B - Emotion identification method based on bimodal graph network information bottleneck

Info

Publication number: CN115631504B
Application number: CN202211645853.1A
Authority: CN
Inventors: 李丽; 李平; 苟丽
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-04-07
Anticipated expiration: 2042-12-21
Also published as: CN115631504A

Abstract

The invention provides an emotion recognition method based on bimodal graph network information bottleneck, which comprises the steps of preprocessing data, and respectively coding pictures and texts through corresponding pre-training models; respectively extracting the characteristics of the text and the image by using a long-short term memory network and a feedforward neural network; constructing a topological graph in the modes based on the grammar dependency relationship and the adjacent position relationship of the visual blocks, and constructing a bimodal topological graph based on a complete bipartite graph; designing a modal interaction module based on a bimodal graph network, and realizing information interaction in and among the modalities by utilizing a graph convolution network; converting node representation of the bimodal topological graph into graph representation through a graph pooling technology; and (4) performing bimodal emotion recognition by adopting a multilayer perceptron. In addition, an information bottleneck module is established, and the generalization capability of the method is improved. The emotion recognition method based on the bimodal graph network information bottleneck can effectively fuse modal information and is used for guiding emotion recognition.

Description

Emotion identification method based on bimodal graph network information bottleneck

Technical Field

The invention belongs to the field of bimodal emotion recognition in the fields of natural language processing and vision intersection, and particularly relates to an emotion recognition method based on bimodal graph network information bottleneck.

Background

The emotion recognition aims at mining subjective information in data by using a natural language processing technology, and is widely applied to various fields, such as: financial market forecasting, business review analysis, and the like. With the rapid development of internet technology, information in the internet gradually changes from plain text to bimodal, so that the existing emotion analysis method faces new challenges and opportunities. How to effectively extract and fuse features from bimodal data is key to bimodal emotion characterization.

General bimodal emotion recognition can be realized by splicing, adding and calculating Hadamard products of all monomodal features, but correlation among the modals cannot be obtained in the mode. Recently, a cross attention mechanism method is introduced to enhance the feature fusion of bimodal data; however, cross-attention merely establishes the association of global semantics of one modality with local features on another modality, and is not sufficient to reflect the alignment relationship of the modalities on the local features, and using a global feature representation of a modality for semantic alignment may generate a large noise. Furthermore, attention-based methods have another drawback, and such methods typically require careful attention patterns, such as: multi-layer/multi-pass attention, multi-layer attention will introduce more parameters, increasing the likelihood of overfitting.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides an emotion recognition method based on bimodal graph network information bottleneck, and decomposes data of each modality into semantic units with fine granularity, such as: the text word and image visual block establishes the relation between the bimodal fine-grained semantic units by utilizing the relevance in each modality and among the modalities, so that bimodal feature fusion is directly performed among the fine-grained semantic units, namely, a mapping relation is established for the representation information of each modality by adopting a local alignment local mode, and the semantic information of a text and the local information of an image can be fully fused. In addition, an information bottleneck mechanism is added, so that the generalization capability of the method can be effectively improved.

In order to realize the purpose, the invention adopts the following technical scheme:

s1: preprocessing data, processing the text by adopting a word embedding technology Glove to obtain a text embedding matrix

(ii) a The image is processed using an image processing technique ResNet152, whereby the image is cut to ≦ before processing>

A vision block resulting in an image representation matrix->

(ii) a Wherein it is present>

Indicating the number of visual blocks.

S2: extracting the features of the preprocessed embedded expression, and extracting the text features by using a bidirectional long-short term memory network

Extracting image features using a feed-forward neural network>

。

S3: and constructing a topological graph by using the grammar dependency relationship in the text and the spatial position relationship in the image. The specific operation is as follows:

s31, constructing a topological graph in a text mode by taking words in the text as nodes and grammatical dependency relationship in a dependency tree as undirected edges

。/>

S32: constructing a topological graph in an image mode by taking visual blocks in an image as nodes and taking spatial position relations among the visual blocks as undirected edges

。

S33: taking words in a text and a visual block in an image as two groups of nodes, forming a non-directional edge by any node in the words and each node in the visual block, and constructing a complete bipartite graph as a dual-mode topological graph

。

S4: and designing a modal interaction module based on a bimodal graph network, and performing representation learning by using a message transfer mechanism of the graph convolution network to realize information interaction and feature fusion in and among the modes. The specific operation is as follows:

s41: topological graph in text mode

For an adjacency matrix, S2The obtained text features are word node feature vectors, expression learning of word nodes is carried out through a graph convolution network, information interaction in a text mode is achieved, and the calculation formula is as follows:

in the above-mentioned formula, the compound has the following structure,

is trainable and takes place for a parameter>

The function is activated for sigmoid.

S42: in topological graph in image mode

The image features extracted in the step 2 are visual block node feature vectors, the visual block nodes are represented and learned through a graph convolution network, information interaction in an image mode is achieved, and the calculation formula is as follows:

in the above-mentioned formula, the compound has the following structure,

is trainable and takes place for a parameter>

The function is activated for sigmoid.

S43: in a bimodal topology

As an adjacency matrix, the text and image features extracted by splicing S2 are node feature vectors ^ and ^>

Information aggregation is carried out through a graph convolution network, information fusion between modes is achieved, and a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

is trainable and takes place for a parameter>

The function is activated for sigmoid.

S44: loops S41-S43 are set according to the specific parameters of the model.

S5: and an information bottleneck module is established, and the generalization capability of the method is improved. The specific operation is as follows:

s51: splicing the text embedding and the image embedding after the S1 data preprocessing to obtain the input characteristics of the information bottleneck module

。

S52: splicing the text features and the image features extracted in the step S2 to obtain intermediate features of the information bottleneck module

。

S53: splicing the text representation and the image representation after the modal interaction based on the bimodal graph network S4 to serve as the output characteristic of the information bottleneck module

。/>

S54: the goal of information bottlenecks is to reduce

And/or>

In between, increase->

And/or>

The calculation formula is as follows:

in the above-mentioned formula, the compound has the following structure,

target for which optimization is required for the information bottleneck module, <' >>

For the parameters of the emotion recognition method based on the bimodal graph network information bottleneck, be->

Is->

And &>

The mutual information between the two groups is obtained,

is->

And/or>

In between, based on mutual information->

Is an adjustable factor.

S6: obtaining a graph representation vector by adopting a graph pooling technology represented by all nodes in the spliced bimodal topological graph, wherein a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

representation of a graph representation resulting from the representation of all nodes of a stitched text and visual block, and->

For all nodes in the bimodal topology map, ->

Is node after S4->

Is shown.

S7: and identifying bimodal emotional tendency by using a multi-layer perceptron as a classifier.

S8: the model is trained through bimodal data, a cross entropy loss function and an information bottleneck objective function are used as model training targets, and an Adam optimizer with hot start is used for training the model. The training goals for the model are as follows:

in the above formula, the first and second carbon atoms are,

for a sample in the training set, ->

For the set of all training samples, <' >>

Is adjustable factor>

Is the true value of the sample or samples,

is a predicted value.

S9: and classifying the bimodal data to be classified through the trained model to obtain an emotion recognition result.

Compared with the existing bimodal emotion recognition method, the emotion recognition method based on the bimodal graph network information bottleneck has the following beneficial effects:

1. forming a bimodal topological graph by the text words and the visual blocks, and utilizing grammatical information of the text and spatial position information of the image;

2. the bi-modal topological graph establishes the relation between the bi-modal fine-grained semantic units, so that the multi-modal feature fusion is directly carried out between the fine-grained semantic units, the semantic information of texts and the local information of images can be fully fused, and the defects of the existing method are greatly supplemented;

3. by utilizing an information bottleneck mechanism, the generalization capability of the method is effectively improved.

Drawings

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is a diagram of a system model of the present invention;

FIG. 3 is a building block of a bimodal topology of the present invention.

Detailed Description

In order that the public may better understand the present invention, specific embodiments thereof will be described below with reference to the accompanying drawings. Wherein the drawings are for illustrative purposes only and are not to be construed as limiting the invention; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The invention provides an emotion recognition method based on bimodal graph network information bottleneck, which comprises the following steps:

s1: and (3) data preprocessing, namely preprocessing the text and the image respectively through corresponding pre-training models.

As shown in FIG. 1, the text and image in the bimodal data are first separated and then separatedText and images are preprocessed. For texts, the representation of words is searched in pre-trained Glove, each word is mapped to a 300-dimensional vector, and a text embedding matrix is obtained

(ii) a For images, cut first into->

Processing each visual block by adopting an image processing technology ResNet152, processing each visual block into a representation vector with 1024 dimensions, and finally obtaining an image embedding matrix ^ and ^ according to the image>

(ii) a Wherein +>

Indicating the number of visual blocks.

S2: and performing feature extraction on the preprocessed embedded representation.

As shown in fig. 1, feature extraction is performed on the text embedding and the image embedding obtained in S1, respectively.

Because the text has a front-back order relation, in order to integrate more context information into word embedding, a bidirectional long-short early-stage memory network is adopted to carry out context semantic dependency learning, and text characteristics are extracted

. The specific calculation formula is as follows:

in the above-mentioned formula, the compound has the following structure,

for forgetting to close the door>

Is a input door, is>

Is an output gate, which is arranged in the interior of the housing>

Is a candidate value vector, is->

Is a memory cell at the previous moment>

For memory cells at the present moment>

Is a representation of the hidden state at the last moment,

for a hidden status representation at the present time>

、/>

、/>

、/>

And &>

、/>

、/>

、/>

Indicating a trainable parameter, subscript @, of a long and short term memory network>

Representing the index of the position of the current word in the text.

Because no sequence features exist among visual blocks of the image, the feedforward neural network is adopted to extract the image features

. The specific calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

trainable parameters representing a feed forward neural network.

To facilitate the implementation of subsequent feature fusion, text features

And an image feature>

Is set to 128.

S3: and constructing a topological graph by using the grammar dependency relationship in the text and the spatial position relationship in the image.

In order to solve the defects of the prior art, the alignment relation of each modality on local features is reflected. As shown in fig. 3, this step will construct three topologies, namely: two intra-modal topographies and one bi-modal topographies, the operation is as follows.

S31: for the text modality, complex grammatical dependencies exist between words, and modeling grammatical dependencies facilitate learning of text information. Therefore, a topological graph in a text mode is constructed by taking words in the text as nodes and grammar dependence relations in the dependence tree as undirected edges

。

。

S33: establishing the relation between bimodal fine-grained semantic units, so that bimodal feature fusion can be directly carried out between the fine-grained semantic units, namely: and establishing a mapping relation for the representation information of each mode by adopting a local alignment local mode, so that the semantic information of the text and the local information of the image are fully fused. Therefore, the words in the text and the visual blocks in the images are used as two groups of nodes, any node in the words and each node in the visual blocks form a non-directional edge, and a complete bipartite graph is constructed to be used as a dual-mode topological graph

。

S4: and designing a modal interaction module based on a bimodal graph network, and performing representation learning by using a message transmission mechanism of the graph convolution network to realize information interaction and feature fusion in and among the modes.

As shown in fig. 2, the text features extracted in S2

And an image feature>

And sending the data to a bimodal graph network, and carrying out information interaction and feature fusion through a graph volume network on the basis of a topological graph constructed in the S3.

S41: topological graph in text mode

Is adjacent to the matrix, is>

The expression learning of the word nodes is carried out for the word node feature vectors through a graph convolution network, each word node transmits information to a neighbor word node with a grammar dependency relationship, and the information interaction in a text mode is realized, wherein the calculation formula is as follows: />

In the above formula, the first and second carbon atoms are,

for trainable parameters>

The function is activated for sigmoid.

S42: topology map in image modality

In the vicinity of a matrix>

For the feature vectors of the visual block nodes, the representation learning of the visual block nodes is carried out through a graph convolution network, and the information transmission is carried out between the adjacent visual blocks, so as to realize the information interaction in the image modality, and the calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

for trainable parameters>

The function is activated for sigmoid.

S43: in a bimodal topology

Information aggregation is carried out through a graph convolution network, all neighbor nodes of each node belong to another mode node, so that information fusion between modes is realized, and a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

for trainable parameters>

The function is activated for sigmoid.

S44: as shown in fig. 2, S41 to S43 form a convolutional network block, and after the parameter adjustment is performed on the model, a better parameter value of the layer number of the convolutional network block is obtained, and S41 to S43 are cycled according to the specific parameter value.

S5: and an information bottleneck module is established, and the generalization capability of the method is improved.

The information bottleneck module runs through the whole process of the method, and the specific operation is as follows.

。

S52: extracting the text feature and the image feature from S2Performing splicing to obtain intermediate characteristics of the information bottleneck module

。

S53: s4, splicing the text representation and the image representation after modal interaction based on the bimodal graph network, wherein the text representation and the image representation are used as the output characteristics of the information bottleneck module

。

S54: the goal of information bottlenecks is to reduce

And/or>

In between, increase->

And/or>

The calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

Is->

And/or>

In betweenMutual information->

Is->

And/or>

In between, based on the mutual information->

Is an adjustable coefficient.

S6: a graph pooling technique is employed to convert the node representation of the bimodal topology graph into a graph representation.

The bimodal emotion recognition is to classify the overall emotional tendency of the data, and needs to combine the feature information of all nodes in the bimodal topological graph. Therefore, a graph pooling technology represented by all nodes in the spliced bimodal topological graph is adopted to obtain a graph representation vector, and a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

representing a graph representation vector represented by all nodes of the stitched text and the visual block, and->

For all nodes in the bimodal topology map, ->

Is node after S4->

Is shown.

S7: and expressing the vector through the graph obtained in the S6, and identifying the bimodal emotional tendency by using a multilayer perceptron as a classifier, wherein a calculation formula is as follows:

in the above formula, the first and second carbon atoms are,

for a final learned bimodal characterization, the>

For the emotional tendency predicted by the model, < >>

And

represents a trainable weight, <' > asserted>

And &>

Is a trainable bias.

S8: the model is trained through the bimodal data.

In the training process, a cross entropy loss function and an information bottleneck objective function are used as model training targets, and an Adam optimizer with hot start is used for training the model. Wherein the training targets of the model are as follows:

in the above formula, the first and second carbon atoms are,

for a sample in the training set, ->

For the set of all training samples, <' >>

Is adjustable factor>

Is the true value of the sample or samples,

is a predicted value.

The embodiments of the present invention are described only for the preferred embodiments of the present invention, and not for the limitation of the concept and scope of the present invention, and various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall within the protection scope of the present invention, and the technical contents of the present invention which are claimed are all described in the claims.

Claims

1. An emotion recognition method based on bimodal graph network information bottleneck is characterized by comprising the following steps:

s1: data preprocessing, namely preprocessing a text and an image respectively through corresponding pre-training models;

s2: extracting the character of the embedded expression after the pretreatment, and extracting the text character H by using a bidirectional long-short term memory network ^t Extracting image features H using a feed-forward neural network ^v ；

S3: constructing a topological graph by using a syntax dependency relationship in a text and a spatial position relationship in an image;

s31, constructing a topological graph G in a text mode by taking words in the text as nodes and grammatical dependency relationship in a dependency tree as undirected edges ^t ；

S32: by visual blocks in the imageAs nodes, the spatial position relation between the visual blocks is used as a non-directional edge to construct a topological graph G in an image modality ^v ；

S33: taking words in a text and a visual block in an image as two groups of nodes, forming a non-directional edge by any node in the words and each node in the visual block, and constructing a complete bipartite graph as a bimodal topological graph G ^m ；

S4: designing a modal interaction module based on a bimodal graph network, and performing representation learning by using a message transfer mechanism of a graph convolution network to realize information interaction and feature fusion in and among the modes;

s41: topology G in text modality ^t The extracted text features are word node feature vectors, and the expression learning of word nodes is carried out through a graph convolution network, so that information interaction in a text mode is realized;

s42: in a topological graph G within an image modality ^v The image features extracted in S2 are visual block node feature vectors, and the representation learning of the visual block nodes is carried out through a graph convolution network, so that information interaction in an image mode is realized;

s43: in a bimodal topology G ^m As an adjacency matrix, splicing the text and image features extracted by S2 into a node feature vector H ^m ＝[H ^t ，H ^v ]Information aggregation is carried out through a graph convolution network, and information fusion between modes is realized;

s44: setting loops S41-S43 according to the specific parameters of the model;

s5: an information bottleneck module is established, and the generalization capability of the method is improved;

s51: splicing the text embedding and the image embedding after the data preprocessing of the S1 to obtain an input characteristic X of an information bottleneck module;

s52: splicing the text features and the image features extracted in the step S2 to obtain intermediate features Z of the information bottleneck module;

s53: splicing the text representation and the image representation after the modal interaction based on the bimodal graph network in the S4 to serve as the output characteristic Y of the information bottleneck module;

s54: the information bottleneck aims to reduce mutual information between X and Z and increase mutual information between Z and Y, and the calculation formula is as follows:

R(θ)＝I(Z，Y；θ)-βI(Z，X；θ)

in the formula, R (theta) is a target to be optimized for an information bottleneck module, theta is a parameter of an emotion identification method based on the bimodal graph network information bottleneck, I (Z, Y; theta) is mutual information between Z and Y, I (Z, X; theta) is mutual information between X and Z, and beta is an adjustable coefficient;

s6: converting the node representation of the bimodal topological graph into a graph representation by adopting a graph pooling technology;

s7: identifying bimodal emotional tendency by taking a multilayer perceptron as a classifier;

s8: training the model through the bimodal data;

2. The emotion recognition method based on bimodal graph network information bottleneck, according to claim 1, wherein the S1 specifically is: processing the text by adopting a word embedding technology Glove to obtain a text embedding matrix E ^t (ii) a Processing the image by using an image processing technology ResNet152, cutting the image into n visual blocks before processing to obtain an image representation matrix E ^v (ii) a Where n represents the number of visual blocks.

3. The emotion recognition method based on bimodal graph network information bottleneck, as claimed in claim 1, wherein said S6 specifically is: obtaining a graph representation vector by adopting a graph pooling technology of all node representations in the spliced bimodal topological graph, wherein a calculation formula is as follows:

g＝concat(a _k |k∈G ^m )

in the above formula, g represents a graph representation vector obtained by splicing the text and all the nodes of the visual block, k is all the nodes in the bimodal topological graph, and a _k Is the representation of node k after S4.

4. The emotion recognition method based on bimodal graph network information bottleneck, according to claim 1, wherein the S8 specifically is: using a cross entropy loss function and an information bottleneck objective function as model training objectives, and using an Adam optimizer with hot start to train a model; wherein the training targets of the model are as follows:

in the above formula, J is a sample in the training set, J is a set of all training samples, β is an adjustable coefficient, θ is a parameter of the emotion recognition method based on the bimodal graph network information bottleneck, y _j Is the true value of the sample or samples,

is a predicted value. />