CN116245102B

CN116245102B - Multi-mode emotion recognition method based on multi-head attention and graph neural network

Info

Publication number: CN116245102B
Application number: CN202310524378.0A
Authority: CN
Inventors: 牟昊; 黄于晏; 何宇轩; 徐亚波; 李旭日
Original assignee: Guangzhou Datastory Information Technology Co ltd
Current assignee: Guangzhou Datastory Information Technology Co ltd
Priority date: 2023-05-11
Filing date: 2023-05-11
Publication date: 2023-07-04
Anticipated expiration: 2043-05-11
Also published as: CN116245102A

Abstract

The invention provides a multi-mode emotion recognition method based on a multi-head attention and graph neural network, which comprises the following steps: acquiring multi-modal data related to emotion classification, wherein the multi-modal data comprises text data and image data; performing feature fusion on related semantics, syntactic dependency and similar semantics of text data by using a graph convolution GCN neural network model and a weighted attention mechanism, obtaining GCN text features, respectively performing feature extraction on an entity and a background of image data, obtaining GCN entity features and GCN background features, inputting the GCN entity features and the GCN background features together with RNN text features, resNet entity features and ResNet background features into a multi-head attention module, outputting a final emotion classification result, and completing multi-mode emotion recognition; according to the invention, the text and image data can be processed simultaneously, the relation between different modal data is processed through the multichannel graph neural network, the multi-modal data characteristics are more comprehensively mined and fused, and the emotion recognition accuracy is improved.

Description

Multi-mode emotion recognition method based on multi-head attention and graph neural network

Technical Field

The invention relates to the technical field of multi-modal emotion classification, in particular to a multi-modal emotion recognition method based on a multi-head attention and graph neural network.

Background

Object-oriented emotion recognition has received considerable attention in recent years because of its ability to provide more complete and in-depth results. However, with the proliferation of multi-modal data, emotion recognition is not limited to text content, but is converted into interactive analysis of multi-modal information, which has important significance for improving the accuracy and reliability of emotion analysis.

Most of the existing multi-mode emotion recognition methods mainly rely on independent analysis of multi-mode content to conduct feature interaction through a multi-head attention mechanism, and the interaction degree is shallow. The multi-layer multi-head attention mechanism added with residual connection can carry out finer attention weighting on a plurality of inputs, and can better extract key characteristics for complex multi-mode data and improve the expression capacity of a model; and secondly, the characteristics of different modes can be weighted in a self-adaptive manner, and meanwhile, the information of different levels can be weighted, so that the model can better process data of different types and scales, and the generalization capability of the model is improved, and therefore, the model is commonly used for multi-mode emotion recognition.

The prior art discloses a multimode emotion recognition method based on the attribute feature fusion, which utilizes data of three modes of text, voice and video to carry out final emotion recognition, firstly carries out feature extraction on the data of the three modes, then carries out feature fusion in a mode based on the attribute feature layer fusion, and finally obtains an emotion recognition result; the method in the prior art usually only pays attention to a single data type, but ignores the mutual influence and the mutually complementary characteristics among the multi-mode data, and the characteristic extraction is not comprehensive enough; considering that the method for multi-mode data is generally based on an attention mechanism only when features are fused, the simple weighting method is adopted for feature fusion, interaction among modes is difficult to fully utilize, information fusion is insufficient, and therefore prediction capability of a model is reduced; in addition, the prior art has problems such as neglecting the use of syntax information for processing text information, neglecting the association between features for processing image information, and the like.

Disclosure of Invention

The invention provides a multi-modal emotion recognition method based on a multi-head attention and graph neural network, which can process text and image data simultaneously, process the relation among different modal data through the multi-channel graph neural network, more comprehensively mine and integrate the characteristics of the multi-modal data and improve the accuracy of emotion recognition, so as to overcome the defects of shallow interaction degree, insufficient characteristic extraction and poor emotion classification effect of the multi-modal data during emotion recognition in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a multi-mode emotion recognition method based on a multi-head attention and graph neural network comprises the following steps:

s1: acquiring multi-modal data related to emotion classification, wherein the multi-modal data comprises text data and image data;

s2: converting the text data into word vectors; respectively carrying out entity extraction and background extraction on the image data to obtain entity image data and background image data;

s3: constructing an associated semantic matrix and a syntactic dependency matrix by using text data, and constructing a similar semantic matrix by using word vectors of the text data; simultaneously inputting word vectors of the text data into a trained RNN neural network model to obtain RNN text characteristics;

Setting a conditional probability threshold, constructing an entity conditional probability matrix by using entity image data, and constructing a background conditional probability matrix by using background image data; simultaneously, respectively inputting entity image data and background image data into a trained ResNet neural network model to obtain ResNet entity characteristics and ResNet background characteristics;

s4: performing semantic-syntactic interactive representation on the associated semantic matrix, the syntactic dependency matrix and the similar semantic matrix, and inputting the three matrices after interaction into a three-channel graph convolution GCN neural network model to obtain an associated semantic feature vector, a syntactic dependency feature vector and a similar semantic feature vector;

respectively inputting the entity conditional probability matrix and the background conditional probability matrix into a graph convolution GCN neural network model to obtain entity characteristic vectors and background characteristic vectors;

s5: respectively calculating the attention weights of the associated semantic feature vectors, the syntactic dependency feature vectors and the similar semantic feature vectors by using a weighted attention mechanism, and merging to obtain semantic-syntactic features of the text data;

acquiring semantic-syntactic characteristics of emotion classification target words according to the semantic-syntactic characteristics of the text data, and taking the semantic-syntactic characteristics of the emotion classification target words as GCN text characteristics;

Calculating the attention weights of the entity feature vector and the background feature vector respectively by utilizing a multi-head attention mechanism, and acquiring GCN entity features and GCN background features;

s6: the GCN text feature, the RNN text feature, the GCN entity feature, the GCN background feature, the ResNet entity feature and the ResNet background feature are input into a preset multi-head attention module together, a final emotion classification result is output, and multi-mode emotion recognition is completed.

Preferably, the specific method of step S2 is as follows:

acquiring multi-mode data related to emotion classification, and converting text data into GloVe word vectors;

respectively carrying out entity extraction and background extraction on the image data to obtain entity image data and background image data;

the entity extraction method comprises the following steps: performing entity extraction by using any one or the same type of models of the trained Mask R-CNN, the trained fast R-CNN, the trained SSD and the trained YOLO models;

the background extraction method comprises the following steps: and performing background extraction by using any one or the same type of models of the trained VGG-Place, places-CNN and SceneNet models.

Preferably, in the step S3, the specific method for constructing the association semantic matrix and the syntax dependency matrix by using the text data and constructing the similarity semantic matrix by using the word vector of the text data is as follows:

The specific method for constructing the associated semantic matrix by using the text data comprises the following steps:

calculating the point mutual information value of each word and other words in sentences of the text data, and carrying out normalization processing to obtain the association value N of each word and other words; setting a correlation value threshold and acquiring the correlation degree of each word and other words, wherein if the correlation value N is larger than the correlation value threshold, the correlation degree of the two words is N, otherwise, the correlation degree of the two words is 0, and constructing a correlation semantic matrix, wherein the value range of the correlation degree is [0,1], and when the correlation degree is 1, the two words are simultaneously appeared in all sentences of the text data;

the specific method for constructing the syntactic dependency matrix by using the text data comprises the following steps:

dividing all sentences in the text data into words, and marking the parts of speech of all words obtained by dividing the words; analyzing all words marked by parts of speech by using a natural language processing tool or a manual analysis method, obtaining grammar information of each word and other words in a sentence, constructing a syntactic dependency matrix according to the dependency relationship between the two words, and if the dependency relationship exists between the two words, setting the corresponding position element in the syntactic dependency matrix to be 1, otherwise, setting the corresponding position element to be 0;

The specific method for constructing the similar semantic matrix by using the word vector of the text data comprises the following steps:

and calculating cosine similarity C between each word vector of the text data and other word vectors, setting a similarity threshold value, acquiring the similarity between each word and other words, if the cosine similarity C is larger than the similarity threshold value, the similarity of the two words is C, otherwise, the similarity of the two words is 0, and constructing a similarity semantic matrix, wherein the value range of the similarity is [0,1], and when the similarity is 1, the two words are completely identical.

Preferably, in the step S3, a conditional probability threshold is set, an entity conditional probability matrix is constructed by using entity image data, and a specific method for constructing a background conditional probability matrix by using background image data is as follows:

the specific method for constructing the entity conditional probability matrix by using the entity image data comprises the following steps:

acquiring an entity co-occurrence matrix, wherein elements represent the frequency of simultaneous occurrence of two entities; calculating the conditional probability between each entity and other entities in the entity image data according to the entity co-occurrence matrix; setting a conditional probability threshold, setting the conditional probability as 1 if the calculated conditional probability is larger than the conditional probability threshold, otherwise setting 0, and constructing an entity conditional probability matrix;

The specific method for constructing the background condition probability matrix by using the background image data comprises the following steps:

acquiring a background co-occurrence matrix, wherein elements represent the frequency of simultaneous occurrence of two backgrounds; calculating the conditional probability between each background and other backgrounds in the background image data according to the background co-occurrence matrix; setting a conditional probability threshold, setting the conditional probability to 1 if the calculated conditional probability is larger than the conditional probability threshold, otherwise setting 0, and constructing a background conditional probability matrix.

Preferably, in the step S3, the trained RNN neural network model includes: transformer, LSTM, BI-LSTM, BI-GRU, GRU and PQRNN;

the trained ResNet neural network model is specifically any one of ResNet series models.

Preferably, in the step S4, the semantic-syntactic feature of the emotion classification target word is obtained according to the semantic-syntactic feature of the text data, and the specific method is as follows:

setting emotion classification target words in text data, using other words as non-target words, performing 0 setting operation on non-target word vectors in semantic-syntactic characteristics of the text data, obtaining semantic-syntactic characteristics of the emotion classification target words, and using the semantic-syntactic characteristics of the target words as GCN text characteristics.

Preferably, the number of layers of the graph roll GCN neural network model in the step S4 is at least two.

Preferably, the multi-head attention module preset in the step S6 includes a plurality of multi-head attention layers and full connection layers with residual connection, which are sequentially connected;

the multi-head attention layers are used for calculating weights of all the features and carrying out residual weighting, and the full-connection layer is used for outputting a final emotion classification result.

Preferably, the step S6 further includes: and setting a loss function, optimizing all the graph convolution GCN neural network models by using an Adam optimizer, and obtaining the optimal graph convolution GCN neural network model when the loss function value is minimum.

Preferably, the multi-mode data further includes audio data and video data, and according to steps S2 to S6, feature extraction fusion is performed on any one multi-mode data combination of text data and audio data, text data and video data, image data and audio data, image data and video data, and emotion recognition is performed.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a multi-mode emotion recognition method based on a multi-head attention and graph neural network, which comprises the following steps: acquiring multi-modal data related to emotion classification, wherein the multi-modal data comprises text data and image data; converting the text data into word vectors; respectively carrying out entity extraction and background extraction on the image data to obtain entity image data and background image data; constructing an associated semantic matrix and a syntactic dependency matrix by using text data, and constructing a similar semantic matrix by using word vectors of the text data; simultaneously inputting word vectors of the text data into a trained RNN neural network model to obtain RNN text characteristics; setting a conditional probability threshold, constructing an entity conditional probability matrix by using entity image data, and constructing a background conditional probability matrix by using background image data; simultaneously, respectively inputting entity image data and background image data into a trained ResNet neural network model to obtain ResNet entity characteristics and ResNet background characteristics; performing semantic-syntactic interactive representation on the associated semantic matrix, the syntactic dependency matrix and the similar semantic matrix, and inputting the three matrices after interaction into a three-channel graph convolution GCN neural network model to obtain an associated semantic feature vector, a syntactic dependency feature vector and a similar semantic feature vector; respectively inputting the entity conditional probability matrix and the background conditional probability matrix into a graph convolution GCN neural network model to obtain entity characteristic vectors and background characteristic vectors; respectively calculating the attention weights of the associated semantic feature vectors, the syntactic dependency feature vectors and the similar semantic feature vectors by using a weighted attention mechanism, and merging to obtain semantic-syntactic features of the text data; acquiring semantic-syntactic characteristics of emotion classification target words according to the semantic-syntactic characteristics of the text data, and taking the semantic-syntactic characteristics of the emotion classification target words as GCN text characteristics; calculating the attention weights of the entity feature vector and the background feature vector respectively by utilizing a multi-head attention mechanism, and acquiring GCN entity features and GCN background features; the GCN text feature, the RNN text feature, the GCN entity feature, the GCN background feature, the ResNet entity feature and the ResNet background feature are input into a preset multi-head attention module together, a final emotion classification result is output, and multi-mode emotion recognition is completed;

The invention firstly utilizes the similar semantic matrix and the co-occurrence semantic matrix to extract semantic information in the text, then respectively blends the semantic information and the syntactic characteristics of the text, provides more information related to emotion for multi-modal emotion classification tasks, and not only concentrates on the characteristic extraction of the whole picture but also supplements the relation between the characteristics on the picture like the prior art to process the image, captures the global co-occurrence relation, can process the text and the image data simultaneously, processes the relation between different modal data through a multi-channel graph neural network, more comprehensively excavates and fuses the multi-modal data characteristics and improves the accuracy of emotion recognition; in addition, the multi-head attention mechanism with residual connection used in the invention can carry out finer attention weighting on a plurality of inputs, and can better extract key characteristics for complex multi-mode data, thereby improving the expression capability of the model.

Drawings

Fig. 1 is a flowchart of a multi-modal emotion recognition method based on a multi-head attention and graph neural network according to embodiment 1.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, the embodiment provides a multi-modal emotion recognition method based on multi-head attention and graph neural network, which includes the following steps:

In the specific implementation process, firstly, multi-mode data related to emotion classification, including text data and image data, is obtained; converting the text data into word vectors; respectively carrying out entity extraction and background extraction on the image data to obtain entity image data and background image data;

constructing an associated semantic matrix and a syntactic dependency matrix by using text data, and constructing a similar semantic matrix by using word vectors of the text data; setting a conditional probability threshold, constructing an entity conditional probability matrix by using entity image data, and constructing a background conditional probability matrix by using background image data;

in the present embodiment, setting the conditional probability threshold has the following advantages:

1) Noise removal: in practical application, due to the influence of factors such as data noise and sparsity, some conditional probabilities may be erroneously estimated as non-zero values, so that a large number of noise edges exist in the binary co-occurrence matrix; the edges with smaller conditional probability can be filtered out by setting a super parameter beta, so that the noises can be removed, and a more accurate and stable conditional probability matrix is obtained;

2) Performance is improved: by filtering out edges with smaller conditional probabilities and possibly noise, a more accurate and stable conditional probability matrix can be obtained; the sample can enable us to better understand whether a close relation exists between entities or not, and improve performance of subsequent tasks (such as emotion classification);

3) Flexibility: the super parameter beta can be adjusted according to specific application scenes, and if the data is denser or the noise is less, the value of beta can be properly reduced; if the data is sparse or more noisy, then the value of β needs to be increased appropriately; thus, the method can be better suitable for data sets of different types and different scales;

performing semantic-syntactic interaction on the associated semantic matrix, the syntactic dependency matrix and the similar semantic matrix, and inputting the three matrices after interaction into a graph convolution GCN neural network model together to obtain an associated semantic feature vector, a syntactic dependency feature vector and a similar semantic feature vector;

respectively inputting the entity conditional probability matrix and the background conditional probability matrix into a graph convolution GCN neural network model to obtain entity characteristic vectors and background characteristic vectors; respectively calculating the attention weights of the associated semantic feature vectors, the syntactic dependency feature vectors and the similar semantic feature vectors by using a weighted attention mechanism, and merging to obtain semantic-syntactic features of the text data; acquiring semantic-syntactic characteristics of a target word to be emotionally classified according to the semantic-syntactic characteristics of the text data, and taking the semantic-syntactic characteristics of the target word as GCN text characteristics;

Calculating the attention weights of the entity feature vector and the background feature vector respectively by utilizing a multi-head attention mechanism, and acquiring GCN entity features and GCN background features; inputting word vectors of the text data into a trained RNN neural network model to obtain RNN text characteristics; respectively inputting the entity image data and the background image data into a trained ResNet neural network model to obtain ResNet entity characteristics and ResNet background characteristics; the GCN text feature, the RNN text feature, the GCN entity feature, the GCN background feature, the ResNet entity feature and the ResNet background feature are input into a preset multi-head attention module together, a final emotion classification result is output, and multi-mode emotion recognition is completed;

the method is characterized in that semantic information in a text is firstly extracted by utilizing a similar semantic matrix and a co-occurrence semantic matrix, then syntactic information is respectively merged, semantic and syntactic characteristics of the text are comprehensively extracted, more information related to emotion is provided for a multi-modal emotion classification task, and for image processing, the method is only concentrated on characteristic extraction of a whole image as in the prior art, but supplements the relation between the characteristics on the image, captures the global co-occurrence relation of the image, can process text and image data at the same time, processes the relation between different modal data through a multi-channel graph neural network, more comprehensively excavates and fuses multi-modal data characteristics, and improves the accuracy of emotion recognition; in addition, the multi-head attention mechanism with residual connection used in the method can carry out finer attention weighting on a plurality of inputs, and can better extract key features for complex multi-mode data, thereby improving the expression capability of the model.

Example 2

The embodiment provides a multi-mode emotion recognition method based on a multi-head attention and graph neural network, which comprises the following steps:

s6: the GCN text feature, the RNN text feature, the GCN entity feature, the GCN background feature, the ResNet entity feature and the ResNet background feature are input into a preset multi-head attention module together, a final emotion classification result is output, and multi-mode emotion recognition is completed;

the specific method of the step S2 is as follows:

the background extraction method comprises the following steps: performing background extraction by using any one or the same type of models of the trained VGG-space, spaces-CNN and SceneNet models;

in the step S3, the specific method for constructing the associated semantic matrix and the syntactic dependency matrix by using the text data and constructing the similar semantic matrix by using the word vector of the text data is as follows:

calculating cosine similarity C between each word vector of the text data and other word vectors, setting a similarity threshold value, acquiring similarity between each word and other words, if the cosine similarity C is larger than the similarity threshold value, the similarity of the two words is C, otherwise, the similarity of the two words is 0, and constructing a similarity semantic matrix, wherein the value range of the similarity is [0,1], and when the similarity is 1, the two words are completely identical;

in the step S3, a conditional probability threshold is set, and an entity conditional probability matrix is constructed by using entity image data, and a specific method for constructing a background conditional probability matrix by using background image data is as follows:

acquiring a background co-occurrence matrix, wherein elements represent the frequency of simultaneous occurrence of two backgrounds; calculating the conditional probability between each background and other backgrounds in the background image data according to the background co-occurrence matrix; setting a conditional probability threshold, setting the conditional probability as 1 if the calculated conditional probability is larger than the conditional probability threshold, otherwise setting 0, and constructing a background conditional probability matrix;

in the step S3, the trained RNN neural network model includes: transformer, LSTM, BI-LSTM, BI-GRU, GRU and PQRNN;

the trained ResNet neural network model is specifically any one of ResNet series models, and in the embodiment, the ResNet neural network model is a ResNet-152 neural network model;

In the step S4, the semantic-syntactic characteristics of the emotion classification target word are obtained according to the semantic-syntactic characteristics of the text data, and the specific method is as follows:

setting emotion classification target words in text data, using other words as non-target words, performing 0 setting operation on non-target word vectors in semantic-syntactic characteristics of the text data, obtaining semantic-syntactic characteristics of the emotion classification target words, and using the semantic-syntactic characteristics of the target words as GCN text characteristics;

the number of layers of the graph roll GCN neural network model in the step S4 is at least two;

the multi-head attention module preset in the step S6 comprises a plurality of multi-head attention layers and full-connection layers which are connected in sequence and provided with residual connection;

the multi-head attention layers are used for calculating the weights of all the features and carrying out residual weighting, and the full-connection layer is used for outputting a final emotion classification result;

the step S6 further includes: setting a loss function, optimizing all the graph rolling GCN neural network models by using an Adam optimizer, and obtaining an optimal graph rolling GCN neural network model when the loss function value is minimum;

the multi-mode data also comprises audio data and video data, and any multi-mode data combination of text data and audio data, text data and video data, image data and audio data, image data and video data is subjected to feature extraction fusion and emotion recognition according to the steps S2-S6.

In the specific implementation process, firstly, multi-mode data related to emotion classification, including text data and image data, is obtained; converting the text data into GloVe word vectors, and capturing the relation and the context information among the words through a model; performing entity extraction by using a trained YOLOv3 model, and performing background extraction by using a trained VGG-plane and model to obtain entity image data and background image data;

constructing an associated semantic matrix and a syntactic dependency matrix by using text data, and constructing a similar semantic matrix by using word vectors of the text data;

firstly, point mutual information based on word co-occurrence frequency is used for constructing a text word point mutual information table, and the specific method comprises the following steps:

1) Counting the frequency of each word appearing in all texts, namely dividing the frequency of the occurrence of each word by the total number of all words;

2) Counting the frequency of simultaneous occurrence of each pair of words in all texts, namely dividing the number of times of occurrence of each pair of words in the same text by the total number of all texts;

3) The point mutual information value (PMI) of each pair of words is calculated, i.e. the logarithm of the product of their joint probabilities divided by their respective probabilities:

Wherein w1 and w2 represent two words that are not identical;

4) Constructing a point-to-point information table, recording PMI values of all words and other words in a table, wherein each row represents one word, each column represents another word, and the value on a diagonal line can be ignored or filled to be 0;

calculating the point mutual information value of each word and other words in sentences of the text data, and carrying out normalization processing to obtain the association value N of each word and other words; setting a correlation value threshold and acquiring the correlation degree of each word and other words, if the correlation value N is larger than the correlation value threshold, the correlation degree of the two words is N, otherwise, the correlation degree of the two words is 0, and constructing a correlation semantic matrix, as shown in a table 1, wherein elements on diagonal lines represent the correlation degree between each word and the word, usually 1, the rest elements of the matrix represent the correlation degree between two different words, the value range is [0,1], and when the correlation degree is 1, the two words are simultaneously appeared in all sentences of text data;

table 1 illustrates an example of an associative semantic matrix constructed

For example, in table 1, the degree of association between "basketball" and "football" is 0.4, the degree of association between "basketball" and "tennis" is 0.3, the degree of association between "football" and "swimming" is 0.1, and the higher the degree of association, the higher the probability that two words appear simultaneously;

dividing all sentences in the text data into words, and marking parts of speech of all words obtained by dividing the words, wherein the parts of speech comprise verbs, nouns, adjectives and the like;

analyzing all words marked by parts of speech by using a natural language processing tool or a manual analysis method, obtaining grammar information of each word and other words in a sentence, and constructing a syntactic dependency matrix according to the dependency relationship between the two words, wherein if the dependency relationship exists between the two words, the corresponding position element in the syntactic dependency matrix is 1, otherwise, the corresponding position element is 0; dependency refers to grammatical relations between each word in a sentence and other words, such as subject, object, subject, etc.;

table 2 constructed syntax dependency matrix example

For example, in table 2, there is a grammatical dependency between "me" and "like", there is a grammatical dependency between "like" and "eat", there is a grammatical dependency between "eat" and "apple", there is no dependency between "apple" and other words, and the whole sentence is "me like to eat apple";

Calculating cosine similarity C between each word vector of the text data and other word vectors, setting a similarity threshold value, acquiring the similarity between each word and other words, if the cosine similarity C is larger than the similarity threshold value, the similarity of the two words is C, otherwise, the similarity of the two words is 0, and constructing a similarity semantic matrix, wherein elements on diagonal lines represent the similarity between each word and the word per se, and the similarity is usually 1; the rest elements of the matrix represent the similarity between two different words, the value range is [0,1], and when the similarity is 1, the two words are identical;

setting a conditional probability threshold, constructing an entity conditional probability matrix by using entity image data, and constructing a background conditional probability matrix by using background image data;

obtaining an entity co-occurrence matrix, modeling interdependencies (global co-occurrence characteristics) among different entities and among different scenes in a picture, wherein the entity and the scene are identical in construction method, and elements represent the frequency of simultaneous occurrence of the two entities; the association degree between the entities can be better represented by constructing the entity co-occurrence matrix;

Calculating the conditional probability between each entity and other entities in the entity image data according to the entity co-occurrence matrix; setting a conditional probability threshold, setting the conditional probability as 1 if the calculated conditional probability is larger than the conditional probability threshold, otherwise setting 0, and constructing an entity conditional probability matrix;

the conditional probability P (P, q) is defined as the number of simultaneous occurrences of P and q in the image divided by the total number of occurrences of P in the image; calculating a conditional probability for a given entity or context p and another entity or context q by means of an entity co-occurrence matrix to represent the degree of association between the two entities;

The method comprises the steps of simultaneously inputting a similar semantic matrix, an associated semantic matrix and a syntactic dependency matrix into a three-channel graph-convolution GCN neural network model, obtaining hidden layer representations of similar semantic, associated semantic and syntactic dependency at the last layer of the model, and designing a semantic syntactic interaction module to carry out interaction representation on three kinds of information in order to fully obtain information of texts in terms of semantic and syntactic structures; the semantics are fused with the syntactic information of the text, and the syntactic tree is utilized to guide the learning of text characteristics, so that the structural information in the text can be better utilized, and the accuracy of text representation is further improved;

the method comprises the steps that a three-way graph convolution GCN network is used for completing aggregation of semantic and syntactic information to obtain semantic-syntactic representation of a text, semantic information only understands word meaning and does not consider the effect of syntactic information on the text, so that information expressed by the text cannot be understood as a whole, and therefore, similar semantic feature vectors based on syntactic information fusion, associated semantic feature vectors fused with syntactic information and syntactic dependency feature vectors fused with semantic information are finally obtained through information aggregation;

respectively calculating the attention weights of the associated semantic feature vectors, the syntactic dependency feature vectors and the similar semantic feature vectors by using a weighted attention mechanism, and merging to obtain semantic-syntactic features of the text data;

setting target words to be emotionally classified in text data, taking other words as non-target words, performing 0 setting operation on non-target word vectors in semantic-syntactic characteristics of the text data, acquiring semantic-syntactic characteristics of the target words, and taking the semantic-syntactic characteristics of the target words as GCN text characteristics;

inputting word vectors of the text data into a trained RNN neural network model to obtain RNN text characteristics; respectively inputting the entity image data and the background image data into a trained ResNet-152 neural network model to obtain ResNet entity characteristics and ResNet background characteristics;

the GCN text feature, the RNN text feature, the GCN entity feature, the GCN background feature, the ResNet entity feature and the ResNet background feature are input into a preset multi-head attention module together, a final emotion classification result is output, and multi-mode emotion recognition is completed;

Introducing residual connection between the image and text features of each layer can avoid the gradient vanishing problem;

in the embodiment, the graph convolution GCN neural network model is divided into two layers, a loss function is set, an Adam optimizer is utilized to optimize all graph convolution GCN neural network models, and when the loss function value is minimum, the optimal graph convolution GCN neural network model is obtained;

the multi-mode data also comprises audio data and video data, wherein the audio data can replace text data, the video data can replace image data, and the characteristic extraction fusion and emotion recognition are carried out according to the steps;

in this embodiment, the picture data may be changed into a video mode or added into a video mode, and only a certain number of key frames in the video need to be extracted, each key frame is a picture, each picture is fused through the same flow, and the extracted features are fused through the attention mechanism; the text data can be replaced by an audio mode, only a network is extracted by using the characteristics of the audio mode, each word corresponds to an audio fragment in audio, and an audio co-occurrence matrix and an audio similarity matrix are calculated; carrying out feature extraction fusion and emotion recognition according to the steps;

The method in the embodiment can also be used for carrying out feature extraction fusion and emotion recognition on any multi-mode data combination of text data and audio data, text data and video data, image data and audio data, image data and video data;

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The multi-mode emotion recognition method based on the multi-head attention and graph neural network is characterized by comprising the following steps of:

2. The multi-modal emotion recognition method based on the multi-head attention and graph neural network according to claim 1, wherein the specific method of step S2 is as follows:

3. The multi-modal emotion recognition method based on multi-head attention and graph neural network according to claim 1, wherein in the step S3, the specific method for constructing the associated semantic matrix and the syntactic dependency matrix by using text data and constructing the similar semantic matrix by using word vectors of the text data is as follows:

4. The multi-modal emotion recognition method based on multi-head attention and graph neural network according to claim 1, wherein in the step S3, a conditional probability threshold is set, an entity conditional probability matrix is constructed by using entity image data, and a specific method for constructing a background conditional probability matrix by using background image data is as follows:

5. The multi-modal emotion recognition method based on multi-head attention and graph neural network according to claim 3 or 4, wherein in step S3, the trained RNN neural network model includes: transformer, LSTM, BI-LSTM, BI-GRU, GRU and PQRNN;

6. The multi-modal emotion recognition method based on multi-head attention and graph neural network according to claim 5, wherein in step S4, semantic-syntactic features of emotion classification target words are obtained according to semantic-syntactic features of text data, and the specific method is as follows:

7. The multi-modal emotion recognition method based on multi-head attention and graph neural network of claim 6, wherein the number of graph roll-up GCN neural network model layers in step S4 is at least two.

8. The multi-modal emotion recognition method based on multi-head attention and graph neural network according to claim 7, wherein the multi-head attention module preset in step S6 includes a plurality of multi-head attention layers and full connection layers with residual connection connected in sequence;

9. The multi-modal emotion recognition method based on multi-head attention and graph neural network of claim 8, further comprising, after step S6: and setting a loss function, optimizing all the graph convolution GCN neural network models by using an Adam optimizer, and obtaining the optimal graph convolution GCN neural network model when the loss function value is minimum.

10. The multi-modal emotion recognition method based on a multi-head attention and graph neural network according to claim 9, wherein the multi-modal data further comprises audio data and video data, and the feature extraction fusion is performed on any one of the multi-modal data combinations of text data and audio data, text data and video data, image data and audio data, image data and video data according to steps S2 to S6, and emotion recognition is performed.