CN115272908A

CN115272908A - Multi-modal emotion recognition method and system based on improved Transformer

Info

Publication number: CN115272908A
Application number: CN202210707463.6A
Authority: CN
Inventors: 丁俊丰; 闫静杰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-11-01

Abstract

The invention provides a multi-modal emotion recognition method based on an improved Transformer and a system for implementing the method. The method comprises the following steps: preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector for each data sample; acquiring the characteristics of global interaction between two modes through a cross-mode attention model; obtaining the characteristics of global interaction in a single mode through a self-attention model; constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep-level features; and training the constructed network model by using the processed data samples, and using the trained model for the classification of the multivariate emotion. The invention not only extracts the interactive characteristics among the modes, but also considers the interactive characteristic information in the modes, and extracts the advanced characteristics through the improved lightweight transform coder, thereby solving the emotion classification problem more quickly and efficiently.

Description

Multi-modal emotion recognition method and system based on improved Transformer

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-modal emotion recognition method based on an improved Transformer and a system for implementing the method.

Background

With the progress of science and technology and the development of computer technology, artificial intelligence gradually enters the daily life of people. In recent years, the appearance of various intelligent devices improves the quality of life of human beings. However, these smart devices at present cannot achieve a real man-machine conversation, and need to rely on emotion recognition technology to achieve barrier-free communication between human beings and computers. The traditional emotion recognition technology is mainly established on single-mode data, and although the recognition mode is better realized, the problems of low accuracy, low resource sample utilization rate and the like exist. At present, advanced scientific research equipment can extract data of various modes, such as videos, voices, texts, postures, electroencephalograms and the like, and the multi-mode emotion recognition technology can be widely applied to the fields of smart homes, intelligent transportation, smart cities, front-end medical treatment and the like, so that the multi-mode emotion recognition technology is one of hot spots of artificial intelligence research.

Through search discovery, the Chinese patent with the publication number CN112784730A provides a multi-mode emotion recognition method based on a time domain convolution network, and the method mainly utilizes video and audio modal data to carry out emotion recognition. Firstly, the video is sampled at equal intervals, a gray level face image sequence is generated through face monitoring and key point positioning, and audio data are input into a Mel filter bank to obtain a Mel spectrogram. And then respectively sending the face image and the spectrogram into a convolutional neural network for feature fusion. And finally, inputting the fusion characteristic sequence into a time domain convolution network to obtain a high-grade characteristic vector, and finally predicting the multi-modal multi-element emotion recognition through regression of a full connection layer and Softmax.

More and more researchers expect to be able to construct a robust emotion recognition model by utilizing the complementarity between various modal information so as to achieve higher emotion classification accuracy. However, while inter-modal feature complementation is considered, the importance of features in a single modality is mostly ignored, and the complexity and the operation efficiency of the algorithm are also considered, and a place to be improved still exists.

In view of the above, it is necessary to design a method for recognizing multi-modal emotion of speech, expression and text based on improved Transformer and a system for implementing the method to solve the above problems.

Disclosure of Invention

The invention aims to provide a multi-modal emotion recognition method and system based on an improved Transformer aiming at the defects of the existing multi-modal emotion classification technology.

In order to achieve the aim, the invention provides a multi-modal emotion recognition method based on an improved Transformer, which comprises the following steps of:

s1, preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector by each data sample;

s2, acquiring the characteristics of global interaction between two modes through a cross-mode attention model;

s3, acquiring the characteristics of global interaction in a single mode through a self-attention model;

s4, constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep features;

and S5, training the constructed network model by using the processed data sample, and using the trained model for multi-element emotion classification.

A further improvement of the present invention is that the step S1 further comprises the steps of:

s1-1, performing framing processing on each video data sample, intercepting a k frame image sequence according to a time sequence, performing feature extraction on each intercepted frame image, and generating a two-dimensional feature vector Z for each video data sample_V；

S1-2, performing segmentation processing on each voice data sample, intercepting k voice sequences according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each voice data sample_A；

S1-3, performing word level processing on each text data sample, intercepting k words according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each text data sample_T。

A further improvement of the present invention is that in said step S1-1, sample data features are extracted by a Facet tool; low-level acoustic features are extracted by covanep, the voice recognition method comprises the following steps of (1) including 74 acoustic features representing voice features, such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like; and generating a word vector with dimension 300 for each word through a pre-trained Glove model.

The further improvement of the invention is that a cross-modal attention model is constructed in the step S2, the feature data of three modal features after being arranged and combined pairwise is processed, and the global interaction feature vector between two modalities is obtained

A further improvement of the present invention is that step S2 mainly comprises the steps of:

step S2-1: sample data Z after processing_V、Z_A、Z_TBy convolution kernel size ofCarrying out one-dimensional convolution on the n multiplied by n Conv1D, wherein the sample data dimensions are unified to D dimensions; then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained

Step S2-2: obtaining interactive feature vector by global interaction between two modalities through cross-modality attention network

The present invention is further improved in that in step S2-2, the voice modality is taken as the target modality, the video modality is taken as the auxiliary modality, and the voice modality characteristic data is taken as the example

Video modality feature data

Inputting the data into a multi-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculation

The calculation steps are as follows:

step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the calculation process is as follows:

wherein,

representing multiple head through i-1 layersThe attention network carries out feature vectors after feature interaction between modalities;

step S2-2-2: will be provided with

Inputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the calculation process is as follows:

wherein,

a matrix of weights representing the different tensors,

representing low-level feature data using auxiliary modalities

And target modal characteristic data output after passing through i-1 layer multi-head attention network unit

Performing global feature interaction; and

step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the calculation process is as follows:

wherein i =0,1, …, D,

Representing a result obtained by carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network; through the feature interaction of the D1 layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained

In a further improvement of the present invention, the method for constructing the self-attention model in step 3 and acquiring the global interaction features in the single modality includes: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position coding

Performing characteristic information interactive coding in the mode through a D2 layer self-attention module unit, performing residual calculation after the characteristic information interactive coding is performed through a feedforward neural network, and obtaining characteristics of voice, video and text in the mode after the characteristic information interactive coding is performed through the self-attention network

The further improvement of the invention lies in that an improved Transformer model with a BiGRU2D substituted for a multi-head attention module is constructed in the step 4, data obtained by splicing the characteristics of global interaction in a single mode and the characteristics of global interaction between modes after two modes are arranged and combined are processed, and deep features are extracted, and the method further comprises the following steps:

step S4-1: respectively splicing features of global interaction in single mode with features of global interaction between two combined modes, i.e. spliced feature data

Z_A、Z_V、Z_TRespectively representing voice modality inter-modality feature data, video modality inter-modality feature data and text modality inter-modality feature data;

step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1_A、Z_V、Z_TAnd inputting an improved transform encoder with a BiGRU2D substituted for a multi-head attention module to extract deep features.

The further improvement of the present invention is that, in step S4-2, taking processing inter-modality feature data as an example, the specific steps are as follows:

step S4-2-1: firstly, Z is_AThe input layer normalizes the network of the network,

step S4-2-2: will be provided with

Inputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions,

wherein

Representing division of feature vectors in the horizontal direction

Then, inputting the sequence into the feature information in the horizontal direction extracted from the BiGRU network

Representing feature vectors divided in the vertical direction

Then, inputting the sequence into the BiGRU network to extract the characteristic information in the vertical direction

Then the two eigenvectors are spliced and residual error calculation is carried out,

step S4-2-3: will be provided with

After layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,

characteristics obtained by the last step

The method is used for classifying the multi-modal multi-emotion.

In order to achieve the above object, the present invention further provides an improved Transformer-based multi-modal emotion recognition system, which can implement the method according to any one of the above mentioned.

The invention has the following beneficial effects:

the method is based on an attention mechanism and an improved transform mode which replaces a multi-head attention mechanism by BiGRU2D to extract voice, video and text emotion characteristics and perform multi-mode emotion classification. By constructing the attention mechanism module, the global interactive coding features between two modes can be acquired, the global interactive coding features in a single mode can be acquired, the two feature data are integrated and spliced, the feature dimension and the information can be enriched, and therefore the recognition rate of multi-mode emotion classification is improved. Meanwhile, the network module of the improved Transformer constructed by the invention extracts high-level characteristic information, and the BiGRU2D modules in the horizontal and vertical directions replace complex multi-head attention modes, so that network parameters are greatly reduced, model training time is saved, and the operation efficiency of the multi-mode emotion recognition system is improved while high accuracy is maintained.

Drawings

FIG. 1 is a flow chart of a method for recognizing multi-modal emotion based on an improved Transformer in the invention.

FIG. 2 is a diagram of an inter-modal attention mechanism network.

FIG. 3 is a diagram of a single-mode internal attention mechanism network.

FIG. 4 is a network architecture diagram of a transform module.

Fig. 5 is a network structure diagram of the BiGRU2D module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

It should be emphasized that in describing the present invention, various formulas and constraints are identified with consistent labels, but the use of different labels to identify the same formula and/or constraint is not precluded and is provided for the purpose of more clearly illustrating the features of the present invention.

As shown in FIG. 1, the invention provides a multi-modal emotion recognition method based on an improved Transformer. The method comprises the following steps: preprocessing each mode in a video, voice and text database, and extracting the data characteristics of each sample by a traditional method; then, acquiring the characteristics of global interaction between two modes through a 3-layer cross-mode attention network, and acquiring the characteristics of global interaction in a single mode through a 3-layer self-attention model; further extracting the characteristics extracted from the attention network through the constructed improved Transformer network; and finally, inputting the data into a deep neural network for training, and applying the trained model to a multi-modal emotion classification task.

The following will describe the multi-modal emotion recognition method based on improved Transformer provided by the present invention in detail with reference to fig. 1. The method comprises the following steps:

step S1: preprocessing each mode in a video, voice and text database, extracting data characteristics of each sample by a traditional method, and generating a two-dimensional characteristic vector by each data sample. The present embodiment selects an IEMOCAP multimodal emotion sample library. The IEMOCAP multimodal emotion library is a database of one-motion, multimodal and multi-talker recorded by the SAIL laboratory at the university of southern california from facial, head and hand markers of ten actors, including video, voice, facial motion capture, etc. The performers performed selected episodes of emotions, i.e. created hypothetical scenes, aiming to elicit 5 specific types of discrete emotions (happy, angry, sad, depressed and neutral), and the corpus contained approximately 12 hours of data. Four multimodal mood samples of happiness, anger, sadness and depression were processed in the experiment, with a sample number of 973. In the experiment, the trimodal data were preprocessed by conventional methods. The specific processing steps of the IEMOCAP multi-modal database are as follows: :

step S1-1: performing frame processing on each video data sample, intercepting 20 frame image sequences according to a time sequence, extracting motion information of 35 face action units of a human face in each frame image through a Facet tool, and finally generating a two-dimensional characteristic vector Z for each video data sample_V；

Step S1-2: each voice data sample is segmented, 20 voice sequences are intercepted according to the time sequence, then low-level acoustic features are extracted through COVAREP, wherein the low-level acoustic features comprise 74 acoustic features representing voice features such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like, and finally, each voice data sample generates a two-dimensional feature vector Z_A；

Step S1-3: performing word level processing on each text data sample, generating a word vector with the dimension of 300 for each word through a pre-trained Glove model, and finally generating a two-dimensional feature vector Z for each text data sample_T。

Step S2: as shown in fig. 2, firstly, after the three modal features are arranged and combined two by two, the combined features are input into a constructed cross-modal attention network, there are two modes for the overall interaction between the modalities, that is, the two modalities are respectively used as a target modality and an auxiliary modality for interaction, the auxiliary modality is used as a low-level feature, the feature codes output by the previous layer of cross-modal attention network are encoded in each layer of cross-modal attention network, the feature interaction is performed again, and finally, the overall interaction feature vector between the two modalities is obtained through the three layers of cross-modal attention network

The specific steps of the acquisition process are as follows:

step S2-1: sample data Z after processing_V、Z_A、Z_TBy performing one-dimensional convolution with Conv1D having a convolution kernel size of 3 × 3, the sample data dimensions are unified to 40 dimensions. Then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained

Step S2-2: global interaction between two modes is carried out through a 3-layer cross-mode attention network to obtain a feature vector

Taking the voice mode as the target mode and the video mode as the auxiliary mode as an example, the voice mode characteristic data is obtained

Video modal feature data

Inputting the data into a 3-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculation

Specific calculation stepsThe method comprises the following steps:

step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the specific calculation process is as follows:

wherein,

representing the feature vectors after inter-modal feature interaction through an i-1 level multi-head attention network.

Step S2-2-2: will be provided with

Inputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the specific calculation process is as follows:

wherein,

weight matrices representing different tensors, in particular d_A、d_V、d_k、d_SRespectively 74, 35, 40 and 40,

representing low-level feature data using auxiliary modalities

Performing global feature interaction;

step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the specific calculation process is as follows:

wherein i =0,1, …, D,

Representing the result of carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network. Finally, through feature interaction of a 3-layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained

And step S3: as shown in fig. 3, a specific method for constructing a self-attention model and acquiring a global interaction feature in a single modality includes: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position coding

Through a 3-layer self-attention module unit, characteristic information in the mode is interactively coded, residual error calculation is carried out after the characteristic information passes through a feedforward neural network, and a self-attention network is obtainedInter-coded intra-modal speech, video, text features

And step S4: constructing an improved Transformer network with a BiGRU2D substituted for a multi-head attention network, and specifically comprising the following steps:

step S4-1: respectively splicing the features of global interaction in the single mode with the features of global interaction between the modes combined in pairs to obtain spliced feature data

Z_A、Z_V、Z_TRespectively representing voice modality inter-modality-intra-modality feature data, video modality inter-modality-intra-modality feature data, text modality inter-modality-intra-modality feature data;

step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1_A、Z_V、Z_TAnd inputting the improved Transformer encoder to extract deep-level features. Taking processing of inter-modal-intra-modal feature data as an example, the specific steps are as follows:

step S4-2-2: will be provided with

wherein

Representing division of feature vectors in the horizontal direction

Then, inputting the sequence into the characteristic information in the horizontal direction extracted from the BiGRU network

Representing feature vectors divided in the vertical direction

Then, inputting the sequence into the extracted feature information in the vertical direction in the BiGRU network

step S4-2-3: will be provided with

characteristics obtained by the last step

The method is used for classifying the multi-modal multi-emotion.

Step S5: and inputting the processed data into a deep neural network for training, and applying the trained model to a multi-modal emotion classification task.

Based on the above inventive concept, the invention also discloses an improved Transformer-based multi-modal emotion recognition system, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and when the computer program is loaded into the processor, the improved Transformer-based multi-modal emotion recognition method can be realized.

According to the invention, an attention mechanism module is constructed, so that not only can the global interactive coding features between two modes be obtained, but also the global interactive coding features in a single mode can be obtained, the two feature data are integrated and spliced, the feature dimension and information can be enriched, and the recognition rate of multi-mode emotion classification is improved. Meanwhile, the network module of the improved Transformer constructed by the invention extracts high-level characteristic information, and the BiGRU2D module in the horizontal and vertical directions replaces a complex multi-head attention mode, so that network parameters are greatly reduced, model training time is saved, and the operation efficiency of a multi-mode emotion recognition system is improved while high accuracy is maintained.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A multi-modal emotion recognition method based on an improved Transformer is characterized by comprising the following steps: the method comprises the following steps:

2. The method of claim 1, wherein: the step S1 further includes the steps of:

3. The method of claim 2, wherein: in the step S1-1, sample data features are extracted through a Facet tool; extracting low-level acoustic features through covrep, wherein the low-level acoustic features comprise 74 acoustic features representing voice features, such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like; and generating a word vector with dimension 300 for each word through a pre-trained Glove model.

4. The method of claim 2, wherein: in the step S2, a cross-modal attention model is constructed, feature data of three modal features after two-by-two arrangement and combination are processed, and a global interaction feature vector between two modalities is obtained

5. The method of claim 4, wherein: the step S2 mainly comprises the following steps:

step S2-1: sample data Z after processing_V、Z_A、Z_TCarrying out one-dimensional convolution through Conv1D with convolution kernel size of n multiplied by n, wherein the sample data dimensions are unified to D dimensions; then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained

6. The method of claim 5, wherein: in step S2-2, taking the voice modality as the target modality and the video modality as the auxiliary modality as an example, the voice modality feature data is analyzed

Video modality feature data

The calculation steps are as follows:

wherein,

representing a feature vector after performing inter-modal feature interaction through an i-1 layer multi-head attention network;

step S2-2-2: will be provided with

wherein,

a matrix of weights representing the different tensors,

low-level feature data representing use of auxiliary modalities

Performing global feature interaction; and

wherein i =0,1, …, D,

7. The method of claim 6, wherein: the method for constructing the self-attention model in the step 3 and acquiring the global interaction features in the single modality comprises the following steps: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position coding

Performing characteristic information interactive coding in a mode through a D2 layer self-attention module unit, performing residual calculation after the characteristic information interactive coding passes through a feedforward neural network, and obtaining a self-attention networkInter-coded intra-modal speech, video, text features

8. The method of claim 7, wherein: in the step 4, an improved transform model with BiGRU2D replacing a multi-head attention module is constructed, the data obtained by splicing the features of the global interaction in the single mode and the features of the global interaction between the modes after the arrangement and combination of every two modes is processed, and the deep features are extracted, and the method further comprises the following steps:

step S4-1: respectively splicing features of global interaction in single mode and features of global interaction between two combined modes, namely spliced feature data