CN115272908A - Multi-modal emotion recognition method and system based on improved Transformer - Google Patents

Multi-modal emotion recognition method and system based on improved Transformer Download PDF

Info

Publication number
CN115272908A
CN115272908A CN202210707463.6A CN202210707463A CN115272908A CN 115272908 A CN115272908 A CN 115272908A CN 202210707463 A CN202210707463 A CN 202210707463A CN 115272908 A CN115272908 A CN 115272908A
Authority
CN
China
Prior art keywords
data
modality
modal
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210707463.6A
Other languages
Chinese (zh)
Inventor
丁俊丰
闫静杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210707463.6A priority Critical patent/CN115272908A/en
Publication of CN115272908A publication Critical patent/CN115272908A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-modal emotion recognition method based on an improved Transformer and a system for implementing the method. The method comprises the following steps: preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector for each data sample; acquiring the characteristics of global interaction between two modes through a cross-mode attention model; obtaining the characteristics of global interaction in a single mode through a self-attention model; constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep-level features; and training the constructed network model by using the processed data samples, and using the trained model for the classification of the multivariate emotion. The invention not only extracts the interactive characteristics among the modes, but also considers the interactive characteristic information in the modes, and extracts the advanced characteristics through the improved lightweight transform coder, thereby solving the emotion classification problem more quickly and efficiently.

Description

Multi-modal emotion recognition method and system based on improved Transformer
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-modal emotion recognition method based on an improved Transformer and a system for implementing the method.
Background
With the progress of science and technology and the development of computer technology, artificial intelligence gradually enters the daily life of people. In recent years, the appearance of various intelligent devices improves the quality of life of human beings. However, these smart devices at present cannot achieve a real man-machine conversation, and need to rely on emotion recognition technology to achieve barrier-free communication between human beings and computers. The traditional emotion recognition technology is mainly established on single-mode data, and although the recognition mode is better realized, the problems of low accuracy, low resource sample utilization rate and the like exist. At present, advanced scientific research equipment can extract data of various modes, such as videos, voices, texts, postures, electroencephalograms and the like, and the multi-mode emotion recognition technology can be widely applied to the fields of smart homes, intelligent transportation, smart cities, front-end medical treatment and the like, so that the multi-mode emotion recognition technology is one of hot spots of artificial intelligence research.
Through search discovery, the Chinese patent with the publication number CN112784730A provides a multi-mode emotion recognition method based on a time domain convolution network, and the method mainly utilizes video and audio modal data to carry out emotion recognition. Firstly, the video is sampled at equal intervals, a gray level face image sequence is generated through face monitoring and key point positioning, and audio data are input into a Mel filter bank to obtain a Mel spectrogram. And then respectively sending the face image and the spectrogram into a convolutional neural network for feature fusion. And finally, inputting the fusion characteristic sequence into a time domain convolution network to obtain a high-grade characteristic vector, and finally predicting the multi-modal multi-element emotion recognition through regression of a full connection layer and Softmax.
More and more researchers expect to be able to construct a robust emotion recognition model by utilizing the complementarity between various modal information so as to achieve higher emotion classification accuracy. However, while inter-modal feature complementation is considered, the importance of features in a single modality is mostly ignored, and the complexity and the operation efficiency of the algorithm are also considered, and a place to be improved still exists.
In view of the above, it is necessary to design a method for recognizing multi-modal emotion of speech, expression and text based on improved Transformer and a system for implementing the method to solve the above problems.
Disclosure of Invention
The invention aims to provide a multi-modal emotion recognition method and system based on an improved Transformer aiming at the defects of the existing multi-modal emotion classification technology.
In order to achieve the aim, the invention provides a multi-modal emotion recognition method based on an improved Transformer, which comprises the following steps of:
s1, preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector by each data sample;
s2, acquiring the characteristics of global interaction between two modes through a cross-mode attention model;
s3, acquiring the characteristics of global interaction in a single mode through a self-attention model;
s4, constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep features;
and S5, training the constructed network model by using the processed data sample, and using the trained model for multi-element emotion classification.
A further improvement of the present invention is that the step S1 further comprises the steps of:
s1-1, performing framing processing on each video data sample, intercepting a k frame image sequence according to a time sequence, performing feature extraction on each intercepted frame image, and generating a two-dimensional feature vector Z for each video data sampleV
S1-2, performing segmentation processing on each voice data sample, intercepting k voice sequences according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each voice data sampleA
S1-3, performing word level processing on each text data sample, intercepting k words according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each text data sampleT
A further improvement of the present invention is that in said step S1-1, sample data features are extracted by a Facet tool; low-level acoustic features are extracted by covanep, the voice recognition method comprises the following steps of (1) including 74 acoustic features representing voice features, such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like; and generating a word vector with dimension 300 for each word through a pre-trained Glove model.
The further improvement of the invention is that a cross-modal attention model is constructed in the step S2, the feature data of three modal features after being arranged and combined pairwise is processed, and the global interaction feature vector between two modalities is obtained
Figure BDA0003705933180000031
A further improvement of the present invention is that step S2 mainly comprises the steps of:
step S2-1: sample data Z after processingV、ZA、ZTBy convolution kernel size ofCarrying out one-dimensional convolution on the n multiplied by n Conv1D, wherein the sample data dimensions are unified to D dimensions; then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained
Figure BDA0003705933180000032
Step S2-2: obtaining interactive feature vector by global interaction between two modalities through cross-modality attention network
Figure BDA0003705933180000033
The present invention is further improved in that in step S2-2, the voice modality is taken as the target modality, the video modality is taken as the auxiliary modality, and the voice modality characteristic data is taken as the example
Figure BDA0003705933180000041
Video modality feature data
Figure BDA0003705933180000042
Inputting the data into a multi-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculation
Figure BDA0003705933180000043
The calculation steps are as follows:
step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the calculation process is as follows:
Figure BDA0003705933180000044
Figure BDA0003705933180000045
wherein,
Figure BDA0003705933180000046
representing multiple head through i-1 layersThe attention network carries out feature vectors after feature interaction between modalities;
step S2-2-2: will be provided with
Figure BDA0003705933180000047
Inputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the calculation process is as follows:
Figure BDA0003705933180000048
Figure BDA0003705933180000049
Figure BDA00037059331800000410
wherein,
Figure BDA00037059331800000411
a matrix of weights representing the different tensors,
Figure BDA00037059331800000412
representing low-level feature data using auxiliary modalities
Figure BDA00037059331800000413
And target modal characteristic data output after passing through i-1 layer multi-head attention network unit
Figure BDA00037059331800000414
Performing global feature interaction; and
step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the calculation process is as follows:
Figure BDA00037059331800000415
wherein i =0,1, …, D,
Figure BDA00037059331800000416
Representing a result obtained by carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network; through the feature interaction of the D1 layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained
Figure BDA0003705933180000051
In a further improvement of the present invention, the method for constructing the self-attention model in step 3 and acquiring the global interaction features in the single modality includes: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position coding
Figure BDA0003705933180000052
Performing characteristic information interactive coding in the mode through a D2 layer self-attention module unit, performing residual calculation after the characteristic information interactive coding is performed through a feedforward neural network, and obtaining characteristics of voice, video and text in the mode after the characteristic information interactive coding is performed through the self-attention network
Figure BDA0003705933180000053
The further improvement of the invention lies in that an improved Transformer model with a BiGRU2D substituted for a multi-head attention module is constructed in the step 4, data obtained by splicing the characteristics of global interaction in a single mode and the characteristics of global interaction between modes after two modes are arranged and combined are processed, and deep features are extracted, and the method further comprises the following steps:
step S4-1: respectively splicing features of global interaction in single mode with features of global interaction between two combined modes, i.e. spliced feature data
Figure BDA0003705933180000054
Figure BDA0003705933180000055
ZA、ZV、ZTRespectively representing voice modality inter-modality feature data, video modality inter-modality feature data and text modality inter-modality feature data;
step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1A、ZV、ZTAnd inputting an improved transform encoder with a BiGRU2D substituted for a multi-head attention module to extract deep features.
The further improvement of the present invention is that, in step S4-2, taking processing inter-modality feature data as an example, the specific steps are as follows:
step S4-2-1: firstly, Z isAThe input layer normalizes the network of the network,
Figure BDA0003705933180000061
step S4-2-2: will be provided with
Figure BDA0003705933180000062
Inputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions,
Figure BDA0003705933180000063
Figure BDA0003705933180000064
wherein
Figure BDA0003705933180000065
Representing division of feature vectors in the horizontal direction
Figure BDA0003705933180000066
Then, inputting the sequence into the feature information in the horizontal direction extracted from the BiGRU network
Figure BDA0003705933180000067
Representing feature vectors divided in the vertical direction
Figure BDA0003705933180000068
Then, inputting the sequence into the BiGRU network to extract the characteristic information in the vertical direction
Figure BDA0003705933180000069
Then the two eigenvectors are spliced and residual error calculation is carried out,
Figure BDA00037059331800000610
step S4-2-3: will be provided with
Figure BDA00037059331800000611
After layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,
Figure BDA00037059331800000612
characteristics obtained by the last step
Figure BDA00037059331800000613
The method is used for classifying the multi-modal multi-emotion.
In order to achieve the above object, the present invention further provides an improved Transformer-based multi-modal emotion recognition system, which can implement the method according to any one of the above mentioned.
The invention has the following beneficial effects:
the method is based on an attention mechanism and an improved transform mode which replaces a multi-head attention mechanism by BiGRU2D to extract voice, video and text emotion characteristics and perform multi-mode emotion classification. By constructing the attention mechanism module, the global interactive coding features between two modes can be acquired, the global interactive coding features in a single mode can be acquired, the two feature data are integrated and spliced, the feature dimension and the information can be enriched, and therefore the recognition rate of multi-mode emotion classification is improved. Meanwhile, the network module of the improved Transformer constructed by the invention extracts high-level characteristic information, and the BiGRU2D modules in the horizontal and vertical directions replace complex multi-head attention modes, so that network parameters are greatly reduced, model training time is saved, and the operation efficiency of the multi-mode emotion recognition system is improved while high accuracy is maintained.
Drawings
FIG. 1 is a flow chart of a method for recognizing multi-modal emotion based on an improved Transformer in the invention.
FIG. 2 is a diagram of an inter-modal attention mechanism network.
FIG. 3 is a diagram of a single-mode internal attention mechanism network.
FIG. 4 is a network architecture diagram of a transform module.
Fig. 5 is a network structure diagram of the BiGRU2D module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
It should be emphasized that in describing the present invention, various formulas and constraints are identified with consistent labels, but the use of different labels to identify the same formula and/or constraint is not precluded and is provided for the purpose of more clearly illustrating the features of the present invention.
As shown in FIG. 1, the invention provides a multi-modal emotion recognition method based on an improved Transformer. The method comprises the following steps: preprocessing each mode in a video, voice and text database, and extracting the data characteristics of each sample by a traditional method; then, acquiring the characteristics of global interaction between two modes through a 3-layer cross-mode attention network, and acquiring the characteristics of global interaction in a single mode through a 3-layer self-attention model; further extracting the characteristics extracted from the attention network through the constructed improved Transformer network; and finally, inputting the data into a deep neural network for training, and applying the trained model to a multi-modal emotion classification task.
The following will describe the multi-modal emotion recognition method based on improved Transformer provided by the present invention in detail with reference to fig. 1. The method comprises the following steps:
step S1: preprocessing each mode in a video, voice and text database, extracting data characteristics of each sample by a traditional method, and generating a two-dimensional characteristic vector by each data sample. The present embodiment selects an IEMOCAP multimodal emotion sample library. The IEMOCAP multimodal emotion library is a database of one-motion, multimodal and multi-talker recorded by the SAIL laboratory at the university of southern california from facial, head and hand markers of ten actors, including video, voice, facial motion capture, etc. The performers performed selected episodes of emotions, i.e. created hypothetical scenes, aiming to elicit 5 specific types of discrete emotions (happy, angry, sad, depressed and neutral), and the corpus contained approximately 12 hours of data. Four multimodal mood samples of happiness, anger, sadness and depression were processed in the experiment, with a sample number of 973. In the experiment, the trimodal data were preprocessed by conventional methods. The specific processing steps of the IEMOCAP multi-modal database are as follows: :
step S1-1: performing frame processing on each video data sample, intercepting 20 frame image sequences according to a time sequence, extracting motion information of 35 face action units of a human face in each frame image through a Facet tool, and finally generating a two-dimensional characteristic vector Z for each video data sampleV
Step S1-2: each voice data sample is segmented, 20 voice sequences are intercepted according to the time sequence, then low-level acoustic features are extracted through COVAREP, wherein the low-level acoustic features comprise 74 acoustic features representing voice features such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like, and finally, each voice data sample generates a two-dimensional feature vector ZA
Step S1-3: performing word level processing on each text data sample, generating a word vector with the dimension of 300 for each word through a pre-trained Glove model, and finally generating a two-dimensional feature vector Z for each text data sampleT
Step S2: as shown in fig. 2, firstly, after the three modal features are arranged and combined two by two, the combined features are input into a constructed cross-modal attention network, there are two modes for the overall interaction between the modalities, that is, the two modalities are respectively used as a target modality and an auxiliary modality for interaction, the auxiliary modality is used as a low-level feature, the feature codes output by the previous layer of cross-modal attention network are encoded in each layer of cross-modal attention network, the feature interaction is performed again, and finally, the overall interaction feature vector between the two modalities is obtained through the three layers of cross-modal attention network
Figure BDA0003705933180000091
The specific steps of the acquisition process are as follows:
step S2-1: sample data Z after processingV、ZA、ZTBy performing one-dimensional convolution with Conv1D having a convolution kernel size of 3 × 3, the sample data dimensions are unified to 40 dimensions. Then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained
Figure BDA0003705933180000092
Step S2-2: global interaction between two modes is carried out through a 3-layer cross-mode attention network to obtain a feature vector
Figure BDA0003705933180000093
Taking the voice mode as the target mode and the video mode as the auxiliary mode as an example, the voice mode characteristic data is obtained
Figure BDA0003705933180000094
Video modal feature data
Figure BDA0003705933180000095
Inputting the data into a 3-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculation
Figure BDA0003705933180000096
Specific calculation stepsThe method comprises the following steps:
step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the specific calculation process is as follows:
Figure BDA0003705933180000097
Figure BDA0003705933180000098
wherein,
Figure BDA0003705933180000099
representing the feature vectors after inter-modal feature interaction through an i-1 level multi-head attention network.
Step S2-2-2: will be provided with
Figure BDA00037059331800000910
Inputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the specific calculation process is as follows:
Figure BDA0003705933180000101
Figure BDA0003705933180000102
Figure BDA0003705933180000103
wherein,
Figure BDA0003705933180000104
weight matrices representing different tensors, in particular dA、dV、dk、dSRespectively 74, 35, 40 and 40,
Figure BDA0003705933180000105
representing low-level feature data using auxiliary modalities
Figure BDA0003705933180000106
And target modal characteristic data output after passing through i-1 layer multi-head attention network unit
Figure BDA0003705933180000107
Performing global feature interaction;
step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the specific calculation process is as follows:
Figure BDA0003705933180000108
wherein i =0,1, …, D,
Figure BDA0003705933180000109
Representing the result of carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network. Finally, through feature interaction of a 3-layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained
Figure BDA00037059331800001010
And step S3: as shown in fig. 3, a specific method for constructing a self-attention model and acquiring a global interaction feature in a single modality includes: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position coding
Figure BDA00037059331800001011
Through a 3-layer self-attention module unit, characteristic information in the mode is interactively coded, residual error calculation is carried out after the characteristic information passes through a feedforward neural network, and a self-attention network is obtainedInter-coded intra-modal speech, video, text features
Figure BDA0003705933180000111
And step S4: constructing an improved Transformer network with a BiGRU2D substituted for a multi-head attention network, and specifically comprising the following steps:
step S4-1: respectively splicing the features of global interaction in the single mode with the features of global interaction between the modes combined in pairs to obtain spliced feature data
Figure BDA0003705933180000112
Figure BDA0003705933180000113
ZA、ZV、ZTRespectively representing voice modality inter-modality-intra-modality feature data, video modality inter-modality-intra-modality feature data, text modality inter-modality-intra-modality feature data;
step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1A、ZV、ZTAnd inputting the improved Transformer encoder to extract deep-level features. Taking processing of inter-modal-intra-modal feature data as an example, the specific steps are as follows:
step S4-2-1: firstly, Z isAThe input layer normalizes the network of the network,
Figure BDA0003705933180000114
step S4-2-2: will be provided with
Figure BDA0003705933180000115
Inputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions,
Figure BDA0003705933180000116
Figure BDA0003705933180000117
wherein
Figure BDA0003705933180000118
Representing division of feature vectors in the horizontal direction
Figure BDA0003705933180000119
Then, inputting the sequence into the characteristic information in the horizontal direction extracted from the BiGRU network
Figure BDA00037059331800001110
Figure BDA00037059331800001111
Representing feature vectors divided in the vertical direction
Figure BDA00037059331800001112
Then, inputting the sequence into the extracted feature information in the vertical direction in the BiGRU network
Figure BDA00037059331800001113
Then the two eigenvectors are spliced and residual error calculation is carried out,
Figure BDA00037059331800001114
step S4-2-3: will be provided with
Figure BDA00037059331800001115
After layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,
Figure BDA00037059331800001116
characteristics obtained by the last step
Figure BDA00037059331800001117
The method is used for classifying the multi-modal multi-emotion.
Step S5: and inputting the processed data into a deep neural network for training, and applying the trained model to a multi-modal emotion classification task.
Based on the above inventive concept, the invention also discloses an improved Transformer-based multi-modal emotion recognition system, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and when the computer program is loaded into the processor, the improved Transformer-based multi-modal emotion recognition method can be realized.
According to the invention, an attention mechanism module is constructed, so that not only can the global interactive coding features between two modes be obtained, but also the global interactive coding features in a single mode can be obtained, the two feature data are integrated and spliced, the feature dimension and information can be enriched, and the recognition rate of multi-mode emotion classification is improved. Meanwhile, the network module of the improved Transformer constructed by the invention extracts high-level characteristic information, and the BiGRU2D module in the horizontal and vertical directions replaces a complex multi-head attention mode, so that network parameters are greatly reduced, model training time is saved, and the operation efficiency of a multi-mode emotion recognition system is improved while high accuracy is maintained.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims (10)

1. A multi-modal emotion recognition method based on an improved Transformer is characterized by comprising the following steps: the method comprises the following steps:
s1, preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector by each data sample;
s2, acquiring the characteristics of global interaction between two modes through a cross-mode attention model;
s3, acquiring the characteristics of global interaction in a single mode through a self-attention model;
s4, constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep features;
and S5, training the constructed network model by using the processed data sample, and using the trained model for multi-element emotion classification.
2. The method of claim 1, wherein: the step S1 further includes the steps of:
s1-1, performing framing processing on each video data sample, intercepting a k frame image sequence according to a time sequence, performing feature extraction on each intercepted frame image, and generating a two-dimensional feature vector Z for each video data sampleV
S1-2, performing segmentation processing on each voice data sample, intercepting k voice sequences according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each voice data sampleA
S1-3, performing word level processing on each text data sample, intercepting k words according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each text data sampleT
3. The method of claim 2, wherein: in the step S1-1, sample data features are extracted through a Facet tool; extracting low-level acoustic features through covrep, wherein the low-level acoustic features comprise 74 acoustic features representing voice features, such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like; and generating a word vector with dimension 300 for each word through a pre-trained Glove model.
4. The method of claim 2, wherein: in the step S2, a cross-modal attention model is constructed, feature data of three modal features after two-by-two arrangement and combination are processed, and a global interaction feature vector between two modalities is obtained
Figure FDA0003705933170000021
5. The method of claim 4, wherein: the step S2 mainly comprises the following steps:
step S2-1: sample data Z after processingV、ZA、ZTCarrying out one-dimensional convolution through Conv1D with convolution kernel size of n multiplied by n, wherein the sample data dimensions are unified to D dimensions; then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained
Figure FDA0003705933170000022
Step S2-2: obtaining interactive feature vector by global interaction between two modalities through cross-modality attention network
Figure FDA0003705933170000023
6. The method of claim 5, wherein: in step S2-2, taking the voice modality as the target modality and the video modality as the auxiliary modality as an example, the voice modality feature data is analyzed
Figure FDA0003705933170000024
Video modality feature data
Figure FDA0003705933170000025
Inputting the data into a multi-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculation
Figure FDA0003705933170000026
The calculation steps are as follows:
step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the calculation process is as follows:
Figure FDA0003705933170000027
Figure FDA0003705933170000028
wherein,
Figure FDA0003705933170000029
representing a feature vector after performing inter-modal feature interaction through an i-1 layer multi-head attention network;
step S2-2-2: will be provided with
Figure FDA00037059331700000210
Inputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the calculation process is as follows:
Figure FDA0003705933170000031
Figure FDA0003705933170000032
Figure FDA0003705933170000033
wherein,
Figure FDA0003705933170000034
a matrix of weights representing the different tensors,
Figure FDA0003705933170000035
low-level feature data representing use of auxiliary modalities
Figure FDA0003705933170000036
And target modal characteristic data output after passing through i-1 layer multi-head attention network unit
Figure FDA0003705933170000037
Performing global feature interaction; and
step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the calculation process is as follows:
Figure FDA0003705933170000038
wherein i =0,1, …, D,
Figure FDA0003705933170000039
Representing a result obtained by carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network; through the feature interaction of the D1 layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained
Figure FDA00037059331700000310
7. The method of claim 6, wherein: the method for constructing the self-attention model in the step 3 and acquiring the global interaction features in the single modality comprises the following steps: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position coding
Figure FDA00037059331700000311
Performing characteristic information interactive coding in a mode through a D2 layer self-attention module unit, performing residual calculation after the characteristic information interactive coding passes through a feedforward neural network, and obtaining a self-attention networkInter-coded intra-modal speech, video, text features
Figure FDA00037059331700000312
8. The method of claim 7, wherein: in the step 4, an improved transform model with BiGRU2D replacing a multi-head attention module is constructed, the data obtained by splicing the features of the global interaction in the single mode and the features of the global interaction between the modes after the arrangement and combination of every two modes is processed, and the deep features are extracted, and the method further comprises the following steps:
step S4-1: respectively splicing features of global interaction in single mode and features of global interaction between two combined modes, namely spliced feature data
Figure FDA0003705933170000041
Figure FDA0003705933170000042
Figure FDA0003705933170000043
ZA、ZV、ZTRespectively representing voice modality inter-modality feature data, video modality inter-modality feature data and text modality inter-modality feature data;
step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1A、ZV、ZTAnd inputting an improved transform encoder with a BiGRU2D substituted for a multi-head attention module to extract deep features.
9. The method of claim 8, wherein: in step S4-2, taking processing inter-modality feature data as an example, the specific steps are as follows:
step S4-2-1: firstly, Z isAThe input layer normalizes the network of the network,
Figure FDA0003705933170000044
step S4-2-2: will be provided with
Figure FDA0003705933170000045
Inputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions,
Figure FDA0003705933170000046
wherein
Figure FDA0003705933170000047
Representing division of feature vectors in the horizontal direction
Figure FDA0003705933170000048
Then, inputting the sequence into the characteristic information in the horizontal direction extracted from the BiGRU network
Figure FDA0003705933170000049
Representing feature vectors divided in the vertical direction
Figure FDA00037059331700000410
Then, inputting the sequence into the BiGRU network to extract the characteristic information in the vertical direction
Figure FDA00037059331700000411
Then the two eigenvectors are spliced and residual error calculation is carried out,
Figure FDA00037059331700000412
Figure FDA00037059331700000413
step S4-2-3: will be provided with
Figure FDA00037059331700000414
After layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,
Figure FDA00037059331700000415
characteristics obtained by the last step
Figure FDA00037059331700000416
The method is used for classifying the multi-modal multi-emotion.
10. An improved transform-based multi-modal emotion recognition system, which can implement the method of any of claims 1 to 9.
CN202210707463.6A 2022-06-21 2022-06-21 Multi-modal emotion recognition method and system based on improved Transformer Pending CN115272908A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210707463.6A CN115272908A (en) 2022-06-21 2022-06-21 Multi-modal emotion recognition method and system based on improved Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210707463.6A CN115272908A (en) 2022-06-21 2022-06-21 Multi-modal emotion recognition method and system based on improved Transformer

Publications (1)

Publication Number Publication Date
CN115272908A true CN115272908A (en) 2022-11-01

Family

ID=83761836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210707463.6A Pending CN115272908A (en) 2022-06-21 2022-06-21 Multi-modal emotion recognition method and system based on improved Transformer

Country Status (1)

Country Link
CN (1) CN115272908A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496077A (en) * 2022-11-18 2022-12-20 之江实验室 Multimode emotion analysis method and device based on modal observation and grading
CN116070169A (en) * 2023-01-28 2023-05-05 天翼云科技有限公司 Model training method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496077A (en) * 2022-11-18 2022-12-20 之江实验室 Multimode emotion analysis method and device based on modal observation and grading
CN116070169A (en) * 2023-01-28 2023-05-05 天翼云科技有限公司 Model training method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Latif et al. Deep representation learning in speech processing: Challenges, recent advances, and future trends
CN112560503B (en) Semantic emotion analysis method integrating depth features and time sequence model
CN110751208B (en) Criminal emotion recognition method for multi-mode feature fusion based on self-weight differential encoder
Cho et al. Describing multimedia content using attention-based encoder-decoder networks
Ning et al. Semantics-consistent representation learning for remote sensing image–voice retrieval
CN113822192A (en) Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion
Huang et al. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks
CN111898670B (en) Multi-mode emotion recognition method, device, equipment and storage medium
CN115272908A (en) Multi-modal emotion recognition method and system based on improved Transformer
Zhang et al. Multi-modal multi-label emotion detection with modality and label dependence
Praveen et al. Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention
CN112151030A (en) Multi-mode-based complex scene voice recognition method and device
CN114676234A (en) Model training method and related equipment
CN113392265A (en) Multimedia processing method, device and equipment
CN114386515A (en) Single-mode label generation and multi-mode emotion distinguishing method based on Transformer algorithm
Li et al. Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications
Yoon Can we exploit all datasets? Multimodal emotion recognition using cross-modal translation
Parvin et al. Transformer-based local-global guidance for image captioning
Boukdir et al. Character-level Arabic text generation from sign language video using encoder–decoder model
CN117150320B (en) Dialog digital human emotion style similarity evaluation method and system
Hafeth et al. Semantic representations with attention networks for boosting image captioning
CN112541541B (en) Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion
Yu et al. Multimodal fusion method with spatiotemporal sequences and relationship learning for valence-arousal estimation
Peng et al. Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal
US11810598B2 (en) Apparatus and method for automated video record generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination