CN115272908A - Multi-modal emotion recognition method and system based on improved Transformer - Google Patents
Multi-modal emotion recognition method and system based on improved Transformer Download PDFInfo
- Publication number
- CN115272908A CN115272908A CN202210707463.6A CN202210707463A CN115272908A CN 115272908 A CN115272908 A CN 115272908A CN 202210707463 A CN202210707463 A CN 202210707463A CN 115272908 A CN115272908 A CN 115272908A
- Authority
- CN
- China
- Prior art keywords
- data
- modality
- modal
- feature
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 23
- 230000003993 interaction Effects 0.000 claims abstract description 45
- 239000013598 vector Substances 0.000 claims abstract description 41
- 230000008451 emotion Effects 0.000 claims abstract description 18
- 230000002452 interceptive effect Effects 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 27
- 238000012545 processing Methods 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 239000006185 dispersion Substances 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 description 7
- 230000006872 improvement Effects 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Signal Processing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-modal emotion recognition method based on an improved Transformer and a system for implementing the method. The method comprises the following steps: preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector for each data sample; acquiring the characteristics of global interaction between two modes through a cross-mode attention model; obtaining the characteristics of global interaction in a single mode through a self-attention model; constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep-level features; and training the constructed network model by using the processed data samples, and using the trained model for the classification of the multivariate emotion. The invention not only extracts the interactive characteristics among the modes, but also considers the interactive characteristic information in the modes, and extracts the advanced characteristics through the improved lightweight transform coder, thereby solving the emotion classification problem more quickly and efficiently.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-modal emotion recognition method based on an improved Transformer and a system for implementing the method.
Background
With the progress of science and technology and the development of computer technology, artificial intelligence gradually enters the daily life of people. In recent years, the appearance of various intelligent devices improves the quality of life of human beings. However, these smart devices at present cannot achieve a real man-machine conversation, and need to rely on emotion recognition technology to achieve barrier-free communication between human beings and computers. The traditional emotion recognition technology is mainly established on single-mode data, and although the recognition mode is better realized, the problems of low accuracy, low resource sample utilization rate and the like exist. At present, advanced scientific research equipment can extract data of various modes, such as videos, voices, texts, postures, electroencephalograms and the like, and the multi-mode emotion recognition technology can be widely applied to the fields of smart homes, intelligent transportation, smart cities, front-end medical treatment and the like, so that the multi-mode emotion recognition technology is one of hot spots of artificial intelligence research.
Through search discovery, the Chinese patent with the publication number CN112784730A provides a multi-mode emotion recognition method based on a time domain convolution network, and the method mainly utilizes video and audio modal data to carry out emotion recognition. Firstly, the video is sampled at equal intervals, a gray level face image sequence is generated through face monitoring and key point positioning, and audio data are input into a Mel filter bank to obtain a Mel spectrogram. And then respectively sending the face image and the spectrogram into a convolutional neural network for feature fusion. And finally, inputting the fusion characteristic sequence into a time domain convolution network to obtain a high-grade characteristic vector, and finally predicting the multi-modal multi-element emotion recognition through regression of a full connection layer and Softmax.
More and more researchers expect to be able to construct a robust emotion recognition model by utilizing the complementarity between various modal information so as to achieve higher emotion classification accuracy. However, while inter-modal feature complementation is considered, the importance of features in a single modality is mostly ignored, and the complexity and the operation efficiency of the algorithm are also considered, and a place to be improved still exists.
In view of the above, it is necessary to design a method for recognizing multi-modal emotion of speech, expression and text based on improved Transformer and a system for implementing the method to solve the above problems.
Disclosure of Invention
The invention aims to provide a multi-modal emotion recognition method and system based on an improved Transformer aiming at the defects of the existing multi-modal emotion classification technology.
In order to achieve the aim, the invention provides a multi-modal emotion recognition method based on an improved Transformer, which comprises the following steps of:
s1, preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector by each data sample;
s2, acquiring the characteristics of global interaction between two modes through a cross-mode attention model;
s3, acquiring the characteristics of global interaction in a single mode through a self-attention model;
s4, constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep features;
and S5, training the constructed network model by using the processed data sample, and using the trained model for multi-element emotion classification.
A further improvement of the present invention is that the step S1 further comprises the steps of:
s1-1, performing framing processing on each video data sample, intercepting a k frame image sequence according to a time sequence, performing feature extraction on each intercepted frame image, and generating a two-dimensional feature vector Z for each video data sampleV;
S1-2, performing segmentation processing on each voice data sample, intercepting k voice sequences according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each voice data sampleA;
S1-3, performing word level processing on each text data sample, intercepting k words according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each text data sampleT。
A further improvement of the present invention is that in said step S1-1, sample data features are extracted by a Facet tool; low-level acoustic features are extracted by covanep, the voice recognition method comprises the following steps of (1) including 74 acoustic features representing voice features, such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like; and generating a word vector with dimension 300 for each word through a pre-trained Glove model.
The further improvement of the invention is that a cross-modal attention model is constructed in the step S2, the feature data of three modal features after being arranged and combined pairwise is processed, and the global interaction feature vector between two modalities is obtained
A further improvement of the present invention is that step S2 mainly comprises the steps of:
step S2-1: sample data Z after processingV、ZA、ZTBy convolution kernel size ofCarrying out one-dimensional convolution on the n multiplied by n Conv1D, wherein the sample data dimensions are unified to D dimensions; then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained
Step S2-2: obtaining interactive feature vector by global interaction between two modalities through cross-modality attention network
The present invention is further improved in that in step S2-2, the voice modality is taken as the target modality, the video modality is taken as the auxiliary modality, and the voice modality characteristic data is taken as the exampleVideo modality feature dataInputting the data into a multi-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculationThe calculation steps are as follows:
step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the calculation process is as follows:
wherein,representing multiple head through i-1 layersThe attention network carries out feature vectors after feature interaction between modalities;
step S2-2-2: will be provided withInputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the calculation process is as follows:
wherein,a matrix of weights representing the different tensors,representing low-level feature data using auxiliary modalitiesAnd target modal characteristic data output after passing through i-1 layer multi-head attention network unitPerforming global feature interaction; and
step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the calculation process is as follows:
wherein i =0,1, …, D,Representing a result obtained by carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network; through the feature interaction of the D1 layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained
In a further improvement of the present invention, the method for constructing the self-attention model in step 3 and acquiring the global interaction features in the single modality includes: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position codingPerforming characteristic information interactive coding in the mode through a D2 layer self-attention module unit, performing residual calculation after the characteristic information interactive coding is performed through a feedforward neural network, and obtaining characteristics of voice, video and text in the mode after the characteristic information interactive coding is performed through the self-attention network
The further improvement of the invention lies in that an improved Transformer model with a BiGRU2D substituted for a multi-head attention module is constructed in the step 4, data obtained by splicing the characteristics of global interaction in a single mode and the characteristics of global interaction between modes after two modes are arranged and combined are processed, and deep features are extracted, and the method further comprises the following steps:
step S4-1: respectively splicing features of global interaction in single mode with features of global interaction between two combined modes, i.e. spliced feature data ZA、ZV、ZTRespectively representing voice modality inter-modality feature data, video modality inter-modality feature data and text modality inter-modality feature data;
step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1A、ZV、ZTAnd inputting an improved transform encoder with a BiGRU2D substituted for a multi-head attention module to extract deep features.
The further improvement of the present invention is that, in step S4-2, taking processing inter-modality feature data as an example, the specific steps are as follows:
step S4-2-2: will be provided withInputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions, whereinRepresenting division of feature vectors in the horizontal directionThen, inputting the sequence into the feature information in the horizontal direction extracted from the BiGRU networkRepresenting feature vectors divided in the vertical directionThen, inputting the sequence into the BiGRU network to extract the characteristic information in the vertical directionThen the two eigenvectors are spliced and residual error calculation is carried out,
step S4-2-3: will be provided withAfter layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,
characteristics obtained by the last stepThe method is used for classifying the multi-modal multi-emotion.
In order to achieve the above object, the present invention further provides an improved Transformer-based multi-modal emotion recognition system, which can implement the method according to any one of the above mentioned.
The invention has the following beneficial effects:
the method is based on an attention mechanism and an improved transform mode which replaces a multi-head attention mechanism by BiGRU2D to extract voice, video and text emotion characteristics and perform multi-mode emotion classification. By constructing the attention mechanism module, the global interactive coding features between two modes can be acquired, the global interactive coding features in a single mode can be acquired, the two feature data are integrated and spliced, the feature dimension and the information can be enriched, and therefore the recognition rate of multi-mode emotion classification is improved. Meanwhile, the network module of the improved Transformer constructed by the invention extracts high-level characteristic information, and the BiGRU2D modules in the horizontal and vertical directions replace complex multi-head attention modes, so that network parameters are greatly reduced, model training time is saved, and the operation efficiency of the multi-mode emotion recognition system is improved while high accuracy is maintained.
Drawings
FIG. 1 is a flow chart of a method for recognizing multi-modal emotion based on an improved Transformer in the invention.
FIG. 2 is a diagram of an inter-modal attention mechanism network.
FIG. 3 is a diagram of a single-mode internal attention mechanism network.
FIG. 4 is a network architecture diagram of a transform module.
Fig. 5 is a network structure diagram of the BiGRU2D module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
It should be emphasized that in describing the present invention, various formulas and constraints are identified with consistent labels, but the use of different labels to identify the same formula and/or constraint is not precluded and is provided for the purpose of more clearly illustrating the features of the present invention.
As shown in FIG. 1, the invention provides a multi-modal emotion recognition method based on an improved Transformer. The method comprises the following steps: preprocessing each mode in a video, voice and text database, and extracting the data characteristics of each sample by a traditional method; then, acquiring the characteristics of global interaction between two modes through a 3-layer cross-mode attention network, and acquiring the characteristics of global interaction in a single mode through a 3-layer self-attention model; further extracting the characteristics extracted from the attention network through the constructed improved Transformer network; and finally, inputting the data into a deep neural network for training, and applying the trained model to a multi-modal emotion classification task.
The following will describe the multi-modal emotion recognition method based on improved Transformer provided by the present invention in detail with reference to fig. 1. The method comprises the following steps:
step S1: preprocessing each mode in a video, voice and text database, extracting data characteristics of each sample by a traditional method, and generating a two-dimensional characteristic vector by each data sample. The present embodiment selects an IEMOCAP multimodal emotion sample library. The IEMOCAP multimodal emotion library is a database of one-motion, multimodal and multi-talker recorded by the SAIL laboratory at the university of southern california from facial, head and hand markers of ten actors, including video, voice, facial motion capture, etc. The performers performed selected episodes of emotions, i.e. created hypothetical scenes, aiming to elicit 5 specific types of discrete emotions (happy, angry, sad, depressed and neutral), and the corpus contained approximately 12 hours of data. Four multimodal mood samples of happiness, anger, sadness and depression were processed in the experiment, with a sample number of 973. In the experiment, the trimodal data were preprocessed by conventional methods. The specific processing steps of the IEMOCAP multi-modal database are as follows: :
step S1-1: performing frame processing on each video data sample, intercepting 20 frame image sequences according to a time sequence, extracting motion information of 35 face action units of a human face in each frame image through a Facet tool, and finally generating a two-dimensional characteristic vector Z for each video data sampleV;
Step S1-2: each voice data sample is segmented, 20 voice sequences are intercepted according to the time sequence, then low-level acoustic features are extracted through COVAREP, wherein the low-level acoustic features comprise 74 acoustic features representing voice features such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like, and finally, each voice data sample generates a two-dimensional feature vector ZA;
Step S1-3: performing word level processing on each text data sample, generating a word vector with the dimension of 300 for each word through a pre-trained Glove model, and finally generating a two-dimensional feature vector Z for each text data sampleT。
Step S2: as shown in fig. 2, firstly, after the three modal features are arranged and combined two by two, the combined features are input into a constructed cross-modal attention network, there are two modes for the overall interaction between the modalities, that is, the two modalities are respectively used as a target modality and an auxiliary modality for interaction, the auxiliary modality is used as a low-level feature, the feature codes output by the previous layer of cross-modal attention network are encoded in each layer of cross-modal attention network, the feature interaction is performed again, and finally, the overall interaction feature vector between the two modalities is obtained through the three layers of cross-modal attention networkThe specific steps of the acquisition process are as follows:
step S2-1: sample data Z after processingV、ZA、ZTBy performing one-dimensional convolution with Conv1D having a convolution kernel size of 3 × 3, the sample data dimensions are unified to 40 dimensions. Then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained
Step S2-2: global interaction between two modes is carried out through a 3-layer cross-mode attention network to obtain a feature vectorTaking the voice mode as the target mode and the video mode as the auxiliary mode as an example, the voice mode characteristic data is obtainedVideo modal feature dataInputting the data into a 3-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculationSpecific calculation stepsThe method comprises the following steps:
step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the specific calculation process is as follows:
wherein,representing the feature vectors after inter-modal feature interaction through an i-1 level multi-head attention network.
Step S2-2-2: will be provided withInputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the specific calculation process is as follows:
wherein,weight matrices representing different tensors, in particular dA、dV、dk、dSRespectively 74, 35, 40 and 40,representing low-level feature data using auxiliary modalitiesAnd target modal characteristic data output after passing through i-1 layer multi-head attention network unitPerforming global feature interaction;
step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the specific calculation process is as follows:
wherein i =0,1, …, D,Representing the result of carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network. Finally, through feature interaction of a 3-layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained
And step S3: as shown in fig. 3, a specific method for constructing a self-attention model and acquiring a global interaction feature in a single modality includes: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position codingThrough a 3-layer self-attention module unit, characteristic information in the mode is interactively coded, residual error calculation is carried out after the characteristic information passes through a feedforward neural network, and a self-attention network is obtainedInter-coded intra-modal speech, video, text features
And step S4: constructing an improved Transformer network with a BiGRU2D substituted for a multi-head attention network, and specifically comprising the following steps:
step S4-1: respectively splicing the features of global interaction in the single mode with the features of global interaction between the modes combined in pairs to obtain spliced feature data ZA、ZV、ZTRespectively representing voice modality inter-modality-intra-modality feature data, video modality inter-modality-intra-modality feature data, text modality inter-modality-intra-modality feature data;
step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1A、ZV、ZTAnd inputting the improved Transformer encoder to extract deep-level features. Taking processing of inter-modal-intra-modal feature data as an example, the specific steps are as follows:
step S4-2-2: will be provided withInputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions, whereinRepresenting division of feature vectors in the horizontal directionThen, inputting the sequence into the characteristic information in the horizontal direction extracted from the BiGRU network Representing feature vectors divided in the vertical directionThen, inputting the sequence into the extracted feature information in the vertical direction in the BiGRU networkThen the two eigenvectors are spliced and residual error calculation is carried out,
step S4-2-3: will be provided withAfter layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,
characteristics obtained by the last stepThe method is used for classifying the multi-modal multi-emotion.
Step S5: and inputting the processed data into a deep neural network for training, and applying the trained model to a multi-modal emotion classification task.
Based on the above inventive concept, the invention also discloses an improved Transformer-based multi-modal emotion recognition system, which comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, and when the computer program is loaded into the processor, the improved Transformer-based multi-modal emotion recognition method can be realized.
According to the invention, an attention mechanism module is constructed, so that not only can the global interactive coding features between two modes be obtained, but also the global interactive coding features in a single mode can be obtained, the two feature data are integrated and spliced, the feature dimension and information can be enriched, and the recognition rate of multi-mode emotion classification is improved. Meanwhile, the network module of the improved Transformer constructed by the invention extracts high-level characteristic information, and the BiGRU2D module in the horizontal and vertical directions replaces a complex multi-head attention mode, so that network parameters are greatly reduced, model training time is saved, and the operation efficiency of a multi-mode emotion recognition system is improved while high accuracy is maintained.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.
Claims (10)
1. A multi-modal emotion recognition method based on an improved Transformer is characterized by comprising the following steps: the method comprises the following steps:
s1, preprocessing each modality in a video, voice and text database, extracting data characteristics of each sample, and generating a two-dimensional characteristic vector by each data sample;
s2, acquiring the characteristics of global interaction between two modes through a cross-mode attention model;
s3, acquiring the characteristics of global interaction in a single mode through a self-attention model;
s4, constructing an improved Transformer model with a BiGRU2D substituted for a multi-head attention module, and extracting deep features;
and S5, training the constructed network model by using the processed data sample, and using the trained model for multi-element emotion classification.
2. The method of claim 1, wherein: the step S1 further includes the steps of:
s1-1, performing framing processing on each video data sample, intercepting a k frame image sequence according to a time sequence, performing feature extraction on each intercepted frame image, and generating a two-dimensional feature vector Z for each video data sampleV;
S1-2, performing segmentation processing on each voice data sample, intercepting k voice sequences according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each voice data sampleA;
S1-3, performing word level processing on each text data sample, intercepting k words according to a time sequence, performing feature extraction on each intercepted voice, and generating a two-dimensional feature vector Z for each text data sampleT。
3. The method of claim 2, wherein: in the step S1-1, sample data features are extracted through a Facet tool; extracting low-level acoustic features through covrep, wherein the low-level acoustic features comprise 74 acoustic features representing voice features, such as 12 Mel cepstral coefficients, glottal source parameters, peak slope parameters, pitch tracking and voiced/unvoiced segmentation features, maximum dispersion quotient and the like; and generating a word vector with dimension 300 for each word through a pre-trained Glove model.
5. The method of claim 4, wherein: the step S2 mainly comprises the following steps:
step S2-1: sample data Z after processingV、ZA、ZTCarrying out one-dimensional convolution through Conv1D with convolution kernel size of n multiplied by n, wherein the sample data dimensions are unified to D dimensions; then, the sample data with unified dimensionality is continuously processed through sine and cosine position coding, and finally the three-mode data sample is obtained
6. The method of claim 5, wherein: in step S2-2, taking the voice modality as the target modality and the video modality as the auxiliary modality as an example, the voice modality feature data is analyzedVideo modality feature dataInputting the data into a multi-layer cross-modal attention network unit, and obtaining a feature vector through multi-round global feature interactive calculationThe calculation steps are as follows:
step S2-2-1: respectively carrying out layer normalization processing on the target modal and auxiliary modal data characteristics, wherein the calculation process is as follows:
wherein,representing a feature vector after performing inter-modal feature interaction through an i-1 layer multi-head attention network;
step S2-2-2: will be provided withInputting a multi-head attention network to carry out interaction of global features and carry out residual calculation, wherein the calculation process is as follows:
wherein,a matrix of weights representing the different tensors,low-level feature data representing use of auxiliary modalitiesAnd target modal characteristic data output after passing through i-1 layer multi-head attention network unitPerforming global feature interaction; and
step S2-2-3: after normalizing the feature data obtained by adding the residual errors, inputting the feature data into a feedforward neural network and calculating the residual errors, wherein the calculation process is as follows:
wherein i =0,1, …, D,Representing a result obtained by carrying out layer normalization after the ith round of voice and video modal characteristics are interacted, and then inputting the result into a feedforward neural network; through the feature interaction of the D1 layer cross-modal attention network, a feature vector which takes a voice modality as a target modality and a video modality as an auxiliary modality and carries out global feature interaction through the cross-modal attention network is obtained
7. The method of claim 6, wherein: the method for constructing the self-attention model in the step 3 and acquiring the global interaction features in the single modality comprises the following steps: respectively encoding the voice, video and text modal characteristic data processed by Conv1D and sine and cosine position codingPerforming characteristic information interactive coding in a mode through a D2 layer self-attention module unit, performing residual calculation after the characteristic information interactive coding passes through a feedforward neural network, and obtaining a self-attention networkInter-coded intra-modal speech, video, text features
8. The method of claim 7, wherein: in the step 4, an improved transform model with BiGRU2D replacing a multi-head attention module is constructed, the data obtained by splicing the features of the global interaction in the single mode and the features of the global interaction between the modes after the arrangement and combination of every two modes is processed, and the deep features are extracted, and the method further comprises the following steps:
step S4-1: respectively splicing features of global interaction in single mode and features of global interaction between two combined modes, namely spliced feature data ZA、ZV、ZTRespectively representing voice modality inter-modality feature data, video modality inter-modality feature data and text modality inter-modality feature data;
step S4-2: the inter-modal-intra-modal characteristics Z spliced in the step S4-1A、ZV、ZTAnd inputting an improved transform encoder with a BiGRU2D substituted for a multi-head attention module to extract deep features.
9. The method of claim 8, wherein: in step S4-2, taking processing inter-modality feature data as an example, the specific steps are as follows:
step S4-2-2: will be provided withInputting into a BiGRU2D network module, extracting effective information of the two-dimensional feature vector through BiGRUs in the vertical and horizontal directions,whereinRepresenting division of feature vectors in the horizontal directionThen, inputting the sequence into the characteristic information in the horizontal direction extracted from the BiGRU networkRepresenting feature vectors divided in the vertical directionThen, inputting the sequence into the BiGRU network to extract the characteristic information in the vertical directionThen the two eigenvectors are spliced and residual error calculation is carried out,
step S4-2-3: will be provided withAfter layer normalization, the layer is sent to a feedforward neural network and residual calculation is introduced,
10. An improved transform-based multi-modal emotion recognition system, which can implement the method of any of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210707463.6A CN115272908A (en) | 2022-06-21 | 2022-06-21 | Multi-modal emotion recognition method and system based on improved Transformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210707463.6A CN115272908A (en) | 2022-06-21 | 2022-06-21 | Multi-modal emotion recognition method and system based on improved Transformer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115272908A true CN115272908A (en) | 2022-11-01 |
Family
ID=83761836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210707463.6A Pending CN115272908A (en) | 2022-06-21 | 2022-06-21 | Multi-modal emotion recognition method and system based on improved Transformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115272908A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115496077A (en) * | 2022-11-18 | 2022-12-20 | 之江实验室 | Multimode emotion analysis method and device based on modal observation and grading |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
-
2022
- 2022-06-21 CN CN202210707463.6A patent/CN115272908A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115496077A (en) * | 2022-11-18 | 2022-12-20 | 之江实验室 | Multimode emotion analysis method and device based on modal observation and grading |
CN116070169A (en) * | 2023-01-28 | 2023-05-05 | 天翼云科技有限公司 | Model training method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Latif et al. | Deep representation learning in speech processing: Challenges, recent advances, and future trends | |
CN112560503B (en) | Semantic emotion analysis method integrating depth features and time sequence model | |
CN110751208B (en) | Criminal emotion recognition method for multi-mode feature fusion based on self-weight differential encoder | |
Cho et al. | Describing multimedia content using attention-based encoder-decoder networks | |
Ning et al. | Semantics-consistent representation learning for remote sensing image–voice retrieval | |
CN113822192A (en) | Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion | |
Huang et al. | Multimodal continuous emotion recognition with data augmentation using recurrent neural networks | |
CN111898670B (en) | Multi-mode emotion recognition method, device, equipment and storage medium | |
CN115272908A (en) | Multi-modal emotion recognition method and system based on improved Transformer | |
Zhang et al. | Multi-modal multi-label emotion detection with modality and label dependence | |
Praveen et al. | Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention | |
CN112151030A (en) | Multi-mode-based complex scene voice recognition method and device | |
CN114676234A (en) | Model training method and related equipment | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN114386515A (en) | Single-mode label generation and multi-mode emotion distinguishing method based on Transformer algorithm | |
Li et al. | Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications | |
Yoon | Can we exploit all datasets? Multimodal emotion recognition using cross-modal translation | |
Parvin et al. | Transformer-based local-global guidance for image captioning | |
Boukdir et al. | Character-level Arabic text generation from sign language video using encoder–decoder model | |
CN117150320B (en) | Dialog digital human emotion style similarity evaluation method and system | |
Hafeth et al. | Semantic representations with attention networks for boosting image captioning | |
CN112541541B (en) | Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion | |
Yu et al. | Multimodal fusion method with spatiotemporal sequences and relationship learning for valence-arousal estimation | |
Peng et al. | Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal | |
US11810598B2 (en) | Apparatus and method for automated video record generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |