CN116682463A

CN116682463A - Multi-mode emotion recognition method and system

Info

Publication number: CN116682463A
Application number: CN202310632363.6A
Authority: CN
Inventors: 肖红; 张嘉柠; 姜文超; 黄子豪
Original assignee: Guangzhou Fansha Intelligent Technology Co ltd; Guangdong University of Technology
Current assignee: Guangzhou Fansha Intelligent Technology Co ltd; Guangdong University of Technology
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-09-01

Abstract

The application discloses a multi-mode emotion recognition method and system, which relate to the technical field of emotion recognition, and S10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal; s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text; s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text; s40: and carrying out emotion recognition on the MFCC, the spectrogram and the fused corresponding advanced feature vector. According to the multi-mode emotion recognition method and system provided by the application, multi-level acoustic features can be fused, information interaction and connection of multiple acoustic features in a voice mode are enhanced, meanwhile, a cross-convolution attention mechanism is used, emotion information between two modes can be more comprehensively captured, and the problems that the voice mode acoustic information is single and the voice-text mode information interaction is insufficient, so that the information cannot be fully utilized, and the emotion recognition accuracy is reduced are solved.

Description

Multi-mode emotion recognition method and system

Technical Field

The application relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system based on a cross-mode convolution attention and multi-acoustic feature fusion module.

Background

The speech emotion recognition technology is a natural language processing technology, aims to recognize emotion states of human beings through speech signals, can be used for evaluating emotion, attitude, emotion tendencies and the like of a speaker, and has many applications in real life, such as: (1) natural language understanding: helping computers better understand and interpret emotional states in human speech; (2) intelligent customer service: the voice emotion recognition technology can help customer service personnel to better know the emotion state and the demand of the customer, and provide better service and support; (3) emotion diagnosis and treatment: the voice emotion recognition technology can be used for emotion diagnosis and treatment, helps doctors and therapists to better know the emotion state and the demand of patients, and provides a more effective treatment scheme; (4) social media: the voice emotion recognition technology can be applied to social media, such as emotion analysis of voice messages, voice comments, live video and the like, so that feedback and emotion states of users can be better known and responded.

Speech is a signal, and the earliest speech emotion feature extraction uses a general signal feature extraction method, such as: general signal feature extraction such as Moving Average (MA), fourier transform (FFT), wavelet Transform (WT), etc., and also there are special feature extraction methods based on the own characteristics of the speech signal, such as: short Time Energy (STE), zero Crossing Rate (ZCR), linear predictive coding coefficient (LPC), mel-frequency cepstrum coefficient (MFCC). The performance of the artificial feature extraction methods is limited by the quality and quantity of signal acquisition, noise signals which are mixed are difficult to filter, calculation is complicated, and robustness is poor.

With the progress of deep learning technology, modern speech emotion recognition technology has made significant progress. Techniques in deep learning are widely used for speech signal feature extraction, such as: convolutional Neural Networks (CNNs), long-short-term memory networks (LSTM), and self-attention models (transducers). Deep learning has many advantages in speech emotion recognition, such as the deep learning network can automatically learn high-level feature representation from the original speech signal without manually designing features, so that emotion information in the speech signal can be more accurately represented; the method has stronger robustness, and can process noise and voice deformation to a certain extent, so that the emotion state in the voice signal can be recognized more stably.

Zhao Li et al of the university of eastern radio engineering system first put forward emotion recognition research in speech signals in 2001, and conducted deep research on speech emotion recognition by principal component analysis; the computer Cai Yonglian of Qinghua university also carries out research on emotion recognition of Chinese Mandarin, and researches the role of emotion recognition of prosodic features in Chinese Mandarin, and adopts a Gaussian mixture model and a probability neural network model as a classifier, so that emotion recognition rate is ideal. However, in the existing voice-text bimodal emotion recognition technology, only the acoustic feature of a voice waveform is used in a voice mode, other acoustic features are not used, the feature is single, the information is less, and particularly when a data set is smaller, the voice mode information is insufficient and the training effect is poor due to the fact that only the voice waveform is used. The voice and the text modes have strong correlation, and under the same text condition, the speaking tone is high and low, the speaking speed is low, the speaking rhythm and the like are all related to emotion. However, under the existing voice-text emotion recognition technology, there is no sufficient information interaction process between the voice mode and the text mode, but the feature vectors of the two modes are directly spliced or the inner products of the two feature vectors are made, so that the correlation between the voice and the text is not fully utilized to capture key information, and the information interaction in Gao Weizi space is not achieved.

In summary, the prior art has the following drawbacks: (1) the voice mode only uses the acoustic characteristic of the voice waveform, other acoustic characteristics are not used, the characteristic is single, the information is less, and particularly when the data set is smaller, the voice mode only uses the voice waveform, so that the voice mode information is insufficient, and the training effect is poor; (2) under the voice-text emotion recognition technology, the process of insufficient information interaction between a voice mode and a text mode is only to directly splice the feature vectors of the two modes or to do inner products on the two feature vectors, and the correlation between the voice and the text is not fully utilized to capture key information and the information interaction of Gao Weizi space, so that the emotion recognition accuracy is reduced.

Disclosure of Invention

Aiming at the technical problems of single acoustic characteristics and low emotion recognition accuracy rate in the prior art, the invention provides a multi-mode emotion recognition method and system, which are used for enhancing information interaction and connection of multiple acoustic characteristics in a voice mode, and capturing emotion information between the voice mode and a text mode more comprehensively by capturing interaction information between the voice mode and the text mode and fusing the interaction information, so that better performance is obtained.

In order to achieve the purpose of the invention, the invention adopts the following technical scheme:

a multi-modal emotion recognition method comprising the steps of:

s10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal;

s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;

s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;

s40: carrying out emotion recognition on the MFCC, the spectrogram and the advanced feature vector obtained after fusion in the step S30;

wherein MFCC represents mel-frequency cepstral coefficients.

According to the technical scheme, through acquiring the MFCC, the spectrogram, the voice waveform, the voice text and the corresponding high-level feature vectors thereof, and fusing the high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text, the information interaction and the connection of various acoustic features in the voice mode are enhanced, and the interaction information between the voice mode and the text mode is fused, so that emotion information between the two modes can be acquired more comprehensively, and better emotion recognition performance is obtained.

Further, the step S10 specifically includes the following steps:

S11: pre-emphasis processing is carried out on the voice signal;

s12: carrying out framing treatment on the voice signal subjected to pre-emphasis treatment;

s13: and windowing the voice signal subjected to framing.

In step S11, the pre-emphasis process needs to be added by a high-pass filter equation, and the pre-emphasis process meets the following requirements:

y(t)＝x(t)-αx(t-1)；

in the formula, t represents the current time, α represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal.

In step S13, the windowing process needs to be added through a Hamming window function, and the process of the windowing process satisfies:

in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function on the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain.

S14: carrying out energy distribution transformation on each frame of voice signal subjected to windowing treatment on a frequency domain to obtain an energy spectrum of the voice signal, wherein the energy distribution transformation process meets the formula:

in the formula, S _i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain _i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.

According to the technical scheme, the voice signal can be better amplified to high frequency through pre-emphasis processing, the voice signal after the high frequency amplification is subjected to framing processing, the whole stable voice signal is framed into the short-time stable signal, and then the short-time stable signal is subjected to windowing processing, so that frequency spectrum leakage is reduced, the voice signal can be better subjected to energy distribution transformation on a frequency domain, and then the energy spectrum corresponding to the MFCC and the spectrogram is obtained.

Further, the step S20 specifically includes the following steps:

s21: extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal;

s22: extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text comprises the following steps:

inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the MFCC:

x _m ＝Flatten(BiLTSM(x _MFCC ))；

wherein ,

inputting the low-level feature vector corresponding to the spectrogram into an AlexNet processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the spectrogram:

x _s ＝Flattern(AlexNet(x _Spec ))；

wherein ,

inputting the voice waveform into a Wav2Vec processor for preprocessing, and performing deep learning on the preprocessed voice waveform through the Wav2Vec processor to obtain an advanced feature vector corresponding to the voice waveform:

x _w ＝Wav2Vec(x _Wav )；

wherein ,

inputting the voice text into a BERT processor for preprocessing, and performing deep learning on the preprocessed voice text through the BERT processor to obtain a high-level feature vector corresponding to the voice text:

x _t ＝BERT(x _Text )；

wherein ,

according to the technical scheme, the low-level feature vectors, the voice waveforms and the voice texts corresponding to the MFCCs and the spectrograms are transmitted to the corresponding processors, and deep learning is performed through the deep learning model, so that the high-level sound features and the high-level feature vectors of the voice emotion of the voice signals can be obtained.

Further, the step S30 specifically includes the following steps:

vector fusion is carried out on the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform through a linear layer of an activation function:

wherein ,f_l (. Cndot.) represents a linear layer function, x _m Representing the advanced feature vector, x, of the MFCC _s Representing a spectrogramIs used to determine the high-level feature vector of (a),representing vector concatenation, x _ms A fusion vector representing the linear layer output;

will fuse vector x _ms Vector fusion by activating the linear layer of the function:

wherein ,representing matrix multiplication, x' _w Representing multi-acoustic fusion feature vectors obtained after fusion of the MFCC, the spectrogram and the advanced feature vectors corresponding to the voice waveforms;

The high-level feature vector corresponding to the voice text is decompressed and mapped to a fixed dimension N, and a corresponding high-dimension vector is obtained:

x _d ＝f _De (x _t )；

wherein ,

compressing and mapping the advanced feature vector corresponding to the voice waveform to a fixed dimension N to obtain a corresponding dimension vector:

x _e ＝f _En (x _w )；

wherein ,

the N-dimensional feature vectors corresponding to the voice text and the voice waveform are subjected to enhanced feature extraction through a cross-convolution attention mechanism, and a corresponding cross-mode single-head attention formula is obtained through self-adaptive weight distribution:

Q _w ＝W _w x _e +b _w ；

K _t ＝W _t x _d +b _t ；

V _b ＝W _b x _d +b _b ；

in the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q _w Obtained by voice mode, K _t And V is equal to _b Acquiring through a character mode;

acquiring an embedded expression according to a cross-mode single-head attention formula:

wherein d represents the size of the insert;

embedding expressions and x _t Adding to obtain a fusion vector of the voice waveform and the voice text:

x′ _t ＝LN(Attention(Q _w ，K _t ，V _b )+x _t )；

in the formula, LN (·) is a layer normalization function,

according to the technical scheme, the advantages of the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform can be integrated by fusing the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform, information complementation is realized, and the correlation between the MFCC, the spectrogram and the voice waveform can be utilized to strengthen the feature output of the voice waveform path; by fusing the high-level feature vectors of the voice waveform and the voice text, the precise modeling and fusion of the voice and text features can be realized, the performance of the multi-mode learning task can be improved, and the generalization capability and the interpretability of the model can be enhanced.

Further, the step S40 specifically includes the following steps:

performing shape regularity on the MFCC, the spectrogram and the corresponding advanced feature vector after fusion to obtain a spliced vector:

x _concat ＝Concat(f _FFN (x _m )，f _FFN (x _s )，f _FFN (x′ _w )，f _FFN (x′ _t ))；

wherein ,f_FFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function,representing a vector splicing result;

according to the splice vector x _concat Emotion recognition is carried out, and the process is as follows:

wherein ,f_CLS (. Cndot.) means that the emotion recognition function,and representing emotion recognition results.

According to the technical scheme, the MFCC, the spectrogram and the fused corresponding advanced feature vectors are subjected to shape normalization, so that feature vectors with different dimensions can be subjected to shape normalization, and further the influence on emotion recognition accuracy caused by different vector dimensions is eliminated.

A multi-modal emotion recognition system, comprising:

the acquisition module is used for acquiring the MFCC, the spectrogram, the voice waveform and the voice text of the voice signal;

the vector extraction module is used for extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;

the vector fusion module is used for fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;

and the emotion recognition module is used for performing emotion recognition on the high-level feature vector obtained by fusing the MFCC, the spectrogram and the vector fusion module.

Further, the emotion recognition system further comprises a preprocessing module, wherein the preprocessing module is used for pre-emphasis, framing and windowing of the voice signal, the pre-emphasis processing is added through a high-pass filter equation, and the pre-emphasis processing process meets the following conditions:

y(t)＝x(t)-αx(t-1)；

in the formula, t represents the current moment, alpha represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal;

the windowing process needs to be added through a Hamming window function, and the process of the windowing process meets the following conditions:

in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function to the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain;

carrying out energy distribution transformation on each frame of voice signal subjected to windowing processing on a frequency domain to obtain energy spectrums corresponding to the MFCC and the spectrogram, wherein the energy distribution transformation process meets the formula:

Further, the vector extraction module carries out low-level feature vector extraction according to the energy spectrums corresponding to the MFCC and the spectrogram;

Inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and extracting the high-level feature vector corresponding to the MFCC:

x _m ＝Flatten(BiLTSM(x _MFCC ))；

wherein ,

inputting low-level feature vectors corresponding to the spectrograms into an AlexNet processor added with a flat layer for deep learning, and extracting high-level feature vectors corresponding to the spectrograms:

x _s ＝Flattern(AlexNet(x _Spec ))；

wherein ,

inputting the voice waveform into a Wav2Vec processor for preprocessing, performing deep learning on the preprocessed voice waveform through the Wav2Vec processor, and extracting a high-level feature vector corresponding to the voice waveform:

x _w ＝Wav2Vec(x _Wav )；

wherein ,

inputting the voice text into a BERT processor for preprocessing, performing deep learning on the preprocessed voice text through the BERT processor, and extracting high-level feature vectors corresponding to the voice text:

x _t ＝BERT(x _Text )；

wherein ,

the vector extraction module decompresses and maps the advanced feature vector corresponding to the voice text to a fixed dimension N, and extracts the corresponding high-dimension vector:

x _d ＝f _De (x _t )；

wherein ,

the vector extraction module compresses and maps the high-level feature vector corresponding to the voice waveform to a fixed dimension N, and extracts a corresponding low-dimension vector:

x _e ＝f _En (x _w )；

wherein ,

the vector extraction module performs enhanced feature extraction on N-dimensional feature vectors corresponding to the voice text and the voice waveform through a cross-convolution attention mechanism, and extracts a corresponding cross-mode single-head attention formula through self-adaptive weight distribution:

Q _w ＝W _w x _e +b _w ；

K _t ＝W _t x _d +b _t ；

V _b ＝W _b x _d +b _b ；

extracting an embedded expression according to a cross-mode single-head attention formula:

wherein d represents the size of the insert;

the vector extraction module combines the embedded expression with x _t Adding, extracting a fusion vector of the voice waveform and the voice text:

x′ _t ＝LN(Attention(Q _w ，K _t ，V _b )+x _t )；

in the formula, LN (·) is a layer normalization function,

further, the emotion recognition module performs shape normalization on the MFCC, the spectrogram and the fused corresponding advanced feature vector to obtain a spliced vector:

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a multi-mode emotion recognition method and system, which are characterized in that through acquiring MFCCs, spectrograms, voice waveforms, voice texts and corresponding high-level feature vectors of voice signals and fusing the high-level feature vectors corresponding to the MFCCs, the spectrograms, the voice waveforms and the voice texts, the information interaction and connection of various acoustic features in voice modes are enhanced, and through capturing interaction information between the voice modes and the text modes and fusing, emotion information between the two modes and information interaction in Gao Weizi space can be more comprehensively captured, so that better emotion recognition performance is obtained, emotion recognition accuracy is obviously improved, and training effect is further enhanced; the problems that the emotion recognition accuracy is reduced due to the fact that the acoustic information of the voice mode is single and the interaction of the voice-text mode information is insufficient, so that the information cannot be fully utilized are effectively solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a multi-modal emotion recognition method provided by an embodiment of the present application;

FIG. 2 is a flow chart of extracting MFCCs provided by an embodiment of the present application;

FIG. 3 is a flowchart of generating a spectrogram according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a MAFF module according to an embodiment of the present application;

fig. 5 is a schematic diagram of a CMCA module according to an embodiment of the present application;

FIG. 6 is a general framework for multimodal emotion recognition provided by an embodiment of the present application;

wherein MAFF represents multi-acoustic feature fusion; CMCA represents cross-modal convolution attention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without inventive faculty, are intended to fall within the scope of the application.

Embodiment one:

the embodiment proposes a multi-mode emotion recognition method, referring to fig. 1, including the following steps:

wherein MFCC represents mel-frequency cepstral coefficients.

It can be understood that by acquiring the MFCC, the spectrogram, the voice waveform, the voice text and the corresponding advanced feature vectors thereof, and fusing the MFCC, the spectrogram, the voice waveform and the corresponding advanced feature vectors of the voice text, information interaction and connection of various acoustic features in the voice mode are enhanced, and interaction information between the voice mode and the text mode is fused, emotion information between the two modes can be acquired more comprehensively, and better performance is obtained.

In this embodiment, the step S10 specifically includes the following steps:

referring to fig. 2 and 3, the process of acquiring MFCCs and spectrograms includes the steps of:

S11: pre-emphasis processing is carried out on the voice signal;

y(t)＝x(t)-αx(t-1)；

It will be appreciated that by pre-emphasis processing, the speech signal is passed through a high pass filter, and the speech signal is amplified at high frequencies using the pre-emphasis filter.

it will be appreciated that after pre-emphasis, the signal needs to be split into short time frames. In most cases, the speech signal is non-stationary and fourier transforming the whole signal is meaningless, as the frequency profile of the signal is lost over time. The speech signal subjected to framing processing is a short-time stationary signal. We therefore perform a fourier transform on the short-term frames to obtain a good approximation of the signal frequency profile by concatenating adjacent frames. The window is 20 ms, the frame shift is 10 ms, and the overlap between two adjacent frames is 50%.

S13: and windowing the voice signal subjected to framing.

it can be appreciated that after the speech signal is subjected to framing, in order to reduce spectrum leakage, each frame needs to be multiplied by a window function, so that spectrum leakage is effectively reduced.

It can be understood that for each frame of voice signal after pre-emphasis, framing and windowing, in order to convert the voice signal into energy distribution on a frequency domain, FFT (fast fourier transform) is required to be performed, so as to obtain an energy spectrum corresponding to the MFCC and the spectrogram; wherein, the MFCC can reflect the energy, frequency spectrum and formant information of the voice signal, and is commonly used for the characteristic analysis of the voice signal; the spectrogram can reflect acoustic information such as frequency spectrum, frequency change, tone and the like, and is commonly used for extracting voice emotion analysis characteristics.

According to the technical scheme provided in steps S11-S14, by pre-emphasis processing of the voice signal, the voice signal can be better amplified at high frequency, the voice signal after the high frequency amplification is subjected to framing processing, the whole stable voice signal is framed into a short-time stable signal, and then windowing processing is performed on the short-time stable signal to reduce frequency spectrum leakage, so that the voice signal can be better subjected to energy distribution transformation on a frequency domain, and further energy spectrums corresponding to the MFCC and the spectrogram are obtained.

In this embodiment, the step S20 specifically includes the following steps:

it can be appreciated that, extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal according to the energy spectrum of the voice signal;

referring to fig. 2, low-level feature vectors corresponding to MFCCs are extracted, and energy spectrum is passed through a set of Mel-scale triangular filter banks to obtain filter bank coefficients, which are intended to simulate the human auditory system, and since the filter bank coefficients have high correlation, a decorrelation process using DCT (discrete cosine transform) is required to be performed, so as to obtain the low-level feature vectors corresponding to MFCCs, and the DCT process satisfies the formula:

In the formula, the L-order refers to the MFCC coefficient order, and M is the number of triangular filters. C (n) the nth coefficient in the frequency domain; s (m) the mth sample value in the time domain; n is the number of samples in the time domain;

referring to fig. 3, in the process of extracting the low-level feature vector of the spectrogram, the logarithmic power calculation is performed on the energy spectrum obtained after the FFT transformation, so as to obtain the low-level feature vector corresponding to the spectrogram.

x _m ＝Flatten(BiLTSM(x _MFCC ))；

wherein ,

it can be understood that the BiLSTM processor includes a BiLSTM module, which is formed by stacking two layers of bidirectional LSTM networks, and a bitten layer with a dropout of 0.5 is added after the BiLSTM module, and then a low-level feature vector corresponding to the MFCC is input into the BiLSTM module for high-level feature vector extraction.

x _s ＝Flattern(AlexNet(x _Spec ))；

wherein ,

it can be understood that the AlexNet processor comprises an AlexNet module, a dropout 0.5 flatten layer is added after the AlexNet module, and then the low-level feature vector corresponding to the spectrogram is input into the AlexNet module for high-level feature vector extraction.

x _w ＝Wav2Vec(x _Wav )；

wherein ,

it can be understood that the Wav2Vec processor comprises a Wav2Vec module, and the voice waveform preprocessed by the Wav2Vec module can conform to the input form of the Wav2Vec model, so that the advanced feature vector corresponding to the voice waveform can be better extracted.

x _t ＝BERT(x _Text )；

wherein ,

it can be understood that the BERT processor comprises a BERT module, and the voice waveform preprocessed by the BERT module can conform to the input form of the BERT model, so that the advanced feature vector corresponding to the voice waveform can be better extracted.

According to the technical scheme provided in steps S21-S22, the low-level feature vectors, the voice waveforms and the voice texts corresponding to the MFCCs and the spectrograms are transmitted to the corresponding processors, and deep learning is performed through the deep learning model, so that the high-level feature vectors of the voice features and the voice emotions of the voice signals at higher levels can be obtained.

In this embodiment, the step S30 specifically includes the following steps:

referring to fig. 4, the MAFF Module (i.e. Multi-acoustic feature fusion Module, english name: multi-Acoustic Feature Fusion Module) has the same MFCC, spectrogram and speech waveform sources, but different information characterization capabilities, can capture richer and more complete emotion information from different angles, and only uses feature vectors output by Wav2Vec to easily lose a part of feature information, so a MAFF Module is provided. Feature x of voice waveform output through Wav2Vec model _w Feature x of MFCC output through BiLSTM _m Feature x of spectrogram output through AlexNet _s Three different acoustic advanced features are fused, the correlation among the three acoustic advanced features is utilized to strengthen the feature output of the voice waveform, and vector fusion is carried out on the advanced feature vectors corresponding to the MFCC, the spectrogram and the voice waveform through the linear layer with the activation function of ReLU:

in the formula, f _l (. Cndot.) represents a Linear layer (Linear) function, x _m Representing the advanced feature vector, x, of the MFCC _s Representing the high-level feature vector of the spectrogram,representing vector concatenation, x _ms A fusion vector representing the linear layer output;

will fuse vector x _ms Passing through a linear layer with dropout of 0.5, multiplying the result by x _w Vector fusion is performed through the linear layer of the activation function:

it can be appreciated that the MAFF module can effectively fuse the MFCCs, the spectrograms, and the advanced feature vectors corresponding to the speech waveforms, integrate the advantages between them, implement information complementation, and output the final result as a speech waveform path.

Referring to fig. 5, a CMCA Module (i.e., cross-modality convolution attention Module, cross-Modal Convolutional Attention Module) for processing information from different modalities, the CMCA Module accepts two inputs: one is the last layer state output feature vector of the Wav2Vec model and the other is the CLS output vector of the BERT model. To merge these two sources of information, the cmc a module contains a convolutional encoder, a convolutional decoder, and a cross-modal attention mechanism.

The high-level feature vector corresponding to the voice text is decompressed and mapped to a fixed dimension N through a convolution decoder, and a corresponding high-dimension vector is obtained:

x _d ＝f _De (x _t )；

wherein ,

it will be appreciated that text feature extraction uses the CLS layer output of the BERT model, which can be regarded as an overall semantic representation of the input text, which integrates context information, including vocabulary, grammar, sentence structure, etc., but which compresses various features into one-dimensional vectors, which are detrimental to information interaction fusion between different modalities, a convolutional decoder is designed that can decompress the one-dimensional CLS vector output by the BERT model into higher-dimensional vectors (i.e., to decompress the 1-dimensional advanced feature vector of the speech text into a feature vector of one dimension N, e.g., N ranging from 20 to 30) by means of a convolutional decoder, which facilitates extraction of key information in the text, making the text feature more suitable for fusion with speech modality features, which consists of two one-dimensional convolutions, a Maxpooling layer and a flame layer.

Compressing and mapping the advanced feature vector corresponding to the voice waveform to a fixed dimension N through a convolution encoder to obtain a corresponding low-dimension vector:

x _e ＝f _En (x _w )；

wherein ,

it will be appreciated that the last layer of the output vector of the Wav2Vec model contains the timing structure information and advanced features of the audio signal, such as intonation, pronunciation, sound texture, etc. of the speaker, but the vector dimension is too high, and is directly used for cross-modal attention, so that the vector redundancy is caused, the calculation amount is too large, and the dimensions are not matched, therefore, a convolutional encoder module is designed, and the convolutional encoder can compress the features output by the Wav2Vec model into a lower-dimension representation (i.e. compress and map the feature vector into an N-dimension feature vector, for example, compress and map the 149-dimension advanced feature vector of the speech waveform into an N-dimension feature vector through the convolutional encoder, and the range of N is 20-30), while retaining the key information in the audio signal, which helps to improve the calculation efficiency, and make the audio feature more suitable for being fused with the text feature, and the composition structure is similar to that of the convolutional decoder.

Q _w ＝W _w x _e +b _w ；

K _t ＝W _t x _d +b _t ；

V _b ＝W _b x _d +b _b ；

it can be understood that the emotion expressed by the same text, different intonation and rhythm is not used, and in order to integrate the voice and the text modes, a cross-mode attention mechanism needs to be designed, two vectors with different dimensions of voice text and voice waveform are mapped to the same dimension N, and multi-head attention is used to enhance the extraction of the features. The cross-modal attention mechanism is responsible for computing the correlation between the audio features and the text features and assigning a weight to each feature. In this way, the attention mechanism can effectively capture the association between the data of different modes, automatically pay attention to the key information and distribute weights, and finally, fuse the weighted characteristics. The mechanism strengthens information interaction between two modes, focuses on key features through self-adaptive weight distribution, explicitly models the association between modes, realizes accurate modeling and effective fusion of multiple modes, has a similar structure with a cross-mode multi-head attention mechanism and obtains a corresponding cross-mode single-head attention formula under the condition of not losing generality.

wherein d represents the size of the insert; multiple head attentives perform this process multiple times to obtain information from the various presentation subspaces, using three head attentives mechanisms to ensure that the output of each branch contains enough information from another modality.

x′ _t ＝LN(Attention(Q _w ，K _t ，V _b )+x _t )；

in the formula, LN (·) is a layer normalization function,

it can be understood that the CMCA module serves as an effective cross-modal convolution attention module, and through the organic combination of the convolution encoder, the convolution decoder and the cross-modal attention mechanism, the precise modeling and fusion of the voice waveform and the voice text characteristics are realized, the performance of the multi-modal learning task can be improved, and the generalization capability and the interpretability of the model are enhanced.

Further, the step S40 includes the steps of:

see FIG. 6, including classifiers (i.e., classifier modules)And the method is used for carrying out emotion recognition on the MFCC, the spectrogram and the fused corresponding advanced feature vector. And the FFN module is used for transmitting the MFCC, the spectrogram and the fused corresponding advanced feature vector to the classifier. Because the dimension of each feature vector before being input into the classifier module is different, in order to eliminate influence on emotion recognition accuracy due to different vector dimensions, the feature vectors are conveniently placed into the classifier for classification, and all feature vectors need to be subjected to FFN modules and are shaped into regular shapes The FFN module consists of a flat layer and a single or a plurality of Linear layers, vectors passing through the FFN are put into a classifier to be classified, the classifier consists of a registration layer, two Linear layers and a Dropout layer, the registration layer is used for splicing feature vectors, the feature vectors pass through the FFN module and are transmitted into the classifier by the FFN module, namely:

the MFCC, the spectrogram and the corresponding high-level feature vector after fusion are subjected to shape normalization to beAcquiring a splicing vector:

wherein ,f_CLS (. Cndot.) means that the emotion recognition function,representing emotion recognition results;

using a cross entropy function as a loss function for emotion classification:

in the present embodiment as a whole, through step S10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal; s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text; s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text; s40: and carrying out emotion recognition on the MFCC, the spectrogram and the fused corresponding advanced feature vector. The present embodiment fully combines three acoustic features of the speech modality: the method comprises the steps of providing a multi-mode emotion recognition method for waveform, MFCC, spectrogram and text mode, and simultaneously providing a multi-acoustic feature fusion module for fusing three acoustic features of the MFCC, the spectrogram and the voice waveform; and a cross-modal convolution attention is provided for interactive fusion of voice modality and text modality information. By the multi-mode emotion recognition method provided by the embodiment, multi-level acoustic features can be fused, and information interaction and connection of multiple acoustic features in a voice mode are enhanced; the interaction information between the voice mode and the text mode can be captured and fused, so that emotion information between the two modes and information interaction in Gao Weizi space can be more comprehensively captured, better emotion recognition performance can be obtained, emotion recognition accuracy is remarkably improved, and training effect is further enhanced; the problems that the emotion recognition accuracy is reduced due to the fact that the acoustic information of the voice mode is single and the interaction of the voice-text mode information is insufficient, so that the information cannot be fully utilized are effectively solved.

Embodiment two:

a multi-modal emotion recognition system, comprising:

The emotion recognition system also comprises a preprocessing module.

Further, the preprocessing module is configured to perform pre-emphasis, framing and windowing processing on a voice signal, where the pre-emphasis processing needs to be added through a high-pass filter equation, and a process of the pre-emphasis processing satisfies:

y(t)＝x(t)-αx(t-1)；

The preprocessing module performs energy distribution transformation on each frame of voice signal subjected to windowing processing on a frequency domain to obtain energy spectrums corresponding to the MFCC and the spectrogram, and the energy distribution transformation process meets the formula:

Further, the vector extraction module is used for extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal;

the vector extraction module inputs the low-level feature vector corresponding to the MFCC into the BiLSTM processor added with the flat layer for deep learning, and extracts the high-level feature vector corresponding to the MFCC:

x _m ＝Flatten(BiLTSM(x _MFCC ))；

wherein ,

the vector extraction module inputs low-level feature vectors corresponding to the spectrogram into an AlexNet processor added with a flat layer for deep learning, and extracts high-level feature vectors corresponding to the spectrogram:

x _s ＝Flattern(AlexNet(x _Spec ))；

wherein ,

the preprocessing module inputs the voice waveform into the Wav2Vec processor for preprocessing, and the vector extraction module carries out deep learning on the preprocessed voice waveform through the Wav2Vec processor to extract an advanced feature vector corresponding to the voice waveform:

x _w ＝Wav2Vec(x _Wav )；

wherein ,

the preprocessing module inputs the voice text into the BERT processor for preprocessing, and the vector extraction module carries out deep learning on the preprocessed voice text through the BERT processor to extract advanced feature vectors corresponding to the voice text:

x _t ＝BERT(x _Text )；

wherein ,

x _d ＝f _De (x _t )；

wherein ,

it can be understood that the vector extraction module includes a decoder module, where the decoder module maps a low-dimensional advanced feature vector corresponding to the phonetic text to a higher dimension for decompressing feature information and extracting a corresponding high-dimensional vector;

x _e ＝f _En (x _w )；

wherein ,

it can be understood that the vector extraction module further includes an encoder module, where the encoder module maps the manuscript-dimension advanced feature vector corresponding to the voice waveform to a low dimension for compressing feature information and extracting a corresponding low-dimension vector;

Q _w ＝W _w x _e +b _w ；

K _t ＝W _t x _d +b _t ；

V _b ＝W _b x _d +b _b ；

wherein d represents the size of the insert;

x′ _t ＝LN(Attention(Q _w ，K _t ，V _b )+x _t )；

in the formula, LN (·) is a layer normalization function,

according to the splice vector x _concat Emotion recognition is carried out, and the process is fullFoot:

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. A multi-modal emotion recognition method, comprising the steps of:

wherein MFCC represents mel-frequency cepstral coefficients.

2. The method of claim 1, wherein the step S10 further comprises the steps of:

s11: pre-emphasis processing is carried out on the voice signal;

s13: windowing the voice signal subjected to framing treatment;

y(t)＝x(t)-αx(t-1)；

3. The method for identifying multi-modal emotion according to claim 2, wherein said step S20 comprises the steps of:

x _m ＝Flatten(BiLTSM(x _MFCC ))；

wherein ,

x _s ＝Flattern(AlexNet(x _spec ))；

wherein ,

x _w ＝Wav2Vec(x _Wav )；

wherein ,

x _t ＝BERT(x _Text )；

wherein ,

4. a multi-modal emotion recognition method as claimed in claim 3, wherein said step S30 specifically includes the steps of:

wherein ,f_l (. Cndot.) represents a linear layer function, x _m Representing the advanced feature vector, x, of the MFCC _s Representing the high-level feature vector of the spectrogram,representing vector concatenation, x _ms A fusion vector representing the linear layer output;

wherein ,representing matrix multiplication, x' _w And the multi-acoustic fusion feature vector is obtained after the fusion of the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform.

5. The multi-modal emotion recognition method of claim 4, wherein step S30 further includes the steps of:

x _d ＝f _De x _t )；

wherein ,

compressing and mapping the advanced feature vector corresponding to the voice waveform to a fixed dimension N to obtain a corresponding low-dimension vector:

x _e ＝f _En (x _w )；

wherein ,

Q _w ＝W _w x _e +b _w ；

K _t ＝W _t x _d +b _t ；

V _b ＝W _b x _d +b _b ；

wherein d represents the size of the insert;

x′ _t ＝LN(Attention(Q _w ，K _t ，V _b )+x _t ) The method comprises the steps of carrying out a first treatment on the surface of the In the formula, LN (·) is a layer normalization function,

6. the method of claim 5, wherein the step S40 further comprises the steps of:

x _concat ＝Concat(f _FFN (x _m ), _FFN (x _s ), _FFN (x ^′ _w ), _FFN ( _t ^′ ))；

7. A multi-modal emotion recognition system, comprising:

the vector fusion module is used for fusing the high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;

8. The multi-modal emotion recognition system as recited in claim 7 further comprising a preprocessing module,

the preprocessing module is used for carrying out pre-emphasis, framing and windowing on the voice signal, the pre-emphasis processing needs to be added through a high-pass filter equation, and the pre-emphasis processing process meets the following conditions:

y(t)＝(t)-x(t-1)；

where t represents the current time, alpha represents the filter coefficients,

x(t)

an input value representing a speech signal, y (t) representing an output value of the speech signal;

9. The multi-modal emotion recognition system of claim 8, comprising:

the vector extraction module is used for extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signals;

x _m ＝Flatten(BiLTSM(x _MFCC ))；

wherein ,

x _s ＝Flattern(AlexNet(x _Spec ))；

wherein ,

x _w ＝Wav2Vec(x _Wav )；

wherein ,

x _t ＝BERT(x _Text )；

wherein ,

x _d ＝ _De ( _t )；

wherein ,

x _e ＝ _En ( _w )；

wherein ,

Q _w ＝ _w x _e + _w ；

K _t ＝ _t x _d + _t ；

V _b ＝ _b x _d + _b ；

wherein d represents the size of the insert;

x _t ^′ ＝LN(Attention(Q _w ，K _t ，V _b )+ _t )；

in the formula, LN (·) is a layer normalization function,

10. the multi-modal emotion recognition system of claim 9, comprising:

the emotion recognition module is used for carrying out shape regularity on the MFCC, the spectrogram and the fused corresponding advanced feature vector to obtain a spliced vector:

wherein ,f_FFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function, Representing a vector splicing result;