CN116682463A - Multi-mode emotion recognition method and system - Google Patents
Multi-mode emotion recognition method and system Download PDFInfo
- Publication number
- CN116682463A CN116682463A CN202310632363.6A CN202310632363A CN116682463A CN 116682463 A CN116682463 A CN 116682463A CN 202310632363 A CN202310632363 A CN 202310632363A CN 116682463 A CN116682463 A CN 116682463A
- Authority
- CN
- China
- Prior art keywords
- voice
- vector
- mfcc
- spectrogram
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 85
- 238000000034 method Methods 0.000 title claims abstract description 71
- 239000013598 vector Substances 0.000 claims abstract description 256
- 230000008451 emotion Effects 0.000 claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 48
- 230000008569 process Effects 0.000 claims description 48
- 230000004927 fusion Effects 0.000 claims description 44
- 238000000605 extraction Methods 0.000 claims description 42
- 238000013135 deep learning Methods 0.000 claims description 30
- 238000007781 pre-processing Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 22
- 238000009432 framing Methods 0.000 claims description 20
- 238000001228 spectrum Methods 0.000 claims description 19
- 238000005070 sampling Methods 0.000 claims description 18
- 230000009466 transformation Effects 0.000 claims description 15
- 230000014509 gene expression Effects 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000003213 activating effect Effects 0.000 claims description 2
- 230000003993 interaction Effects 0.000 abstract description 23
- 238000005516 engineering process Methods 0.000 description 10
- 101000962469 Homo sapiens Transcription factor MafF Proteins 0.000 description 5
- 102100039187 Transcription factor MafF Human genes 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 241001672694 Citrus reticulata Species 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000002902 bimodal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a multi-mode emotion recognition method and system, which relate to the technical field of emotion recognition, and S10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal; s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text; s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text; s40: and carrying out emotion recognition on the MFCC, the spectrogram and the fused corresponding advanced feature vector. According to the multi-mode emotion recognition method and system provided by the application, multi-level acoustic features can be fused, information interaction and connection of multiple acoustic features in a voice mode are enhanced, meanwhile, a cross-convolution attention mechanism is used, emotion information between two modes can be more comprehensively captured, and the problems that the voice mode acoustic information is single and the voice-text mode information interaction is insufficient, so that the information cannot be fully utilized, and the emotion recognition accuracy is reduced are solved.
Description
Technical Field
The application relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system based on a cross-mode convolution attention and multi-acoustic feature fusion module.
Background
The speech emotion recognition technology is a natural language processing technology, aims to recognize emotion states of human beings through speech signals, can be used for evaluating emotion, attitude, emotion tendencies and the like of a speaker, and has many applications in real life, such as: (1) natural language understanding: helping computers better understand and interpret emotional states in human speech; (2) intelligent customer service: the voice emotion recognition technology can help customer service personnel to better know the emotion state and the demand of the customer, and provide better service and support; (3) emotion diagnosis and treatment: the voice emotion recognition technology can be used for emotion diagnosis and treatment, helps doctors and therapists to better know the emotion state and the demand of patients, and provides a more effective treatment scheme; (4) social media: the voice emotion recognition technology can be applied to social media, such as emotion analysis of voice messages, voice comments, live video and the like, so that feedback and emotion states of users can be better known and responded.
Speech is a signal, and the earliest speech emotion feature extraction uses a general signal feature extraction method, such as: general signal feature extraction such as Moving Average (MA), fourier transform (FFT), wavelet Transform (WT), etc., and also there are special feature extraction methods based on the own characteristics of the speech signal, such as: short Time Energy (STE), zero Crossing Rate (ZCR), linear predictive coding coefficient (LPC), mel-frequency cepstrum coefficient (MFCC). The performance of the artificial feature extraction methods is limited by the quality and quantity of signal acquisition, noise signals which are mixed are difficult to filter, calculation is complicated, and robustness is poor.
With the progress of deep learning technology, modern speech emotion recognition technology has made significant progress. Techniques in deep learning are widely used for speech signal feature extraction, such as: convolutional Neural Networks (CNNs), long-short-term memory networks (LSTM), and self-attention models (transducers). Deep learning has many advantages in speech emotion recognition, such as the deep learning network can automatically learn high-level feature representation from the original speech signal without manually designing features, so that emotion information in the speech signal can be more accurately represented; the method has stronger robustness, and can process noise and voice deformation to a certain extent, so that the emotion state in the voice signal can be recognized more stably.
Zhao Li et al of the university of eastern radio engineering system first put forward emotion recognition research in speech signals in 2001, and conducted deep research on speech emotion recognition by principal component analysis; the computer Cai Yonglian of Qinghua university also carries out research on emotion recognition of Chinese Mandarin, and researches the role of emotion recognition of prosodic features in Chinese Mandarin, and adopts a Gaussian mixture model and a probability neural network model as a classifier, so that emotion recognition rate is ideal. However, in the existing voice-text bimodal emotion recognition technology, only the acoustic feature of a voice waveform is used in a voice mode, other acoustic features are not used, the feature is single, the information is less, and particularly when a data set is smaller, the voice mode information is insufficient and the training effect is poor due to the fact that only the voice waveform is used. The voice and the text modes have strong correlation, and under the same text condition, the speaking tone is high and low, the speaking speed is low, the speaking rhythm and the like are all related to emotion. However, under the existing voice-text emotion recognition technology, there is no sufficient information interaction process between the voice mode and the text mode, but the feature vectors of the two modes are directly spliced or the inner products of the two feature vectors are made, so that the correlation between the voice and the text is not fully utilized to capture key information, and the information interaction in Gao Weizi space is not achieved.
In summary, the prior art has the following drawbacks: (1) the voice mode only uses the acoustic characteristic of the voice waveform, other acoustic characteristics are not used, the characteristic is single, the information is less, and particularly when the data set is smaller, the voice mode only uses the voice waveform, so that the voice mode information is insufficient, and the training effect is poor; (2) under the voice-text emotion recognition technology, the process of insufficient information interaction between a voice mode and a text mode is only to directly splice the feature vectors of the two modes or to do inner products on the two feature vectors, and the correlation between the voice and the text is not fully utilized to capture key information and the information interaction of Gao Weizi space, so that the emotion recognition accuracy is reduced.
Disclosure of Invention
Aiming at the technical problems of single acoustic characteristics and low emotion recognition accuracy rate in the prior art, the invention provides a multi-mode emotion recognition method and system, which are used for enhancing information interaction and connection of multiple acoustic characteristics in a voice mode, and capturing emotion information between the voice mode and a text mode more comprehensively by capturing interaction information between the voice mode and the text mode and fusing the interaction information, so that better performance is obtained.
In order to achieve the purpose of the invention, the invention adopts the following technical scheme:
a multi-modal emotion recognition method comprising the steps of:
s10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal;
s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;
s40: carrying out emotion recognition on the MFCC, the spectrogram and the advanced feature vector obtained after fusion in the step S30;
wherein MFCC represents mel-frequency cepstral coefficients.
According to the technical scheme, through acquiring the MFCC, the spectrogram, the voice waveform, the voice text and the corresponding high-level feature vectors thereof, and fusing the high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text, the information interaction and the connection of various acoustic features in the voice mode are enhanced, and the interaction information between the voice mode and the text mode is fused, so that emotion information between the two modes can be acquired more comprehensively, and better emotion recognition performance is obtained.
Further, the step S10 specifically includes the following steps:
S11: pre-emphasis processing is carried out on the voice signal;
s12: carrying out framing treatment on the voice signal subjected to pre-emphasis treatment;
s13: and windowing the voice signal subjected to framing.
In step S11, the pre-emphasis process needs to be added by a high-pass filter equation, and the pre-emphasis process meets the following requirements:
y(t)=x(t)-αx(t-1);
in the formula, t represents the current time, α represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal.
In step S13, the windowing process needs to be added through a Hamming window function, and the process of the windowing process satisfies:
in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function on the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain.
S14: carrying out energy distribution transformation on each frame of voice signal subjected to windowing treatment on a frequency domain to obtain an energy spectrum of the voice signal, wherein the energy distribution transformation process meets the formula:
in the formula, S i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.
According to the technical scheme, the voice signal can be better amplified to high frequency through pre-emphasis processing, the voice signal after the high frequency amplification is subjected to framing processing, the whole stable voice signal is framed into the short-time stable signal, and then the short-time stable signal is subjected to windowing processing, so that frequency spectrum leakage is reduced, the voice signal can be better subjected to energy distribution transformation on a frequency domain, and then the energy spectrum corresponding to the MFCC and the spectrogram is obtained.
Further, the step S20 specifically includes the following steps:
s21: extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal;
s22: extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text comprises the following steps:
inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the MFCC:
x m =Flatten(BiLTSM(x MFCC ));
wherein ,
inputting the low-level feature vector corresponding to the spectrogram into an AlexNet processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the spectrogram:
x s =Flattern(AlexNet(x Spec ));
wherein ,
inputting the voice waveform into a Wav2Vec processor for preprocessing, and performing deep learning on the preprocessed voice waveform through the Wav2Vec processor to obtain an advanced feature vector corresponding to the voice waveform:
x w =Wav2Vec(x Wav );
wherein ,
inputting the voice text into a BERT processor for preprocessing, and performing deep learning on the preprocessed voice text through the BERT processor to obtain a high-level feature vector corresponding to the voice text:
x t =BERT(x Text );
wherein ,
according to the technical scheme, the low-level feature vectors, the voice waveforms and the voice texts corresponding to the MFCCs and the spectrograms are transmitted to the corresponding processors, and deep learning is performed through the deep learning model, so that the high-level sound features and the high-level feature vectors of the voice emotion of the voice signals can be obtained.
Further, the step S30 specifically includes the following steps:
vector fusion is carried out on the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform through a linear layer of an activation function:
wherein ,fl (. Cndot.) represents a linear layer function, x m Representing the advanced feature vector, x, of the MFCC s Representing a spectrogramIs used to determine the high-level feature vector of (a),representing vector concatenation, x ms A fusion vector representing the linear layer output;
will fuse vector x ms Vector fusion by activating the linear layer of the function:
wherein ,representing matrix multiplication, x' w Representing multi-acoustic fusion feature vectors obtained after fusion of the MFCC, the spectrogram and the advanced feature vectors corresponding to the voice waveforms;
The high-level feature vector corresponding to the voice text is decompressed and mapped to a fixed dimension N, and a corresponding high-dimension vector is obtained:
x d =f De (x t );
wherein ,
compressing and mapping the advanced feature vector corresponding to the voice waveform to a fixed dimension N to obtain a corresponding dimension vector:
x e =f En (x w );
wherein ,
the N-dimensional feature vectors corresponding to the voice text and the voice waveform are subjected to enhanced feature extraction through a cross-convolution attention mechanism, and a corresponding cross-mode single-head attention formula is obtained through self-adaptive weight distribution:
Q w =W w x e +b w ;
K t =W t x d +b t ;
V b =W b x d +b b ;
in the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q w Obtained by voice mode, K t And V is equal to b Acquiring through a character mode;
acquiring an embedded expression according to a cross-mode single-head attention formula:
wherein d represents the size of the insert;
embedding expressions and x t Adding to obtain a fusion vector of the voice waveform and the voice text:
x′ t =LN(Attention(Q w ,K t ,V b )+x t );
in the formula, LN (·) is a layer normalization function,
according to the technical scheme, the advantages of the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform can be integrated by fusing the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform, information complementation is realized, and the correlation between the MFCC, the spectrogram and the voice waveform can be utilized to strengthen the feature output of the voice waveform path; by fusing the high-level feature vectors of the voice waveform and the voice text, the precise modeling and fusion of the voice and text features can be realized, the performance of the multi-mode learning task can be improved, and the generalization capability and the interpretability of the model can be enhanced.
Further, the step S40 specifically includes the following steps:
performing shape regularity on the MFCC, the spectrogram and the corresponding advanced feature vector after fusion to obtain a spliced vector:
x concat =Concat(f FFN (x m ),f FFN (x s ),f FFN (x′ w ),f FFN (x′ t ));
wherein ,fFFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function,representing a vector splicing result;
according to the splice vector x concat Emotion recognition is carried out, and the process is as follows:
wherein ,fCLS (. Cndot.) means that the emotion recognition function,and representing emotion recognition results.
According to the technical scheme, the MFCC, the spectrogram and the fused corresponding advanced feature vectors are subjected to shape normalization, so that feature vectors with different dimensions can be subjected to shape normalization, and further the influence on emotion recognition accuracy caused by different vector dimensions is eliminated.
A multi-modal emotion recognition system, comprising:
the acquisition module is used for acquiring the MFCC, the spectrogram, the voice waveform and the voice text of the voice signal;
the vector extraction module is used for extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
the vector fusion module is used for fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;
and the emotion recognition module is used for performing emotion recognition on the high-level feature vector obtained by fusing the MFCC, the spectrogram and the vector fusion module.
Further, the emotion recognition system further comprises a preprocessing module, wherein the preprocessing module is used for pre-emphasis, framing and windowing of the voice signal, the pre-emphasis processing is added through a high-pass filter equation, and the pre-emphasis processing process meets the following conditions:
y(t)=x(t)-αx(t-1);
in the formula, t represents the current moment, alpha represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal;
the windowing process needs to be added through a Hamming window function, and the process of the windowing process meets the following conditions:
in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function to the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain;
carrying out energy distribution transformation on each frame of voice signal subjected to windowing processing on a frequency domain to obtain energy spectrums corresponding to the MFCC and the spectrogram, wherein the energy distribution transformation process meets the formula:
in the formula, S i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.
Further, the vector extraction module carries out low-level feature vector extraction according to the energy spectrums corresponding to the MFCC and the spectrogram;
Inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and extracting the high-level feature vector corresponding to the MFCC:
x m =Flatten(BiLTSM(x MFCC ));
wherein ,
inputting low-level feature vectors corresponding to the spectrograms into an AlexNet processor added with a flat layer for deep learning, and extracting high-level feature vectors corresponding to the spectrograms:
x s =Flattern(AlexNet(x Spec ));
wherein ,
inputting the voice waveform into a Wav2Vec processor for preprocessing, performing deep learning on the preprocessed voice waveform through the Wav2Vec processor, and extracting a high-level feature vector corresponding to the voice waveform:
x w =Wav2Vec(x Wav );
wherein ,
inputting the voice text into a BERT processor for preprocessing, performing deep learning on the preprocessed voice text through the BERT processor, and extracting high-level feature vectors corresponding to the voice text:
x t =BERT(x Text );
wherein ,
the vector extraction module decompresses and maps the advanced feature vector corresponding to the voice text to a fixed dimension N, and extracts the corresponding high-dimension vector:
x d =f De (x t );
wherein ,
the vector extraction module compresses and maps the high-level feature vector corresponding to the voice waveform to a fixed dimension N, and extracts a corresponding low-dimension vector:
x e =f En (x w );
wherein ,
the vector extraction module performs enhanced feature extraction on N-dimensional feature vectors corresponding to the voice text and the voice waveform through a cross-convolution attention mechanism, and extracts a corresponding cross-mode single-head attention formula through self-adaptive weight distribution:
Q w =W w x e +b w ;
K t =W t x d +b t ;
V b =W b x d +b b ;
In the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q w Obtained by voice mode, K t And V is equal to b Acquiring through a character mode;
extracting an embedded expression according to a cross-mode single-head attention formula:
wherein d represents the size of the insert;
the vector extraction module combines the embedded expression with x t Adding, extracting a fusion vector of the voice waveform and the voice text:
x′ t =LN(Attention(Q w ,K t ,V b )+x t );
in the formula, LN (·) is a layer normalization function,
further, the emotion recognition module performs shape normalization on the MFCC, the spectrogram and the fused corresponding advanced feature vector to obtain a spliced vector:
x concat =Concat(f FFN (x m ),f FFN (x s ),f FFN (x′ w ),f FFN (x′ t ));
wherein ,fFFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function,representing a vector splicing result;
according to the splice vector x concat Emotion recognition is carried out, and the process is as follows:
wherein ,fCLS (. Cndot.) means that the emotion recognition function,and representing emotion recognition results.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a multi-mode emotion recognition method and system, which are characterized in that through acquiring MFCCs, spectrograms, voice waveforms, voice texts and corresponding high-level feature vectors of voice signals and fusing the high-level feature vectors corresponding to the MFCCs, the spectrograms, the voice waveforms and the voice texts, the information interaction and connection of various acoustic features in voice modes are enhanced, and through capturing interaction information between the voice modes and the text modes and fusing, emotion information between the two modes and information interaction in Gao Weizi space can be more comprehensively captured, so that better emotion recognition performance is obtained, emotion recognition accuracy is obviously improved, and training effect is further enhanced; the problems that the emotion recognition accuracy is reduced due to the fact that the acoustic information of the voice mode is single and the interaction of the voice-text mode information is insufficient, so that the information cannot be fully utilized are effectively solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a multi-modal emotion recognition method provided by an embodiment of the present application;
FIG. 2 is a flow chart of extracting MFCCs provided by an embodiment of the present application;
FIG. 3 is a flowchart of generating a spectrogram according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a MAFF module according to an embodiment of the present application;
fig. 5 is a schematic diagram of a CMCA module according to an embodiment of the present application;
FIG. 6 is a general framework for multimodal emotion recognition provided by an embodiment of the present application;
wherein MAFF represents multi-acoustic feature fusion; CMCA represents cross-modal convolution attention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without inventive faculty, are intended to fall within the scope of the application.
Embodiment one:
the embodiment proposes a multi-mode emotion recognition method, referring to fig. 1, including the following steps:
s10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal;
s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;
s40: carrying out emotion recognition on the MFCC, the spectrogram and the advanced feature vector obtained after fusion in the step S30;
wherein MFCC represents mel-frequency cepstral coefficients.
It can be understood that by acquiring the MFCC, the spectrogram, the voice waveform, the voice text and the corresponding advanced feature vectors thereof, and fusing the MFCC, the spectrogram, the voice waveform and the corresponding advanced feature vectors of the voice text, information interaction and connection of various acoustic features in the voice mode are enhanced, and interaction information between the voice mode and the text mode is fused, emotion information between the two modes can be acquired more comprehensively, and better performance is obtained.
In this embodiment, the step S10 specifically includes the following steps:
referring to fig. 2 and 3, the process of acquiring MFCCs and spectrograms includes the steps of:
S11: pre-emphasis processing is carried out on the voice signal;
in step S11, the pre-emphasis process needs to be added by a high-pass filter equation, and the pre-emphasis process meets the following requirements:
y(t)=x(t)-αx(t-1);
in the formula, t represents the current time, α represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal.
It will be appreciated that by pre-emphasis processing, the speech signal is passed through a high pass filter, and the speech signal is amplified at high frequencies using the pre-emphasis filter.
S12: carrying out framing treatment on the voice signal subjected to pre-emphasis treatment;
it will be appreciated that after pre-emphasis, the signal needs to be split into short time frames. In most cases, the speech signal is non-stationary and fourier transforming the whole signal is meaningless, as the frequency profile of the signal is lost over time. The speech signal subjected to framing processing is a short-time stationary signal. We therefore perform a fourier transform on the short-term frames to obtain a good approximation of the signal frequency profile by concatenating adjacent frames. The window is 20 ms, the frame shift is 10 ms, and the overlap between two adjacent frames is 50%.
S13: and windowing the voice signal subjected to framing.
In step S13, the windowing process needs to be added through a Hamming window function, and the process of the windowing process satisfies:
in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function to the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain;
it can be appreciated that after the speech signal is subjected to framing, in order to reduce spectrum leakage, each frame needs to be multiplied by a window function, so that spectrum leakage is effectively reduced.
S14: carrying out energy distribution transformation on each frame of voice signal subjected to windowing treatment on a frequency domain to obtain an energy spectrum of the voice signal, wherein the energy distribution transformation process meets the formula:
in the formula, S i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.
It can be understood that for each frame of voice signal after pre-emphasis, framing and windowing, in order to convert the voice signal into energy distribution on a frequency domain, FFT (fast fourier transform) is required to be performed, so as to obtain an energy spectrum corresponding to the MFCC and the spectrogram; wherein, the MFCC can reflect the energy, frequency spectrum and formant information of the voice signal, and is commonly used for the characteristic analysis of the voice signal; the spectrogram can reflect acoustic information such as frequency spectrum, frequency change, tone and the like, and is commonly used for extracting voice emotion analysis characteristics.
According to the technical scheme provided in steps S11-S14, by pre-emphasis processing of the voice signal, the voice signal can be better amplified at high frequency, the voice signal after the high frequency amplification is subjected to framing processing, the whole stable voice signal is framed into a short-time stable signal, and then windowing processing is performed on the short-time stable signal to reduce frequency spectrum leakage, so that the voice signal can be better subjected to energy distribution transformation on a frequency domain, and further energy spectrums corresponding to the MFCC and the spectrogram are obtained.
In this embodiment, the step S20 specifically includes the following steps:
s21: extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal;
it can be appreciated that, extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal according to the energy spectrum of the voice signal;
referring to fig. 2, low-level feature vectors corresponding to MFCCs are extracted, and energy spectrum is passed through a set of Mel-scale triangular filter banks to obtain filter bank coefficients, which are intended to simulate the human auditory system, and since the filter bank coefficients have high correlation, a decorrelation process using DCT (discrete cosine transform) is required to be performed, so as to obtain the low-level feature vectors corresponding to MFCCs, and the DCT process satisfies the formula:
In the formula, the L-order refers to the MFCC coefficient order, and M is the number of triangular filters. C (n) the nth coefficient in the frequency domain; s (m) the mth sample value in the time domain; n is the number of samples in the time domain;
referring to fig. 3, in the process of extracting the low-level feature vector of the spectrogram, the logarithmic power calculation is performed on the energy spectrum obtained after the FFT transformation, so as to obtain the low-level feature vector corresponding to the spectrogram.
S22: extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text comprises the following steps:
inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the MFCC:
x m =Flatten(BiLTSM(x MFCC ));
wherein ,
it can be understood that the BiLSTM processor includes a BiLSTM module, which is formed by stacking two layers of bidirectional LSTM networks, and a bitten layer with a dropout of 0.5 is added after the BiLSTM module, and then a low-level feature vector corresponding to the MFCC is input into the BiLSTM module for high-level feature vector extraction.
Inputting the low-level feature vector corresponding to the spectrogram into an AlexNet processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the spectrogram:
x s =Flattern(AlexNet(x Spec ));
wherein ,
it can be understood that the AlexNet processor comprises an AlexNet module, a dropout 0.5 flatten layer is added after the AlexNet module, and then the low-level feature vector corresponding to the spectrogram is input into the AlexNet module for high-level feature vector extraction.
Inputting the voice waveform into a Wav2Vec processor for preprocessing, and performing deep learning on the preprocessed voice waveform through the Wav2Vec processor to obtain an advanced feature vector corresponding to the voice waveform:
x w =Wav2Vec(x Wav );
wherein ,
it can be understood that the Wav2Vec processor comprises a Wav2Vec module, and the voice waveform preprocessed by the Wav2Vec module can conform to the input form of the Wav2Vec model, so that the advanced feature vector corresponding to the voice waveform can be better extracted.
Inputting the voice text into a BERT processor for preprocessing, and performing deep learning on the preprocessed voice text through the BERT processor to obtain a high-level feature vector corresponding to the voice text:
x t =BERT(x Text );
wherein ,
it can be understood that the BERT processor comprises a BERT module, and the voice waveform preprocessed by the BERT module can conform to the input form of the BERT model, so that the advanced feature vector corresponding to the voice waveform can be better extracted.
According to the technical scheme provided in steps S21-S22, the low-level feature vectors, the voice waveforms and the voice texts corresponding to the MFCCs and the spectrograms are transmitted to the corresponding processors, and deep learning is performed through the deep learning model, so that the high-level feature vectors of the voice features and the voice emotions of the voice signals at higher levels can be obtained.
In this embodiment, the step S30 specifically includes the following steps:
referring to fig. 4, the MAFF Module (i.e. Multi-acoustic feature fusion Module, english name: multi-Acoustic Feature Fusion Module) has the same MFCC, spectrogram and speech waveform sources, but different information characterization capabilities, can capture richer and more complete emotion information from different angles, and only uses feature vectors output by Wav2Vec to easily lose a part of feature information, so a MAFF Module is provided. Feature x of voice waveform output through Wav2Vec model w Feature x of MFCC output through BiLSTM m Feature x of spectrogram output through AlexNet s Three different acoustic advanced features are fused, the correlation among the three acoustic advanced features is utilized to strengthen the feature output of the voice waveform, and vector fusion is carried out on the advanced feature vectors corresponding to the MFCC, the spectrogram and the voice waveform through the linear layer with the activation function of ReLU:
in the formula, f l (. Cndot.) represents a Linear layer (Linear) function, x m Representing the advanced feature vector, x, of the MFCC s Representing the high-level feature vector of the spectrogram,representing vector concatenation, x ms A fusion vector representing the linear layer output;
will fuse vector x ms Passing through a linear layer with dropout of 0.5, multiplying the result by x w Vector fusion is performed through the linear layer of the activation function:
wherein ,representing matrix multiplication, x' w Representing multi-acoustic fusion feature vectors obtained after fusion of the MFCC, the spectrogram and the advanced feature vectors corresponding to the voice waveforms;
it can be appreciated that the MAFF module can effectively fuse the MFCCs, the spectrograms, and the advanced feature vectors corresponding to the speech waveforms, integrate the advantages between them, implement information complementation, and output the final result as a speech waveform path.
Referring to fig. 5, a CMCA Module (i.e., cross-modality convolution attention Module, cross-Modal Convolutional Attention Module) for processing information from different modalities, the CMCA Module accepts two inputs: one is the last layer state output feature vector of the Wav2Vec model and the other is the CLS output vector of the BERT model. To merge these two sources of information, the cmc a module contains a convolutional encoder, a convolutional decoder, and a cross-modal attention mechanism.
The high-level feature vector corresponding to the voice text is decompressed and mapped to a fixed dimension N through a convolution decoder, and a corresponding high-dimension vector is obtained:
x d =f De (x t );
wherein ,
it will be appreciated that text feature extraction uses the CLS layer output of the BERT model, which can be regarded as an overall semantic representation of the input text, which integrates context information, including vocabulary, grammar, sentence structure, etc., but which compresses various features into one-dimensional vectors, which are detrimental to information interaction fusion between different modalities, a convolutional decoder is designed that can decompress the one-dimensional CLS vector output by the BERT model into higher-dimensional vectors (i.e., to decompress the 1-dimensional advanced feature vector of the speech text into a feature vector of one dimension N, e.g., N ranging from 20 to 30) by means of a convolutional decoder, which facilitates extraction of key information in the text, making the text feature more suitable for fusion with speech modality features, which consists of two one-dimensional convolutions, a Maxpooling layer and a flame layer.
Compressing and mapping the advanced feature vector corresponding to the voice waveform to a fixed dimension N through a convolution encoder to obtain a corresponding low-dimension vector:
x e =f En (x w );
wherein ,
it will be appreciated that the last layer of the output vector of the Wav2Vec model contains the timing structure information and advanced features of the audio signal, such as intonation, pronunciation, sound texture, etc. of the speaker, but the vector dimension is too high, and is directly used for cross-modal attention, so that the vector redundancy is caused, the calculation amount is too large, and the dimensions are not matched, therefore, a convolutional encoder module is designed, and the convolutional encoder can compress the features output by the Wav2Vec model into a lower-dimension representation (i.e. compress and map the feature vector into an N-dimension feature vector, for example, compress and map the 149-dimension advanced feature vector of the speech waveform into an N-dimension feature vector through the convolutional encoder, and the range of N is 20-30), while retaining the key information in the audio signal, which helps to improve the calculation efficiency, and make the audio feature more suitable for being fused with the text feature, and the composition structure is similar to that of the convolutional decoder.
The N-dimensional feature vectors corresponding to the voice text and the voice waveform are subjected to enhanced feature extraction through a cross-convolution attention mechanism, and a corresponding cross-mode single-head attention formula is obtained through self-adaptive weight distribution:
Q w =W w x e +b w ;
K t =W t x d +b t ;
V b =W b x d +b b ;
In the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q w Obtained by voice mode, K t And V is equal to b Acquiring through a character mode;
it can be understood that the emotion expressed by the same text, different intonation and rhythm is not used, and in order to integrate the voice and the text modes, a cross-mode attention mechanism needs to be designed, two vectors with different dimensions of voice text and voice waveform are mapped to the same dimension N, and multi-head attention is used to enhance the extraction of the features. The cross-modal attention mechanism is responsible for computing the correlation between the audio features and the text features and assigning a weight to each feature. In this way, the attention mechanism can effectively capture the association between the data of different modes, automatically pay attention to the key information and distribute weights, and finally, fuse the weighted characteristics. The mechanism strengthens information interaction between two modes, focuses on key features through self-adaptive weight distribution, explicitly models the association between modes, realizes accurate modeling and effective fusion of multiple modes, has a similar structure with a cross-mode multi-head attention mechanism and obtains a corresponding cross-mode single-head attention formula under the condition of not losing generality.
Acquiring an embedded expression according to a cross-mode single-head attention formula:
wherein d represents the size of the insert; multiple head attentives perform this process multiple times to obtain information from the various presentation subspaces, using three head attentives mechanisms to ensure that the output of each branch contains enough information from another modality.
Embedding expressions and x t Adding to obtain a fusion vector of the voice waveform and the voice text:
x′ t =LN(Attention(Q w ,K t ,V b )+x t );
in the formula, LN (·) is a layer normalization function,
it can be understood that the CMCA module serves as an effective cross-modal convolution attention module, and through the organic combination of the convolution encoder, the convolution decoder and the cross-modal attention mechanism, the precise modeling and fusion of the voice waveform and the voice text characteristics are realized, the performance of the multi-modal learning task can be improved, and the generalization capability and the interpretability of the model are enhanced.
Further, the step S40 includes the steps of:
see FIG. 6, including classifiers (i.e., classifier modules)And the method is used for carrying out emotion recognition on the MFCC, the spectrogram and the fused corresponding advanced feature vector. And the FFN module is used for transmitting the MFCC, the spectrogram and the fused corresponding advanced feature vector to the classifier. Because the dimension of each feature vector before being input into the classifier module is different, in order to eliminate influence on emotion recognition accuracy due to different vector dimensions, the feature vectors are conveniently placed into the classifier for classification, and all feature vectors need to be subjected to FFN modules and are shaped into regular shapes The FFN module consists of a flat layer and a single or a plurality of Linear layers, vectors passing through the FFN are put into a classifier to be classified, the classifier consists of a registration layer, two Linear layers and a Dropout layer, the registration layer is used for splicing feature vectors, the feature vectors pass through the FFN module and are transmitted into the classifier by the FFN module, namely:
the MFCC, the spectrogram and the corresponding high-level feature vector after fusion are subjected to shape normalization to beAcquiring a splicing vector:
x concat =Concat(f FFN (x m ),f FFN (x s ),f FFN (x′ w ),f FFN (x′ t ));
wherein ,fFFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function,representing a vector splicing result;
according to the splice vector x concat Emotion recognition is carried out, and the process is as follows:
wherein ,fCLS (. Cndot.) means that the emotion recognition function,representing emotion recognition results;
using a cross entropy function as a loss function for emotion classification:
in the present embodiment as a whole, through step S10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal; s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text; s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text; s40: and carrying out emotion recognition on the MFCC, the spectrogram and the fused corresponding advanced feature vector. The present embodiment fully combines three acoustic features of the speech modality: the method comprises the steps of providing a multi-mode emotion recognition method for waveform, MFCC, spectrogram and text mode, and simultaneously providing a multi-acoustic feature fusion module for fusing three acoustic features of the MFCC, the spectrogram and the voice waveform; and a cross-modal convolution attention is provided for interactive fusion of voice modality and text modality information. By the multi-mode emotion recognition method provided by the embodiment, multi-level acoustic features can be fused, and information interaction and connection of multiple acoustic features in a voice mode are enhanced; the interaction information between the voice mode and the text mode can be captured and fused, so that emotion information between the two modes and information interaction in Gao Weizi space can be more comprehensively captured, better emotion recognition performance can be obtained, emotion recognition accuracy is remarkably improved, and training effect is further enhanced; the problems that the emotion recognition accuracy is reduced due to the fact that the acoustic information of the voice mode is single and the interaction of the voice-text mode information is insufficient, so that the information cannot be fully utilized are effectively solved.
Embodiment two:
a multi-modal emotion recognition system, comprising:
the acquisition module is used for acquiring the MFCC, the spectrogram, the voice waveform and the voice text of the voice signal;
the vector extraction module is used for extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
the vector fusion module is used for fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;
and the emotion recognition module is used for performing emotion recognition on the high-level feature vector obtained by fusing the MFCC, the spectrogram and the vector fusion module.
The emotion recognition system also comprises a preprocessing module.
Further, the preprocessing module is configured to perform pre-emphasis, framing and windowing processing on a voice signal, where the pre-emphasis processing needs to be added through a high-pass filter equation, and a process of the pre-emphasis processing satisfies:
y(t)=x(t)-αx(t-1);
in the formula, t represents the current moment, alpha represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal;
the windowing process needs to be added through a Hamming window function, and the process of the windowing process meets the following conditions:
in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function to the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain;
The preprocessing module performs energy distribution transformation on each frame of voice signal subjected to windowing processing on a frequency domain to obtain energy spectrums corresponding to the MFCC and the spectrogram, and the energy distribution transformation process meets the formula:
in the formula, S i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.
Further, the vector extraction module is used for extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal;
the vector extraction module inputs the low-level feature vector corresponding to the MFCC into the BiLSTM processor added with the flat layer for deep learning, and extracts the high-level feature vector corresponding to the MFCC:
x m =Flatten(BiLTSM(x MFCC ));
wherein ,
the vector extraction module inputs low-level feature vectors corresponding to the spectrogram into an AlexNet processor added with a flat layer for deep learning, and extracts high-level feature vectors corresponding to the spectrogram:
x s =Flattern(AlexNet(x Spec ));
wherein ,
the preprocessing module inputs the voice waveform into the Wav2Vec processor for preprocessing, and the vector extraction module carries out deep learning on the preprocessed voice waveform through the Wav2Vec processor to extract an advanced feature vector corresponding to the voice waveform:
x w =Wav2Vec(x Wav );
wherein ,
the preprocessing module inputs the voice text into the BERT processor for preprocessing, and the vector extraction module carries out deep learning on the preprocessed voice text through the BERT processor to extract advanced feature vectors corresponding to the voice text:
x t =BERT(x Text );
wherein ,
the vector extraction module decompresses and maps the advanced feature vector corresponding to the voice text to a fixed dimension N, and extracts the corresponding high-dimension vector:
x d =f De (x t );
wherein ,
it can be understood that the vector extraction module includes a decoder module, where the decoder module maps a low-dimensional advanced feature vector corresponding to the phonetic text to a higher dimension for decompressing feature information and extracting a corresponding high-dimensional vector;
the vector extraction module compresses and maps the high-level feature vector corresponding to the voice waveform to a fixed dimension N, and extracts a corresponding low-dimension vector:
x e =f En (x w );
wherein ,
it can be understood that the vector extraction module further includes an encoder module, where the encoder module maps the manuscript-dimension advanced feature vector corresponding to the voice waveform to a low dimension for compressing feature information and extracting a corresponding low-dimension vector;
the vector extraction module performs enhanced feature extraction on N-dimensional feature vectors corresponding to the voice text and the voice waveform through a cross-convolution attention mechanism, and extracts a corresponding cross-mode single-head attention formula through self-adaptive weight distribution:
Q w =W w x e +b w ;
K t =W t x d +b t ;
V b =W b x d +b b ;
In the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q w Obtained by voice mode, K t And V is equal to b Acquiring through a character mode;
extracting an embedded expression according to a cross-mode single-head attention formula:
wherein d represents the size of the insert;
the vector extraction module combines the embedded expression with x t Adding, extracting a fusion vector of the voice waveform and the voice text:
x′ t =LN(Attention(Q w ,K t ,V b )+x t );
in the formula, LN (·) is a layer normalization function,
further, the emotion recognition module performs shape normalization on the MFCC, the spectrogram and the fused corresponding advanced feature vector to obtain a spliced vector:
x concat =Concat(f FFN (x m ),f FFN (x s ),f FFN (x′ w ),f FFN (x′ t ));
wherein ,fFFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function,representing a vector splicing result;
according to the splice vector x concat Emotion recognition is carried out, and the process is fullFoot:
wherein ,fCLS (. Cndot.) means that the emotion recognition function,and representing emotion recognition results.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.
Claims (10)
1. A multi-modal emotion recognition method, comprising the steps of:
s10: acquiring an MFCC, a spectrogram, a voice waveform and a voice text of a voice signal;
s20: acquiring high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
s30: fusing the MFCC, the spectrogram, the voice waveform and the advanced feature vector corresponding to the voice text;
s40: carrying out emotion recognition on the MFCC, the spectrogram and the advanced feature vector obtained after fusion in the step S30;
wherein MFCC represents mel-frequency cepstral coefficients.
2. The method of claim 1, wherein the step S10 further comprises the steps of:
s11: pre-emphasis processing is carried out on the voice signal;
s12: carrying out framing treatment on the voice signal subjected to pre-emphasis treatment;
s13: windowing the voice signal subjected to framing treatment;
in step S11, the pre-emphasis process needs to be added by a high-pass filter equation, and the pre-emphasis process meets the following requirements:
y(t)=x(t)-αx(t-1);
in the formula, t represents the current moment, alpha represents a filter coefficient, x (t) represents an input value of a voice signal, and y (t) represents an output value of the voice signal;
In step S13, the windowing process needs to be added through a Hamming window function, and the process of the windowing process satisfies:
in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function to the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain;
s14: carrying out energy distribution transformation on each frame of voice signal subjected to windowing treatment on a frequency domain to obtain an energy spectrum of the voice signal, wherein the energy distribution transformation process meets the formula:
in the formula, S i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.
3. The method for identifying multi-modal emotion according to claim 2, wherein said step S20 comprises the steps of:
s21: extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signal;
s22: extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text comprises the following steps:
inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the MFCC:
x m =Flatten(BiLTSM(x MFCC ));
wherein ,
inputting the low-level feature vector corresponding to the spectrogram into an AlexNet processor added with a flat layer for deep learning, and obtaining the high-level feature vector corresponding to the spectrogram:
x s =Flattern(AlexNet(x spec ));
wherein ,
inputting the voice waveform into a Wav2Vec processor for preprocessing, and performing deep learning on the preprocessed voice waveform through the Wav2Vec processor to obtain an advanced feature vector corresponding to the voice waveform:
x w =Wav2Vec(x Wav );
wherein ,
inputting the voice text into a BERT processor for preprocessing, and performing deep learning on the preprocessed voice text through the BERT processor to obtain a high-level feature vector corresponding to the voice text:
x t =BERT(x Text );
wherein ,
4. a multi-modal emotion recognition method as claimed in claim 3, wherein said step S30 specifically includes the steps of:
vector fusion is carried out on the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform through a linear layer of an activation function:
wherein ,fl (. Cndot.) represents a linear layer function, x m Representing the advanced feature vector, x, of the MFCC s Representing the high-level feature vector of the spectrogram,representing vector concatenation, x ms A fusion vector representing the linear layer output;
will fuse vector x ms Vector fusion by activating the linear layer of the function:
wherein ,representing matrix multiplication, x' w And the multi-acoustic fusion feature vector is obtained after the fusion of the high-level feature vectors corresponding to the MFCC, the spectrogram and the voice waveform.
5. The multi-modal emotion recognition method of claim 4, wherein step S30 further includes the steps of:
the high-level feature vector corresponding to the voice text is decompressed and mapped to a fixed dimension N, and a corresponding high-dimension vector is obtained:
x d =f De x t );
wherein ,
compressing and mapping the advanced feature vector corresponding to the voice waveform to a fixed dimension N to obtain a corresponding low-dimension vector:
x e =f En (x w );
wherein ,
the N-dimensional feature vectors corresponding to the voice text and the voice waveform are subjected to enhanced feature extraction through a cross-convolution attention mechanism, and a corresponding cross-mode single-head attention formula is obtained through self-adaptive weight distribution:
Q w =W w x e +b w ;
K t =W t x d +b t ;
V b =W b x d +b b ;
in the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q w Obtained by voice mode, K t And V is equal to b Acquiring through a character mode;
acquiring an embedded expression according to a cross-mode single-head attention formula:
wherein d represents the size of the insert;
embedding expressions and x t Adding to obtain a fusion vector of the voice waveform and the voice text:
x′ t =LN(Attention(Q w ,K t ,V b )+x t ) The method comprises the steps of carrying out a first treatment on the surface of the In the formula, LN (·) is a layer normalization function,
6. the method of claim 5, wherein the step S40 further comprises the steps of:
performing shape regularity on the MFCC, the spectrogram and the corresponding advanced feature vector after fusion to obtain a spliced vector:
x concat =Concat(f FFN (x m ), FFN (x s ), FFN (x ′ w ), FFN ( t ′ ));
wherein ,fFFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function,representing a vector splicing result;
according to the splice vector x concat Emotion recognition is carried out, and the process is as follows:
wherein ,fCLS (. Cndot.) means that the emotion recognition function,and representing emotion recognition results.
7. A multi-modal emotion recognition system, comprising:
the acquisition module is used for acquiring the MFCC, the spectrogram, the voice waveform and the voice text of the voice signal;
the vector extraction module is used for extracting high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
the vector fusion module is used for fusing the high-level feature vectors corresponding to the MFCC, the spectrogram, the voice waveform and the voice text;
and the emotion recognition module is used for performing emotion recognition on the high-level feature vector obtained by fusing the MFCC, the spectrogram and the vector fusion module.
8. The multi-modal emotion recognition system as recited in claim 7 further comprising a preprocessing module,
the preprocessing module is used for carrying out pre-emphasis, framing and windowing on the voice signal, the pre-emphasis processing needs to be added through a high-pass filter equation, and the pre-emphasis processing process meets the following conditions:
y(t)=(t)-x(t-1);
where t represents the current time, alpha represents the filter coefficients,
x(t)
an input value representing a speech signal, y (t) representing an output value of the speech signal;
the windowing process needs to be added through a Hamming window function, and the process of the windowing process meets the following conditions:
in the formula, N represents the window length of each frame of voice signal, a represents the Hamming window coefficient of each frame of voice signal, N represents the sampling point of the window function to the voice signal after framing, and W (N, a) represents the value of the Hamming window in the time domain;
the preprocessing module performs energy distribution transformation on each frame of voice signal subjected to windowing processing on a frequency domain to obtain energy spectrums corresponding to the MFCC and the spectrogram, and the energy distribution transformation process meets the formula:
in the formula, s i (k) Complex value s representing the kth frequency component of the speech signal in the frequency domain i (N) represents the real value of the nth sampling point in the time domain, k represents the subscript of the frequency component, and N represents the total sampling point number.
9. The multi-modal emotion recognition system of claim 8, comprising:
the vector extraction module is used for extracting low-level feature vectors corresponding to the MFCC and the spectrogram in the voice signals;
inputting the low-level feature vector corresponding to the MFCC into a BiLSTM processor added with a flat layer for deep learning, and extracting the high-level feature vector corresponding to the MFCC:
x m =Flatten(BiLTSM(x MFCC ));
wherein ,
inputting low-level feature vectors corresponding to the spectrograms into an AlexNet processor added with a flat layer for deep learning, and extracting high-level feature vectors corresponding to the spectrograms:
x s =Flattern(AlexNet(x Spec ));
wherein ,
inputting the voice waveform into a Wav2Vec processor for preprocessing, performing deep learning on the preprocessed voice waveform through the Wav2Vec processor, and extracting a high-level feature vector corresponding to the voice waveform:
x w =Wav2Vec(x Wav );
wherein ,
inputting the voice text into a BERT processor for preprocessing, performing deep learning on the preprocessed voice text through the BERT processor, and extracting high-level feature vectors corresponding to the voice text:
x t =BERT(x Text );
wherein ,
the vector extraction module decompresses and maps the advanced feature vector corresponding to the voice text to a fixed dimension N, and extracts the corresponding high-dimension vector:
x d = De ( t );
wherein ,
the vector extraction module compresses and maps the high-level feature vector corresponding to the voice waveform to a fixed dimension N, and extracts a corresponding low-dimension vector:
x e = En ( w );
wherein ,
the vector extraction module performs enhanced feature extraction on N-dimensional feature vectors corresponding to the voice text and the voice waveform through a cross-convolution attention mechanism, and extracts a corresponding cross-mode single-head attention formula through self-adaptive weight distribution:
Q w = w x e + w ;
K t = t x d + t ;
V b = b x d + b ;
in the formula, W represents the weight matrix of each layer, b represents the deviation coefficient, Q w Obtained by voice mode, K t And V is equal to b Acquiring through a character mode;
extracting an embedded expression according to a cross-mode single-head attention formula:
wherein d represents the size of the insert;
the vector extraction module combines the embedded expression with x t Adding, extracting a fusion vector of the voice waveform and the voice text:
x t ′ =LN(Attention(Q w ,K t ,V b )+ t );
in the formula, LN (·) is a layer normalization function,
10. the multi-modal emotion recognition system of claim 9, comprising:
the emotion recognition module is used for carrying out shape regularity on the MFCC, the spectrogram and the fused corresponding advanced feature vector to obtain a spliced vector:
x concat =Concat(f FFN (x m ), FFN (x s ), FFN (x ′ w ), FFN ( t ′ ));
wherein ,fFFN (. Cndot.) represents the forward propagation module function, concat (-) represents the vector concatenation function, Representing a vector splicing result;
according to the splice vector x concat Emotion recognition is carried out, and the process is as follows:
wherein ,fCLS (. Cndot.) means that the emotion recognition function,and representing emotion recognition results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310632363.6A CN116682463A (en) | 2023-05-30 | 2023-05-30 | Multi-mode emotion recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310632363.6A CN116682463A (en) | 2023-05-30 | 2023-05-30 | Multi-mode emotion recognition method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116682463A true CN116682463A (en) | 2023-09-01 |
Family
ID=87783006
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310632363.6A Pending CN116682463A (en) | 2023-05-30 | 2023-05-30 | Multi-mode emotion recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116682463A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116913258A (en) * | 2023-09-08 | 2023-10-20 | 鹿客科技(北京)股份有限公司 | Speech signal recognition method, device, electronic equipment and computer readable medium |
-
2023
- 2023-05-30 CN CN202310632363.6A patent/CN116682463A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116913258A (en) * | 2023-09-08 | 2023-10-20 | 鹿客科技(北京)股份有限公司 | Speech signal recognition method, device, electronic equipment and computer readable medium |
CN116913258B (en) * | 2023-09-08 | 2023-11-24 | 鹿客科技(北京)股份有限公司 | Speech signal recognition method, device, electronic equipment and computer readable medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Venkataramanan et al. | Emotion recognition from speech | |
CN109741732B (en) | Named entity recognition method, named entity recognition device, equipment and medium | |
CN110992987A (en) | Parallel feature extraction system and method for general specific voice in voice signal | |
Qamhan et al. | Digital audio forensics: microphone and environment classification using deep learning | |
Shaw et al. | Emotion recognition and classification in speech using artificial neural networks | |
Tripathi et al. | Focal loss based residual convolutional neural network for speech emotion recognition | |
Singh et al. | An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning | |
Agrawal et al. | Speech emotion recognition of Hindi speech using statistical and machine learning techniques | |
Sekkate et al. | A statistical feature extraction for deep speech emotion recognition in a bilingual scenario | |
Kuang et al. | Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks | |
CN116682463A (en) | Multi-mode emotion recognition method and system | |
Daouad et al. | An automatic speech recognition system for isolated Amazigh word using 1D & 2D CNN-LSTM architecture | |
Radha et al. | Speech and speaker recognition using raw waveform modeling for adult and children’s speech: A comprehensive review | |
Gambhir et al. | End-to-end multi-modal low-resourced speech keywords recognition using sequential Conv2D nets | |
Alemu et al. | Ethio-Semitic language identification using convolutional neural networks with data augmentation | |
Swamidason et al. | Exploration of diverse intelligent approaches in speech recognition systems | |
Li et al. | TRSD: a time-varying and region-changed speech database for speaker recognition | |
Chauhan et al. | A method for simplifying the spoken emotion recognition system using a shallow neural network and temporal feature stacking & pooling (TFSP) | |
Mulfari et al. | Toward a lightweight ASR solution for atypical speech on the edge | |
Sukvichai et al. | Automatic speech recognition for Thai sentence based on MFCC and CNNs | |
Joshi et al. | A novel deep learning based Nepali speech recognition | |
CN116013371A (en) | Neurodegenerative disease monitoring method, system, device and storage medium | |
KR102429365B1 (en) | System and method for analyzing emotion of speech | |
Kiran Reddy et al. | DNN-based cross-lingual voice conversion using Bottleneck Features | |
Chen et al. | Phoneme-guided Dysarthric speech conversion With non-parallel data by joint training |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |