CN110534133B - Voice emotion recognition system and voice emotion recognition method - Google Patents

Voice emotion recognition system and voice emotion recognition method Download PDF

Info

Publication number
CN110534133B
CN110534133B CN201910803429.7A CN201910803429A CN110534133B CN 110534133 B CN110534133 B CN 110534133B CN 201910803429 A CN201910803429 A CN 201910803429A CN 110534133 B CN110534133 B CN 110534133B
Authority
CN
China
Prior art keywords
module
speech
layer
time step
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910803429.7A
Other languages
Chinese (zh)
Other versions
CN110534133A (en
Inventor
殷绪成
曹秒
杨春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Eeasy Electronic Tech Co ltd
Original Assignee
Zhuhai Eeasy Electronic Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Eeasy Electronic Tech Co ltd filed Critical Zhuhai Eeasy Electronic Tech Co ltd
Priority to CN201910803429.7A priority Critical patent/CN110534133B/en
Publication of CN110534133A publication Critical patent/CN110534133A/en
Application granted granted Critical
Publication of CN110534133B publication Critical patent/CN110534133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a speech emotion recognition system, which comprises: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the invention also discloses a speech emotion recognition method applied to the speech emotion recognition system, which comprises the following steps: 1. performing preliminary work on the voice to obtain a voice spectrum characteristic diagram; 2. operating the spectral feature map, and constructing a spectral feature map containing audio shallow information; 3. further processing the speech spectrum characteristic map containing the audio shallow information to obtain deeper semantic information and context information; 4. processing a speech spectrum feature map with deeper semantic information and context information to obtain a feature vector with highest emotion correlation degree of the whole speech and the speaker; 5. and outputting the emotion classification corresponding to the whole voice. Compared with the prior art, the voice emotion recognition performance of the invention is greatly improved.

Description

Voice emotion recognition system and voice emotion recognition method
Technical Field
The invention relates to the technical field of artificial intelligence and voice recognition, in particular to a voice emotion recognition system and a voice emotion recognition method, which are end-to-end deep neural network technology and are improved by taking DFSMN as a basic network.
Background
With the continuous progress of speech recognition technology and the wide application of speech recognition equipment, human-computer interaction is more and more common in people's daily life. However, most of these devices can only recognize text-level content of human language and cannot recognize emotional state of a speaker, and speech emotion recognition has many useful applications in human-centric services and human-computer interaction, such as intelligent service robots, automated call centers, and distance education. It has attracted considerable research attention so far and many methods have been proposed. Since machine learning (deep neural networks such as CNN) has been rapidly developed in recent years, this method has exhibited excellent performance through attempts and improvements in many fields. How to apply the deep learning technology to the field is still in a fumbling state, and in practical application, many problems facing the challenging task are urgently to be solved. The practical application of the speech emotion recognition technology is a challenging task, massive complex and difficult audio data needs to be collected for research, and how to make the audio recorded in a pure environment closer to the speech in a real scene is a big problem to be solved in the prior art.
Typical Speech Emotion Recognition (SER) systems take a speech waveform as input and then output one of the target emotion classes. Conventional SER systems use Gaussian Mixture Models (GMMs) (Neiberg D, Elenius K, Laskowski K. electronic registration in specific Speech using GMMs [ C ]// Ninten International registration on learning processing.2006.), Hidden Markov Models (HMMs) (Nwe T L, Foo S W, De Silva L C. Speech registration in specific registration in hidden Markov models [ J ]. Speech communication,2003,41(4): 603. support vector machines (Ms) (Yang N, Yuan J, Zhou Y, al. enhanced simulation with restriction function of prediction function in prediction function J./(I) and I. 12. Speech recognition of prediction in Speech processing. I. D. Speech recognition in Speech processing, I. D. Speech recognition in Speech recognition, I. 12. D. Speech recognition, D. I. 12. D. Speech recognition, D. Speech recognition in 1. 11. 1. Speech recognition, D. Speech recognition in encoding J. (I. 12. 1. Speech recognition in Speech recognition, speech and Signal Processing (ICASSP) IEEE 2018: 2906-2910), which all depend on mature manual voice characteristics, and the selection of the characteristics has great influence on the model effect. These features typically include spectral, cepstral, pitch, and energy features of the frame-level speech signal. Statistical functions of these features are then applied across multiple frames, resulting in a speech-level feature vector.
With the explosive development of deep learning techniques, some researchers have explored deep learning approaches to build more robust SER models. Zhang Z et al (Zhang Z, ringer F, Han J, et al, facing realm in Speech recognition from Speech: Feature enhancement by Speech encoder with LSTM neural networks [ C ]// Proceedings INTERSPEECH 2016,17th Annual Conference of the International Speech Communication Association (ISCA) 2016:3593 @ -3597.) propose a Feature-enhanced self-coding algorithm based on long-short-time memory (LSTM) neural networks for extracting emotion information from Speech. Accordingly, Recurrent Neural Networks (RNNs) have proven to have strong sequence modeling capabilities, especially in speech recognition tasks. However, RNN training relies on time-lapse Backpropagation (BPTT), which, due to its computational complexity, can cause problems of time consumption, gradient extinction, and explosions. In order to solve these problems, a feedforward sequence memory network (hereinafter, abbreviated as FSMN) is proposed. In recent years, a great deal of research has shown that FSMN can model long-term relationships without any repetitive feedback in tasks such as speech recognition and language modeling. Furthermore, Zhang S et al (Zhang S, Lei M, Yan Z, et al. deep-FSMN for large spherical connected Speech communication [ C ]//2018IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2018:5869-5873.) propose that the skip connection structure is applied to FSMN, and make great improvement on the former model.
Research activities on SER dates back to the 80's of the 20 th century. SER is still challenging in practical applications due to factors such as gender, speakers, language, and changes in the recording environment. Many researchers have attempted to solve these problems by designing sophisticated handwritten speech features to strengthen the connection to human emotion. However, these manually extracted speech features are only suitable for specific tasks and are less versatile. This leads to the need to design different speech features in the face of different speech-related tasks, which is contrary to the original intent of deep learning techniques.
Disclosure of Invention
In view of the shortcomings of the prior art, it is an object of the present invention to provide a speech emotion recognition system which is an end-to-end feedforward deep neural network structure.
In view of the defects of the prior art, another object of the present invention is to provide a speech emotion recognition method applied to a speech emotion recognition system.
In order to realize the purpose of the invention, the following technical scheme is adopted: a speech emotion recognition system comprising: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the CNN module is provided with a convolution layer, and the pyramid FSMN module is provided with a pyramid memory block structure;
the audio preprocessing module converts the received original audio data into a speech spectrum characteristic diagram;
the CNN module performs primary processing on the spectrum characteristic diagram to construct a characteristic diagram containing shallow information;
the pyramid FSMN module further processes the characteristic diagram containing the shallow information to acquire deeper semantic information and context information;
the time step attention module is used for paying attention to a specific area in a time step and calculating influence weights of different time step lengths on final emotion recognition;
the output module is provided with a plurality of emotion categories and is used for outputting the emotion category which is most matched with the original audio data.
The time step attention module may specifically be represented by the following formula:
at=Average(ht),
Figure BDA0002182960840000031
y=Xs,
wherein, atIs the mean value of the t-th time step, htIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,
Figure BDA0002182960840000032
is the softmax activation function, W1Is the weight parameter of the first layer in the time step attention module, W2Is the weight parameter of the second layer in the time step attention module, b1Is the bias parameter of the first layer in the time-step attention module, b2Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all atA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.
When the convolutional layer performs a convolution operation with a kernel having a size of k and a step size of s, the output of the convolutional layer is calculated by the following formula:
Wout=(Win-k)/s+1,
Hout=(Hin-k)/s+1,
wherein, WoutIs the width, W, of the feature map of the output speech spectruminIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; houtIs the high of the output characteristic diagram, HinIs the height of the input feature map, k is the convolution kernel size, and s is the step size of the convolution kernel move.
Adopting the pyramid memory block structure to take the length as a forward time step length N1And a backward time step N2Step of time htCoded to a fixed size length and then N is added1And N2Is calculated as the current output, which is specifically shown by the following equation:
Figure BDA0002182960840000041
wherein the content of the first and second substances,
Figure BDA0002182960840000042
is the output of the memory module at time t, f is an arbitrary activation function, aiIs the weight of the ith forward time step, ht-iIs the ith forward time step, bjIs the weight of the jth backward time step, ht-jIs the jth forward time step;
the pyramid memory block structure may adopt a jump connection, and the input and output relationship of the jump connection is shown as the following formula:
Figure BDA0002182960840000043
Figure BDA0002182960840000044
wherein the content of the first and second substances,
Figure BDA0002182960840000045
is the output of the memory block at the time t of the first layer,
Figure BDA0002182960840000046
is an arbitrary activation function that is a function of,
Figure BDA0002182960840000047
is the memory block output at time t of layer l-1,
Figure BDA0002182960840000048
is the input of the memory block at the t moment of the first layer,
Figure BDA0002182960840000049
is the l-th layer forward time step,
Figure BDA00021829608400000410
is the weight of the i-th forward time step of the l layer,
Figure BDA00021829608400000411
is the ith forward time step of the l layer, s1Is the interval of the forward time step,
Figure BDA00021829608400000412
is the weight of the i-th backward time step of the l layers,
Figure BDA00021829608400000413
is the jth backward time step of the l layer, s2Is the backward time step interval;
Figure BDA00021829608400000414
is the output of the hidden layer at the time t of the l +1 th layer, WlIs a weight parameter of the first-layer memory block, bl+1Is the bias of the first layer memory block.
The convolutional layer may be two layers, the shallow information may be audio loudness or frequency, and the plurality of emotion categories may be four emotion categories, which may be happy, sad, angry, and neutral.
In order to realize another purpose of the invention, the following technical scheme is adopted: a speech emotion recognition method applied to a speech emotion recognition system comprises the following steps:
step 1, an audio preprocessing module performs primary feature extraction and regularization operation on received voice to obtain a voice spectrum feature map, and inputs the voice spectrum feature map into a CNN module;
step 2, the CNN module performs convolution calculation operation on the received speech spectrum characteristic diagram to construct a speech spectrum characteristic diagram containing audio shallow layer information (such as audio loudness, frequency and the like);
step 3, the pyramid FSMN module further processes the speech spectrum feature map containing the audio shallow information, and obtains deeper semantic information and context information in the speech spectrum feature map through a pyramid memory block structure, such as the gender and emotion of a speaker contained in a section of speech;
step 4, the time step attention module processes the speech spectrum characteristic graph with deeper semantic information and context information, firstly, the attention scores of different time steps are calculated, then the scores are used for carrying out weighted summation on the whole speech spectrum characteristic graph in the time step dimension, and therefore the characteristic vector with the highest emotion correlation degree between the whole section of speech and the speaker is obtained; the time step attention module can be utilized to enable the model to pay more attention to the part related to the emotion of the speaker, so that the robustness of the model is improved;
step 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output, namely the model classifies the predicted emotion; the output module is a full-connection layer, and the output of the full-connection layer is a feature vector with the length of 4.
The technical problems solved by the invention are as follows: the speech emotion recognition problem is solved based on a deep learning technology, and a section of speech contains a lot of information, such as: gender of the speaker, background noise, content of the speech, emotional state of the speaker, etc., which presents great difficulties and challenges to speech emotion recognition problems. Also, although speech emotion recognition based on deep learning has been studied to some extent, most of the research is based on LSTM, which itself has problems such as huge amount of parameters, difficulty in training, etc.; in summary, the existing speech emotion recognition technology still faces many problems which are not solved well.
The invention has the advantages and beneficial effects that:
1. in the tasks of speech recognition, language models and the like, the FSMN can model the long-term relationship without any repeated feedback, and based on the research results, the invention provides an end-to-end feedforward deep neural network for solving the speech emotion recognition task. After the LSTM is eliminated, the recognition speed of the model is greatly improved, and meanwhile, the training time is effectively reduced. Compared with the traditional method, the method does not use various manually extracted audio features as model input, but directly uses the original spectrogram as input, which contains more original voice information, thereby ensuring that the model generalization capability is stronger. Meanwhile, the complexity of model construction is reduced, and different input characteristics do not need to be set for different models.
2. Different from most of deep learning-based speech emotion recognition research, the method does not use a recurrent neural network and the variety thereof as a basic network, but adopts a standard feedforward fully-connected neural network, namely DFSMN, as the basic network, and provides a pyramid-shaped memory block structure on the basis of the DFSMN, so that the whole model is more robust, and higher-level semantic information can be extracted along with deepening of the network.
3. The bottom layer of the model of the invention is 2 layers of convolution layers instead of directly using DFSMN layers, and in addition, with the depth of the network, the characteristic with stronger robustness can be extracted by adopting a down-sampling method, and the size of the characteristic is obviously reduced.
4. In order to enable the model to pay more Attention to emotion-related information and not be interfered by other information, the invention also provides an Attention mechanism based on time steps, the Attention mechanism is applied to the output of the pyramid FSMN, and each element in the output sequence depends on a specific element in the input sequence by using the Attention mechanism, so that the computational burden of the model is increased, but a more accurate and better-performance model can be generated, and the validity of the end-to-end network is verified by using an IEMOCAP voice emotion data set.
5. The end-to-end deep neural network structure provided by the invention and the method designed aiming at each problem can effectively operate, the experiment is well proved, the experiment is 3.3 times faster than the original model, and the speech emotion recognition performance is greatly improved through analysis and verification.
Drawings
Fig. 1 is a block diagram of a speech emotion recognition system, in which pFSMN is a pyramid FSMN.
Fig. 2a is a block diagram of the FSMN.
Fig. 2b is a block diagram of DFSMN.
FIG. 3 is a block diagram of a time step attention module.
FIG. 4 is a flow chart of a speech emotion recognition method.
Detailed Description
Examples
The present invention will be further described with reference to the following embodiments.
As shown in FIG. 1, a speech emotion recognition system includes: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the CNN module has a convolution layer, and the pyramid FSMN module has a pyramid memory block structure.
The speech emotion recognition system in the embodiment is an end-to-end feedforward deep neural network structure, and the invention improves the basic network in the classic FSMN and DFSMN structures, and adds a convolution layer in the basic network to realize the feature extraction of a lower layer.
The audio data of the invention is in a mainstream wav format, the sampling frequency is 16000Hz, the original audio data is further subjected to Fourier transform in frames, the length of each frame is 25ms, and the frame shift step length is 10 ms. Through preliminary processing, the audio data is converted into 2-dimensional spectrogram characteristics as model input. In detail, referring to fig. 1, the model is based on 2 convolutional layers instead of using DFSMN layers directly. In addition, with the depth of the network, the characteristics with stronger robustness can be extracted by adopting a down-sampling method, the size of the characteristics is obviously reduced, and the accuracy of the model is obviously improved by the module. When performing a convolution (pooling) operation using a kernel with a size of k and a step size of s, the output layer can be calculated by the following equation:
Wout=(Win-k)/s+1,
Hout=(Hin-k)/s+1,
wherein, WoutIs the width, W, of the feature map of the output speech spectruminIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; houtIs the high of the output characteristic diagram, HinIs the height of the input feature map, k is the convolution kernel size, and s is the step size of the convolution kernel move.
As shown in FIG. 2a, FSMN is a standard feedforward fully-connected neural network, which adds an additional memory module in a hidden layer and can use a tapped-delay structure to divide the time step htHas a length of N1Forward time step and length of N2The subsequent time steps of (a) are encoded to a fixed size length and then their sum is calculated as the current output, as shown in the following equation:
Figure BDA0002182960840000071
wherein the content of the first and second substances,
Figure BDA0002182960840000072
is the output of the memory module at time t, f is an arbitrary activation function, aiIs the weight of the ith forward time step, ht-iIs the ith forward time step, bjIs the weight of the jth backward time step, ht-jIs the jth forward time step.
In order to make the depth of FSMN larger, as shown in fig. 2b, unlike the original FSMN architecture, DFSMN removes the direct forward connection between hidden layers, only takes the memory module as input, and at the same time, introduces the jump connection to overcome the gradient extinction and explosion problems, and the relationship between input and output is shown in the following formula:
Figure BDA0002182960840000073
Figure BDA0002182960840000074
wherein the content of the first and second substances,
Figure BDA0002182960840000075
is the output of the memory block at the time t of the first layer,
Figure BDA0002182960840000076
is an arbitrary activation function that is a function of,
Figure BDA0002182960840000077
is the memory block output at time t of layer l-1,
Figure BDA0002182960840000078
is the input of the memory block at the t moment of the first layer,
Figure BDA0002182960840000079
is the l-th layer forward time step,
Figure BDA00021829608400000710
is the weight of the i-th forward time step of the l layer,
Figure BDA00021829608400000711
is the ith forward time step of the l layer, s1Is the interval of the forward time step,
Figure BDA00021829608400000712
is the weight of the i-th backward time step of the l layers,
Figure BDA00021829608400000713
is the jth backward time step of the l layer, s2Is the backward time step interval;
Figure BDA00021829608400000714
is the output of the hidden layer at the time t of the l +1 th layer, WlIs a weight parameter of the first-layer memory block, bl+1Is the bias of the first layer memory block.
In the FSMN and DFSMN described above, the length of the memory module is the same, which means that in the above equation, the same is obtained through all hidden layers
Figure BDA00021829608400000715
And s2In the memory block structure, the bottom layer not only extracts the context information of a specific time step t, but also contains the context information, so that the subsequent long-term relationship can be repeated, and unnecessary information does not need to be introduced at the top layer. The invention provides a pyramid memory block structure, in which a model extracts more context information at a deeper level, and the pyramid memory block structure is formed by adding
Figure BDA00021829608400000716
And s2Therefore, the bottom layer extracts features from fine information such as the speed and rhythm of speech, the top layer extracts features from higher-level information such as emotion and gender, and the pyramid memory block structure improves the precision and reduces the number of parameters.
The present invention also adds an Attention mechanism by which each element in the output sequence depends on a particular element in the input sequence, applying the Attention mechanism to the output of the pyramid FSMN. This increases the computational burden of the model, but results in a more accurate, better performing model. In most implementations, note that it is implemented as a weight vector (usually as the output of a softmax function) whose dimension is equal to the length of the input sequence. In this embodiment, a segment of speech is divided into many small segments, referred to as time steps in the neural network. Obviously, when a speech segment contains a large number of blanks, not every time step is useful for the SER task, so the model needs to focus on a specific region, and on this basis, a time step attention module is constructed, as shown in fig. 3, and the time step attention module can be described as the following formula:
at=Average(ht),
Figure BDA0002182960840000081
y=Xs,
wherein, atIs the t thMean value of time step, htIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,
Figure BDA0002182960840000082
is the softmax activation function, W1Is the weight parameter of the first layer in the time step attention module, W2Is the weight parameter of the second layer in the time step attention module, b1Is the bias parameter of the first layer in the time-step attention module, b2Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all atA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.
And finally, obtaining the output of the whole network through a full connection layer, wherein the optimization target of the model is a universal cross entropy loss function. The length of the output vector of the model is matched with the number of the emotion categories, the probability that the value of each position in the output vector corresponds to the emotion category is output, and finally the emotion category with the highest probability is selected as output.
As shown in fig. 4, the speech emotion recognition process in this embodiment specifically includes the following steps:
step 1, an audio preprocessing module performs primary feature extraction and regularization operation on received voice to obtain a voice spectrum feature map, and inputs the voice spectrum feature map into a CNN module;
step 2, the CNN module performs convolution calculation operation on the received speech spectrum characteristic graph to construct a speech spectrum characteristic graph containing audio shallow layer information;
step 3, the pyramid FSMN module further processes the speech spectrum feature map containing the audio shallow information, and obtains deeper semantic information and context information in the speech spectrum feature map through a pyramid memory block structure, such as the gender and emotion of a speaker contained in a section of speech;
step 4, the time step attention module processes the speech spectrum characteristic graph with deeper semantic information and context information, firstly, the attention scores of different time steps are calculated, then the scores are used for carrying out weighted summation on the whole speech spectrum characteristic graph in the time step dimension, and therefore the characteristic vector with the highest emotion correlation degree between the whole section of speech and the speaker is obtained; the time step attention module can be utilized to enable the model to pay more attention to the part related to the emotion of the speaker, so that the robustness of the model is improved;
step 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output, namely the model classifies the predicted emotion; the output module is a full-connection layer, and the output of the full-connection layer is a feature vector with the length of 4.
According to the technical scheme, after the LSTM is eliminated, the recognition speed of the model is greatly improved, meanwhile, the training time consumption is effectively reduced, and in addition, different from the traditional speech emotion recognition system, the method does not use artificial features as model input, but directly uses an original speech spectrogram as input, wherein more original information is contained, so that the generalization capability of the model is stronger. In order to make the model focus more on the information related to emotion and not be interfered by other information, the invention provides an attention mechanism based on time steps and integrates the attention mechanism into the model.
(1) Data sets used by the invention;
the SER model of the present invention was evaluated using an IEMOCAP corpus, which contains several sessions in each of which two participants communicate to present a particular type of emotion. These utterances are classified as anger, fear, excitement, neutrality, disgust, surprise, sadness, happiness, depression, others, and XXX. XXX is the case where the annotator cannot agree on the tag. In this example, only 5 classes of anger, excitement, happiness, neutrality and sadness were selected, and the total number of voices used was 5531. To balance the number of samples per emotion category, happy and excited emotions are merged into a happy category. In addition, 10% of the total data is randomly selected as a test object, the rest data is used as training data, and 10% of the training data is used as verification data to check whether the test object needs to be stopped in advance.
The corpus has two channels of video and audio data, and the invention only uses audio data. The audio acquisition used a high quality microphone (Schoeps CMIT 5U) with a sampling rate of 48 khz. It is down sampled to 16 khz and a 201D acoustic signature is extracted. Different from other technical solutions, in this embodiment, only the spectrogram is used as an input, and the extraction process is performed in a 25ms window with a moving step size of 10ms (100 fps). Meanwhile, normalization processing is carried out on the whole sentence voice data.
(2) Describing a test process;
using the pytorech framework as a training tool, the network architecture is shown in fig. 1, two 5 × 5conv layers are used in front, and the hidden layer and memory block of the 4 FSMN blocks have 256 and 128 nodes, respectively. In order to avoid overfitting, a batch normalization layer is arranged behind the CNN layer and the pFSMN layer, the time sequence is 4 to 32, the step length is 1 to 2, the model of the embodiment is trained on the basis of an Adam optimizer of a Pythrch, the batch size is set to 32, the learning rate is fixed to 0.003, preset 4470 pieces of training audio data are used for iterative training, the effect of the model on the verification set is tested at the same time in each iteration, and the training is stopped in advance when the identification accuracy on the verification set is unchanged for 3 continuous iteration rounds. All experiments were performed on a workstation at station 1 NVIDIATITAN XP.
(3) Testing results;
to measure the performance of the system, the overall accuracy (weighted accuracy, WA) and the average recall rate (unweighted accuracy, UA) of the different mood categories for the test sample were calculated, as well as the corresponding recall rate for each category.
The test result shows that compared with the LSTM, the improved sequence model performance is improved by 2.47%, which indicates that the FSMN has better sequence model performance in the task. HSF-CRNN (Luo D, Zou Y, Huang D. investment on Joint reproduction Learning for the Robust Feature Extraction in the Speech evaluation Recognition [ J ]. Proc. Interspeed 2018,2018: 152-. 156.) is an improved CNN combination RNN method proposed by Luo, which uses the manually made Speech features as input, the model of this embodiment realizes 0.53% and 3.99% absolute improvements on UA and WA, respectively, experiments prove that useful information can be automatically extracted from the spectrogram without using the commonly used manual Speech features, the invention also establishes a basic C-bilSTM model for comparison, the accuracy of the 'sad' sample is better than other methods, and the Recognition accuracy of other categories is much worse. In order to illustrate the working principle of the attention mechanism, a C-pFSMN model is established, the rest part of the whole model is completely consistent with the model in the invention except that the attention mechanism is not needed, and the result shows that compared with the C-pFSMN, the attention mechanism provided by the invention has good performance in a SER task, UA is improved by 6.3% absolutely, and in addition, a front-end CNN layer can extract more complex characteristics, so that the model performance is improved as expected.
C-bilSTM is constructed from 2-CNN layers and 2-Bi-LSTM, with 256 nodes in the hidden layer. It is similar to the model of the embodiment and widely applied to the sequence modeling task. Thus, the computational resources of C-bilSTM were also compared to the method of the present invention. The results show that the model of the invention has only 1.85M parameters and the training time is 64 minutes, which is much faster than the C-LSTM model. This means that the invention can achieve better performance while requiring less computing resources.
The technical scheme of the invention well solves the problem of speech emotion recognition, greatly improves the speed of speech recognition, and effectively reduces the time consumption of training. In addition, unlike the traditional speech emotion recognition system, the method does not use artificial features as model input, but directly uses the original spectrogram as input, wherein more original information is contained, so that the generalization capability of the model is stronger. In order to enable the model to pay more attention to the information related to emotion and not be interfered by other information, the invention provides an attention mechanism based on time step and combines the attention mechanism into the model, and experiments show that the model of the invention has good effect and needs less computing resources.
The above detailed description is specific to possible embodiments of the present invention, and the embodiments are not intended to limit the scope of the present invention, and all equivalent implementations or modifications that do not depart from the scope of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A speech emotion recognition system, comprising:
the audio frequency preprocessing module, the CNN module, the pyramid FSMN module, the time step attention module and the output module are connected in sequence, and the CNN module is provided with a convolution layer;
the audio preprocessing module converts the received original audio data into a speech spectrum characteristic diagram;
the CNN module performs primary processing on the spectrum characteristic diagram to construct a characteristic diagram containing shallow information;
the pyramid FSMN module further processes the characteristic diagram containing the shallow information to acquire deeper semantic information and context information;
the time step attention module is used for paying attention to a specific area in a time step and calculating influence weights of different time step lengths on final emotion recognition;
the output module is provided with a plurality of emotion categories and is used for outputting the emotion category which is most matched with the original audio data; wherein the content of the first and second substances,
when the convolutional layer performs a convolution operation with a kernel having a size of k and a step size of s, the output of the convolutional layer is calculated by the following formula:
Wout=(Win-k)/s+1,
Hout=(Hin-k)/s+1,
wherein, WoutIs the width, W, of the feature map of the output speech spectruminIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; houtIs the high of the output characteristic diagram, HinIs the height of the input feature map, k is the convolution kernel size, s is the step size of the convolution kernel move;
the pyramid FSMN module has a pyramid memory block structure, and the pyramid memory block structure is adoptedConstruct the length as the forward time step N1And a backward time step N2Step of time htCoded to a fixed size length and then N is added1And N2Is calculated as the current output, which is specifically shown by the following equation:
Figure FDA0003381218580000011
wherein the content of the first and second substances,
Figure FDA0003381218580000012
is the output of the memory module at time t, f is an arbitrary activation function, aiIs the weight of the ith forward time step, ht-iIs the ith forward time step, bjIs the weight of the jth backward time step, ht-jIs the jth forward time step;
the pyramid memory block structure adopts a jump connection, and the relation between the input and the output of the jump connection is shown as the following formula:
Figure FDA0003381218580000013
Figure FDA0003381218580000014
wherein the content of the first and second substances,
Figure FDA0003381218580000021
is the output of the memory block at the time t of the first layer,
Figure FDA0003381218580000022
is an arbitrary activation function that is a function of,
Figure FDA0003381218580000023
is the memory block output at layer 1-1 time t,
Figure FDA0003381218580000024
is the input of the memory block at the t moment of the first layer,
Figure FDA0003381218580000025
is the l-th layer forward time step,
Figure FDA0003381218580000026
is the weight of the i-th forward time step of the l layer,
Figure FDA0003381218580000027
is the ith forward time step of the l layer, s1Is the interval of the forward time step,
Figure FDA0003381218580000028
is the weight of the i-th backward time step of the l layers,
Figure FDA0003381218580000029
is the jth backward time step of the l layer, s2Is the backward time step interval;
Figure FDA00033812185800000210
is the output of the hidden layer at the time t of the l +1 th layer, WlIs a weight parameter of the first-layer memory block, bl+1Is the bias of the first layer memory block.
2. The speech emotion recognition system of claim 1, wherein the time step attention module is specifically represented by the following formula:
at=Average(ht),
Figure FDA00033812185800000211
y=Xs,
wherein, atIs the mean value of the t-th time step, htIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,
Figure FDA00033812185800000212
is the softmax activation function, W1Is the weight parameter of the first layer in the time step attention module, W2Is the weight parameter of the second layer in the time step attention module, b1Is the bias parameter of the first layer in the time-step attention module, b2Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all atA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.
3. The speech emotion recognition system of any one of claims 1 to 2, wherein the convolutional layer is two layers.
4. The speech emotion recognition system of any one of claims 1 to 2, wherein the shallow information is audio loudness or frequency.
5. The speech emotion recognition system of any one of claims 1 to 2, wherein the plurality of emotion categories are four emotion categories.
6. The speech emotion recognition system of claim 5, wherein the four emotion classifications are happiness, injury, anger and neutrality.
7. A speech emotion recognition method applied to the speech emotion recognition system of claim 1, characterized by comprising the steps of:
step 1, an audio preprocessing module performs primary feature extraction and regularization operation on received voice to obtain a voice spectrum feature map, and inputs the voice spectrum feature map into a CNN module;
step 2, the CNN module performs convolution calculation operation on the received speech spectrum characteristic graph to construct a speech spectrum characteristic graph containing audio shallow layer information;
step 3, the pyramid FSMN module further processes the speech spectrum characteristic diagram containing the audio shallow layer information, and obtains semantic information and context information of a deeper layer in the speech spectrum characteristic diagram through a pyramid memory block structure;
step 4, the time step attention module processes the speech spectrum characteristic graph with deeper semantic information and context information, firstly, the attention scores of different time steps are calculated, then the scores are used for carrying out weighted summation on the whole speech spectrum characteristic graph in the time step dimension, and therefore the characteristic vector with the highest emotion correlation degree between the whole section of speech and the speaker is obtained;
and 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output.
8. The method of claim 7, wherein in step 5, the output module is a fully-connected layer, and the fully-connected layer outputs a feature vector with a length of 4.
CN201910803429.7A 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method Active CN110534133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910803429.7A CN110534133B (en) 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910803429.7A CN110534133B (en) 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method

Publications (2)

Publication Number Publication Date
CN110534133A CN110534133A (en) 2019-12-03
CN110534133B true CN110534133B (en) 2022-03-25

Family

ID=68664896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910803429.7A Active CN110534133B (en) 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method

Country Status (1)

Country Link
CN (1) CN110534133B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143567B (en) * 2019-12-30 2023-04-07 成都数之联科技股份有限公司 Comment emotion analysis method based on improved neural network
CN111539458B (en) * 2020-04-02 2024-02-27 咪咕文化科技有限公司 Feature map processing method and device, electronic equipment and storage medium
CN112053007B (en) * 2020-09-18 2022-07-26 国网浙江兰溪市供电有限公司 Distribution network fault first-aid repair prediction analysis system and method
CN112634947B (en) * 2020-12-18 2023-03-14 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN113255800B (en) * 2021-06-02 2021-10-15 中国科学院自动化研究所 Robust emotion modeling system based on audio and video
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090063202A (en) * 2009-05-29 2009-06-17 포항공과대학교 산학협력단 Method for apparatus for providing emotion speech recognition
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090063202A (en) * 2009-05-29 2009-06-17 포항공과대학교 산학협력단 Method for apparatus for providing emotion speech recognition
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION;Shiliang Zhang etc.;《2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)》;20180930;正文第5869-5873页 *
基于深度学习的多模态情感识别方法研究;张园园;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190815(第08期);正文第23-25页 *
基于语谱图提取深度空间注意特征的语音情感识别算法;王金华 等;《电信科学》;20190318;第35卷(第07期);正文第100-108页 *

Also Published As

Publication number Publication date
CN110534133A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110534133B (en) Voice emotion recognition system and voice emotion recognition method
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
CN112784798B (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
Wang et al. Data augmentation using deep generative models for embedding based speaker recognition
Sultana et al. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks
CN110610534B (en) Automatic mouth shape animation generation method based on Actor-Critic algorithm
Han et al. Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network
Liu et al. Learning salient features for speech emotion recognition using CNN
Kaur et al. Speech recognition system; challenges and techniques
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
Barkur et al. EnsembleWave: an ensembled approach for automatic speech emotion recognition
Ashrafidoost et al. Recognizing Emotional State Changes Using Speech Processing
Utomo et al. Spoken word and speaker recognition using MFCC and multiple recurrent neural networks
CN117115312B (en) Voice-driven facial animation method, device, equipment and medium
Djeffal et al. Noise-Robust Speech Recognition: A Comparative Analysis of LSTM and CNN Approaches
Kaewprateep et al. Evaluation of small-scale deep learning architectures in Thai speech recognition
CN114863939B (en) Panda attribute identification method and system based on sound
Kalita et al. Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language
Mirhassani et al. Fuzzy decision fusion of complementary experts based on evolutionary cepstral coefficients for phoneme recognition
Song et al. Speech emotion recognition and intensity estimation
Ma et al. Fine-grained Dynamical Speech Emotion Analysis Utilizing Networks Customized for Acoustic Data
Ashrafidoost et al. A Method for Modelling and Simulation the Changes Trend of Emotions in Human Speech
Saini et al. Audio emotion recognition using machine learning
Zhang et al. Spoken emotion recognition using radial basis function neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant