CN110534133B - Voice emotion recognition system and voice emotion recognition method - Google Patents
Voice emotion recognition system and voice emotion recognition method Download PDFInfo
- Publication number
- CN110534133B CN110534133B CN201910803429.7A CN201910803429A CN110534133B CN 110534133 B CN110534133 B CN 110534133B CN 201910803429 A CN201910803429 A CN 201910803429A CN 110534133 B CN110534133 B CN 110534133B
- Authority
- CN
- China
- Prior art keywords
- module
- speech
- layer
- time step
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000008451 emotion Effects 0.000 claims abstract description 43
- 238000001228 spectrum Methods 0.000 claims abstract description 37
- 238000010586 diagram Methods 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 27
- 230000007246 mechanism Effects 0.000 claims description 15
- 230000004913 activation Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 208000027418 Wounds and injury Diseases 0.000 claims 1
- 230000006378 damage Effects 0.000 claims 1
- 208000014674 injury Diseases 0.000 claims 1
- 230000003595 spectral effect Effects 0.000 abstract description 3
- 238000013527 convolutional neural network Methods 0.000 description 14
- 238000013528 artificial neural network Methods 0.000 description 13
- 238000012549 training Methods 0.000 description 12
- 238000013135 deep learning Methods 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 230000008033 biological extinction Effects 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 206010017472 Fumbling Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- DHNRXBZYEKSXIM-UHFFFAOYSA-N chloromethylisothiazolinone Chemical compound CN1SC(Cl)=CC1=O DHNRXBZYEKSXIM-UHFFFAOYSA-N 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a speech emotion recognition system, which comprises: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the invention also discloses a speech emotion recognition method applied to the speech emotion recognition system, which comprises the following steps: 1. performing preliminary work on the voice to obtain a voice spectrum characteristic diagram; 2. operating the spectral feature map, and constructing a spectral feature map containing audio shallow information; 3. further processing the speech spectrum characteristic map containing the audio shallow information to obtain deeper semantic information and context information; 4. processing a speech spectrum feature map with deeper semantic information and context information to obtain a feature vector with highest emotion correlation degree of the whole speech and the speaker; 5. and outputting the emotion classification corresponding to the whole voice. Compared with the prior art, the voice emotion recognition performance of the invention is greatly improved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence and voice recognition, in particular to a voice emotion recognition system and a voice emotion recognition method, which are end-to-end deep neural network technology and are improved by taking DFSMN as a basic network.
Background
With the continuous progress of speech recognition technology and the wide application of speech recognition equipment, human-computer interaction is more and more common in people's daily life. However, most of these devices can only recognize text-level content of human language and cannot recognize emotional state of a speaker, and speech emotion recognition has many useful applications in human-centric services and human-computer interaction, such as intelligent service robots, automated call centers, and distance education. It has attracted considerable research attention so far and many methods have been proposed. Since machine learning (deep neural networks such as CNN) has been rapidly developed in recent years, this method has exhibited excellent performance through attempts and improvements in many fields. How to apply the deep learning technology to the field is still in a fumbling state, and in practical application, many problems facing the challenging task are urgently to be solved. The practical application of the speech emotion recognition technology is a challenging task, massive complex and difficult audio data needs to be collected for research, and how to make the audio recorded in a pure environment closer to the speech in a real scene is a big problem to be solved in the prior art.
Typical Speech Emotion Recognition (SER) systems take a speech waveform as input and then output one of the target emotion classes. Conventional SER systems use Gaussian Mixture Models (GMMs) (Neiberg D, Elenius K, Laskowski K. electronic registration in specific Speech using GMMs [ C ]// Ninten International registration on learning processing.2006.), Hidden Markov Models (HMMs) (Nwe T L, Foo S W, De Silva L C. Speech registration in specific registration in hidden Markov models [ J ]. Speech communication,2003,41(4): 603. support vector machines (Ms) (Yang N, Yuan J, Zhou Y, al. enhanced simulation with restriction function of prediction function in prediction function J./(I) and I. 12. Speech recognition of prediction in Speech processing. I. D. Speech recognition in Speech processing, I. D. Speech recognition in Speech recognition, I. 12. D. Speech recognition, D. I. 12. D. Speech recognition, D. Speech recognition in 1. 11. 1. Speech recognition, D. Speech recognition in encoding J. (I. 12. 1. Speech recognition in Speech recognition, speech and Signal Processing (ICASSP) IEEE 2018: 2906-2910), which all depend on mature manual voice characteristics, and the selection of the characteristics has great influence on the model effect. These features typically include spectral, cepstral, pitch, and energy features of the frame-level speech signal. Statistical functions of these features are then applied across multiple frames, resulting in a speech-level feature vector.
With the explosive development of deep learning techniques, some researchers have explored deep learning approaches to build more robust SER models. Zhang Z et al (Zhang Z, ringer F, Han J, et al, facing realm in Speech recognition from Speech: Feature enhancement by Speech encoder with LSTM neural networks [ C ]// Proceedings INTERSPEECH 2016,17th Annual Conference of the International Speech Communication Association (ISCA) 2016:3593 @ -3597.) propose a Feature-enhanced self-coding algorithm based on long-short-time memory (LSTM) neural networks for extracting emotion information from Speech. Accordingly, Recurrent Neural Networks (RNNs) have proven to have strong sequence modeling capabilities, especially in speech recognition tasks. However, RNN training relies on time-lapse Backpropagation (BPTT), which, due to its computational complexity, can cause problems of time consumption, gradient extinction, and explosions. In order to solve these problems, a feedforward sequence memory network (hereinafter, abbreviated as FSMN) is proposed. In recent years, a great deal of research has shown that FSMN can model long-term relationships without any repetitive feedback in tasks such as speech recognition and language modeling. Furthermore, Zhang S et al (Zhang S, Lei M, Yan Z, et al. deep-FSMN for large spherical connected Speech communication [ C ]//2018IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2018:5869-5873.) propose that the skip connection structure is applied to FSMN, and make great improvement on the former model.
Research activities on SER dates back to the 80's of the 20 th century. SER is still challenging in practical applications due to factors such as gender, speakers, language, and changes in the recording environment. Many researchers have attempted to solve these problems by designing sophisticated handwritten speech features to strengthen the connection to human emotion. However, these manually extracted speech features are only suitable for specific tasks and are less versatile. This leads to the need to design different speech features in the face of different speech-related tasks, which is contrary to the original intent of deep learning techniques.
Disclosure of Invention
In view of the shortcomings of the prior art, it is an object of the present invention to provide a speech emotion recognition system which is an end-to-end feedforward deep neural network structure.
In view of the defects of the prior art, another object of the present invention is to provide a speech emotion recognition method applied to a speech emotion recognition system.
In order to realize the purpose of the invention, the following technical scheme is adopted: a speech emotion recognition system comprising: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the CNN module is provided with a convolution layer, and the pyramid FSMN module is provided with a pyramid memory block structure;
the audio preprocessing module converts the received original audio data into a speech spectrum characteristic diagram;
the CNN module performs primary processing on the spectrum characteristic diagram to construct a characteristic diagram containing shallow information;
the pyramid FSMN module further processes the characteristic diagram containing the shallow information to acquire deeper semantic information and context information;
the time step attention module is used for paying attention to a specific area in a time step and calculating influence weights of different time step lengths on final emotion recognition;
the output module is provided with a plurality of emotion categories and is used for outputting the emotion category which is most matched with the original audio data.
The time step attention module may specifically be represented by the following formula:
at=Average(ht),
y=Xs,
wherein, atIs the mean value of the t-th time step, htIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,is the softmax activation function, W1Is the weight parameter of the first layer in the time step attention module, W2Is the weight parameter of the second layer in the time step attention module, b1Is the bias parameter of the first layer in the time-step attention module, b2Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all atA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.
When the convolutional layer performs a convolution operation with a kernel having a size of k and a step size of s, the output of the convolutional layer is calculated by the following formula:
Wout=(Win-k)/s+1,
Hout=(Hin-k)/s+1,
wherein, WoutIs the width, W, of the feature map of the output speech spectruminIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; houtIs the high of the output characteristic diagram, HinIs the height of the input feature map, k is the convolution kernel size, and s is the step size of the convolution kernel move.
Adopting the pyramid memory block structure to take the length as a forward time step length N1And a backward time step N2Step of time htCoded to a fixed size length and then N is added1And N2Is calculated as the current output, which is specifically shown by the following equation:
wherein,is the output of the memory module at time t, f is an arbitrary activation function, aiIs the weight of the ith forward time step, ht-iIs the ith forward time step, bjIs the weight of the jth backward time step, ht-jIs the jth forward time step;
the pyramid memory block structure may adopt a jump connection, and the input and output relationship of the jump connection is shown as the following formula:
wherein,is the output of the memory block at the time t of the first layer,is an arbitrary activation function that is a function of,is the memory block output at time t of layer l-1,is the input of the memory block at the t moment of the first layer,is the l-th layer forward time step,is the weight of the i-th forward time step of the l layer,is the ith forward time step of the l layer, s1Is the interval of the forward time step,is the weight of the i-th backward time step of the l layers,is the jth backward time step of the l layer, s2Is the backward time step interval;is the output of the hidden layer at the time t of the l +1 th layer, WlIs a weight parameter of the first-layer memory block, bl+1Is the bias of the first layer memory block.
The convolutional layer may be two layers, the shallow information may be audio loudness or frequency, and the plurality of emotion categories may be four emotion categories, which may be happy, sad, angry, and neutral.
In order to realize another purpose of the invention, the following technical scheme is adopted: a speech emotion recognition method applied to a speech emotion recognition system comprises the following steps:
step 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output, namely the model classifies the predicted emotion; the output module is a full-connection layer, and the output of the full-connection layer is a feature vector with the length of 4.
The technical problems solved by the invention are as follows: the speech emotion recognition problem is solved based on a deep learning technology, and a section of speech contains a lot of information, such as: gender of the speaker, background noise, content of the speech, emotional state of the speaker, etc., which presents great difficulties and challenges to speech emotion recognition problems. Also, although speech emotion recognition based on deep learning has been studied to some extent, most of the research is based on LSTM, which itself has problems such as huge amount of parameters, difficulty in training, etc.; in summary, the existing speech emotion recognition technology still faces many problems which are not solved well.
The invention has the advantages and beneficial effects that:
1. in the tasks of speech recognition, language models and the like, the FSMN can model the long-term relationship without any repeated feedback, and based on the research results, the invention provides an end-to-end feedforward deep neural network for solving the speech emotion recognition task. After the LSTM is eliminated, the recognition speed of the model is greatly improved, and meanwhile, the training time is effectively reduced. Compared with the traditional method, the method does not use various manually extracted audio features as model input, but directly uses the original spectrogram as input, which contains more original voice information, thereby ensuring that the model generalization capability is stronger. Meanwhile, the complexity of model construction is reduced, and different input characteristics do not need to be set for different models.
2. Different from most of deep learning-based speech emotion recognition research, the method does not use a recurrent neural network and the variety thereof as a basic network, but adopts a standard feedforward fully-connected neural network, namely DFSMN, as the basic network, and provides a pyramid-shaped memory block structure on the basis of the DFSMN, so that the whole model is more robust, and higher-level semantic information can be extracted along with deepening of the network.
3. The bottom layer of the model of the invention is 2 layers of convolution layers instead of directly using DFSMN layers, and in addition, with the depth of the network, the characteristic with stronger robustness can be extracted by adopting a down-sampling method, and the size of the characteristic is obviously reduced.
4. In order to enable the model to pay more Attention to emotion-related information and not be interfered by other information, the invention also provides an Attention mechanism based on time steps, the Attention mechanism is applied to the output of the pyramid FSMN, and each element in the output sequence depends on a specific element in the input sequence by using the Attention mechanism, so that the computational burden of the model is increased, but a more accurate and better-performance model can be generated, and the validity of the end-to-end network is verified by using an IEMOCAP voice emotion data set.
5. The end-to-end deep neural network structure provided by the invention and the method designed aiming at each problem can effectively operate, the experiment is well proved, the experiment is 3.3 times faster than the original model, and the speech emotion recognition performance is greatly improved through analysis and verification.
Drawings
Fig. 1 is a block diagram of a speech emotion recognition system, in which pFSMN is a pyramid FSMN.
Fig. 2a is a block diagram of the FSMN.
Fig. 2b is a block diagram of DFSMN.
FIG. 3 is a block diagram of a time step attention module.
FIG. 4 is a flow chart of a speech emotion recognition method.
Detailed Description
Examples
The present invention will be further described with reference to the following embodiments.
As shown in FIG. 1, a speech emotion recognition system includes: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the CNN module has a convolution layer, and the pyramid FSMN module has a pyramid memory block structure.
The speech emotion recognition system in the embodiment is an end-to-end feedforward deep neural network structure, and the invention improves the basic network in the classic FSMN and DFSMN structures, and adds a convolution layer in the basic network to realize the feature extraction of a lower layer.
The audio data of the invention is in a mainstream wav format, the sampling frequency is 16000Hz, the original audio data is further subjected to Fourier transform in frames, the length of each frame is 25ms, and the frame shift step length is 10 ms. Through preliminary processing, the audio data is converted into 2-dimensional spectrogram characteristics as model input. In detail, referring to fig. 1, the model is based on 2 convolutional layers instead of using DFSMN layers directly. In addition, with the depth of the network, the characteristics with stronger robustness can be extracted by adopting a down-sampling method, the size of the characteristics is obviously reduced, and the accuracy of the model is obviously improved by the module. When performing a convolution (pooling) operation using a kernel with a size of k and a step size of s, the output layer can be calculated by the following equation:
Wout=(Win-k)/s+1,
Hout=(Hin-k)/s+1,
wherein, WoutIs the width, W, of the feature map of the output speech spectruminIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; houtIs the high of the output characteristic diagram, HinIs the height of the input feature map, k is the convolution kernel size, and s is the step size of the convolution kernel move.
As shown in FIG. 2a, FSMN is a standard feedforward fully-connected neural network, which adds an additional memory module in a hidden layer and can use a tapped-delay structure to divide the time step htHas a length of N1Forward time step and length of N2The subsequent time steps of (a) are encoded to a fixed size length and then their sum is calculated as the current output, as shown in the following equation:
wherein,is the output of the memory module at time t, f is an arbitrary activation function, aiIs the weight of the ith forward time step, ht-iIs the ith forward time step, bjIs the weight of the jth backward time step, ht-jIs the jth forward time step.
In order to make the depth of FSMN larger, as shown in fig. 2b, unlike the original FSMN architecture, DFSMN removes the direct forward connection between hidden layers, only takes the memory module as input, and at the same time, introduces the jump connection to overcome the gradient extinction and explosion problems, and the relationship between input and output is shown in the following formula:
wherein,is the output of the memory block at the time t of the first layer,is an arbitrary activation function that is a function of,is the memory block output at time t of layer l-1,is the input of the memory block at the t moment of the first layer,is the l-th layer forward time step,is the weight of the i-th forward time step of the l layer,is the ith forward time step of the l layer, s1Is the interval of the forward time step,is the weight of the i-th backward time step of the l layers,is the jth backward time step of the l layer, s2Is the backward time step interval;is the output of the hidden layer at the time t of the l +1 th layer, WlIs a weight parameter of the first-layer memory block, bl+1Is the bias of the first layer memory block.
In the FSMN and DFSMN described above, the length of the memory module is the same, which means that in the above equation, the same is obtained through all hidden layersAnd s2In the memory block structure, the bottom layer not only extracts the context information of a specific time step t, but also contains the context information, so that the subsequent long-term relationship can be repeated, and unnecessary information does not need to be introduced at the top layer. The invention provides a pyramid memory block structure, in which a model extracts more context information at a deeper level, and the pyramid memory block structure is formed by addingAnd s2Therefore, the bottom layer extracts features from fine information such as the speed and rhythm of speech, the top layer extracts features from higher-level information such as emotion and gender, and the pyramid memory block structure improves the precision and reduces the number of parameters.
The present invention also adds an Attention mechanism by which each element in the output sequence depends on a particular element in the input sequence, applying the Attention mechanism to the output of the pyramid FSMN. This increases the computational burden of the model, but results in a more accurate, better performing model. In most implementations, note that it is implemented as a weight vector (usually as the output of a softmax function) whose dimension is equal to the length of the input sequence. In this embodiment, a segment of speech is divided into many small segments, referred to as time steps in the neural network. Obviously, when a speech segment contains a large number of blanks, not every time step is useful for the SER task, so the model needs to focus on a specific region, and on this basis, a time step attention module is constructed, as shown in fig. 3, and the time step attention module can be described as the following formula:
at=Average(ht),
y=Xs,
wherein, atIs the t thMean value of time step, htIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,is the softmax activation function, W1Is the weight parameter of the first layer in the time step attention module, W2Is the weight parameter of the second layer in the time step attention module, b1Is the bias parameter of the first layer in the time-step attention module, b2Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all atA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.
And finally, obtaining the output of the whole network through a full connection layer, wherein the optimization target of the model is a universal cross entropy loss function. The length of the output vector of the model is matched with the number of the emotion categories, the probability that the value of each position in the output vector corresponds to the emotion category is output, and finally the emotion category with the highest probability is selected as output.
As shown in fig. 4, the speech emotion recognition process in this embodiment specifically includes the following steps:
step 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output, namely the model classifies the predicted emotion; the output module is a full-connection layer, and the output of the full-connection layer is a feature vector with the length of 4.
According to the technical scheme, after the LSTM is eliminated, the recognition speed of the model is greatly improved, meanwhile, the training time consumption is effectively reduced, and in addition, different from the traditional speech emotion recognition system, the method does not use artificial features as model input, but directly uses an original speech spectrogram as input, wherein more original information is contained, so that the generalization capability of the model is stronger. In order to make the model focus more on the information related to emotion and not be interfered by other information, the invention provides an attention mechanism based on time steps and integrates the attention mechanism into the model.
(1) Data sets used by the invention;
the SER model of the present invention was evaluated using an IEMOCAP corpus, which contains several sessions in each of which two participants communicate to present a particular type of emotion. These utterances are classified as anger, fear, excitement, neutrality, disgust, surprise, sadness, happiness, depression, others, and XXX. XXX is the case where the annotator cannot agree on the tag. In this example, only 5 classes of anger, excitement, happiness, neutrality and sadness were selected, and the total number of voices used was 5531. To balance the number of samples per emotion category, happy and excited emotions are merged into a happy category. In addition, 10% of the total data is randomly selected as a test object, the rest data is used as training data, and 10% of the training data is used as verification data to check whether the test object needs to be stopped in advance.
The corpus has two channels of video and audio data, and the invention only uses audio data. The audio acquisition used a high quality microphone (Schoeps CMIT 5U) with a sampling rate of 48 khz. It is down sampled to 16 khz and a 201D acoustic signature is extracted. Different from other technical solutions, in this embodiment, only the spectrogram is used as an input, and the extraction process is performed in a 25ms window with a moving step size of 10ms (100 fps). Meanwhile, normalization processing is carried out on the whole sentence voice data.
(2) Describing a test process;
using the pytorech framework as a training tool, the network architecture is shown in fig. 1, two 5 × 5conv layers are used in front, and the hidden layer and memory block of the 4 FSMN blocks have 256 and 128 nodes, respectively. In order to avoid overfitting, a batch normalization layer is arranged behind the CNN layer and the pFSMN layer, the time sequence is 4 to 32, the step length is 1 to 2, the model of the embodiment is trained on the basis of an Adam optimizer of a Pythrch, the batch size is set to 32, the learning rate is fixed to 0.003, preset 4470 pieces of training audio data are used for iterative training, the effect of the model on the verification set is tested at the same time in each iteration, and the training is stopped in advance when the identification accuracy on the verification set is unchanged for 3 continuous iteration rounds. All experiments were performed on a workstation at station 1 NVIDIATITAN XP.
(3) Testing results;
to measure the performance of the system, the overall accuracy (weighted accuracy, WA) and the average recall rate (unweighted accuracy, UA) of the different mood categories for the test sample were calculated, as well as the corresponding recall rate for each category.
The test result shows that compared with the LSTM, the improved sequence model performance is improved by 2.47%, which indicates that the FSMN has better sequence model performance in the task. HSF-CRNN (Luo D, Zou Y, Huang D. investment on Joint reproduction Learning for the Robust Feature Extraction in the Speech evaluation Recognition [ J ]. Proc. Interspeed 2018,2018: 152-. 156.) is an improved CNN combination RNN method proposed by Luo, which uses the manually made Speech features as input, the model of this embodiment realizes 0.53% and 3.99% absolute improvements on UA and WA, respectively, experiments prove that useful information can be automatically extracted from the spectrogram without using the commonly used manual Speech features, the invention also establishes a basic C-bilSTM model for comparison, the accuracy of the 'sad' sample is better than other methods, and the Recognition accuracy of other categories is much worse. In order to illustrate the working principle of the attention mechanism, a C-pFSMN model is established, the rest part of the whole model is completely consistent with the model in the invention except that the attention mechanism is not needed, and the result shows that compared with the C-pFSMN, the attention mechanism provided by the invention has good performance in a SER task, UA is improved by 6.3% absolutely, and in addition, a front-end CNN layer can extract more complex characteristics, so that the model performance is improved as expected.
C-bilSTM is constructed from 2-CNN layers and 2-Bi-LSTM, with 256 nodes in the hidden layer. It is similar to the model of the embodiment and widely applied to the sequence modeling task. Thus, the computational resources of C-bilSTM were also compared to the method of the present invention. The results show that the model of the invention has only 1.85M parameters and the training time is 64 minutes, which is much faster than the C-LSTM model. This means that the invention can achieve better performance while requiring less computing resources.
The technical scheme of the invention well solves the problem of speech emotion recognition, greatly improves the speed of speech recognition, and effectively reduces the time consumption of training. In addition, unlike the traditional speech emotion recognition system, the method does not use artificial features as model input, but directly uses the original spectrogram as input, wherein more original information is contained, so that the generalization capability of the model is stronger. In order to enable the model to pay more attention to the information related to emotion and not be interfered by other information, the invention provides an attention mechanism based on time step and combines the attention mechanism into the model, and experiments show that the model of the invention has good effect and needs less computing resources.
The above detailed description is specific to possible embodiments of the present invention, and the embodiments are not intended to limit the scope of the present invention, and all equivalent implementations or modifications that do not depart from the scope of the present invention are intended to be included within the scope of the present invention.
Claims (8)
1. A speech emotion recognition system, comprising:
the audio frequency preprocessing module, the CNN module, the pyramid FSMN module, the time step attention module and the output module are connected in sequence, and the CNN module is provided with a convolution layer;
the audio preprocessing module converts the received original audio data into a speech spectrum characteristic diagram;
the CNN module performs primary processing on the spectrum characteristic diagram to construct a characteristic diagram containing shallow information;
the pyramid FSMN module further processes the characteristic diagram containing the shallow information to acquire deeper semantic information and context information;
the time step attention module is used for paying attention to a specific area in a time step and calculating influence weights of different time step lengths on final emotion recognition;
the output module is provided with a plurality of emotion categories and is used for outputting the emotion category which is most matched with the original audio data; wherein,
when the convolutional layer performs a convolution operation with a kernel having a size of k and a step size of s, the output of the convolutional layer is calculated by the following formula:
Wout=(Win-k)/s+1,
Hout=(Hin-k)/s+1,
wherein, WoutIs the width, W, of the feature map of the output speech spectruminIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; houtIs the high of the output characteristic diagram, HinIs the height of the input feature map, k is the convolution kernel size, s is the step size of the convolution kernel move;
the pyramid FSMN module has a pyramid memory block structure, and the pyramid memory block structure is adoptedConstruct the length as the forward time step N1And a backward time step N2Step of time htCoded to a fixed size length and then N is added1And N2Is calculated as the current output, which is specifically shown by the following equation:
wherein,is the output of the memory module at time t, f is an arbitrary activation function, aiIs the weight of the ith forward time step, ht-iIs the ith forward time step, bjIs the weight of the jth backward time step, ht-jIs the jth forward time step;
the pyramid memory block structure adopts a jump connection, and the relation between the input and the output of the jump connection is shown as the following formula:
wherein,is the output of the memory block at the time t of the first layer,is an arbitrary activation function that is a function of,is the memory block output at layer 1-1 time t,is the input of the memory block at the t moment of the first layer,is the l-th layer forward time step,is the weight of the i-th forward time step of the l layer,is the ith forward time step of the l layer, s1Is the interval of the forward time step,is the weight of the i-th backward time step of the l layers,is the jth backward time step of the l layer, s2Is the backward time step interval;is the output of the hidden layer at the time t of the l +1 th layer, WlIs a weight parameter of the first-layer memory block, bl+1Is the bias of the first layer memory block.
2. The speech emotion recognition system of claim 1, wherein the time step attention module is specifically represented by the following formula:
at=Average(ht),
y=Xs,
wherein, atIs the mean value of the t-th time step, htIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,is the softmax activation function, W1Is the weight parameter of the first layer in the time step attention module, W2Is the weight parameter of the second layer in the time step attention module, b1Is the bias parameter of the first layer in the time-step attention module, b2Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all atA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.
3. The speech emotion recognition system of any one of claims 1 to 2, wherein the convolutional layer is two layers.
4. The speech emotion recognition system of any one of claims 1 to 2, wherein the shallow information is audio loudness or frequency.
5. The speech emotion recognition system of any one of claims 1 to 2, wherein the plurality of emotion categories are four emotion categories.
6. The speech emotion recognition system of claim 5, wherein the four emotion classifications are happiness, injury, anger and neutrality.
7. A speech emotion recognition method applied to the speech emotion recognition system of claim 1, characterized by comprising the steps of:
step 1, an audio preprocessing module performs primary feature extraction and regularization operation on received voice to obtain a voice spectrum feature map, and inputs the voice spectrum feature map into a CNN module;
step 2, the CNN module performs convolution calculation operation on the received speech spectrum characteristic graph to construct a speech spectrum characteristic graph containing audio shallow layer information;
step 3, the pyramid FSMN module further processes the speech spectrum characteristic diagram containing the audio shallow layer information, and obtains semantic information and context information of a deeper layer in the speech spectrum characteristic diagram through a pyramid memory block structure;
step 4, the time step attention module processes the speech spectrum characteristic graph with deeper semantic information and context information, firstly, the attention scores of different time steps are calculated, then the scores are used for carrying out weighted summation on the whole speech spectrum characteristic graph in the time step dimension, and therefore the characteristic vector with the highest emotion correlation degree between the whole section of speech and the speaker is obtained;
and 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output.
8. The method of claim 7, wherein in step 5, the output module is a fully-connected layer, and the fully-connected layer outputs a feature vector with a length of 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910803429.7A CN110534133B (en) | 2019-08-28 | 2019-08-28 | Voice emotion recognition system and voice emotion recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910803429.7A CN110534133B (en) | 2019-08-28 | 2019-08-28 | Voice emotion recognition system and voice emotion recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110534133A CN110534133A (en) | 2019-12-03 |
CN110534133B true CN110534133B (en) | 2022-03-25 |
Family
ID=68664896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910803429.7A Active CN110534133B (en) | 2019-08-28 | 2019-08-28 | Voice emotion recognition system and voice emotion recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110534133B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143567B (en) * | 2019-12-30 | 2023-04-07 | 成都数之联科技股份有限公司 | Comment emotion analysis method based on improved neural network |
CN111539458B (en) * | 2020-04-02 | 2024-02-27 | 咪咕文化科技有限公司 | Feature map processing method and device, electronic equipment and storage medium |
CN112053007B (en) * | 2020-09-18 | 2022-07-26 | 国网浙江兰溪市供电有限公司 | Distribution network fault first-aid repair prediction analysis system and method |
CN112634947B (en) * | 2020-12-18 | 2023-03-14 | 大连东软信息学院 | Animal voice and emotion feature set sequencing and identifying method and system |
CN113255800B (en) * | 2021-06-02 | 2021-10-15 | 中国科学院自动化研究所 | Robust emotion modeling system based on audio and video |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN115512693B (en) * | 2021-06-23 | 2024-08-09 | 中移(杭州)信息技术有限公司 | Audio recognition method, acoustic model training method, device and storage medium |
CN113903327B (en) * | 2021-09-13 | 2024-06-28 | 北京卷心菜科技有限公司 | Voice environment atmosphere recognition method based on deep neural network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20090063202A (en) * | 2009-05-29 | 2009-06-17 | 포항공과대학교 산학협력단 | Method for apparatus for providing emotion speech recognition |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
CN109036465A (en) * | 2018-06-28 | 2018-12-18 | 南京邮电大学 | Speech-emotion recognition method |
CN109285562A (en) * | 2018-09-28 | 2019-01-29 | 东南大学 | Speech-emotion recognition method based on attention mechanism |
CN109460737A (en) * | 2018-11-13 | 2019-03-12 | 四川大学 | A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network |
CN109637522A (en) * | 2018-12-26 | 2019-04-16 | 杭州电子科技大学 | A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
-
2019
- 2019-08-28 CN CN201910803429.7A patent/CN110534133B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20090063202A (en) * | 2009-05-29 | 2009-06-17 | 포항공과대학교 산학협력단 | Method for apparatus for providing emotion speech recognition |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
CN109036465A (en) * | 2018-06-28 | 2018-12-18 | 南京邮电大学 | Speech-emotion recognition method |
CN109285562A (en) * | 2018-09-28 | 2019-01-29 | 东南大学 | Speech-emotion recognition method based on attention mechanism |
CN109460737A (en) * | 2018-11-13 | 2019-03-12 | 四川大学 | A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network |
CN109637522A (en) * | 2018-12-26 | 2019-04-16 | 杭州电子科技大学 | A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
Non-Patent Citations (3)
Title |
---|
DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION;Shiliang Zhang etc.;《2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)》;20180930;正文第5869-5873页 * |
基于深度学习的多模态情感识别方法研究;张园园;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190815(第08期);正文第23-25页 * |
基于语谱图提取深度空间注意特征的语音情感识别算法;王金华 等;《电信科学》;20190318;第35卷(第07期);正文第100-108页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110534133A (en) | 2019-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110534133B (en) | Voice emotion recognition system and voice emotion recognition method | |
CN112348075B (en) | Multi-mode emotion recognition method based on contextual attention neural network | |
CN110400579B (en) | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network | |
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
Wang et al. | Data augmentation using deep generative models for embedding based speaker recognition | |
Han et al. | Speech emotion recognition with a resnet-cnn-transformer parallel neural network | |
CN110610534B (en) | Automatic mouth shape animation generation method based on Actor-Critic algorithm | |
CN110956953A (en) | Quarrel identification method based on audio analysis and deep learning | |
CN108962247A (en) | Based on gradual neural network multidimensional voice messaging identifying system and its method | |
Liu et al. | Learning salient features for speech emotion recognition using CNN | |
Djeffal et al. | Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches | |
CN117115312B (en) | Voice-driven facial animation method, device, equipment and medium | |
Tang et al. | Speech Emotion Recognition Via CNN-Transforemr and Multidimensional Attention Mechanism | |
Kaur et al. | Speech recognition system; challenges and techniques | |
Barkur et al. | EnsembleWave: an ensembled approach for automatic speech emotion recognition | |
Sunny et al. | Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam | |
Kalita et al. | Use of bidirectional long short term memory in spoken word detection with reference to the Assamese language | |
Utomo et al. | Spoken word and speaker recognition using MFCC and multiple recurrent neural networks | |
Ashrafidoost et al. | Recognizing Emotional State Changes Using Speech Processing | |
Song et al. | Speech emotion recognition and intensity estimation | |
CN114863939B (en) | Panda attribute identification method and system based on sound | |
Kaewprateep et al. | Evaluation of small-scale deep learning architectures in Thai speech recognition | |
Cao et al. | Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition. | |
Mirhassani et al. | Fuzzy decision fusion of complementary experts based on evolutionary cepstral coefficients for phoneme recognition | |
Ma et al. | Fine-grained Dynamical Speech Emotion Analysis Utilizing Networks Customized for Acoustic Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |