CN109036467B

CN109036467B - CFFD extraction method, speech emotion recognition method and system based on TF-LSTM

Info

Publication number: CN109036467B
Application number: CN201811258369.7A
Authority: CN
Inventors: 卫伟; 李晓飞; 吴聪; 柴磊
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2021-04-16
Anticipated expiration: 2038-10-26
Also published as: CN109036467A

Abstract

The invention discloses a TF-LSTM-based CFFD extraction method, a speech emotion recognition method and a system, wherein the TF-LSTM-based speech emotion recognition system includes a CFTD generation module, which is used for generating CFTD; Hybrid Deep Neural Network Model Construction Module, used to construct a Hybrid Deep Neural Network Model; CFFD Extraction Module, used to input pre-extracted 256x256-dimensional frequency domain features into the constructed Hybrid Deep Neural Network Model Extraction; Classifier Training Module : used to fuse the two features of CFTD and CFFD to train a linear SVM classifier to obtain the final speech emotion recognition result. The invention integrates two kinds of deep feature information including time domain feature and frequency domain feature to improve the accuracy of speech emotion recognition; adopts one-dimensional convolutional neural network to extract time domain bottom layer features, and learns speech emotion information through multiple LSTM modules. The contextual features of temporal emotional information are better obtained.

Description

TF-LSTM-based CFFD extraction method, voice emotion recognition method and system

Technical Field

The invention relates to a TF-LSTM-based CFFD extraction method, a voice emotion recognition method and a system, and belongs to the technical field of voice emotion recognition.

Background

The concept of emotion calculation has become a research hotspot in recent years, and has attracted the attention of many emotion analysis experts at home and abroad. The voice signal of the speaker often contains rich emotional information, which helps him to better transfer information. When the same person expresses the same sentence with different emotions, the transmitted information is not very same. In order for a computer to better understand human emotion, the accuracy of speech emotion recognition must be improved. Speech emotion recognition is increasingly used in man-machine interaction fields such as man-made customer service, distance education, medical assistance, and automobile driving.

At present, the traditional speech emotion recognition at home and abroad is greatly developed in the aspects of introduction of emotion description models, construction of emotion speech libraries, emotion characteristic analysis and the like. The speech emotion recognition accuracy rate has great relation with the extraction of speech emotion characteristics, because the traditional speech emotion recognition technology is established on the basis of emotion acoustic characteristics. Deep neural networks have made major breakthroughs in the field of speech emotion recognition in recent years, and have achieved better results in the large vocabulary continuous speech emotion recognition task (LVCSR) than the gaussian mixture model/hidden markov model (GMM/HMM) system. Although a Convolutional Neural Network (CNN) is excellent in image recognition and can obtain a good effect in speech emotion recognition, the problem of low speech emotion recognition efficiency due to an unsatisfactory effect of a frequency domain CFFD feature extraction method in the prior art exists; on the other hand, with the advancement of science and technology, the voice information is explosively increased, and massive data needs to be processed, so that training a high-efficiency and high-recognition-rate voice emotion recognition system becomes a practical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a CFFD extraction method based on TF-LSTM and a speech emotion recognition method and system, wherein a deep hybrid deep neural network model is constructed to extract CFFD, so that the recognition efficiency is improved, and Time Domain Context Features (CFTD) and Frequency Domain context Features (CFFD) are fused, so that the two depth Features have good complementarity in speech and the recognition accuracy is improved.

In order to solve the technical problems, the invention firstly provides a CFFD extraction method based on TF-LSTM, which comprises the following steps:

constructing a hybrid deep neural network model;

inputting pre-extracted 256x256 dimensional frequency domain features into a constructed mixed depth neural network model to extract CFFD;

the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;

the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;

and splicing an LSTM module after the full connection layer, wherein the LSTM module has an implicit layer, the input of the implicit layer is 4096 dimensions, the output of the implicit layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained.

In another aspect, the present invention provides a speech emotion recognition method based on TF-LSTM, comprising the steps of:

generating a CFTD according to pre-extracted time domain context information of the voice signal;

constructing a hybrid deep neural network model;

inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed deep neural network model to extract CFFD;

splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;

and fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier, and obtaining a final speech emotion recognition result.

In a third aspect, the present invention provides a TF-LSTM based speech emotion recognition system, comprising:

the CFTD generating module is used for generating CFTD according to the pre-extracted time domain context information of the voice signal;

the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model;

the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction;

a classifier training module: the method is used for fusing the CFTD characteristic and the CFFD characteristic, training the linear SVM classifier and obtaining a final speech emotion recognition result.

The invention achieves the following beneficial effects:

1) the method integrates two kinds of depth feature information including time domain features and frequency domain features to improve the accuracy of speech emotion recognition;

2) the method adopts the one-dimensional convolutional neural network to extract the bottom layer characteristics of the time domain, learns the speech emotion information through a plurality of LSTM modules, and well obtains the context characteristics of the time domain emotion information;

3) the invention designs a mixed deep learning network structure consisting of a 2-dimensional convolutional neural network and an LSTM (least squares metric) to extract context phase information of speech emotion information in a frequency domain.

Drawings

FIG. 1 is a flowchart of a speech emotion recognition method according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

FIG. 1 is an overall flow chart of the method, and the voice emotion recognition method based on TF-LSTM comprises three steps of feature extraction, feature fusion and SVM classification.

The detailed process is as follows:

step A: extracting the Time Domain context information of the speech emotion signal and generating a Time Domain Context Feature (CFTD).

In a specific embodiment, the detailed extraction process of the CFTD extraction method is preferably as follows:

a.1, inputting a native Speech signal of Berlin Speech emission Database into a one-dimensional convolutional neural network, preprocessing, and extracting bottom layer features;

a.2, the structure of the one-dimensional preprocessing convolution neural network is as follows: one input, 13 convolutional layers, 4 max pooling layers, 2 full link layers. Wherein the input is an EMO-DB voice signal;

a.3, the first 19 layers are respectively: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers. The convolution kernel size for all convolutional layers is 3 × 3, the step size of the convolution is 1. The pooling layer employs a2 x2 convolution kernel with a step size of 2. Space filling is adopted for input of the convolution layer, so that the resolution ratio is kept unchanged after convolution, and the dimension of the output feature of the last full-connection layer is 3072;

and A.4, splicing two LSTM modules respectively behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implicit layer, the input of the implicit layer is 512 dimensions, the output of the implicit layer is also 512 dimensions, and the output of the two LSTMs and the output of the fully-connected layer of the last layer are directly connected in series to be used as the output of the network. The output of the whole network is 4096 dimensions, namely CFTD;

and B: extracting context information of a speech emotion signal Frequency Domain to generate Frequency Domain context Features (CFFD);

in the embodiment, if the voice signal is preprocessed, the step can be omitted; if no pretreatment is performed, the pretreatment of CFFD is preferably performed by the following method, as follows:

b.1, re-adopting the EMO-DB signal, wherein the sampling frequency is 16 khz;

b.2, framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and adding a Hamming window to obtain a short-time signal x (n) of a single frame;

b.3, performing Fast Fourier Transform (FFT) on each frame of signal to obtain frequency domain data X (i, k);

and B.4, calculating 65-dimensional frequency domain characteristics, which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio. The effect of introducing the traditional frequency domain characteristics is better than the effect of simply inputting the frequency domain information into the neural network;

and B.5, after the directly obtained FFT result and the extracted frequency domain feature are spliced, adjusting the dimension to be 256x256 dimensions to obtain the preprocessed frequency domain feature.

The detailed extraction process for constructing the hybrid deep neural network is as follows:

and B.6, the mixed deep neural network model comprises an input layer, 5 convolutional layers, 3 maximum pooling layers, 2 full-link layers and an LSTM module. Wherein the input is the 256x256 frequency domain characteristics obtained in step B.5;

b.7, convolutional layer C1, 96 (15 × 3) convolutional kernels, step 3 × 1, max pooling layer (3 × 1), step 2; convolutional layer C2, 256 (9 × 3) convolutional kernels, step size 1, maximum pool layer (3 × 1), step size 1; convolutional layer C3, 384 (7 × 3) convolutional kernels, convolutional layer C4 has 384 (7 × 1) convolutional kernels; convolutional layer C5, 256 (7 × 1) convolution kernels; two fully connected layers 4096 dimensions;

b.8, splicing an LSTM module after fully connecting the layers. LSTM can learn the context characteristics of speech emotion well. The LSTM module has an implied layer with 4096 dimensions for input and 4096 dimensions for output.

Inputting pre-extracted 256x 256-dimensional frequency domain characteristics into a constructed mixed deep neural network model to extract CFFD, taking the output of LSTM in the mixed deep neural network model as the output of the network, and taking the output of the whole network as 4096 dimensions to obtain CFFD;

and C: fusing the two characteristics of CFTD and CFFD, training a linear SVM (support vector machine) classifier, and obtaining a final speech emotion classification recognition result;

another embodiment is a TF-LSTM based speech emotion recognition system comprising:

a CFTD generating module, configured to generate a CFTD according to the pre-extracted time domain context information of the speech signal, where the implementation method corresponds to step a of the previous embodiment;

a hybrid deep neural network model constructing module for constructing a hybrid deep neural network model, which performs a method corresponding to steps B6-B8 of the previous embodiment;

a CFFD extraction module, which is used for inputting the pre-extracted 256x256 dimensional frequency domain features into the constructed mixed depth neural network model extraction, and the execution method corresponds to the step B of the previous embodiment;

a classifier training module: and C, fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result, wherein the execution method corresponds to the step C of the previous embodiment.

Preferably, the CFTD generation module includes a one-dimensional convolutional neural network construction module for preprocessing the input native speech signal, and the execution method corresponds to steps a2 to a4 of the previous embodiment.

Preferably, a frequency domain feature extraction module is further included for extracting the frequency domain features of the speech signal and adjusting to 256 × 256 dimensional frequency domain features, which is performed by a method corresponding to steps B1-B5 of the previous embodiment.

The robot speech emotion analysis is taken as a target, and the recognition rate and the robustness of a speech signal emotion algorithm are improved.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The CFFD extraction method based on the TF-LSTM is characterized by comprising the following steps:

constructing a hybrid deep neural network model;

the method for extracting the frequency domain features comprises the following steps:

step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;

step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;

b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);

step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;

and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.

2. The TF-LSTM based CFFD extraction method of claim 1,

the first convolutional layer C1 of the hybrid deep neural network employs 96 convolutional kernels of size 15 × 3, with the step size set to 3 × 1, followed by a largest pooling layer of size 3 × 1 with a step size of 2; the second convolutional layer C2 has 256 convolutional kernels with size of 9 × 3 and step size of 1; the second convolutional layer C2 is followed again by a largest pool layer of size 3 × 1 and with a step size of 1;

the third convolutional layer C3 has 384 convolution kernels of size 7 × 3, and C4 has 384 kernels of size 7 × 1;

the last convolutional layer C5 contains 256 convolutional kernels of size 7 × 1, followed by the largest pool layer of size 3 × 1; convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions.

3. The voice emotion recognition method based on TF-LSTM is characterized by comprising the following steps:

constructing a hybrid deep neural network model;

fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result;

and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.

4. The TF-LSTM based speech emotion recognition method of claim 3, wherein extracting the temporal context information of the speech signal and generating CFTD includes the steps of:

inputting a native voice signal into a one-dimensional convolution-based neural network for preprocessing;

the structure of the one-dimensional convolutional neural network is one input, 13 convolutional layers, 4 maximum pooling layers and 2 full-connection layers; the input is a one-dimensional native speech signal;

the first 19 layers are: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers;

and respectively splicing two LSTM modules behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implied layer, the input of the implied layer is 512 dimensions, the output of the implied layer is also 512 dimensions, the outputs of the two LSTMs and the output of the full-connection layer of the last layer are directly connected in series to be used as the output of the network, the output of the whole network is 4096 dimensions, and the CFTD is obtained.

5. The TF-LSTM based speech emotion recognition method of claim 4, wherein the convolution kernel size of all convolution layers of said one-dimensional convolutional neural network is 3 x 3, the step size of convolution is 1; the pooling layer adopts 2 multiplied by 2 convolution kernels, and the step length is 2; the convolutional layers are input with space filling, so that the resolution remains unchanged after convolution, and the dimension of the output feature of the last fully-connected layer is 3072.

6. The TF-LSTM based speech emotion recognition method of claim 3,

7. The voice emotion recognition system based on TF-LSTM is characterized in that: the method comprises the following steps:

a classifier training module: the system is used for fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier and obtaining a final speech emotion recognition result;

the system also comprises a frequency domain feature extraction module, which is used for extracting the frequency domain features of the voice signals and adjusting the frequency domain features into 256x 256-dimensional frequency domain features, wherein the specific method for extracting the frequency domain features comprises the following steps:

and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.

8. The TF-LSTM based speech emotion recognition system of claim 7, wherein:

the CFTD generating module comprises a one-dimensional convolution neural network constructing module which is used for preprocessing the input native voice signal.