Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a CFFD extraction method based on TF-LSTM and a speech emotion recognition method and system, wherein a deep hybrid deep neural network model is constructed to extract CFFD, so that the recognition efficiency is improved, and Time Domain Context Features (CFTD) and Frequency Domain context Features (CFFD) are fused, so that the two depth Features have good complementarity in speech and the recognition accuracy is improved.
In order to solve the technical problems, the invention firstly provides a CFFD extraction method based on TF-LSTM, which comprises the following steps:
constructing a hybrid deep neural network model;
inputting pre-extracted 256x256 dimensional frequency domain features into a constructed mixed depth neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
and splicing an LSTM module after the full connection layer, wherein the LSTM module has an implicit layer, the input of the implicit layer is 4096 dimensions, the output of the implicit layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained.
In another aspect, the present invention provides a speech emotion recognition method based on TF-LSTM, comprising the steps of:
generating a CFTD according to pre-extracted time domain context information of the voice signal;
constructing a hybrid deep neural network model;
inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed deep neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
and fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier, and obtaining a final speech emotion recognition result.
In a third aspect, the present invention provides a TF-LSTM based speech emotion recognition system, comprising:
the CFTD generating module is used for generating CFTD according to the pre-extracted time domain context information of the voice signal;
the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model;
the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction;
a classifier training module: the method is used for fusing the CFTD characteristic and the CFFD characteristic, training the linear SVM classifier and obtaining a final speech emotion recognition result.
The invention achieves the following beneficial effects:
1) the method integrates two kinds of depth feature information including time domain features and frequency domain features to improve the accuracy of speech emotion recognition;
2) the method adopts the one-dimensional convolutional neural network to extract the bottom layer characteristics of the time domain, learns the speech emotion information through a plurality of LSTM modules, and well obtains the context characteristics of the time domain emotion information;
3) the invention designs a mixed deep learning network structure consisting of a 2-dimensional convolutional neural network and an LSTM (least squares metric) to extract context phase information of speech emotion information in a frequency domain.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
FIG. 1 is an overall flow chart of the method, and the voice emotion recognition method based on TF-LSTM comprises three steps of feature extraction, feature fusion and SVM classification.
The detailed process is as follows:
step A: extracting the Time Domain context information of the speech emotion signal and generating a Time Domain Context Feature (CFTD).
In a specific embodiment, the detailed extraction process of the CFTD extraction method is preferably as follows:
a.1, inputting a native Speech signal of Berlin Speech emission Database into a one-dimensional convolutional neural network, preprocessing, and extracting bottom layer features;
a.2, the structure of the one-dimensional preprocessing convolution neural network is as follows: one input, 13 convolutional layers, 4 max pooling layers, 2 full link layers. Wherein the input is an EMO-DB voice signal;
a.3, the first 19 layers are respectively: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers. The convolution kernel size for all convolutional layers is 3 × 3, the step size of the convolution is 1. The pooling layer employs a2 x2 convolution kernel with a step size of 2. Space filling is adopted for input of the convolution layer, so that the resolution ratio is kept unchanged after convolution, and the dimension of the output feature of the last full-connection layer is 3072;
and A.4, splicing two LSTM modules respectively behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implicit layer, the input of the implicit layer is 512 dimensions, the output of the implicit layer is also 512 dimensions, and the output of the two LSTMs and the output of the fully-connected layer of the last layer are directly connected in series to be used as the output of the network. The output of the whole network is 4096 dimensions, namely CFTD;
and B: extracting context information of a speech emotion signal Frequency Domain to generate Frequency Domain context Features (CFFD);
in the embodiment, if the voice signal is preprocessed, the step can be omitted; if no pretreatment is performed, the pretreatment of CFFD is preferably performed by the following method, as follows:
b.1, re-adopting the EMO-DB signal, wherein the sampling frequency is 16 khz;
b.2, framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and adding a Hamming window to obtain a short-time signal x (n) of a single frame;
b.3, performing Fast Fourier Transform (FFT) on each frame of signal to obtain frequency domain data X (i, k);
and B.4, calculating 65-dimensional frequency domain characteristics, which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio. The effect of introducing the traditional frequency domain characteristics is better than the effect of simply inputting the frequency domain information into the neural network;
and B.5, after the directly obtained FFT result and the extracted frequency domain feature are spliced, adjusting the dimension to be 256x256 dimensions to obtain the preprocessed frequency domain feature.
The detailed extraction process for constructing the hybrid deep neural network is as follows:
and B.6, the mixed deep neural network model comprises an input layer, 5 convolutional layers, 3 maximum pooling layers, 2 full-link layers and an LSTM module. Wherein the input is the 256x256 frequency domain characteristics obtained in step B.5;
b.7, convolutional layer C1, 96 (15 × 3) convolutional kernels, step 3 × 1, max pooling layer (3 × 1), step 2; convolutional layer C2, 256 (9 × 3) convolutional kernels, step size 1, maximum pool layer (3 × 1), step size 1; convolutional layer C3, 384 (7 × 3) convolutional kernels, convolutional layer C4 has 384 (7 × 1) convolutional kernels; convolutional layer C5, 256 (7 × 1) convolution kernels; two fully connected layers 4096 dimensions;
b.8, splicing an LSTM module after fully connecting the layers. LSTM can learn the context characteristics of speech emotion well. The LSTM module has an implied layer with 4096 dimensions for input and 4096 dimensions for output.
Inputting pre-extracted 256x 256-dimensional frequency domain characteristics into a constructed mixed deep neural network model to extract CFFD, taking the output of LSTM in the mixed deep neural network model as the output of the network, and taking the output of the whole network as 4096 dimensions to obtain CFFD;
and C: fusing the two characteristics of CFTD and CFFD, training a linear SVM (support vector machine) classifier, and obtaining a final speech emotion classification recognition result;
another embodiment is a TF-LSTM based speech emotion recognition system comprising:
a CFTD generating module, configured to generate a CFTD according to the pre-extracted time domain context information of the speech signal, where the implementation method corresponds to step a of the previous embodiment;
a hybrid deep neural network model constructing module for constructing a hybrid deep neural network model, which performs a method corresponding to steps B6-B8 of the previous embodiment;
a CFFD extraction module, which is used for inputting the pre-extracted 256x256 dimensional frequency domain features into the constructed mixed depth neural network model extraction, and the execution method corresponds to the step B of the previous embodiment;
a classifier training module: and C, fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result, wherein the execution method corresponds to the step C of the previous embodiment.
Preferably, the CFTD generation module includes a one-dimensional convolutional neural network construction module for preprocessing the input native speech signal, and the execution method corresponds to steps a2 to a4 of the previous embodiment.
Preferably, a frequency domain feature extraction module is further included for extracting the frequency domain features of the speech signal and adjusting to 256 × 256 dimensional frequency domain features, which is performed by a method corresponding to steps B1-B5 of the previous embodiment.
The robot speech emotion analysis is taken as a target, and the recognition rate and the robustness of a speech signal emotion algorithm are improved.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.