CN109036467B - TF-LSTM-based CFFD extraction method, voice emotion recognition method and system - Google Patents
TF-LSTM-based CFFD extraction method, voice emotion recognition method and system Download PDFInfo
- Publication number
- CN109036467B CN109036467B CN201811258369.7A CN201811258369A CN109036467B CN 109036467 B CN109036467 B CN 109036467B CN 201811258369 A CN201811258369 A CN 201811258369A CN 109036467 B CN109036467 B CN 109036467B
- Authority
- CN
- China
- Prior art keywords
- layer
- dimensional
- convolutional
- lstm
- dimensions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000000605 extraction Methods 0.000 title claims abstract description 24
- 238000003062 neural network model Methods 0.000 claims abstract description 28
- 201000011477 congenital fiber-type disproportion Diseases 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 9
- 238000010276 construction Methods 0.000 claims abstract description 5
- 238000011176 pooling Methods 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000003595 spectral effect Effects 0.000 claims description 4
- 238000001228 spectrum Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims 1
- 230000008451 emotion Effects 0.000 abstract description 20
- 238000012706 support-vector machine Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a TF-LSTM-based CFFD extraction method, a TF-LSTM-based speech emotion recognition method and a TF-LSTM-based speech emotion recognition system, wherein the TF-LSTM-based speech emotion recognition system comprises a CFTD generation module, a speech emotion recognition module and a speech emotion recognition module, wherein the CFTD generation module is used for generating CFTD according to pre-extracted time domain context information of a speech signal; the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model; the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction; a classifier training module: the method is used for fusing the CFTD characteristic and the CFFD characteristic, training the linear SVM classifier and obtaining a final speech emotion recognition result. The method integrates two kinds of depth feature information including time domain features and frequency domain features to improve the accuracy of speech emotion recognition; the one-dimensional convolutional neural network is adopted to extract the bottom layer characteristics of the time domain, and the speech emotion information is learned through a plurality of LSTM modules, so that the context characteristics of the time domain emotion information are obtained well.
Description
Technical Field
The invention relates to a TF-LSTM-based CFFD extraction method, a voice emotion recognition method and a system, and belongs to the technical field of voice emotion recognition.
Background
The concept of emotion calculation has become a research hotspot in recent years, and has attracted the attention of many emotion analysis experts at home and abroad. The voice signal of the speaker often contains rich emotional information, which helps him to better transfer information. When the same person expresses the same sentence with different emotions, the transmitted information is not very same. In order for a computer to better understand human emotion, the accuracy of speech emotion recognition must be improved. Speech emotion recognition is increasingly used in man-machine interaction fields such as man-made customer service, distance education, medical assistance, and automobile driving.
At present, the traditional speech emotion recognition at home and abroad is greatly developed in the aspects of introduction of emotion description models, construction of emotion speech libraries, emotion characteristic analysis and the like. The speech emotion recognition accuracy rate has great relation with the extraction of speech emotion characteristics, because the traditional speech emotion recognition technology is established on the basis of emotion acoustic characteristics. Deep neural networks have made major breakthroughs in the field of speech emotion recognition in recent years, and have achieved better results in the large vocabulary continuous speech emotion recognition task (LVCSR) than the gaussian mixture model/hidden markov model (GMM/HMM) system. Although a Convolutional Neural Network (CNN) is excellent in image recognition and can obtain a good effect in speech emotion recognition, the problem of low speech emotion recognition efficiency due to an unsatisfactory effect of a frequency domain CFFD feature extraction method in the prior art exists; on the other hand, with the advancement of science and technology, the voice information is explosively increased, and massive data needs to be processed, so that training a high-efficiency and high-recognition-rate voice emotion recognition system becomes a practical problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a CFFD extraction method based on TF-LSTM and a speech emotion recognition method and system, wherein a deep hybrid deep neural network model is constructed to extract CFFD, so that the recognition efficiency is improved, and Time Domain Context Features (CFTD) and Frequency Domain context Features (CFFD) are fused, so that the two depth Features have good complementarity in speech and the recognition accuracy is improved.
In order to solve the technical problems, the invention firstly provides a CFFD extraction method based on TF-LSTM, which comprises the following steps:
constructing a hybrid deep neural network model;
inputting pre-extracted 256x256 dimensional frequency domain features into a constructed mixed depth neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
and splicing an LSTM module after the full connection layer, wherein the LSTM module has an implicit layer, the input of the implicit layer is 4096 dimensions, the output of the implicit layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained.
In another aspect, the present invention provides a speech emotion recognition method based on TF-LSTM, comprising the steps of:
generating a CFTD according to pre-extracted time domain context information of the voice signal;
constructing a hybrid deep neural network model;
inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed deep neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
and fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier, and obtaining a final speech emotion recognition result.
In a third aspect, the present invention provides a TF-LSTM based speech emotion recognition system, comprising:
the CFTD generating module is used for generating CFTD according to the pre-extracted time domain context information of the voice signal;
the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model;
the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction;
a classifier training module: the method is used for fusing the CFTD characteristic and the CFFD characteristic, training the linear SVM classifier and obtaining a final speech emotion recognition result.
The invention achieves the following beneficial effects:
1) the method integrates two kinds of depth feature information including time domain features and frequency domain features to improve the accuracy of speech emotion recognition;
2) the method adopts the one-dimensional convolutional neural network to extract the bottom layer characteristics of the time domain, learns the speech emotion information through a plurality of LSTM modules, and well obtains the context characteristics of the time domain emotion information;
3) the invention designs a mixed deep learning network structure consisting of a 2-dimensional convolutional neural network and an LSTM (least squares metric) to extract context phase information of speech emotion information in a frequency domain.
Drawings
FIG. 1 is a flowchart of a speech emotion recognition method according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
FIG. 1 is an overall flow chart of the method, and the voice emotion recognition method based on TF-LSTM comprises three steps of feature extraction, feature fusion and SVM classification.
The detailed process is as follows:
step A: extracting the Time Domain context information of the speech emotion signal and generating a Time Domain Context Feature (CFTD).
In a specific embodiment, the detailed extraction process of the CFTD extraction method is preferably as follows:
a.1, inputting a native Speech signal of Berlin Speech emission Database into a one-dimensional convolutional neural network, preprocessing, and extracting bottom layer features;
a.2, the structure of the one-dimensional preprocessing convolution neural network is as follows: one input, 13 convolutional layers, 4 max pooling layers, 2 full link layers. Wherein the input is an EMO-DB voice signal;
a.3, the first 19 layers are respectively: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers. The convolution kernel size for all convolutional layers is 3 × 3, the step size of the convolution is 1. The pooling layer employs a2 x2 convolution kernel with a step size of 2. Space filling is adopted for input of the convolution layer, so that the resolution ratio is kept unchanged after convolution, and the dimension of the output feature of the last full-connection layer is 3072;
and A.4, splicing two LSTM modules respectively behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implicit layer, the input of the implicit layer is 512 dimensions, the output of the implicit layer is also 512 dimensions, and the output of the two LSTMs and the output of the fully-connected layer of the last layer are directly connected in series to be used as the output of the network. The output of the whole network is 4096 dimensions, namely CFTD;
and B: extracting context information of a speech emotion signal Frequency Domain to generate Frequency Domain context Features (CFFD);
in the embodiment, if the voice signal is preprocessed, the step can be omitted; if no pretreatment is performed, the pretreatment of CFFD is preferably performed by the following method, as follows:
b.1, re-adopting the EMO-DB signal, wherein the sampling frequency is 16 khz;
b.2, framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and adding a Hamming window to obtain a short-time signal x (n) of a single frame;
b.3, performing Fast Fourier Transform (FFT) on each frame of signal to obtain frequency domain data X (i, k);
and B.4, calculating 65-dimensional frequency domain characteristics, which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio. The effect of introducing the traditional frequency domain characteristics is better than the effect of simply inputting the frequency domain information into the neural network;
and B.5, after the directly obtained FFT result and the extracted frequency domain feature are spliced, adjusting the dimension to be 256x256 dimensions to obtain the preprocessed frequency domain feature.
The detailed extraction process for constructing the hybrid deep neural network is as follows:
and B.6, the mixed deep neural network model comprises an input layer, 5 convolutional layers, 3 maximum pooling layers, 2 full-link layers and an LSTM module. Wherein the input is the 256x256 frequency domain characteristics obtained in step B.5;
b.7, convolutional layer C1, 96 (15 × 3) convolutional kernels, step 3 × 1, max pooling layer (3 × 1), step 2; convolutional layer C2, 256 (9 × 3) convolutional kernels, step size 1, maximum pool layer (3 × 1), step size 1; convolutional layer C3, 384 (7 × 3) convolutional kernels, convolutional layer C4 has 384 (7 × 1) convolutional kernels; convolutional layer C5, 256 (7 × 1) convolution kernels; two fully connected layers 4096 dimensions;
b.8, splicing an LSTM module after fully connecting the layers. LSTM can learn the context characteristics of speech emotion well. The LSTM module has an implied layer with 4096 dimensions for input and 4096 dimensions for output.
Inputting pre-extracted 256x 256-dimensional frequency domain characteristics into a constructed mixed deep neural network model to extract CFFD, taking the output of LSTM in the mixed deep neural network model as the output of the network, and taking the output of the whole network as 4096 dimensions to obtain CFFD;
and C: fusing the two characteristics of CFTD and CFFD, training a linear SVM (support vector machine) classifier, and obtaining a final speech emotion classification recognition result;
another embodiment is a TF-LSTM based speech emotion recognition system comprising:
a CFTD generating module, configured to generate a CFTD according to the pre-extracted time domain context information of the speech signal, where the implementation method corresponds to step a of the previous embodiment;
a hybrid deep neural network model constructing module for constructing a hybrid deep neural network model, which performs a method corresponding to steps B6-B8 of the previous embodiment;
a CFFD extraction module, which is used for inputting the pre-extracted 256x256 dimensional frequency domain features into the constructed mixed depth neural network model extraction, and the execution method corresponds to the step B of the previous embodiment;
a classifier training module: and C, fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result, wherein the execution method corresponds to the step C of the previous embodiment.
Preferably, the CFTD generation module includes a one-dimensional convolutional neural network construction module for preprocessing the input native speech signal, and the execution method corresponds to steps a2 to a4 of the previous embodiment.
Preferably, a frequency domain feature extraction module is further included for extracting the frequency domain features of the speech signal and adjusting to 256 × 256 dimensional frequency domain features, which is performed by a method corresponding to steps B1-B5 of the previous embodiment.
The robot speech emotion analysis is taken as a target, and the recognition rate and the robustness of a speech signal emotion algorithm are improved.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.
Claims (8)
1. The CFFD extraction method based on the TF-LSTM is characterized by comprising the following steps:
constructing a hybrid deep neural network model;
inputting pre-extracted 256x256 dimensional frequency domain features into a constructed mixed depth neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
the method for extracting the frequency domain features comprises the following steps:
step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;
step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;
b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);
step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;
and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.
2. The TF-LSTM based CFFD extraction method of claim 1,
the first convolutional layer C1 of the hybrid deep neural network employs 96 convolutional kernels of size 15 × 3, with the step size set to 3 × 1, followed by a largest pooling layer of size 3 × 1 with a step size of 2; the second convolutional layer C2 has 256 convolutional kernels with size of 9 × 3 and step size of 1; the second convolutional layer C2 is followed again by a largest pool layer of size 3 × 1 and with a step size of 1;
the third convolutional layer C3 has 384 convolution kernels of size 7 × 3, and C4 has 384 kernels of size 7 × 1;
the last convolutional layer C5 contains 256 convolutional kernels of size 7 × 1, followed by the largest pool layer of size 3 × 1; convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions.
3. The voice emotion recognition method based on TF-LSTM is characterized by comprising the following steps:
generating a CFTD according to pre-extracted time domain context information of the voice signal;
constructing a hybrid deep neural network model;
inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed deep neural network model to extract CFFD;
the hybrid deep neural network model includes: an input layer, five convolutional layers, three maximum pooling layers, two full-connection layers and an LSTM module;
the first convolution layer C1 is followed by a first maximum pooling layer; the second convolutional layer C2 is followed again by the second largest pool layer; followed by third, fourth and fifth convolutional layers followed by a third max pool layer; a fifth convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions;
splicing an LSTM module behind the full connection layer, wherein the LSTM module has a hidden layer, the input of the hidden layer is 4096 dimensions, the output of the hidden layer is also 4096 dimensions, the output of the LSTM is used as the output of the network, the output of the whole network is 4096 dimensions, and the CFFD is obtained;
fusing the two characteristics of CFTD and CFFD, training a linear SVM classifier, and obtaining a final speech emotion recognition result;
the method for extracting the frequency domain features comprises the following steps:
step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;
step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;
b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);
step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;
and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.
4. The TF-LSTM based speech emotion recognition method of claim 3, wherein extracting the temporal context information of the speech signal and generating CFTD includes the steps of:
inputting a native voice signal into a one-dimensional convolution-based neural network for preprocessing;
the structure of the one-dimensional convolutional neural network is one input, 13 convolutional layers, 4 maximum pooling layers and 2 full-connection layers; the input is a one-dimensional native speech signal;
the first 19 layers are: 2 convolutional layers, 1 pooling layer, 3 convolutional layers, 2 full-connection layers;
and respectively splicing two LSTM modules behind the convolution layers of the 13 th layer and the 17 th layer, wherein each LSTM module has an implied layer, the input of the implied layer is 512 dimensions, the output of the implied layer is also 512 dimensions, the outputs of the two LSTMs and the output of the full-connection layer of the last layer are directly connected in series to be used as the output of the network, the output of the whole network is 4096 dimensions, and the CFTD is obtained.
5. The TF-LSTM based speech emotion recognition method of claim 4, wherein the convolution kernel size of all convolution layers of said one-dimensional convolutional neural network is 3 x 3, the step size of convolution is 1; the pooling layer adopts 2 multiplied by 2 convolution kernels, and the step length is 2; the convolutional layers are input with space filling, so that the resolution remains unchanged after convolution, and the dimension of the output feature of the last fully-connected layer is 3072.
6. The TF-LSTM based speech emotion recognition method of claim 3,
the first convolutional layer C1 of the hybrid deep neural network employs 96 convolutional kernels of size 15 × 3, with the step size set to 3 × 1, followed by a largest pooling layer of size 3 × 1 with a step size of 2; the second convolutional layer C2 has 256 convolutional kernels with size of 9 × 3 and step size of 1; the second convolutional layer C2 is followed again by a largest pool layer of size 3 × 1 and with a step size of 1;
the third convolutional layer C3 has 384 convolution kernels of size 7 × 3, and C4 has 384 kernels of size 7 × 1;
the last convolutional layer C5 contains 256 convolutional kernels of size 7 × 1, followed by the largest pool layer of size 3 × 1; convolutional layer C5 is followed by a fully connected layer with 4096 dimensions in both dimensions.
7. The voice emotion recognition system based on TF-LSTM is characterized in that: the method comprises the following steps:
the CFTD generating module is used for generating CFTD according to the pre-extracted time domain context information of the voice signal;
the hybrid deep neural network model construction module is used for constructing a hybrid deep neural network model;
the CFFD extraction module is used for inputting pre-extracted 256x 256-dimensional frequency domain features into the constructed mixed depth neural network model for extraction;
a classifier training module: the system is used for fusing the CFTD characteristic and the CFFD characteristic, training a linear SVM classifier and obtaining a final speech emotion recognition result;
the system also comprises a frequency domain feature extraction module, which is used for extracting the frequency domain features of the voice signals and adjusting the frequency domain features into 256x 256-dimensional frequency domain features, wherein the specific method for extracting the frequency domain features comprises the following steps:
step B.1), the voice signals are adopted again, and the sampling frequency is 16 khz;
step B.2), framing the voice signal, overlapping and framing the voice signal to ensure smooth transition between frames, wherein the frame length is 512 points, the frame overlapping is 256 points, and a Hamming window is added to obtain a short-time signal x (n) of a single frame;
b.3), carrying out fast Fourier transform on each frame of signal to obtain frequency domain data X (i, k);
step B.4), obtaining 65-dimensional frequency domain characteristics which are respectively as follows: 1-dimensional smooth fundamental frequency, 1-dimensional voiced probability, 1-dimensional zero crossing rate, 14-dimensional MFCC, 1-dimensional mean square energy, 28-dimensional acoustic spectrum filtering, 15-dimensional spectral energy, 1-dimensional local frequency jitter, 1-dimensional inter-frame frequency jitter, 1-dimensional local amplitude perturbation and 1-dimensional harmonic-to-noise ratio;
and B.5), adjusting the frequency domain characteristics to 256x256 dimensions.
8. The TF-LSTM based speech emotion recognition system of claim 7, wherein:
the CFTD generating module comprises a one-dimensional convolution neural network constructing module which is used for preprocessing the input native voice signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811258369.7A CN109036467B (en) | 2018-10-26 | 2018-10-26 | TF-LSTM-based CFFD extraction method, voice emotion recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811258369.7A CN109036467B (en) | 2018-10-26 | 2018-10-26 | TF-LSTM-based CFFD extraction method, voice emotion recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109036467A CN109036467A (en) | 2018-12-18 |
CN109036467B true CN109036467B (en) | 2021-04-16 |
Family
ID=64614086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811258369.7A Active CN109036467B (en) | 2018-10-26 | 2018-10-26 | TF-LSTM-based CFFD extraction method, voice emotion recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036467B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110010153A (en) * | 2019-03-25 | 2019-07-12 | 平安科技(深圳)有限公司 | A kind of mute detection method neural network based, terminal device and medium |
RU2720359C1 (en) * | 2019-04-16 | 2020-04-29 | Хуавэй Текнолоджиз Ко., Лтд. | Method and equipment for recognizing emotions in speech |
US20220240824A1 (en) * | 2019-05-16 | 2022-08-04 | Tawny Gmbh | System and method for recognising and measuring affective states |
CN110222748B (en) * | 2019-05-27 | 2022-12-20 | 西南交通大学 | OFDM radar signal identification method based on 1D-CNN multi-domain feature fusion |
CN110490892A (en) * | 2019-07-03 | 2019-11-22 | 中山大学 | A kind of Thyroid ultrasound image tubercle automatic positioning recognition methods based on USFaster R-CNN |
CN112447187B (en) * | 2019-09-02 | 2024-09-06 | 富士通株式会社 | Device and method for identifying sound event |
CN113449569B (en) * | 2020-03-27 | 2023-04-25 | 威海北洋电气集团股份有限公司 | Mechanical signal health state classification method and system based on distributed deep learning |
CN113314151A (en) * | 2021-05-26 | 2021-08-27 | 中国工商银行股份有限公司 | Voice information processing method and device, electronic equipment and storage medium |
CN114387977B (en) * | 2021-12-24 | 2024-06-11 | 深圳大学 | Voice cutting trace positioning method based on double-domain depth feature and attention mechanism |
CN114822737A (en) * | 2022-03-31 | 2022-07-29 | 浙江大学 | Electronic medical record-based Li's artificial liver preoperative diagnosis and treatment system and use method |
CN114882906A (en) * | 2022-06-30 | 2022-08-09 | 广州伏羲智能科技有限公司 | Novel environmental noise identification method and system |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10783900B2 (en) * | 2014-10-03 | 2020-09-22 | Google Llc | Convolutional, long short-term memory, fully connected deep neural networks |
CN105469065B (en) * | 2015-12-07 | 2019-04-23 | 中国科学院自动化研究所 | A kind of discrete emotion identification method based on recurrent neural network |
KR102033411B1 (en) * | 2016-08-12 | 2019-10-17 | 한국전자통신연구원 | Apparatus and Method for Recognizing speech By Using Attention-based Context-Dependent Acoustic Model |
CN106782602B (en) * | 2016-12-01 | 2020-03-17 | 南京邮电大学 | Speech emotion recognition method based on deep neural network |
CN107863111A (en) * | 2017-11-17 | 2018-03-30 | 合肥工业大学 | The voice language material processing method and processing device of interaction |
CN108154879B (en) * | 2017-12-26 | 2021-04-09 | 广西师范大学 | Non-specific human voice emotion recognition method based on cepstrum separation signal |
CN108597539B (en) * | 2018-02-09 | 2021-09-03 | 桂林电子科技大学 | Speech emotion recognition method based on parameter migration and spectrogram |
CN108447490B (en) * | 2018-02-12 | 2020-08-18 | 阿里巴巴集团控股有限公司 | Voiceprint recognition method and device based on memorability bottleneck characteristics |
-
2018
- 2018-10-26 CN CN201811258369.7A patent/CN109036467B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN109036467A (en) | 2018-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036467B (en) | TF-LSTM-based CFFD extraction method, voice emotion recognition method and system | |
CN108717856B (en) | Speech emotion recognition method based on multi-scale deep convolution cyclic neural network | |
CN108597539B (en) | Speech emotion recognition method based on parameter migration and spectrogram | |
US11488586B1 (en) | System for speech recognition text enhancement fusing multi-modal semantic invariance | |
CN110491382B (en) | Speech recognition method and device based on artificial intelligence and speech interaction equipment | |
Vashisht et al. | Speech recognition using machine learning | |
CN110033758B (en) | Voice wake-up implementation method based on small training set optimization decoding network | |
CN105139864B (en) | Audio recognition method and device | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN112101045B (en) | Multi-mode semantic integrity recognition method and device and electronic equipment | |
CN110853656B (en) | Audio tampering identification method based on improved neural network | |
CN116110405B (en) | Land-air conversation speaker identification method and equipment based on semi-supervised learning | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN111009235A (en) | Voice recognition method based on CLDNN + CTC acoustic model | |
CN114842835B (en) | Voice interaction system based on deep learning model | |
CN114566189A (en) | Speech emotion recognition method and system based on three-dimensional depth feature fusion | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
Xue et al. | Cross-modal information fusion for voice spoofing detection | |
CN114360584A (en) | Phoneme-level-based speech emotion layered recognition method and system | |
Hu et al. | Speech emotion recognition based on attention mcnn combined with gender information | |
CN107564546A (en) | A kind of sound end detecting method based on positional information | |
CN111009236A (en) | Voice recognition method based on DBLSTM + CTC acoustic model | |
Aggarwal et al. | Application of genetically optimized neural networks for hindi speech recognition system | |
CN113763939B (en) | Mixed voice recognition system and method based on end-to-end model | |
Mouaz et al. | A new framework based on KNN and DT for speech identification through emphatic letters in Moroccan dialect |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |