CN111326178A

CN111326178A - Multi-mode speech emotion recognition system and method based on convolutional neural network

Info

Publication number: CN111326178A
Application number: CN202010122988.4A
Authority: CN
Inventors: 叶吉祥; 王东杰
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-23

Abstract

The invention discloses a multi-modal speech emotion recognition system and method based on a convolutional neural network, which comprises a speech extraction module, a framing processing module, a frequency domain signal processing module, a spectrogram processing module, a convolutional neural network processing module, a feature extraction module and a speech emotion classification module; the method comprises the following steps: 1. extracting a voice signal from the voice file; 2. performing framing processing on the voice signal; 3. converting the voice signal into a frequency domain signal; 4. processing the frequency domain signal through a spectrogram; 5. processing the voice signal processed by the spectrogram through a convolutional neural network; 6. carrying out voice feature extraction on the voice signal; 7. and performing emotion recognition and analysis on the voice signals. The invention aims to provide a multi-mode speech emotion recognition system and method based on a convolutional neural network, which have high intelligent degree and can accurately recognize emotion information through speech information.

Description

Multi-mode speech emotion recognition system and method based on convolutional neural network

Technical Field

The invention relates to the technical field of voice recognition analysis, in particular to a multi-modal voice emotion recognition system and method based on a convolutional neural network.

Background

The speech emotion recognition system generally consists of the following three parts: voice signal acquisition, emotion feature extraction and emotion recognition. Generally, a voice signal acquisition part mainly obtains the most original voice signal through a voice sensor (for example, a voice recording device such as a mobile phone microphone), and in order to obtain a high-quality and stable voice signal, a voice preprocessing operation is usually required to make a good quality data preprocessing cushion for subsequent emotion recognition.

The preprocessing of the speech signal usually includes pre-filtering, sampling quantization, pre-emphasis, frame windowing, and short-point detection. And then, transmitting the processed voice signal to an emotion feature extraction module to extract acoustic features closely related to the emotion of the speaker in the voice signal, and finally, transmitting the extracted acoustic features and the like to an emotion recognition module to finish emotion judgment.

The speech emotion recognition is a challenging task, a traditional classification model widely relies on audio features to construct a classifier with good performance, the speech features which can be used for model training are usually extracted from an original sound waveform in the first step of the model, a proper speech emotion recognition model is further established, information which can distinguish different emotion categories is obtained from the extracted audio features, and finally emotion prediction on a test data set is obtained by selecting a proper classifier. However, human emotions are often multi-modal, including three modalities of vision, speech and text, each modality has much information, such as the text modality includes basic language conformity, syntax and language actions, and the speech modality includes speech, intonation and sound expression, such as the same audio, "you can really be commander? However, the current speech emotion recognition system and method are not accurate enough in emotion analysis of a speech signal generated by a recognizer, and the emotion analysis is rough, so that emotion expressed by the speech signal cannot be completely analyzed and displayed, and therefore further improvement is needed.

The Chinese patent application numbers are: 201710172622.6, application date is 03 and 21 in 2017, and publication date is: 23/06/2017, with patent names: the invention discloses an acoustic feature extraction method, an acoustic feature extraction device and terminal equipment based on a convolutional neural network, wherein the acoustic feature extraction method based on the convolutional neural network comprises the following steps: arranging the voices to be recognized into a spectrogram with a preset latitude number; and identifying the spectrogram of the preset latitude number through a convolutional neural network to obtain the acoustic features in the voice to be identified. The method and the device can extract the acoustic features in the voice through the convolutional neural network, can better represent the acoustic features in the voice, and improve the accuracy of voice recognition.

The patent documents mentioned above disclose an acoustic feature extraction method, apparatus and terminal device based on convolutional neural network, but the present invention is not accurate in emotion analysis and judgment of speech signals, cannot judge emotion by using speech information emitted by a person, and cannot meet the needs of current social development.

Disclosure of Invention

In view of the above, the present invention provides a multimodal speech emotion recognition system and method based on a convolutional neural network, which have a high intelligence level and can accurately recognize emotion information through speech information.

In order to achieve the first object of the present invention, the following technical solutions may be adopted:

a multi-mode speech emotion recognition system based on a convolutional neural network comprises a speech extraction module, a framing processing module, a frequency domain signal processing module, a spectrogram processing module, a convolutional neural network processing module, a feature extraction module and a speech emotion classification module; the voice extraction module is used for extracting a voice file, the framing processing module is used for performing framing window processing on the voice file, and the frequency domain signal processing is used for converting a voice time domain signal into a frequency domain signal; the spectrogram processing module is used for detecting frequency change of a voice signal; the convolutional neural network processing module is used for extracting high-level frequency characteristics of the voice signals; the characteristic extraction module is used for extracting prosodic characteristics of the voice signals; the voice emotion classification module is used for carrying out emotion recognition and classification on the voice signals through voice signal prosody features;

the voice extracting module extracts a voice file and then inputs a voice signal to the frequency domain processing module through the framing processing module, the frequency domain processing module converts the voice signal into a frequency domain signal and then inputs the voice signal to the convolutional neural network processing module through the spectrogram processing module, and the convolutional neural network processing module extracts high-level frequency characteristics of the voice signal and then inputs the voice signal to the voice emotion classifying module through the characteristic extracting module to perform voice emotion recognition.

The voice file includes a voice file whose file format is a suffix of.

The framing processing module comprises a Hamming window framing processing module.

The Hamming window frame processing module passes through a formula

Obtaining a result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.

The conversion of the voice time domain signal into the frequency domain signal is converted by a Fourier transform formula, wherein the Fourier transform formula is as follows:

wherein, X_n(e^jw) Discrete time domain Fourier transform X (e) of X (m)^jw) And the discrete time domain Fourier transform of W (m) W (e)^jw) A window function w (n-m) represents a sliding window that slides along the sequence x (m) as n varies;

e^-jwmrepresents a set of orthogonal bases, and is based on the Euler formula cos theta + isin theta ═ e^iθObtained by conversion, wherein w is angular frequency; and m is the position of the current frame.

In order to achieve the second object of the present invention, the following technical solutions may be adopted:

a multi-mode speech emotion recognition method based on a convolutional neural network comprises the following steps:

1) extracting voice signals of the voice file;

2) performing framing processing on the voice signals;

3) converting the voice signal into a frequency domain signal;

4) processing the frequency domain signal through a spectrogram;

5) processing the voice signal processed by the spectrogram through a convolutional neural network;

6) carrying out voice feature extraction on the voice signal;

7) and performing emotion recognition and analysis on the voice signals.

The voice file includes a voice file whose file format is a suffix of.

The Hamming window frame processing module passes through a formula

The conversion of the voice signal into a frequency domain signal is performed by a fourier transform formula, which is:

The technical scheme provided by the invention has the beneficial effects that 1) the voice signal is processed through the convolutional neural network, so that the emotion information in the voice is analyzed and recognized by extracting the voice characteristic, and the accuracy of analysis and recognition is greatly improved; 2) the invention has high intelligent degree, low maintenance cost and wider application range; 3) the invention identifies the emotion information from the voice information, so that the intelligent application achieves the purpose of upgrading and updating.

Drawings

FIG. 1 is a block diagram of a multi-modal speech emotion recognition system based on a convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a flowchart of a multi-modal speech emotion recognition method based on a convolutional neural network according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments thereof.

Example 1

Artificial intelligence technology is developed rapidly in various industries nowadays, and various human-computer interaction products are produced accordingly. The emotion recognition cannot be left in the aspects of intelligent question answering robots, intelligent medical care systems, automatic driving and the like.

When a person chats with the robot, the robot can give a conversation sharing the emotion with the person according to the facial expression of the person; when the patient has special requirements, the nursing system can timely respond and inform doctors and nurses; when a driver is tired in driving and production personnel are tired excessively, emotion detection and monitoring can find and give a prompt in time so as to prevent traffic accidents and production safety accidents. The application scene of emotion recognition is very close to life, and when the machine has the capability of 'observing the language and the color', the machine can interact with people harmoniously in life and provide better service for people. Therefore, designing a highly efficient and practical speech emotion recognition system is an inevitable requirement of a new technological revolution.

Referring to fig. 1, a multi-modal speech emotion recognition system based on a convolutional neural network comprises a speech extraction module 1, a framing processing module 2, a frequency domain signal processing module 3, a speech spectrogram processing module 4, a convolutional neural network processing module 5, a feature extraction module 6 and a speech emotion classification module 7; the voice extraction module 1 is used for extracting a voice file, the framing processing module 2 is used for performing framing window processing on the voice file, and the frequency domain signal processing module 3 is used for converting a voice time domain signal into a frequency domain signal; the spectrogram processing module 4 is configured to detect a frequency change of a speech signal; the convolutional neural network processing module 5 is used for extracting high-level frequency characteristics of the voice signal; the feature extraction module 6 is configured to extract prosodic features of the voice signal; the speech emotion classification module 7 is used for performing emotion recognition and classification on the speech signals through the prosodic features of the speech signals;

the voice extracting module 1 extracts a voice file and then inputs a voice signal into the frequency domain processing module 3 through the framing processing module 2, the frequency domain processing module 3 converts the voice signal into a frequency domain signal and then inputs the voice signal into the convolutional neural network processing module 5 through the spectrogram processing module 4, and the convolutional neural network processing module 5 extracts high-level frequency characteristics of the voice signal and then inputs the voice signal into the voice emotion classifying module 7 through the characteristic extracting module 6 to perform voice emotion recognition.

In this embodiment, the voice reading module mainly speaks an original voice file, converts a voice analog signal into a digital signal after reading the original voice file by a computer system, and then performs subsequent analysis operations.

Preferably, the voice file includes a voice file whose file format is suffix of.

The voice file with the suffix of wav is essentially an uncompressed original audio file, the file structure of the voice file is not very complex, and the operation is convenient due to metadata information, audio sampling rate, sampling precision and the like.

In this embodiment, after the waveform file of the voice signal is read, the waveform file is subjected to frame windowing. Because speech signals have short-term stationarity, the analysis and processing of any speech signal must be based on "short-term" analysis of the speech signal into segments for analysis of its characteristic parameters. Typically, each segment is called a "frame," which is typically 10-30 ms in length. At this time, for the whole speech signal, the analyzed parameters should be a feature parameter time series composed of feature parameters of each frame.

The framing is performed by weighting with a movable finite-length window, i.e. multiplying the speech signal by a certain window function w (m). The windowed speech signal is represented as: x is the number of_n(m)＝w(m)x(n+m)

Common window functions are: a rectangular window and a hamming window, preferably, in this embodiment, the framing processing module performs framing processing on the voice signal through the hamming window framing processing module 21.

The Hamming window frame processing module passes through a formula

Acquiring a frame signal processing result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.

The width of the main lobe of the hamming window frame signal is twice as large as that of the rectangular window frame signal, i.e. the bandwidth is approximately doubled, and the out-of-band attenuation is more than doubled than that of the rectangular window. The spectral smoothing performance of the rectangular window frame signal is good, but high-frequency components are lost, so that waveform details are lost; the hamming window frame signal is preferably selected in view of the opposite.

Converting a time domain signal to a frequency domain signal by Fourier transform of the voice file subjected to frame windowing; the fourier transform equation is:

the speech signal is converted into a frequency domain signal and then processed by the spectrogram processing module 4, and the spectrogram processing module 4 uses the spectrogram as a speech representation function, so that background non-speech signals, such as music (excluding singing) and crowd noise, can be effectively processed even under a noise level equivalent to the speech signal level. Using harmonic modeling to remove non-speech components from the spectrogram, we demonstrate a significant improvement in emotion recognition accuracy in the presence of unknown background non-speech signals.

The spectrogram processing module 4 reflects the frequency change of each frame of the speech signal and the energy fluctuation on the frame through a spectrogram.

And the voice signal passing through the spectrogram processing module 4 passes through the convolutional neural network processing module 5 to extract high-level frequency characteristics of the voice signal.

The Convolutional Neural Network (CNN) is a feedforward Neural Network, and functions of local receptive field, convolution, pooling and the like are introduced into the Neural Network. The convolutional neural network is composed of convolutional layers, full-link layers, pooling layers and the like.

The convolution layer is used for extracting different characteristics of an input two-dimensional time-frequency graph through convolution operation, some convolution layers can only extract some low-level characteristics such as edge information, contour information and horizontal information, more convolution layers can be designed to extract more complex characteristics from low-level characteristics in an iteration mode, T two-dimensional color pictures with the length and the width being respectively H and W and the number of channels being three can be represented as T × H × W, T filters are used for performing convolution operation on characteristic matrixes respectively, then outputs at corresponding positions are added to obtain an output matrix, K output matrixes can be obtained through operation of K groups of filters, and O is used for obtaining K output matrixes through operation of O groups of filters_kRepresenting the kth output matrix, can be described by the following equation:

where is the convolution operation, ω represents the parameters of the convolution filter, b_kIs the offset of the K-th output matrix.

The feature map which is mainly convolutely learned by the pooling layer is subjected to downsampling (Subsampling) processing, and the method has the main functions of reducing the input dimensionality of a subsequent network layer, reducing the size of a model and improving the robustness of the model. The method mainly reduces the effective information of the image reserved by the colleagues of the model parameters by carrying out aggregation function statistics on different position characteristics of the image.

The convolutional neural network is structurally characterized by comprising 2 convolutional layers, the convolutional kernel size is 3 × 3, the maximum pooling layer size is 2 × 2, the first convolutional layer comprises 32 convolutional filters, the convolutional kernel size is 3 × 3, the second convolutional layer comprises 64 convolutional filters, the convolutional kernel size is 3 × 3, the number of the first maximum pooling layer filters is 32, the number of the second maximum pooling layers is 64, and high-level frequency features in a voice signal are extracted through convolutional neural network processing.

In this embodiment, after the high-level frequency features of the speech signal are extracted through the convolutional neural network processing, the prosodic features of the speech signal are extracted through the feature extraction module 6.

In this embodiment, the extraction of the prosodic features of the speech signal may be performed by using a PyAudioAnalysis tool to extract prosodic features in the speech, and may extract features such as a fundamental frequency, a sound probability, a loudness curve, and the like of the speech signal; and converting the extracted prosodic features of the voice signals and the frequency features in the spectrogram to obtain voice features with the same dimensionality.

In this embodiment, the speech signal is input to the speech emotion classification module 7 through the feature extraction module 6 to perform speech emotion recognition. The speech emotion classification module 7 is used for performing emotion recognition and classification on the speech signals through the prosodic features of the speech signals;

and transmitting the voice feature vectors extracted by the feature extraction module 6 into softmax for emotion category prediction. The probability distribution of the predicted values is as follows:

where e is the output of the full link layer,

and the deviation b is a parameter value for model learning. Wherein, softhe formula for tmax is:

because the emotion information is weak information contained in the voice, the recognition model is required to have the strong feature learning capability, the learning capability of the traditional method is very limited, and the deep learning-based method disclosed by the invention shows considerable classification and feature learning capabilities by virtue of the strong feature learning capability. The convolutional neural network is used for extracting the frequency characteristics contained in the spectrogram, and the characteristics of each frequency segment in the voice are extracted through the strong learning ability, so that the accuracy of voice emotion classification is improved.

Example 2

Referring to fig. 2, the difference from the above embodiment is that in this embodiment, a method for multi-modal speech emotion recognition based on convolutional neural network includes the following steps:

1) extracting voice signals from the voice file S1;

2) framing the voice signal S2;

3) converting the voice signal into a frequency domain signal S3;

4) processing the frequency domain signal through a spectrogram S4;

5) processing the voice signal processed by the spectrogram through a convolutional neural network S5;

6) performing voice feature extraction on the voice signal S6;

7) and performing emotion recognition and analysis on the speech signal S7.

In this embodiment, preferably, the voice file includes a voice file whose file format is a suffix of.

The step 1) of extracting the voice signal from the voice file may be realized by converting the voice analog signal into a digital signal after the voice signal is read by the computer system, and then performing a subsequent analysis operation.

The step 2) of framing the voice signal is to perform framing and windowing on the waveform file after the waveform file of the voice signal is read. Preferably, the framing is performed by a hamming window framing processing module 21.

The Hamming window frame processing module passes through a formula

The step 3) converts the time domain signal to the frequency domain signal through Fourier transform of the voice file after framing and windowing; the fourier transform equation is:

And 4) processing the frequency domain signal through a spectrogram, namely reflecting the frequency change of each frame of signal of the voice signal and the energy fluctuation on the frame of signal through the spectrogram.

And 5) processing the voice signal processed by the spectrogram through a convolutional neural network, namely extracting high-level frequency characteristics of the voice signal through the convolutional neural network.

And 6) further extracting the voice characteristics of the voice signal according to the extracted high-level frequency characteristics of the voice signal, namely actually extracting the prosodic characteristics of the voice signal.

The extraction of the prosodic features of the voice signals can be realized by extracting prosodic features in the voice by using a PyAudioAnalysis tool, and can be realized by extracting the characteristics of fundamental frequency, voice probability, loudness curve and the like of the voice signals; and converting the extracted prosodic features of the voice signals and the frequency features in the spectrogram to obtain voice features with the same dimensionality.

The step 7) is used for carrying out emotion recognition and analysis on the voice signals and carrying out emotion recognition and classification on the voice signals through voice signal prosody characteristics;

where e is the output of the full link layer,

and the deviation b is a parameter value for model learning. Wherein, the formula of softmax is:

the invention fully combines the characteristics of human emotion multimode and identifies the emotional characteristics of the speaker through the voice characteristics. The method comprises the steps of extracting voice features, normalizing a two-dimensional spectrogram by using a model (CNN + BilSTM) of a mixed neural network, inputting the normalized two-dimensional spectrogram into the CNN model to extract high-dimensional features, and extracting context information contained in a voice fragment by using the BilSTM network based on a time sequence. The method has the advantages of accurate and efficient emotion feature identification, and can be applied to the wider artificial intelligence field.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A multi-mode speech emotion recognition system based on a convolutional neural network is characterized in that: the system comprises a voice extraction module, a framing processing module, a frequency domain signal processing module, a spectrogram processing module, a convolutional neural network processing module, a feature extraction module and a voice emotion classification module; the voice extraction module is used for extracting a voice file, the framing processing module is used for performing framing window processing on the voice file, and the frequency domain signal processing is used for converting a voice time domain signal into a frequency domain signal; the spectrogram processing module is used for detecting frequency change of a voice signal; the convolutional neural network processing module is used for extracting high-level frequency characteristics of the voice signals; the characteristic extraction module is used for extracting prosodic characteristics of the voice signals; the voice emotion classification module is used for carrying out emotion recognition and classification on the voice signals through voice signal prosody features;

2. The convolutional neural network based multimodal speech emotion recognition system of claim 1, wherein: the voice file includes a voice file whose file format is a suffix of.

3. The convolutional neural network based multimodal speech emotion recognition system of claim 1, wherein: the framing processing module comprises a Hamming window framing processing module.

4. The convolutional neural network based multimodal speech emotion recognition system of claim 3, wherein: the Hamming window frame processing module passes through a formula

5. The convolutional neural network based multimodal speech emotion recognition system of claim 1, wherein: the conversion of the voice time domain signal into the frequency domain signal is converted by a Fourier transform formula, wherein the Fourier transform formula is as follows:

6. A multi-mode speech emotion recognition method based on a convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:

1) extracting voice signals of the voice file;

2) performing framing processing on the voice signals;

3) converting the voice signal into a frequency domain signal;

4) processing the frequency domain signal through a spectrogram;

6) carrying out voice feature extraction on the voice signal;

7) and performing emotion recognition and analysis on the voice signals.

7. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 6, wherein: the voice file includes a voice file whose file format is a suffix of.

8. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 6, wherein: the framing processing module comprises a Hamming window framing processing module.

9. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 8, wherein: the Hamming window frame processing module passes through a formula

10. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 6, wherein: the conversion of the voice signal into a frequency domain signal is performed by a fourier transform formula, which is: