CN111326178A - Multi-mode speech emotion recognition system and method based on convolutional neural network - Google Patents

Multi-mode speech emotion recognition system and method based on convolutional neural network Download PDF

Info

Publication number
CN111326178A
CN111326178A CN202010122988.4A CN202010122988A CN111326178A CN 111326178 A CN111326178 A CN 111326178A CN 202010122988 A CN202010122988 A CN 202010122988A CN 111326178 A CN111326178 A CN 111326178A
Authority
CN
China
Prior art keywords
voice
neural network
processing module
convolutional neural
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010122988.4A
Other languages
Chinese (zh)
Inventor
叶吉祥
王东杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University of Science and Technology
Original Assignee
Changsha University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University of Science and Technology filed Critical Changsha University of Science and Technology
Priority to CN202010122988.4A priority Critical patent/CN111326178A/en
Publication of CN111326178A publication Critical patent/CN111326178A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1807Speech classification or search using natural language modelling using prosody or stress
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal speech emotion recognition system and method based on a convolutional neural network, which comprises a speech extraction module, a framing processing module, a frequency domain signal processing module, a spectrogram processing module, a convolutional neural network processing module, a feature extraction module and a speech emotion classification module; the method comprises the following steps: 1. extracting a voice signal from the voice file; 2. performing framing processing on the voice signal; 3. converting the voice signal into a frequency domain signal; 4. processing the frequency domain signal through a spectrogram; 5. processing the voice signal processed by the spectrogram through a convolutional neural network; 6. carrying out voice feature extraction on the voice signal; 7. and performing emotion recognition and analysis on the voice signals. The invention aims to provide a multi-mode speech emotion recognition system and method based on a convolutional neural network, which have high intelligent degree and can accurately recognize emotion information through speech information.

Description

Multi-mode speech emotion recognition system and method based on convolutional neural network
Technical Field
The invention relates to the technical field of voice recognition analysis, in particular to a multi-modal voice emotion recognition system and method based on a convolutional neural network.
Background
The speech emotion recognition system generally consists of the following three parts: voice signal acquisition, emotion feature extraction and emotion recognition. Generally, a voice signal acquisition part mainly obtains the most original voice signal through a voice sensor (for example, a voice recording device such as a mobile phone microphone), and in order to obtain a high-quality and stable voice signal, a voice preprocessing operation is usually required to make a good quality data preprocessing cushion for subsequent emotion recognition.
The preprocessing of the speech signal usually includes pre-filtering, sampling quantization, pre-emphasis, frame windowing, and short-point detection. And then, transmitting the processed voice signal to an emotion feature extraction module to extract acoustic features closely related to the emotion of the speaker in the voice signal, and finally, transmitting the extracted acoustic features and the like to an emotion recognition module to finish emotion judgment.
The speech emotion recognition is a challenging task, a traditional classification model widely relies on audio features to construct a classifier with good performance, the speech features which can be used for model training are usually extracted from an original sound waveform in the first step of the model, a proper speech emotion recognition model is further established, information which can distinguish different emotion categories is obtained from the extracted audio features, and finally emotion prediction on a test data set is obtained by selecting a proper classifier. However, human emotions are often multi-modal, including three modalities of vision, speech and text, each modality has much information, such as the text modality includes basic language conformity, syntax and language actions, and the speech modality includes speech, intonation and sound expression, such as the same audio, "you can really be commander? However, the current speech emotion recognition system and method are not accurate enough in emotion analysis of a speech signal generated by a recognizer, and the emotion analysis is rough, so that emotion expressed by the speech signal cannot be completely analyzed and displayed, and therefore further improvement is needed.
The Chinese patent application numbers are: 201710172622.6, application date is 03 and 21 in 2017, and publication date is: 23/06/2017, with patent names: the invention discloses an acoustic feature extraction method, an acoustic feature extraction device and terminal equipment based on a convolutional neural network, wherein the acoustic feature extraction method based on the convolutional neural network comprises the following steps: arranging the voices to be recognized into a spectrogram with a preset latitude number; and identifying the spectrogram of the preset latitude number through a convolutional neural network to obtain the acoustic features in the voice to be identified. The method and the device can extract the acoustic features in the voice through the convolutional neural network, can better represent the acoustic features in the voice, and improve the accuracy of voice recognition.
The patent documents mentioned above disclose an acoustic feature extraction method, apparatus and terminal device based on convolutional neural network, but the present invention is not accurate in emotion analysis and judgment of speech signals, cannot judge emotion by using speech information emitted by a person, and cannot meet the needs of current social development.
Disclosure of Invention
In view of the above, the present invention provides a multimodal speech emotion recognition system and method based on a convolutional neural network, which have a high intelligence level and can accurately recognize emotion information through speech information.
In order to achieve the first object of the present invention, the following technical solutions may be adopted:
a multi-mode speech emotion recognition system based on a convolutional neural network comprises a speech extraction module, a framing processing module, a frequency domain signal processing module, a spectrogram processing module, a convolutional neural network processing module, a feature extraction module and a speech emotion classification module; the voice extraction module is used for extracting a voice file, the framing processing module is used for performing framing window processing on the voice file, and the frequency domain signal processing is used for converting a voice time domain signal into a frequency domain signal; the spectrogram processing module is used for detecting frequency change of a voice signal; the convolutional neural network processing module is used for extracting high-level frequency characteristics of the voice signals; the characteristic extraction module is used for extracting prosodic characteristics of the voice signals; the voice emotion classification module is used for carrying out emotion recognition and classification on the voice signals through voice signal prosody features;
the voice extracting module extracts a voice file and then inputs a voice signal to the frequency domain processing module through the framing processing module, the frequency domain processing module converts the voice signal into a frequency domain signal and then inputs the voice signal to the convolutional neural network processing module through the spectrogram processing module, and the convolutional neural network processing module extracts high-level frequency characteristics of the voice signal and then inputs the voice signal to the voice emotion classifying module through the characteristic extracting module to perform voice emotion recognition.
The voice file includes a voice file whose file format is a suffix of.
The framing processing module comprises a Hamming window framing processing module.
The Hamming window frame processing module passes through a formula
Figure BDA0002393560020000041
Obtaining a result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.
The conversion of the voice time domain signal into the frequency domain signal is converted by a Fourier transform formula, wherein the Fourier transform formula is as follows:
Figure BDA0002393560020000042
wherein, Xn(ejw) Discrete time domain Fourier transform X (e) of X (m)jw) And the discrete time domain Fourier transform of W (m) W (e)jw) A window function w (n-m) represents a sliding window that slides along the sequence x (m) as n varies;
e-jwmrepresents a set of orthogonal bases, and is based on the Euler formula cos theta + isin theta ═ eObtained by conversion, wherein w is angular frequency; and m is the position of the current frame.
In order to achieve the second object of the present invention, the following technical solutions may be adopted:
a multi-mode speech emotion recognition method based on a convolutional neural network comprises the following steps:
1) extracting voice signals of the voice file;
2) performing framing processing on the voice signals;
3) converting the voice signal into a frequency domain signal;
4) processing the frequency domain signal through a spectrogram;
5) processing the voice signal processed by the spectrogram through a convolutional neural network;
6) carrying out voice feature extraction on the voice signal;
7) and performing emotion recognition and analysis on the voice signals.
The voice file includes a voice file whose file format is a suffix of.
The framing processing module comprises a Hamming window framing processing module.
The Hamming window frame processing module passes through a formula
Figure BDA0002393560020000051
Obtaining a result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.
The conversion of the voice signal into a frequency domain signal is performed by a fourier transform formula, which is:
Figure BDA0002393560020000052
wherein, Xn(ejw) Discrete time domain Fourier transform X (e) of X (m)jw) And the discrete time domain Fourier transform of W (m) W (e)jw) A window function w (n-m) represents a sliding window that slides along the sequence x (m) as n varies;
e-jwmrepresents a set of orthogonal bases, and is based on the Euler formula cos theta + isin theta ═ eObtained by conversion, wherein w is angular frequency; and m is the position of the current frame.
The technical scheme provided by the invention has the beneficial effects that 1) the voice signal is processed through the convolutional neural network, so that the emotion information in the voice is analyzed and recognized by extracting the voice characteristic, and the accuracy of analysis and recognition is greatly improved; 2) the invention has high intelligent degree, low maintenance cost and wider application range; 3) the invention identifies the emotion information from the voice information, so that the intelligent application achieves the purpose of upgrading and updating.
Drawings
FIG. 1 is a block diagram of a multi-modal speech emotion recognition system based on a convolutional neural network according to an embodiment of the present invention;
FIG. 2 is a flowchart of a multi-modal speech emotion recognition method based on a convolutional neural network according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and embodiments thereof.
Example 1
Artificial intelligence technology is developed rapidly in various industries nowadays, and various human-computer interaction products are produced accordingly. The emotion recognition cannot be left in the aspects of intelligent question answering robots, intelligent medical care systems, automatic driving and the like.
When a person chats with the robot, the robot can give a conversation sharing the emotion with the person according to the facial expression of the person; when the patient has special requirements, the nursing system can timely respond and inform doctors and nurses; when a driver is tired in driving and production personnel are tired excessively, emotion detection and monitoring can find and give a prompt in time so as to prevent traffic accidents and production safety accidents. The application scene of emotion recognition is very close to life, and when the machine has the capability of 'observing the language and the color', the machine can interact with people harmoniously in life and provide better service for people. Therefore, designing a highly efficient and practical speech emotion recognition system is an inevitable requirement of a new technological revolution.
Referring to fig. 1, a multi-modal speech emotion recognition system based on a convolutional neural network comprises a speech extraction module 1, a framing processing module 2, a frequency domain signal processing module 3, a speech spectrogram processing module 4, a convolutional neural network processing module 5, a feature extraction module 6 and a speech emotion classification module 7; the voice extraction module 1 is used for extracting a voice file, the framing processing module 2 is used for performing framing window processing on the voice file, and the frequency domain signal processing module 3 is used for converting a voice time domain signal into a frequency domain signal; the spectrogram processing module 4 is configured to detect a frequency change of a speech signal; the convolutional neural network processing module 5 is used for extracting high-level frequency characteristics of the voice signal; the feature extraction module 6 is configured to extract prosodic features of the voice signal; the speech emotion classification module 7 is used for performing emotion recognition and classification on the speech signals through the prosodic features of the speech signals;
the voice extracting module 1 extracts a voice file and then inputs a voice signal into the frequency domain processing module 3 through the framing processing module 2, the frequency domain processing module 3 converts the voice signal into a frequency domain signal and then inputs the voice signal into the convolutional neural network processing module 5 through the spectrogram processing module 4, and the convolutional neural network processing module 5 extracts high-level frequency characteristics of the voice signal and then inputs the voice signal into the voice emotion classifying module 7 through the characteristic extracting module 6 to perform voice emotion recognition.
In this embodiment, the voice reading module mainly speaks an original voice file, converts a voice analog signal into a digital signal after reading the original voice file by a computer system, and then performs subsequent analysis operations.
Preferably, the voice file includes a voice file whose file format is suffix of.
The voice file with the suffix of wav is essentially an uncompressed original audio file, the file structure of the voice file is not very complex, and the operation is convenient due to metadata information, audio sampling rate, sampling precision and the like.
In this embodiment, after the waveform file of the voice signal is read, the waveform file is subjected to frame windowing. Because speech signals have short-term stationarity, the analysis and processing of any speech signal must be based on "short-term" analysis of the speech signal into segments for analysis of its characteristic parameters. Typically, each segment is called a "frame," which is typically 10-30 ms in length. At this time, for the whole speech signal, the analyzed parameters should be a feature parameter time series composed of feature parameters of each frame.
The framing is performed by weighting with a movable finite-length window, i.e. multiplying the speech signal by a certain window function w (m). The windowed speech signal is represented as: x is the number ofn(m)=w(m)x(n+m)
Common window functions are: a rectangular window and a hamming window, preferably, in this embodiment, the framing processing module performs framing processing on the voice signal through the hamming window framing processing module 21.
The Hamming window frame processing module passes through a formula
Figure BDA0002393560020000081
Acquiring a frame signal processing result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.
The width of the main lobe of the hamming window frame signal is twice as large as that of the rectangular window frame signal, i.e. the bandwidth is approximately doubled, and the out-of-band attenuation is more than doubled than that of the rectangular window. The spectral smoothing performance of the rectangular window frame signal is good, but high-frequency components are lost, so that waveform details are lost; the hamming window frame signal is preferably selected in view of the opposite.
Converting a time domain signal to a frequency domain signal by Fourier transform of the voice file subjected to frame windowing; the fourier transform equation is:
Figure BDA0002393560020000091
the speech signal is converted into a frequency domain signal and then processed by the spectrogram processing module 4, and the spectrogram processing module 4 uses the spectrogram as a speech representation function, so that background non-speech signals, such as music (excluding singing) and crowd noise, can be effectively processed even under a noise level equivalent to the speech signal level. Using harmonic modeling to remove non-speech components from the spectrogram, we demonstrate a significant improvement in emotion recognition accuracy in the presence of unknown background non-speech signals.
The spectrogram processing module 4 reflects the frequency change of each frame of the speech signal and the energy fluctuation on the frame through a spectrogram.
And the voice signal passing through the spectrogram processing module 4 passes through the convolutional neural network processing module 5 to extract high-level frequency characteristics of the voice signal.
The Convolutional Neural Network (CNN) is a feedforward Neural Network, and functions of local receptive field, convolution, pooling and the like are introduced into the Neural Network. The convolutional neural network is composed of convolutional layers, full-link layers, pooling layers and the like.
The convolution layer is used for extracting different characteristics of an input two-dimensional time-frequency graph through convolution operation, some convolution layers can only extract some low-level characteristics such as edge information, contour information and horizontal information, more convolution layers can be designed to extract more complex characteristics from low-level characteristics in an iteration mode, T two-dimensional color pictures with the length and the width being respectively H and W and the number of channels being three can be represented as T × H × W, T filters are used for performing convolution operation on characteristic matrixes respectively, then outputs at corresponding positions are added to obtain an output matrix, K output matrixes can be obtained through operation of K groups of filters, and O is used for obtaining K output matrixes through operation of O groups of filterskRepresenting the kth output matrix, can be described by the following equation:
Figure BDA0002393560020000101
where is the convolution operation, ω represents the parameters of the convolution filter, bkIs the offset of the K-th output matrix.
The feature map which is mainly convolutely learned by the pooling layer is subjected to downsampling (Subsampling) processing, and the method has the main functions of reducing the input dimensionality of a subsequent network layer, reducing the size of a model and improving the robustness of the model. The method mainly reduces the effective information of the image reserved by the colleagues of the model parameters by carrying out aggregation function statistics on different position characteristics of the image.
The convolutional neural network is structurally characterized by comprising 2 convolutional layers, the convolutional kernel size is 3 × 3, the maximum pooling layer size is 2 × 2, the first convolutional layer comprises 32 convolutional filters, the convolutional kernel size is 3 × 3, the second convolutional layer comprises 64 convolutional filters, the convolutional kernel size is 3 × 3, the number of the first maximum pooling layer filters is 32, the number of the second maximum pooling layers is 64, and high-level frequency features in a voice signal are extracted through convolutional neural network processing.
In this embodiment, after the high-level frequency features of the speech signal are extracted through the convolutional neural network processing, the prosodic features of the speech signal are extracted through the feature extraction module 6.
In this embodiment, the extraction of the prosodic features of the speech signal may be performed by using a PyAudioAnalysis tool to extract prosodic features in the speech, and may extract features such as a fundamental frequency, a sound probability, a loudness curve, and the like of the speech signal; and converting the extracted prosodic features of the voice signals and the frequency features in the spectrogram to obtain voice features with the same dimensionality.
In this embodiment, the speech signal is input to the speech emotion classification module 7 through the feature extraction module 6 to perform speech emotion recognition. The speech emotion classification module 7 is used for performing emotion recognition and classification on the speech signals through the prosodic features of the speech signals;
and transmitting the voice feature vectors extracted by the feature extraction module 6 into softmax for emotion category prediction. The probability distribution of the predicted values is as follows:
Figure BDA0002393560020000111
where e is the output of the full link layer,
Figure BDA0002393560020000112
and the deviation b is a parameter value for model learning. Wherein, softhe formula for tmax is:
Figure BDA0002393560020000113
because the emotion information is weak information contained in the voice, the recognition model is required to have the strong feature learning capability, the learning capability of the traditional method is very limited, and the deep learning-based method disclosed by the invention shows considerable classification and feature learning capabilities by virtue of the strong feature learning capability. The convolutional neural network is used for extracting the frequency characteristics contained in the spectrogram, and the characteristics of each frequency segment in the voice are extracted through the strong learning ability, so that the accuracy of voice emotion classification is improved.
Example 2
Referring to fig. 2, the difference from the above embodiment is that in this embodiment, a method for multi-modal speech emotion recognition based on convolutional neural network includes the following steps:
1) extracting voice signals from the voice file S1;
2) framing the voice signal S2;
3) converting the voice signal into a frequency domain signal S3;
4) processing the frequency domain signal through a spectrogram S4;
5) processing the voice signal processed by the spectrogram through a convolutional neural network S5;
6) performing voice feature extraction on the voice signal S6;
7) and performing emotion recognition and analysis on the speech signal S7.
In this embodiment, preferably, the voice file includes a voice file whose file format is a suffix of.
The step 1) of extracting the voice signal from the voice file may be realized by converting the voice analog signal into a digital signal after the voice signal is read by the computer system, and then performing a subsequent analysis operation.
The step 2) of framing the voice signal is to perform framing and windowing on the waveform file after the waveform file of the voice signal is read. Preferably, the framing is performed by a hamming window framing processing module 21.
The Hamming window frame processing module passes through a formula
Figure BDA0002393560020000121
Obtaining a result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.
The step 3) converts the time domain signal to the frequency domain signal through Fourier transform of the voice file after framing and windowing; the fourier transform equation is:
Figure BDA0002393560020000131
wherein, Xn(ejw) Discrete time domain Fourier transform X (e) of X (m)jw) And the discrete time domain Fourier transform of W (m) W (e)jw) A window function w (n-m) represents a sliding window that slides along the sequence x (m) as n varies;
e-jwmrepresents a set of orthogonal bases, and is based on the Euler formula cos theta + isin theta ═ eObtained by conversion, wherein w is angular frequency; and m is the position of the current frame.
And 4) processing the frequency domain signal through a spectrogram, namely reflecting the frequency change of each frame of signal of the voice signal and the energy fluctuation on the frame of signal through the spectrogram.
And 5) processing the voice signal processed by the spectrogram through a convolutional neural network, namely extracting high-level frequency characteristics of the voice signal through the convolutional neural network.
And 6) further extracting the voice characteristics of the voice signal according to the extracted high-level frequency characteristics of the voice signal, namely actually extracting the prosodic characteristics of the voice signal.
The extraction of the prosodic features of the voice signals can be realized by extracting prosodic features in the voice by using a PyAudioAnalysis tool, and can be realized by extracting the characteristics of fundamental frequency, voice probability, loudness curve and the like of the voice signals; and converting the extracted prosodic features of the voice signals and the frequency features in the spectrogram to obtain voice features with the same dimensionality.
The step 7) is used for carrying out emotion recognition and analysis on the voice signals and carrying out emotion recognition and classification on the voice signals through voice signal prosody characteristics;
and transmitting the voice feature vectors extracted by the feature extraction module 6 into softmax for emotion category prediction. The probability distribution of the predicted values is as follows:
Figure BDA0002393560020000141
where e is the output of the full link layer,
Figure BDA0002393560020000142
and the deviation b is a parameter value for model learning. Wherein, the formula of softmax is:
Figure BDA0002393560020000143
the invention fully combines the characteristics of human emotion multimode and identifies the emotional characteristics of the speaker through the voice characteristics. The method comprises the steps of extracting voice features, normalizing a two-dimensional spectrogram by using a model (CNN + BilSTM) of a mixed neural network, inputting the normalized two-dimensional spectrogram into the CNN model to extract high-dimensional features, and extracting context information contained in a voice fragment by using the BilSTM network based on a time sequence. The method has the advantages of accurate and efficient emotion feature identification, and can be applied to the wider artificial intelligence field.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A multi-mode speech emotion recognition system based on a convolutional neural network is characterized in that: the system comprises a voice extraction module, a framing processing module, a frequency domain signal processing module, a spectrogram processing module, a convolutional neural network processing module, a feature extraction module and a voice emotion classification module; the voice extraction module is used for extracting a voice file, the framing processing module is used for performing framing window processing on the voice file, and the frequency domain signal processing is used for converting a voice time domain signal into a frequency domain signal; the spectrogram processing module is used for detecting frequency change of a voice signal; the convolutional neural network processing module is used for extracting high-level frequency characteristics of the voice signals; the characteristic extraction module is used for extracting prosodic characteristics of the voice signals; the voice emotion classification module is used for carrying out emotion recognition and classification on the voice signals through voice signal prosody features;
the voice extracting module extracts a voice file and then inputs a voice signal to the frequency domain processing module through the framing processing module, the frequency domain processing module converts the voice signal into a frequency domain signal and then inputs the voice signal to the convolutional neural network processing module through the spectrogram processing module, and the convolutional neural network processing module extracts high-level frequency characteristics of the voice signal and then inputs the voice signal to the voice emotion classifying module through the characteristic extracting module to perform voice emotion recognition.
2. The convolutional neural network based multimodal speech emotion recognition system of claim 1, wherein: the voice file includes a voice file whose file format is a suffix of.
3. The convolutional neural network based multimodal speech emotion recognition system of claim 1, wherein: the framing processing module comprises a Hamming window framing processing module.
4. The convolutional neural network based multimodal speech emotion recognition system of claim 3, wherein: the Hamming window frame processing module passes through a formula
Figure FDA0002393560010000021
Obtaining a result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.
5. The convolutional neural network based multimodal speech emotion recognition system of claim 1, wherein: the conversion of the voice time domain signal into the frequency domain signal is converted by a Fourier transform formula, wherein the Fourier transform formula is as follows:
Figure FDA0002393560010000022
wherein, Xn(ejw) Discrete time domain Fourier transform X (e) of X (m)jw) And the discrete time domain Fourier transform of W (m) W (e)jw) A window function w (n-m) represents a sliding window that slides along the sequence x (m) as n varies;
e-jwmrepresents a set of orthogonal bases, and is based on the Euler formula cos theta + isin theta ═ eObtained by conversion, wherein w is angular frequency; and m is the position of the current frame.
6. A multi-mode speech emotion recognition method based on a convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
1) extracting voice signals of the voice file;
2) performing framing processing on the voice signals;
3) converting the voice signal into a frequency domain signal;
4) processing the frequency domain signal through a spectrogram;
5) processing the voice signal processed by the spectrogram through a convolutional neural network;
6) carrying out voice feature extraction on the voice signal;
7) and performing emotion recognition and analysis on the voice signals.
7. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 6, wherein: the voice file includes a voice file whose file format is a suffix of.
8. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 6, wherein: the framing processing module comprises a Hamming window framing processing module.
9. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 8, wherein: the Hamming window frame processing module passes through a formula
Figure FDA0002393560010000031
Obtaining a result; wherein, N represents the frame length, w (N) is the expression of hamming window in the digital voice signal processing, and N represents the position of the current frame.
10. The method for multi-modal emotion speech recognition based on convolutional neural network as claimed in claim 6, wherein: the conversion of the voice signal into a frequency domain signal is performed by a fourier transform formula, which is:
Figure FDA0002393560010000032
wherein, Xn(ejw) Discrete time domain Fourier transform X (e) of X (m)jw) And the discrete time domain Fourier transform of W (m) W (e)jw) A window function w (n-m) represents a sliding window that slides along the sequence x (m) as n varies;
e-jwmrepresents a set of orthogonal bases, and is based on the Euler formula cos theta + isin theta ═ eObtained by conversion, wherein w is angular frequency; and m is the position of the current frame.
CN202010122988.4A 2020-02-27 2020-02-27 Multi-mode speech emotion recognition system and method based on convolutional neural network Pending CN111326178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010122988.4A CN111326178A (en) 2020-02-27 2020-02-27 Multi-mode speech emotion recognition system and method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010122988.4A CN111326178A (en) 2020-02-27 2020-02-27 Multi-mode speech emotion recognition system and method based on convolutional neural network

Publications (1)

Publication Number Publication Date
CN111326178A true CN111326178A (en) 2020-06-23

Family

ID=71168243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010122988.4A Pending CN111326178A (en) 2020-02-27 2020-02-27 Multi-mode speech emotion recognition system and method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN111326178A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562657A (en) * 2021-02-23 2021-03-26 成都启英泰伦科技有限公司 Personalized language offline learning method based on deep neural network
CN112669874A (en) * 2020-12-16 2021-04-16 西安电子科技大学 Voice feature extraction method based on quantum Fourier transform
CN112784730A (en) * 2021-01-20 2021-05-11 东南大学 Multi-modal emotion recognition method based on time domain convolutional network
CN113283331A (en) * 2021-05-20 2021-08-20 长沙融创智胜电子科技有限公司 Multi-class target identification method and system for unattended sensor system
CN113438368A (en) * 2021-06-22 2021-09-24 上海翰声信息技术有限公司 Method, device and computer readable storage medium for realizing ring back tone detection
US11495216B2 (en) 2020-09-09 2022-11-08 International Business Machines Corporation Speech recognition using data analysis and dilation of interlaced audio input
US11538464B2 (en) 2020-09-09 2022-12-27 International Business Machines Corporation . Speech recognition using data analysis and dilation of speech content from separated audio input

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106887225A (en) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 Acoustic feature extracting method, device and terminal device based on convolutional neural networks
CN108427670A (en) * 2018-04-08 2018-08-21 重庆邮电大学 A kind of sentiment analysis method based on context word vector sum deep learning
CN108899049A (en) * 2018-05-31 2018-11-27 中国地质大学(武汉) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110059188A (en) * 2019-04-11 2019-07-26 四川黑马数码科技有限公司 A kind of Chinese sentiment analysis method based on two-way time convolutional network
CN110175641A (en) * 2019-05-22 2019-08-27 中国科学院苏州纳米技术与纳米仿生研究所 Image-recognizing method, device, equipment and storage medium
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech
CN110502753A (en) * 2019-08-23 2019-11-26 昆明理工大学 A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN110806736A (en) * 2019-11-19 2020-02-18 北京工业大学 Method for detecting quality information of forge pieces of die forging forming intelligent manufacturing production line

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106887225A (en) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 Acoustic feature extracting method, device and terminal device based on convolutional neural networks
CN108427670A (en) * 2018-04-08 2018-08-21 重庆邮电大学 A kind of sentiment analysis method based on context word vector sum deep learning
CN108899049A (en) * 2018-05-31 2018-11-27 中国地质大学(武汉) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110059188A (en) * 2019-04-11 2019-07-26 四川黑马数码科技有限公司 A kind of Chinese sentiment analysis method based on two-way time convolutional network
CN110175641A (en) * 2019-05-22 2019-08-27 中国科学院苏州纳米技术与纳米仿生研究所 Image-recognizing method, device, equipment and storage medium
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech
CN110502753A (en) * 2019-08-23 2019-11-26 昆明理工大学 A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN110806736A (en) * 2019-11-19 2020-02-18 北京工业大学 Method for detecting quality information of forge pieces of die forging forming intelligent manufacturing production line

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11495216B2 (en) 2020-09-09 2022-11-08 International Business Machines Corporation Speech recognition using data analysis and dilation of interlaced audio input
US11538464B2 (en) 2020-09-09 2022-12-27 International Business Machines Corporation . Speech recognition using data analysis and dilation of speech content from separated audio input
CN112669874A (en) * 2020-12-16 2021-04-16 西安电子科技大学 Voice feature extraction method based on quantum Fourier transform
CN112669874B (en) * 2020-12-16 2023-08-15 西安电子科技大学 Speech feature extraction method based on quantum Fourier transform
CN112784730A (en) * 2021-01-20 2021-05-11 东南大学 Multi-modal emotion recognition method based on time domain convolutional network
CN112784730B (en) * 2021-01-20 2022-03-29 东南大学 Multi-modal emotion recognition method based on time domain convolutional network
CN112562657A (en) * 2021-02-23 2021-03-26 成都启英泰伦科技有限公司 Personalized language offline learning method based on deep neural network
CN113283331A (en) * 2021-05-20 2021-08-20 长沙融创智胜电子科技有限公司 Multi-class target identification method and system for unattended sensor system
CN113283331B (en) * 2021-05-20 2023-11-14 长沙融创智胜电子科技有限公司 Multi-class target identification method and system for unattended sensor system
CN113438368A (en) * 2021-06-22 2021-09-24 上海翰声信息技术有限公司 Method, device and computer readable storage medium for realizing ring back tone detection
CN113438368B (en) * 2021-06-22 2023-01-24 上海翰声信息技术有限公司 Method, device and computer readable storage medium for realizing ring back tone detection

Similar Documents

Publication Publication Date Title
CN111326178A (en) Multi-mode speech emotion recognition system and method based on convolutional neural network
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN112466326B (en) Voice emotion feature extraction method based on transducer model encoder
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN109979436B (en) BP neural network voice recognition system and method based on spectrum self-adaption method
Patni et al. Speech emotion recognition using MFCC, GFCC, chromagram and RMSE features
Alghifari et al. On the use of voice activity detection in speech emotion recognition
US20230298616A1 (en) System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input with Haptic Output
CN115881164A (en) Voice emotion recognition method and system
Hamsa et al. An enhanced emotion recognition algorithm using pitch correlogram, deep sparse matrix representation and random forest classifier
Hamsa et al. Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG
Jie Speech emotion recognition based on convolutional neural network
CN116013371A (en) Neurodegenerative disease monitoring method, system, device and storage medium
Hu et al. Speech emotion recognition based on attention mcnn combined with gender information
CN116543797A (en) Emotion recognition method and device based on voice, electronic equipment and storage medium
Shin et al. Speaker-invariant psychological stress detection using attention-based network
CN116230020A (en) Speech emotion recognition and classification method
CN114881668A (en) Multi-mode-based deception detection method
CN111883178B (en) Double-channel voice-to-image-based emotion recognition method
Yousfi et al. Isolated Iqlab checking rules based on speech recognition system
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN114360491A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
Lashkari et al. NMF-based cepstral features for speech emotion recognition
Tomar et al. CNN-MFCC model for speaker recognition using emotive speech
CN117150320B (en) Dialog digital human emotion style similarity evaluation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200623