CN111785301B - Residual error network-based 3DACRNN speech emotion recognition method and storage medium - Google Patents

Residual error network-based 3DACRNN speech emotion recognition method and storage medium Download PDF

Info

Publication number
CN111785301B
CN111785301B CN202010597012.2A CN202010597012A CN111785301B CN 111785301 B CN111785301 B CN 111785301B CN 202010597012 A CN202010597012 A CN 202010597012A CN 111785301 B CN111785301 B CN 111785301B
Authority
CN
China
Prior art keywords
spectrogram
residual error
model
speech
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010597012.2A
Other languages
Chinese (zh)
Other versions
CN111785301A (en
Inventor
胡章芳
唐珊珊
罗元
张昊
诸海渝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202010597012.2A priority Critical patent/CN111785301B/en
Publication of CN111785301A publication Critical patent/CN111785301A/en
Application granted granted Critical
Publication of CN111785301B publication Critical patent/CN111785301B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention requests to protect a 3DACRNN speech emotion recognition method and a storage medium based on a residual error network, wherein the method comprises the following steps: s1, converting the voice signal into a spectrogram, and processing the spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames to be used as the input of Res3 DCNN; s2, Res3DCNN extracts short-term space-time characteristics of emotion voice from the spectrogram, and compensates the missing characteristics of traditional CNNs in the convolution process by using a residual error network; s3, extracting the long-term dependence relation of space-time characteristics through ARNN, improving the problem of weak space-time relevance, and in order to reduce the calculation complexity, the invention provides a novel structure of a rear forgetting gate to improve the traditional LSTM; s4, respectively updating parameters of the model in the model training process to minimize loss, and finally optimizing the model through continuous iterative optimization; and S5, finally, carrying out emotion classification by utilizing a Softmax layer. The method can effectively solve the problems of serious loss of original characteristics and weak space-time correlation, and improve the identification accuracy.

Description

Residual error network-based 3DACRNN speech emotion recognition method and storage medium
Technical Field
The invention belongs to the field of speech signal processing and pattern recognition, and particularly relates to a residual error network-based 3DACRNN speech emotion recognition method.
Background
The continuous development of the field of artificial intelligence makes the relationship between human beings and computers increasingly close, emotion calculation is an important research field, and emotion interaction has very important significance in human-computer interaction. Since speech is a direct medium for human information exchange, Speech Emotion Recognition (SER) is most representative of the wide range of applications and applications in comparison with other emotion recognition techniques. A key link in the emotion recognition process is to extract a feature set capable of representing human emotion from a voice signal, and so far, no systematic feature set exists.
Many previous studies have extracted low-level descriptors (LLDs) directly from speech and then used traditional machine learning methods to classify emotion. However, the effect of selecting a feature set from LLDs for SER is not particularly desirable due to factors such as context and the different ways in which emotions are expressed. With the development of science and technology, image processing becomes easy to implement, so a new focus of SER research is to convert a speech signal into a spectrogram as a recognition object of an SER. The method avoids the tedious process of manual feature extraction, and reduces the workload of modeling and training. The method can also reflect the energy characteristics of the voice signals and the textural features of rhythm changes, and a plurality of researchers begin to research the voice emotion recognition technology based on the atlas, so that a good effect is achieved. Tarronika et al extracts high-level emotional feature representations from the magnitude spectrum using a Deep Neural Network (DNN) and exhibits better performance than traditional acoustic features. Han et al propose a DNN-ELM deep network model for SER, which is trained using the most energetic segments to extract effective emotional information.
In recent years, CNN and RNN are widely used in the SER field, deep convolution models can maintain the spectral time shift invariance of speech signals, and RNN is excellent in processing timing information, and thus is often used for extracting high-level features of emotional speech. Neumann et al integrates unsupervised autoencoder learning representation into the CRNN sentiment classifier, improving recognition accuracy. However, the method for CNN to learn features from spectrogram is only to fuse CNN features of a single frame image, so that the relation between adjacent continuous voice frames is often ignored, and therefore, some researches propose a three-dimensional convolution model for SER, which can better capture the short-term spatiotemporal relationship of feature representation. Peng et al directly input spectrogram information as three-dimensional-CRNN, and the convolution layer is used for extracting high-level representation and the recursion layer is used for extracting long-term dependency relationship for emotion recognition. Aiming at the interference of a silent frame and an emotion-independent frame to an SER, Chen and the like provide an attention-based 3D convolution recurrent neural network (ACRNN) model for learning the distinguishing characteristics of the SER, and the introduction of the attention mechanism effectively reduces the influence of redundant information such as the silent frame and the like. However, as the number of convolutional layers using the CNN model increases, the original features are gradually lost, and the parameters to be trained are gradually increased, resulting in a very large amount of computation. Aiming at the problem of large calculation amount, the invention provides a rear forgetting gate structure to replace the traditional LSTM forgetting gate, and the calculation amount is reduced by reducing parameters.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The 3DACRNN speech emotion recognition method and the storage medium based on the residual error network can obtain higher recognition rate, make up for lost features and reduce calculated amount. The technical scheme of the invention is as follows:
a3 DACRNN speech emotion recognition method based on a residual error network comprises the following steps:
s1, preprocessing the voice signals including pre-emphasis and windowing framing;
s2, converting the voice signal processed in the step S1 into a two-dimensional spectrogram, and processing the two-dimensional spectrogram into three-dimensional spectrogram data by a spectrogram stacking method of a plurality of continuous frames;
s3, extracting short-term space-time characteristics of emotional voice from the three-dimensional spectrogram through a three-dimensional convolution neural network Res3DCNN based on a residual error network, and compensating the missing characteristics of the traditional Convolution Neural Network (CNN) in the convolution process through the residual error network, so that the problem of gradient disappearance or explosion is effectively solved;
s4, taking the output of Res3DCNN as the input of an ARNN model of a recurrent neural network based on an attention mechanism, wherein RNN refers to the recurrent neural network, the RNN has good performance for processing time sequence signals, LSTM is one of RNN, but due to the existence of redundant information, the attention mechanism is added, and can reduce the weight of useless information, improve the training speed, extract the long-term dependence of the space-time characteristics and solve the problem of weak space-time correlation. The traditional LSTM forgetting door is improved by adopting a rear forgetting door structure; the LSTM comprises three gate structures, namely a forgetting gate, an input gate and an output gate, wherein the forgetting gate is improved, and aiming at the problem of large calculation amount, the forgetting gate of a traditional long-time memory LSTM and a special RNN structure network is improved.
S5, performing 10-time cross validation on the trained model by using a validation set, taking cross entropy as a loss function, and optimizing model parameters by using a RMSProp algorithm;
and S6, verifying the trained model by using a verification set, adjusting the hyper-parameters of the model to obtain a final network model, and finally performing speech emotion classification by using a Softmax layer.
Further, the step S1 performs preprocessing including pre-emphasis and windowing framing on the speech signal according to its short-time stationarity, and includes the following specific steps:
step A1: using a first-order high-pass filter, i.e. a pre-emphasis filter, with a transfer function in the Z-domain of h (Z) -1-az -1 A represents a pre-emphasis coefficient, the value is 0.95, Z represents a coordinate value of a Z domain, H (Z) is a transfer function, and a signal after pre-emphasis processing is x (t);
step A2: framing the pre-emphasized signal into x (m, n), wherein n is the frame length, m is the number of frames, windowing is performed by adopting a Hamming window:
Figure BDA0002557734670000031
x (m, n) represents the framed speech signal, w (n) represents the window function of the hamming window, and the windowed framed speech signal is: s is w (m,n)=x(m,n)*w(n),s w (m, N) represents the windowed framed speech signal, where each frame contains N sample points.
Further, step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and process the two-dimensional spectrogram into three-dimensional data by stacking a plurality of spectrograms of consecutive frames, and the processing steps are as follows:
step B1: transforming the signal processed in the step A2 from time domain data to a frequency domain through a Fast Fourier Transform (FFT) to obtain X (m, n);
step B2: making a periodic diagram Y (m, n) with the formula of Y (m, n) ═ X (m, n) × X (m, n) ', X (m, n)' representing the derivative of X (m, n), and taking 10log 10 Y (M, N), a down-scaling M by time, a down-scaling N by frequency, using (M, N,10 log) 10 Y (m, n)) draws a two-dimensional spectrogram;
step B3: a spectrogram of a plurality of continuous frames is stacked to form a cube, and then convolution operation is carried out on the cube and a 3D convolution kernel, wherein input data is set to be Time multiplied by Frequency multiplied by C, the Time and the Frequency respectively represent horizontal axis Time and vertical axis Frequency of the spectrogram, and C represents the number of the spectrogram.
Further, in step S3, the designed Res3DCNN is used to extract short-term spatio-temporal features of emotion speech from the three-dimensional spectrogram, and the residual error formula is:
F(x)=y-x
wherein x is input, y is output, f (x) represents residual error, x and f (x) dimensions are consistent during calculation, and if the x and f (x) dimensions are not consistent, the calculation is carried out by the following algorithm:
y=w k *x+F(x)
w k representing a weight matrix, wherein the dimension of input x can be adjusted to be consistent with F (x), a Res3DCNN model representing design is composed of four residual blocks, each residual block comprises 4 convolutional layers and 1 pooling layer, the convolutional kernel size of the first layer is 1 multiplied by 1, the convolutional kernel sizes of the other three convolutional layers are 3 multiplied by 3, the pooling layer size is 2 multiplied by 1, the step length is 1 multiplied by 1, and a batch specification layer BN and a ReLU activation function layer are added after each convolutional layer;
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
Figure BDA0002557734670000041
Figure BDA0002557734670000042
representing variables to be entered into the activation function, k representing the number of activation functions, in one batch, BN is for each feature, there are m training samples, j dimensions (j neuron nodes), normalize the j-th dimension:
Figure BDA0002557734670000043
Figure BDA0002557734670000044
Figure BDA0002557734670000045
wherein the content of the first and second substances,
Figure BDA0002557734670000051
is the result of a linear calculation of the ith dimension, μ j The mean of each of the small batches of training data is represented,
Figure BDA0002557734670000052
the variance of each small batch of training data is indicated,
Figure BDA0002557734670000053
represents the result of the normalization of the batch of training data, ε is to prevent variance from being 0;
the formula for ReLU is as follows:
Figure BDA0002557734670000054
further, in step S4, the output of Res3DCNN is used as the input of the ARNN model, the long-term dependency relationship of these spatio-temporal features is extracted, the conventional LSTM unit is composed of three gate structures, which are a forgetting gate, input gates, and an output gate, the forgetting gate is used to determine which information should be discarded in the unit state at the previous time, and directly participates in updating the unit state, the updating algorithm of the unit state is related to the hidden layer output at the previous time and the input at the current time, and the unit state at the previous time is used as a parameter for updating the current state;
forget gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Figure BDA0002557734670000055
Figure BDA0002557734670000056
Wherein C t-1 And h t-1 The cell state and hidden layer output at the previous moment, respectively, f t Indicating a forgotten gate output result, i t Input data representing input gates, x t Is an input for the current time of day,
Figure BDA0002557734670000057
is a candidate value, W, to be added to the memory cell f 、W i And W C Weights of forgetting gate, input gate and candidate cell, respectively, obtained by training, b f 、b i And b C Is the deviation of them, i t Is that
Figure BDA0002557734670000058
σ represents a logic sigmoid function:
Figure BDA0002557734670000059
aiming at the problem of large calculation amount, the forgetting gate of the traditional LSTM is improved, and a novel rear forgetting gate structure is provided, wherein the algorithm is as follows:
f t =σ(W f ×C t-1 +b f )
modified W f Smaller dimension because x is not used in the formula t And h t-1 And the modified door structure is called a rear forgetting door.
Further, the improved ARNN model of step S4 sets BLSTM to have 512 bi-directional hidden units, creates a new sequence with a shape of lx 1024, and puts it into the attention layer, and finally generates a new sequence h.
Further, the step S5 is to train the model with a training set, use cross entropy as a loss function, and optimize an objective function with the RMSProp algorithm, which specifically includes:
the cross entropy algorithm is defined as follows:
Figure BDA0002557734670000061
wherein the content of the first and second substances,
Figure BDA0002557734670000062
genuine label of jth sample
y j : the predicted output of the network model for the jth sample, C, represents the loss value.
The RMSprop algorithm is defined as follows:
Figure BDA0002557734670000063
Figure BDA0002557734670000064
wherein, r: slip ratio of gradient square value, w: attenuation rate, α: learning rate, ε: constant term to prevent denominator from being zero, η: hyper-parametric, constant.
Further, in step S6, performing emotion speech classification by using a Softmax layer, where the formula of the Softmax function is as follows:
Figure BDA0002557734670000065
softmax value, S, of the ith element in the array represented by the formula i Is shown asi classification probability of the element. j represents an accumulation variable.
A storage medium being a computer readable storage medium storing one or more programs which, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform a method as claimed in any one of the preceding claims.
The invention has the following advantages and beneficial effects:
in summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: under the same experimental environment, the 3DACRNN speech emotion recognition method based on the residual error network can better solve the problem of feature loss of the deep CNN network in the convolution process and the problem of weak space-time relevance, and further extracts deeper features capable of representing speech emotion. Extracting a spectrogram from the preprocessed voice signal and combining the spectrogram into three-dimensional spectrogram data, extracting short-term space-time characteristics through a three-dimensional convolution neural network structure based on a residual error network, and extracting long-term dependence of the space-time characteristics through a recurrent neural network based on attention. However, the three-dimensional convolutional neural network is complex and large in calculation amount, so that the LSTM network is improved, a new forgetting gate structure is invented to replace the traditional forgetting gate structure, the calculation amount is reduced to a great extent by the improved LSTM network, and the model training and testing speed is improved. In a word, the 3DACRNN speech emotion recognition method based on the residual error network improves the performance of a speech emotion recognition system to a greater extent.
Drawings
FIG. 1 is a general block diagram of the residual network-based 3DACRNN speech emotion recognition method according to the preferred embodiment of the present invention;
FIG. 2 is a spectrogram extraction process;
FIG. 3 is a diagram of a residual block in a convolutional neural network;
fig. 4 is a diagram of a conventional LSTM and a modified LSTM network architecture.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, the present invention provides a residual network-based 3 dacron speech emotion recognition method, which is characterized by comprising the following steps:
s1: the method comprises the following steps of performing pre-emphasis, windowing and frame division on a voice signal:
step A1: using a first-order high-pass filter, i.e. a pre-emphasis filter, having a transfer function of H (z) ═ 1-az -1 A represents a pre-emphasis coefficient, the value of the invention is 0.95, and a signal after pre-emphasis processing is x (t);
step A2: the pre-emphasized signal is divided into frames, and the frames are changed into x (m, n) (n is the frame length, and m is the number of the frames). We use the hamming window for windowing:
Figure BDA0002557734670000081
x (m, n) represents the framed speech signal, w (n) represents the window function of the hamming window, and the windowed framed speech signal is: s w (m,n)=x(m,n)*w(n),s w (m, N) represents the windowed framed speech signal, where each frame contains N sample points.
S2: converting the processed voice signal into a two-dimensional spectrogram (the spectrogram extraction process is shown in fig. 2), and processing the two-dimensional spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames of spectrograms, wherein the processing steps are as follows:
step B1: transforming the signal processed by A2 from time domain data to frequency domain by Fast Fourier Transform (FFT) to obtain X (m, n);
step B2: making a periodic diagram Y (m, n) with the formula of Y (m, n) ═ X (m, n) × X (m, n)', and taking 10log 10 Y (M, N), a down-scale M for M according to time, a down-scale N for N according to frequency, using (M, N,10 log) 10 Y (m, n)) draws a two-dimensional spectrogram.
Step B3: the input data of the 3D convolution, which must be three-dimensional so as to be convolved with the 3D convolution kernel, is set to Time × Frequency × C by stacking spectrogram of a plurality of consecutive frames and then performing convolution operation with the 3D convolution kernel in the cube, where Time and Frequency respectively represent horizontal axis Time and vertical axis Frequency of the spectrogram, and C represents the number of the spectrogram.
S3: the Res3DCNN designed by the invention is used for extracting short-term space-time characteristics of emotional voice from the three-dimensional spectrogram. The schematic diagram of the residual error network is shown in fig. 3, and its formula is:
F(x)=y-x
where x is the input, y is the output, and F (x) represents the residual. The x and F (x) dimensions are consistent during calculation, and if the x and F (x) dimensions are not consistent, the calculation is carried out by the following algorithm:
y=w k *x+F(x)
w k a weight matrix is represented that can adjust the dimension of input x to be consistent with f (x). The designed Res3DCNN model consists of four residual blocks, each containing 4 convolutional layers, 1 pooling layer. The convolution kernel size of the first layer is 1 × 1 × 1, the convolution kernel sizes of the remaining three convolution layers are 3 × 3 × 3, the pooling layer size is 2 × 2 × 1, and the step size is 1 × 1 × 1. After each convolution layer, a Batch Normalization (BN) and a ReLU activation function (rlu) layer are added.
The BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
Figure BDA0002557734670000091
Figure BDA0002557734670000092
representing the variables to be entered into the activation functions, k representing the number of activation functions, and in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes). Normalizing the j dimension:
Figure BDA0002557734670000093
Figure BDA0002557734670000094
Figure BDA0002557734670000095
wherein the content of the first and second substances,
Figure BDA0002557734670000096
is the result of the linear calculation of the ith dimension, mu, of the ith layer j The mean of each of the small batches of training data is represented,
Figure BDA0002557734670000097
the variance of each small batch of training data is represented,
Figure BDA0002557734670000098
the result of the normalization of the training data batch is shown, and ε is to prevent variance from being 0.
The formula for ReLU is as follows:
Figure BDA0002557734670000101
s4: the output of Res3DCNN is used as the input of the ARNN model, and the long-term dependency relationship of the space-time characteristics is extracted. The traditional LSTM unit consists of three gate structures, namely a forgetting gate, input gates and output gates. And determining which information should be discarded in the unit state at the previous moment by using a forgetting gate, directly participating in updating the unit state, wherein an updating algorithm of the unit state is related to the hidden layer output at the previous moment and the input at the current moment, and taking the unit state at the previous moment as a parameter for updating the current state.
Forgetting gate algorithm: f. of t =σ(W f ×[h t-1 ,x t ]+b f )
The unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Figure BDA0002557734670000102
Figure BDA0002557734670000103
Wherein C is t-1 And h t-1 The cell state and hidden layer output at the previous moment, respectively, f t Indicating a forgotten gate output result, i t Input data representing input gates, x t Is an input for the current time of day,
Figure BDA0002557734670000104
is a candidate value, W, to be added to the memory cell f 、W i And W C Are the weights of the forgetting gate, the input gate and the candidate cell, respectively, obtained from the training, b f 、b i And b C Is the deviation of them, i t Is that
Figure BDA0002557734670000105
σ represents a logic sigmoid function:
Figure BDA0002557734670000106
aiming at the problem of large calculation amount, the forgetting gate of the traditional LSTM is improved, a novel rear forgetting gate structure is provided, and the algorithm is as follows:
f t =σ(W f ×C t-1 +b f )
modified W f Smaller dimension because x is not used in the formula t And h t-1 Participate in calculation, reduce the parameters needing training and reduceThe invention refers to the modified gate structure as a forgetting gate, and the traditional LSTM and the improved LSTM network structures are shown in fig. 4.
The invention sets up BLSTM with 512 bidirectional hidden units for the improved ARNN model, creates a new sequence with the shape of L multiplied by 1024, and puts the new sequence into the attention layer, and finally generates a new sequence h.
Step S5, training the model by using the training set, adopting the cross entropy as the loss function, and optimizing the objective function by using the RMSProp algorithm.
The cross entropy algorithm is defined as follows:
Figure BDA0002557734670000111
wherein the content of the first and second substances,
Figure BDA0002557734670000112
the authentic tag of the jth sample, y j : the predicted output of the network model for the jth sample, C, represents the loss value.
The RMSprop algorithm is defined as follows:
Figure BDA0002557734670000113
Figure BDA0002557734670000114
wherein, r: slip ratio of gradient square value, w: attenuation rate, α: learning rate, ε: constant term to prevent denominator from being zero, η: hyper-parametric, constant.
Step S6, performing emotion speech classification using the Softmax layer, where the formula of the Softmax function is as follows:
Figure BDA0002557734670000115
the formula represents in the arraySoftmax value of the i-th element. S i Representing the classification probability of the ith element.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (8)

1. A3 DACRNN speech emotion recognition method based on a residual error network is characterized by comprising the following steps:
s1, preprocessing the voice signals including pre-emphasis and windowing framing;
s2, converting the voice signal processed in the step S1 into a two-dimensional spectrogram, and processing the two-dimensional spectrogram into three-dimensional spectrogram data by a spectrogram stacking method of a plurality of continuous frames;
s3, extracting short-term space-time characteristics of emotion voice from the three-dimensional spectrogram through a three-dimensional convolution neural network Res3DCNN based on a residual error network, and compensating the missing characteristics of the traditional convolution neural network CNN in the convolution process through the residual error network;
s4, taking the output of Res3DCNN as the input of an attention-based recurrent neural network (ARNN) model, wherein the Recurrent Neural Network (RNN) is LSTM; the forgetting gate of the LSTM adopts a rear forgetting gate, wherein the algorithm of the rear forgetting gate is as follows:
f t =σ(W f ×C t-1 +b f ),
the unit state updating algorithm: i.e. i t =σ(W i ×[h t-1 ,x t ]+b i )
Figure FDA0003721343160000011
Figure FDA0003721343160000012
Wherein C is t-1 And h t-1 The cell state and hidden layer output at the previous moment, respectively, f t Indicating a forgotten gate output result, i t Input data representing input gates, x t Is an input for the current time of day,
Figure FDA0003721343160000013
is a candidate value, W, to be added to the memory cell f 、W i And W C Weights of forgetting gate, input gate and candidate cell, respectively, obtained by training, b f 、b i And b C Is the deviation thereof, i t Is that
Figure FDA0003721343160000014
σ represents a logic sigmoid function:
Figure FDA0003721343160000015
s5, performing 10-time cross validation on the trained model by using a validation set, taking cross entropy as a loss function, and optimizing model parameters by using a RMSProp algorithm;
and S6, verifying the trained model by using a verification set, adjusting the hyper-parameters of the RMSProp algorithm in the model to obtain a final network model, and finally performing speech emotion classification by using a Softmax layer.
2. The method for 3dacrn speech emotion recognition based on residual error network as claimed in claim 1, wherein said step S1 is implemented by performing preprocessing including pre-emphasis and windowing framing on the speech signal according to its short-time stationarity, and includes the following specific steps:
step A1: using a first-order high-pass filter, i.e. a pre-emphasis filter, with a transfer function in the Z-domain of h (Z) ═ 1-az -1 A represents a pre-emphasis coefficient, the value is 0.95, Z represents a coordinate value of a Z domain, H (Z) is a transfer function, and a signal after pre-emphasis processing is x (t);
step A2: framing the pre-emphasized signal into x (m, n), wherein n is the frame length, m is the number of frames, windowing is performed by adopting a Hamming window:
Figure FDA0003721343160000021
x (m, n) represents the framed speech signal, w (n) represents the window function of the hamming window, and the windowed framed speech signal is: s is w (m,n)=x(m,n)*w(n),s w (m, N) represents the windowed framed speech signal, where each frame contains N sample points.
3. The method for 3dacrn speech emotion recognition based on residual error network of claim 2, wherein the step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and the two-dimensional spectrogram is processed into three-dimensional data by stacking a plurality of spectrogram of consecutive frames, and the processing steps are as follows:
step B1: transforming the signal processed in the step A2 from time domain data to a frequency domain through a Fast Fourier Transform (FFT) to obtain X (m, n);
step B2: making a periodic diagram Y (m, n) with the formula of Y (m, n) ═ X (m, n) × X (m, n) ', X (m, n)' representing the derivative of X (m, n), and taking 10log 10 Y (M, N), a down-scale M for M according to time, a down-scale N for N according to frequency, using (M, N,10 log) 10 Y (m, n)) draws a two-dimensional spectrogram;
step B3: a spectrogram of a plurality of continuous frames is stacked to form a cube, and then convolution operation is carried out on the cube and a 3D convolution kernel, wherein input data is set to be Time multiplied by Frequency multiplied by C, the Time and the Frequency respectively represent horizontal axis Time and vertical axis Frequency of the spectrogram, and C represents the number of the spectrogram.
4. The residual network-based 3DACRNN speech emotion recognition method of claim 3, wherein said step S3 uses the designed Res3DCNN to extract the short-term spatio-temporal features of emotion speech from the three-dimensional spectrogram, and the residual formula is:
F(x)=y-x
wherein x is input, y is output, f (x) represents residual error, x and f (x) dimensions are consistent during calculation, and if the x and f (x) dimensions are not consistent, the calculation is carried out by the following algorithm:
y=w k *x+F(x)
w k representing a weight matrix, wherein the dimension of input x can be adjusted to be consistent with F (x), a Res3DCNN model representing design is composed of four residual blocks, each residual block comprises 4 convolutional layers and 1 pooling layer, the size of a convolution kernel of the first layer is 1 multiplied by 1, the sizes of convolution kernels of the other three convolutional layers are 3 multiplied by 3, the size of the pooling layer is 2 multiplied by 2, the step length is 1 multiplied by 1, and a batch specification layer BN and a ReLU activation function layer are added after each convolutional layer;
BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:
Figure FDA0003721343160000031
Figure FDA0003721343160000032
representing variables to be entered into the activation function, k representing the number of activation functions, in one batch, BN is for each feature, there are m training samples, j dimensions (j neuron nodes), normalize the j-th dimension:
Figure FDA0003721343160000033
Figure FDA0003721343160000034
Figure FDA0003721343160000035
wherein the content of the first and second substances,
Figure FDA0003721343160000036
is the result of a linear calculation of the ith dimension, μ j Represents the mean of each of the small batches of training data,
Figure FDA0003721343160000037
the variance of each small batch of training data is represented,
Figure FDA0003721343160000038
represents the result of normalization of the training data batch, epsilon is to prevent variance from being 0;
the calculation formula for ReLU is as follows:
Figure FDA0003721343160000041
5. the method for 3dacrn speech emotion recognition based on residual network as claimed in claim 1, wherein the modified ARNN model set in step S4 has 512 bi-directional hidden units, creating a new sequence with the shape of lx 1024, which is put into the attention layer, and finally generating a new sequence h.
6. The method for 3dacrn speech emotion recognition based on residual error network as claimed in claim 5, wherein said step S5 is implemented by training the model with a training set, using cross entropy as a loss function, and using RMSProp algorithm to optimize the objective function, specifically comprising:
the cross entropy algorithm is defined as follows:
Figure FDA0003721343160000042
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003721343160000043
genuine label of jth sample
y j : the predicted output of the network model for the jth sample, C represents the loss value,
the RMSprop algorithm is defined as follows:
Figure FDA0003721343160000044
Figure FDA0003721343160000045
wherein, r: slip ratio of gradient square value, w: attenuation rate, α: learning rate, ε: constant term to prevent denominator from being zero, η: hyper-parametric, constant.
7. The residual error network-based 3DACRNN speech emotion recognition method of claim 6, wherein the step S6 utilizes a Softmax layer for speech emotion classification, and the formula of the Softmax function is as follows:
Figure FDA0003721343160000051
softmax value, S, of the ith element in the array represented by the formula i The classification probability of the ith element is represented and j represents an accumulated variable.
8. A storage medium, the storage medium being a computer readable storage medium storing one or more programs which, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of claims 1-7.
CN202010597012.2A 2020-06-28 2020-06-28 Residual error network-based 3DACRNN speech emotion recognition method and storage medium Active CN111785301B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010597012.2A CN111785301B (en) 2020-06-28 2020-06-28 Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010597012.2A CN111785301B (en) 2020-06-28 2020-06-28 Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Publications (2)

Publication Number Publication Date
CN111785301A CN111785301A (en) 2020-10-16
CN111785301B true CN111785301B (en) 2022-08-23

Family

ID=72761637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010597012.2A Active CN111785301B (en) 2020-06-28 2020-06-28 Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Country Status (1)

Country Link
CN (1) CN111785301B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780610A (en) * 2020-12-02 2021-12-10 北京沃东天骏信息技术有限公司 Customer service portrait construction method and device
CN112783327B (en) * 2021-01-29 2022-08-30 中国科学院计算技术研究所 Method and system for gesture recognition based on surface electromyogram signals
CN113362857A (en) * 2021-06-15 2021-09-07 厦门大学 Real-time speech emotion recognition method based on CapcNN and application device
CN113229810A (en) * 2021-06-22 2021-08-10 西安超越申泰信息科技有限公司 Human behavior recognition method and system and computer readable storage medium
CN113643723B (en) * 2021-06-29 2023-07-25 重庆邮电大学 Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
CN113327631B (en) * 2021-07-15 2023-03-21 广州虎牙科技有限公司 Emotion recognition model training method, emotion recognition method and emotion recognition device
CN113570156A (en) * 2021-08-18 2021-10-29 中国农业大学 Thermal environment prediction model and thermal environment prediction method based on agricultural facilities
CN113808620B (en) * 2021-08-27 2023-03-21 西藏大学 Tibetan language emotion recognition method based on CNN and LSTM
CN114202746B (en) * 2021-11-10 2024-04-12 深圳先进技术研究院 Pavement state identification method, device, terminal equipment and storage medium
WO2023082103A1 (en) * 2021-11-10 2023-05-19 深圳先进技术研究院 Road surface state recognition method and apparatus, and terminal device, storage medium and product
CN114219005B (en) * 2021-11-17 2023-04-18 太原理工大学 Depression classification method based on high-order spectrum voice features

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180358003A1 (en) * 2017-06-09 2018-12-13 Qualcomm Incorporated Methods and apparatus for improving speech communication and speech interface quality using neural networks
CN108597539B (en) * 2018-02-09 2021-09-03 桂林电子科技大学 Speech emotion recognition method based on parameter migration and spectrogram
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN112581979B (en) * 2020-12-10 2022-07-12 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN113450830B (en) * 2021-06-23 2024-03-08 东南大学 Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms

Also Published As

Publication number Publication date
CN111785301A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN111785301B (en) Residual error network-based 3DACRNN speech emotion recognition method and storage medium
Razlighi et al. Looknn: Neural network with no multiplication
Xu et al. Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning.
Cao et al. Urban noise recognition with convolutional neural network
Huang et al. Unsupervised domain adaptation for speech emotion recognition using PCANet
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
Zhang et al. Platon: Pruning large transformer models with upper confidence bound of weight importance
CN110782008B (en) Training method, prediction method and device of deep learning model
Deng et al. Deep learning for signal and information processing
Elleuch et al. Arabic handwritten characters recognition using deep belief neural networks
Kilimci et al. The evaluation of word embedding models and deep learning algorithms for Turkish text classification
Rathor et al. A robust model for domain recognition of acoustic communication using Bidirectional LSTM and deep neural network.
SG182933A1 (en) A data structure and a method for using the data structure
Wei et al. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model
CN112861522B (en) Aspect-level emotion analysis method, system and model based on dual-attention mechanism
CN111968652B (en) Speaker identification method based on 3DCNN-LSTM and storage medium
Chen et al. Deep neural networks for multi-class sentiment classification
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
Swain et al. A DCRNN-based ensemble classifier for speech emotion recognition in Odia language
CN111144500A (en) Differential privacy deep learning classification method based on analytic Gaussian mechanism
Zhang et al. Pulsar candidate recognition with deep learning
Newatia et al. Convolutional neural network for ASR
Chen et al. Towards a universal continuous knowledge base
Islam et al. DCNN-LSTM based audio classification combining multiple feature engineering and data augmentation techniques
Sun et al. Audio-video based multimodal emotion recognition using SVMs and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant