CN111785301B

CN111785301B - Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Info

Publication number: CN111785301B
Application number: CN202010597012.2A
Authority: CN
Inventors: 胡章芳; 唐珊珊; 罗元; 张昊; 诸海渝
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2022-08-23
Anticipated expiration: 2040-06-28
Also published as: CN111785301A

Abstract

The invention requests to protect a 3DACRNN speech emotion recognition method and a storage medium based on a residual error network, wherein the method comprises the following steps: s1, converting the voice signal into a spectrogram, and processing the spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames to be used as the input of Res3 DCNN; s2, Res3DCNN extracts short-term space-time characteristics of emotion voice from the spectrogram, and compensates the missing characteristics of traditional CNNs in the convolution process by using a residual error network; s3, extracting the long-term dependence relation of space-time characteristics through ARNN, improving the problem of weak space-time relevance, and in order to reduce the calculation complexity, the invention provides a novel structure of a rear forgetting gate to improve the traditional LSTM; s4, respectively updating parameters of the model in the model training process to minimize loss, and finally optimizing the model through continuous iterative optimization; and S5, finally, carrying out emotion classification by utilizing a Softmax layer. The method can effectively solve the problems of serious loss of original characteristics and weak space-time correlation, and improve the identification accuracy.

Description

Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Technical Field

The invention belongs to the field of speech signal processing and pattern recognition, and particularly relates to a residual error network-based 3DACRNN speech emotion recognition method.

Background

The continuous development of the field of artificial intelligence makes the relationship between human beings and computers increasingly close, emotion calculation is an important research field, and emotion interaction has very important significance in human-computer interaction. Since speech is a direct medium for human information exchange, Speech Emotion Recognition (SER) is most representative of the wide range of applications and applications in comparison with other emotion recognition techniques. A key link in the emotion recognition process is to extract a feature set capable of representing human emotion from a voice signal, and so far, no systematic feature set exists.

Many previous studies have extracted low-level descriptors (LLDs) directly from speech and then used traditional machine learning methods to classify emotion. However, the effect of selecting a feature set from LLDs for SER is not particularly desirable due to factors such as context and the different ways in which emotions are expressed. With the development of science and technology, image processing becomes easy to implement, so a new focus of SER research is to convert a speech signal into a spectrogram as a recognition object of an SER. The method avoids the tedious process of manual feature extraction, and reduces the workload of modeling and training. The method can also reflect the energy characteristics of the voice signals and the textural features of rhythm changes, and a plurality of researchers begin to research the voice emotion recognition technology based on the atlas, so that a good effect is achieved. Tarronika et al extracts high-level emotional feature representations from the magnitude spectrum using a Deep Neural Network (DNN) and exhibits better performance than traditional acoustic features. Han et al propose a DNN-ELM deep network model for SER, which is trained using the most energetic segments to extract effective emotional information.

In recent years, CNN and RNN are widely used in the SER field, deep convolution models can maintain the spectral time shift invariance of speech signals, and RNN is excellent in processing timing information, and thus is often used for extracting high-level features of emotional speech. Neumann et al integrates unsupervised autoencoder learning representation into the CRNN sentiment classifier, improving recognition accuracy. However, the method for CNN to learn features from spectrogram is only to fuse CNN features of a single frame image, so that the relation between adjacent continuous voice frames is often ignored, and therefore, some researches propose a three-dimensional convolution model for SER, which can better capture the short-term spatiotemporal relationship of feature representation. Peng et al directly input spectrogram information as three-dimensional-CRNN, and the convolution layer is used for extracting high-level representation and the recursion layer is used for extracting long-term dependency relationship for emotion recognition. Aiming at the interference of a silent frame and an emotion-independent frame to an SER, Chen and the like provide an attention-based 3D convolution recurrent neural network (ACRNN) model for learning the distinguishing characteristics of the SER, and the introduction of the attention mechanism effectively reduces the influence of redundant information such as the silent frame and the like. However, as the number of convolutional layers using the CNN model increases, the original features are gradually lost, and the parameters to be trained are gradually increased, resulting in a very large amount of computation. Aiming at the problem of large calculation amount, the invention provides a rear forgetting gate structure to replace the traditional LSTM forgetting gate, and the calculation amount is reduced by reducing parameters.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The 3DACRNN speech emotion recognition method and the storage medium based on the residual error network can obtain higher recognition rate, make up for lost features and reduce calculated amount. The technical scheme of the invention is as follows:

a3 DACRNN speech emotion recognition method based on a residual error network comprises the following steps:

s1, preprocessing the voice signals including pre-emphasis and windowing framing;

s2, converting the voice signal processed in the step S1 into a two-dimensional spectrogram, and processing the two-dimensional spectrogram into three-dimensional spectrogram data by a spectrogram stacking method of a plurality of continuous frames;

s3, extracting short-term space-time characteristics of emotional voice from the three-dimensional spectrogram through a three-dimensional convolution neural network Res3DCNN based on a residual error network, and compensating the missing characteristics of the traditional Convolution Neural Network (CNN) in the convolution process through the residual error network, so that the problem of gradient disappearance or explosion is effectively solved;

s4, taking the output of Res3DCNN as the input of an ARNN model of a recurrent neural network based on an attention mechanism, wherein RNN refers to the recurrent neural network, the RNN has good performance for processing time sequence signals, LSTM is one of RNN, but due to the existence of redundant information, the attention mechanism is added, and can reduce the weight of useless information, improve the training speed, extract the long-term dependence of the space-time characteristics and solve the problem of weak space-time correlation. The traditional LSTM forgetting door is improved by adopting a rear forgetting door structure; the LSTM comprises three gate structures, namely a forgetting gate, an input gate and an output gate, wherein the forgetting gate is improved, and aiming at the problem of large calculation amount, the forgetting gate of a traditional long-time memory LSTM and a special RNN structure network is improved.

S5, performing 10-time cross validation on the trained model by using a validation set, taking cross entropy as a loss function, and optimizing model parameters by using a RMSProp algorithm;

and S6, verifying the trained model by using a verification set, adjusting the hyper-parameters of the model to obtain a final network model, and finally performing speech emotion classification by using a Softmax layer.

Further, the step S1 performs preprocessing including pre-emphasis and windowing framing on the speech signal according to its short-time stationarity, and includes the following specific steps:

step A1: using a first-order high-pass filter, i.e. a pre-emphasis filter, with a transfer function in the Z-domain of h (Z) -1-az ^-1 A represents a pre-emphasis coefficient, the value is 0.95, Z represents a coordinate value of a Z domain, H (Z) is a transfer function, and a signal after pre-emphasis processing is x (t);

step A2: framing the pre-emphasized signal into x (m, n), wherein n is the frame length, m is the number of frames, windowing is performed by adopting a Hamming window:

x (m, n) represents the framed speech signal, w (n) represents the window function of the hamming window, and the windowed framed speech signal is: s is _w (m,n)＝x(m,n)*w(n)，s _w (m, N) represents the windowed framed speech signal, where each frame contains N sample points.

Further, step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and process the two-dimensional spectrogram into three-dimensional data by stacking a plurality of spectrograms of consecutive frames, and the processing steps are as follows:

step B1: transforming the signal processed in the step A2 from time domain data to a frequency domain through a Fast Fourier Transform (FFT) to obtain X (m, n);

step B2: making a periodic diagram Y (m, n) with the formula of Y (m, n) ═ X (m, n) × X (m, n) ', X (m, n)' representing the derivative of X (m, n), and taking 10log ₁₀ Y (M, N), a down-scaling M by time, a down-scaling N by frequency, using (M, N,10 log) ₁₀ Y (m, n)) draws a two-dimensional spectrogram;

step B3: a spectrogram of a plurality of continuous frames is stacked to form a cube, and then convolution operation is carried out on the cube and a 3D convolution kernel, wherein input data is set to be Time multiplied by Frequency multiplied by C, the Time and the Frequency respectively represent horizontal axis Time and vertical axis Frequency of the spectrogram, and C represents the number of the spectrogram.

Further, in step S3, the designed Res3DCNN is used to extract short-term spatio-temporal features of emotion speech from the three-dimensional spectrogram, and the residual error formula is:

F(x)＝y-x

wherein x is input, y is output, f (x) represents residual error, x and f (x) dimensions are consistent during calculation, and if the x and f (x) dimensions are not consistent, the calculation is carried out by the following algorithm:

y＝w _k *x+F(x)

w _k representing a weight matrix, wherein the dimension of input x can be adjusted to be consistent with F (x), a Res3DCNN model representing design is composed of four residual blocks, each residual block comprises 4 convolutional layers and 1 pooling layer, the convolutional kernel size of the first layer is 1 multiplied by 1, the convolutional kernel sizes of the other three convolutional layers are 3 multiplied by 3, the pooling layer size is 2 multiplied by 1, the step length is 1 multiplied by 1, and a batch specification layer BN and a ReLU activation function layer are added after each convolutional layer;

BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:

representing variables to be entered into the activation function, k representing the number of activation functions, in one batch, BN is for each feature, there are m training samples, j dimensions (j neuron nodes), normalize the j-th dimension:

wherein,

is the result of a linear calculation of the ith dimension, μ _j The mean of each of the small batches of training data is represented,

the variance of each small batch of training data is indicated,

represents the result of the normalization of the batch of training data, ε is to prevent variance from being 0;

the formula for ReLU is as follows:

further, in step S4, the output of Res3DCNN is used as the input of the ARNN model, the long-term dependency relationship of these spatio-temporal features is extracted, the conventional LSTM unit is composed of three gate structures, which are a forgetting gate, input gates, and an output gate, the forgetting gate is used to determine which information should be discarded in the unit state at the previous time, and directly participates in updating the unit state, the updating algorithm of the unit state is related to the hidden layer output at the previous time and the input at the current time, and the unit state at the previous time is used as a parameter for updating the current state;

forget gate algorithm: f. of _t ＝σ(W _f ×[h _t-1 ,x _t ]+b _f )

The unit state updating algorithm: i.e. i _t ＝σ(W _i ×[h _t-1 ,x _t ]+b _i )

Wherein C _t-1 And h _t-1 The cell state and hidden layer output at the previous moment, respectively, f _t Indicating a forgotten gate output result, i _t Input data representing input gates, x _t Is an input for the current time of day,

is a candidate value, W, to be added to the memory cell _f 、W _i And W _C Weights of forgetting gate, input gate and candidate cell, respectively, obtained by training, b _f 、b _i And b _C Is the deviation of them, i _t Is that

σ represents a logic sigmoid function:

aiming at the problem of large calculation amount, the forgetting gate of the traditional LSTM is improved, and a novel rear forgetting gate structure is provided, wherein the algorithm is as follows:

f _t ＝σ(W _f ×C _t-1 +b _f )

modified W _f Smaller dimension because x is not used in the formula _t And h _t-1 And the modified door structure is called a rear forgetting door.

Further, the improved ARNN model of step S4 sets BLSTM to have 512 bi-directional hidden units, creates a new sequence with a shape of lx 1024, and puts it into the attention layer, and finally generates a new sequence h.

Further, the step S5 is to train the model with a training set, use cross entropy as a loss function, and optimize an objective function with the RMSProp algorithm, which specifically includes:

the cross entropy algorithm is defined as follows:

wherein,

genuine label of jth sample

y _j : the predicted output of the network model for the jth sample, C, represents the loss value.

The RMSprop algorithm is defined as follows:

wherein, r: slip ratio of gradient square value, w: attenuation rate, α: learning rate, ε: constant term to prevent denominator from being zero, η: hyper-parametric, constant.

Further, in step S6, performing emotion speech classification by using a Softmax layer, where the formula of the Softmax function is as follows:

softmax value, S, of the ith element in the array represented by the formula _i Is shown asi classification probability of the element. j represents an accumulation variable.

A storage medium being a computer readable storage medium storing one or more programs which, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform a method as claimed in any one of the preceding claims.

The invention has the following advantages and beneficial effects:

in summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: under the same experimental environment, the 3DACRNN speech emotion recognition method based on the residual error network can better solve the problem of feature loss of the deep CNN network in the convolution process and the problem of weak space-time relevance, and further extracts deeper features capable of representing speech emotion. Extracting a spectrogram from the preprocessed voice signal and combining the spectrogram into three-dimensional spectrogram data, extracting short-term space-time characteristics through a three-dimensional convolution neural network structure based on a residual error network, and extracting long-term dependence of the space-time characteristics through a recurrent neural network based on attention. However, the three-dimensional convolutional neural network is complex and large in calculation amount, so that the LSTM network is improved, a new forgetting gate structure is invented to replace the traditional forgetting gate structure, the calculation amount is reduced to a great extent by the improved LSTM network, and the model training and testing speed is improved. In a word, the 3DACRNN speech emotion recognition method based on the residual error network improves the performance of a speech emotion recognition system to a greater extent.

Drawings

FIG. 1 is a general block diagram of the residual network-based 3DACRNN speech emotion recognition method according to the preferred embodiment of the present invention;

FIG. 2 is a spectrogram extraction process;

FIG. 3 is a diagram of a residual block in a convolutional neural network;

fig. 4 is a diagram of a conventional LSTM and a modified LSTM network architecture.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in fig. 1, the present invention provides a residual network-based 3 dacron speech emotion recognition method, which is characterized by comprising the following steps:

s1: the method comprises the following steps of performing pre-emphasis, windowing and frame division on a voice signal:

step A1: using a first-order high-pass filter, i.e. a pre-emphasis filter, having a transfer function of H (z) ═ 1-az ^-1 A represents a pre-emphasis coefficient, the value of the invention is 0.95, and a signal after pre-emphasis processing is x (t);

step A2: the pre-emphasized signal is divided into frames, and the frames are changed into x (m, n) (n is the frame length, and m is the number of the frames). We use the hamming window for windowing:

x (m, n) represents the framed speech signal, w (n) represents the window function of the hamming window, and the windowed framed speech signal is: s _w (m,n)＝x(m,n)*w(n)，s _w (m, N) represents the windowed framed speech signal, where each frame contains N sample points.

S2: converting the processed voice signal into a two-dimensional spectrogram (the spectrogram extraction process is shown in fig. 2), and processing the two-dimensional spectrogram into three-dimensional data by a method of stacking a plurality of continuous frames of spectrograms, wherein the processing steps are as follows:

step B1: transforming the signal processed by A2 from time domain data to frequency domain by Fast Fourier Transform (FFT) to obtain X (m, n);

step B2: making a periodic diagram Y (m, n) with the formula of Y (m, n) ═ X (m, n) × X (m, n)', and taking 10log ₁₀ Y (M, N), a down-scale M for M according to time, a down-scale N for N according to frequency, using (M, N,10 log) ₁₀ Y (m, n)) draws a two-dimensional spectrogram.

Step B3: the input data of the 3D convolution, which must be three-dimensional so as to be convolved with the 3D convolution kernel, is set to Time × Frequency × C by stacking spectrogram of a plurality of consecutive frames and then performing convolution operation with the 3D convolution kernel in the cube, where Time and Frequency respectively represent horizontal axis Time and vertical axis Frequency of the spectrogram, and C represents the number of the spectrogram.

S3: the Res3DCNN designed by the invention is used for extracting short-term space-time characteristics of emotional voice from the three-dimensional spectrogram. The schematic diagram of the residual error network is shown in fig. 3, and its formula is:

F(x)＝y-x

where x is the input, y is the output, and F (x) represents the residual. The x and F (x) dimensions are consistent during calculation, and if the x and F (x) dimensions are not consistent, the calculation is carried out by the following algorithm:

y＝w _k *x+F(x)

w _k a weight matrix is represented that can adjust the dimension of input x to be consistent with f (x). The designed Res3DCNN model consists of four residual blocks, each containing 4 convolutional layers, 1 pooling layer. The convolution kernel size of the first layer is 1 × 1 × 1, the convolution kernel sizes of the remaining three convolution layers are 3 × 3 × 3, the pooling layer size is 2 × 2 × 1, and the step size is 1 × 1 × 1. After each convolution layer, a Batch Normalization (BN) and a ReLU activation function (rlu) layer are added.

The BN normalizes activation of a deep neural network middle layer, and the key of the algorithm is that two learnable parameters gamma and beta are introduced:

representing the variables to be entered into the activation functions, k representing the number of activation functions, and in one batch, BN is for each feature, with m training samples, j dimensions (j neuron nodes). Normalizing the j dimension:

wherein,

is the result of the linear calculation of the ith dimension, mu, of the ith layer _j The mean of each of the small batches of training data is represented,

the variance of each small batch of training data is represented,

the result of the normalization of the training data batch is shown, and ε is to prevent variance from being 0.

The formula for ReLU is as follows:

s4: the output of Res3DCNN is used as the input of the ARNN model, and the long-term dependency relationship of the space-time characteristics is extracted. The traditional LSTM unit consists of three gate structures, namely a forgetting gate, input gates and output gates. And determining which information should be discarded in the unit state at the previous moment by using a forgetting gate, directly participating in updating the unit state, wherein an updating algorithm of the unit state is related to the hidden layer output at the previous moment and the input at the current moment, and taking the unit state at the previous moment as a parameter for updating the current state.

Forgetting gate algorithm: f. of _t ＝σ(W _f ×[h _t-1 ,x _t ]+b _f )

Wherein C is _t-1 And h _t-1 The cell state and hidden layer output at the previous moment, respectively, f _t Indicating a forgotten gate output result, i _t Input data representing input gates, x _t Is an input for the current time of day,

is a candidate value, W, to be added to the memory cell _f 、W _i And W _C Are the weights of the forgetting gate, the input gate and the candidate cell, respectively, obtained from the training, b _f 、b _i And b _C Is the deviation of them, i _t Is that

σ represents a logic sigmoid function:

aiming at the problem of large calculation amount, the forgetting gate of the traditional LSTM is improved, a novel rear forgetting gate structure is provided, and the algorithm is as follows:

f _t ＝σ(W _f ×C _t-1 +b _f )

modified W _f Smaller dimension because x is not used in the formula _t And h _t-1 Participate in calculation, reduce the parameters needing training and reduceThe invention refers to the modified gate structure as a forgetting gate, and the traditional LSTM and the improved LSTM network structures are shown in fig. 4.

The invention sets up BLSTM with 512 bidirectional hidden units for the improved ARNN model, creates a new sequence with the shape of L multiplied by 1024, and puts the new sequence into the attention layer, and finally generates a new sequence h.

Step S5, training the model by using the training set, adopting the cross entropy as the loss function, and optimizing the objective function by using the RMSProp algorithm.

The cross entropy algorithm is defined as follows:

wherein,

the authentic tag of the jth sample, y _j : the predicted output of the network model for the jth sample, C, represents the loss value.

The RMSprop algorithm is defined as follows:

Step S6, performing emotion speech classification using the Softmax layer, where the formula of the Softmax function is as follows:

the formula represents in the arraySoftmax value of the i-th element. S _i Representing the classification probability of the ith element.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A3 DACRNN speech emotion recognition method based on a residual error network is characterized by comprising the following steps:

s3, extracting short-term space-time characteristics of emotion voice from the three-dimensional spectrogram through a three-dimensional convolution neural network Res3DCNN based on a residual error network, and compensating the missing characteristics of the traditional convolution neural network CNN in the convolution process through the residual error network;

s4, taking the output of Res3DCNN as the input of an attention-based recurrent neural network (ARNN) model, wherein the Recurrent Neural Network (RNN) is LSTM; the forgetting gate of the LSTM adopts a rear forgetting gate, wherein the algorithm of the rear forgetting gate is as follows:

f _t ＝σ(W _f ×C _t-1 +b _f )，

is a candidate value, W, to be added to the memory cell _f 、W _i And W _C Weights of forgetting gate, input gate and candidate cell, respectively, obtained by training, b _f 、b _i And b _C Is the deviation thereof, i _t Is that

σ represents a logic sigmoid function:

and S6, verifying the trained model by using a verification set, adjusting the hyper-parameters of the RMSProp algorithm in the model to obtain a final network model, and finally performing speech emotion classification by using a Softmax layer.

2. The method for 3dacrn speech emotion recognition based on residual error network as claimed in claim 1, wherein said step S1 is implemented by performing preprocessing including pre-emphasis and windowing framing on the speech signal according to its short-time stationarity, and includes the following specific steps:

step A1: using a first-order high-pass filter, i.e. a pre-emphasis filter, with a transfer function in the Z-domain of h (Z) ═ 1-az ^-1 A represents a pre-emphasis coefficient, the value is 0.95, Z represents a coordinate value of a Z domain, H (Z) is a transfer function, and a signal after pre-emphasis processing is x (t);

3. The method for 3dacrn speech emotion recognition based on residual error network of claim 2, wherein the step S2 is to convert the processed speech signal into a two-dimensional spectrogram, and the two-dimensional spectrogram is processed into three-dimensional data by stacking a plurality of spectrogram of consecutive frames, and the processing steps are as follows:

step B2: making a periodic diagram Y (m, n) with the formula of Y (m, n) ═ X (m, n) × X (m, n) ', X (m, n)' representing the derivative of X (m, n), and taking 10log ₁₀ Y (M, N), a down-scale M for M according to time, a down-scale N for N according to frequency, using (M, N,10 log) ₁₀ Y (m, n)) draws a two-dimensional spectrogram;

4. The residual network-based 3DACRNN speech emotion recognition method of claim 3, wherein said step S3 uses the designed Res3DCNN to extract the short-term spatio-temporal features of emotion speech from the three-dimensional spectrogram, and the residual formula is:

F(x)＝y-x

y＝w _k *x+F(x)

w _k representing a weight matrix, wherein the dimension of input x can be adjusted to be consistent with F (x), a Res3DCNN model representing design is composed of four residual blocks, each residual block comprises 4 convolutional layers and 1 pooling layer, the size of a convolution kernel of the first layer is 1 multiplied by 1, the sizes of convolution kernels of the other three convolutional layers are 3 multiplied by 3, the size of the pooling layer is 2 multiplied by 2, the step length is 1 multiplied by 1, and a batch specification layer BN and a ReLU activation function layer are added after each convolutional layer;

wherein,

is the result of a linear calculation of the ith dimension, μ _j Represents the mean of each of the small batches of training data,

the variance of each small batch of training data is represented,

represents the result of normalization of the training data batch, epsilon is to prevent variance from being 0;

the calculation formula for ReLU is as follows:

5. the method for 3dacrn speech emotion recognition based on residual network as claimed in claim 1, wherein the modified ARNN model set in step S4 has 512 bi-directional hidden units, creating a new sequence with the shape of lx 1024, which is put into the attention layer, and finally generating a new sequence h.

6. The method for 3dacrn speech emotion recognition based on residual error network as claimed in claim 5, wherein said step S5 is implemented by training the model with a training set, using cross entropy as a loss function, and using RMSProp algorithm to optimize the objective function, specifically comprising:

the cross entropy algorithm is defined as follows:

wherein,

genuine label of jth sample

y _j : the predicted output of the network model for the jth sample, C represents the loss value,

the RMSprop algorithm is defined as follows:

7. The residual error network-based 3DACRNN speech emotion recognition method of claim 6, wherein the step S6 utilizes a Softmax layer for speech emotion classification, and the formula of the Softmax function is as follows:

softmax value, S, of the ith element in the array represented by the formula _i The classification probability of the ith element is represented and j represents an accumulated variable.

8. A storage medium, the storage medium being a computer readable storage medium storing one or more programs which, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to perform the method of any of claims 1-7.