CN110534133B

CN110534133B - Voice emotion recognition system and voice emotion recognition method

Info

Publication number: CN110534133B
Application number: CN201910803429.7A
Authority: CN
Inventors: 殷绪成; 曹秒; 杨春
Original assignee: Zhuhai Eeasy Electronic Tech Co ltd
Current assignee: Zhuhai Eeasy Electronic Tech Co ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2022-03-25
Anticipated expiration: 2039-08-28
Also published as: CN110534133A

Abstract

The invention discloses a speech emotion recognition system, which comprises: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the invention also discloses a speech emotion recognition method applied to the speech emotion recognition system, which comprises the following steps: 1. performing preliminary work on the voice to obtain a voice spectrum characteristic diagram; 2. operating the spectral feature map, and constructing a spectral feature map containing audio shallow information; 3. further processing the speech spectrum characteristic map containing the audio shallow information to obtain deeper semantic information and context information; 4. processing a speech spectrum feature map with deeper semantic information and context information to obtain a feature vector with highest emotion correlation degree of the whole speech and the speaker; 5. and outputting the emotion classification corresponding to the whole voice. Compared with the prior art, the voice emotion recognition performance of the invention is greatly improved.

Description

Voice emotion recognition system and voice emotion recognition method

Technical Field

The invention relates to the technical field of artificial intelligence and voice recognition, in particular to a voice emotion recognition system and a voice emotion recognition method, which are end-to-end deep neural network technology and are improved by taking DFSMN as a basic network.

Background

With the continuous progress of speech recognition technology and the wide application of speech recognition equipment, human-computer interaction is more and more common in people's daily life. However, most of these devices can only recognize text-level content of human language and cannot recognize emotional state of a speaker, and speech emotion recognition has many useful applications in human-centric services and human-computer interaction, such as intelligent service robots, automated call centers, and distance education. It has attracted considerable research attention so far and many methods have been proposed. Since machine learning (deep neural networks such as CNN) has been rapidly developed in recent years, this method has exhibited excellent performance through attempts and improvements in many fields. How to apply the deep learning technology to the field is still in a fumbling state, and in practical application, many problems facing the challenging task are urgently to be solved. The practical application of the speech emotion recognition technology is a challenging task, massive complex and difficult audio data needs to be collected for research, and how to make the audio recorded in a pure environment closer to the speech in a real scene is a big problem to be solved in the prior art.

Typical Speech Emotion Recognition (SER) systems take a speech waveform as input and then output one of the target emotion classes. Conventional SER systems use Gaussian Mixture Models (GMMs) (Neiberg D, Elenius K, Laskowski K. electronic registration in specific Speech using GMMs [ C ]// Ninten International registration on learning processing.2006.), Hidden Markov Models (HMMs) (Nwe T L, Foo S W, De Silva L C. Speech registration in specific registration in hidden Markov models [ J ]. Speech communication,2003,41(4): 603. support vector machines (Ms) (Yang N, Yuan J, Zhou Y, al. enhanced simulation with restriction function of prediction function in prediction function J./(I) and I. 12. Speech recognition of prediction in Speech processing. I. D. Speech recognition in Speech processing, I. D. Speech recognition in Speech recognition, I. 12. D. Speech recognition, D. I. 12. D. Speech recognition, D. Speech recognition in 1. 11. 1. Speech recognition, D. Speech recognition in encoding J. (I. 12. 1. Speech recognition in Speech recognition, speech and Signal Processing (ICASSP) IEEE 2018: 2906-2910), which all depend on mature manual voice characteristics, and the selection of the characteristics has great influence on the model effect. These features typically include spectral, cepstral, pitch, and energy features of the frame-level speech signal. Statistical functions of these features are then applied across multiple frames, resulting in a speech-level feature vector.

With the explosive development of deep learning techniques, some researchers have explored deep learning approaches to build more robust SER models. Zhang Z et al (Zhang Z, ringer F, Han J, et al, facing realm in Speech recognition from Speech: Feature enhancement by Speech encoder with LSTM neural networks [ C ]// Proceedings INTERSPEECH 2016,17th Annual Conference of the International Speech Communication Association (ISCA) 2016:3593 @ -3597.) propose a Feature-enhanced self-coding algorithm based on long-short-time memory (LSTM) neural networks for extracting emotion information from Speech. Accordingly, Recurrent Neural Networks (RNNs) have proven to have strong sequence modeling capabilities, especially in speech recognition tasks. However, RNN training relies on time-lapse Backpropagation (BPTT), which, due to its computational complexity, can cause problems of time consumption, gradient extinction, and explosions. In order to solve these problems, a feedforward sequence memory network (hereinafter, abbreviated as FSMN) is proposed. In recent years, a great deal of research has shown that FSMN can model long-term relationships without any repetitive feedback in tasks such as speech recognition and language modeling. Furthermore, Zhang S et al (Zhang S, Lei M, Yan Z, et al. deep-FSMN for large spherical connected Speech communication [ C ]//2018IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2018:5869-5873.) propose that the skip connection structure is applied to FSMN, and make great improvement on the former model.

Research activities on SER dates back to the 80's of the 20 th century. SER is still challenging in practical applications due to factors such as gender, speakers, language, and changes in the recording environment. Many researchers have attempted to solve these problems by designing sophisticated handwritten speech features to strengthen the connection to human emotion. However, these manually extracted speech features are only suitable for specific tasks and are less versatile. This leads to the need to design different speech features in the face of different speech-related tasks, which is contrary to the original intent of deep learning techniques.

Disclosure of Invention

In view of the shortcomings of the prior art, it is an object of the present invention to provide a speech emotion recognition system which is an end-to-end feedforward deep neural network structure.

In view of the defects of the prior art, another object of the present invention is to provide a speech emotion recognition method applied to a speech emotion recognition system.

In order to realize the purpose of the invention, the following technical scheme is adopted: a speech emotion recognition system comprising: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the CNN module is provided with a convolution layer, and the pyramid FSMN module is provided with a pyramid memory block structure;

the audio preprocessing module converts the received original audio data into a speech spectrum characteristic diagram;

the CNN module performs primary processing on the spectrum characteristic diagram to construct a characteristic diagram containing shallow information;

the pyramid FSMN module further processes the characteristic diagram containing the shallow information to acquire deeper semantic information and context information;

the time step attention module is used for paying attention to a specific area in a time step and calculating influence weights of different time step lengths on final emotion recognition;

the output module is provided with a plurality of emotion categories and is used for outputting the emotion category which is most matched with the original audio data.

The time step attention module may specifically be represented by the following formula:

a_t＝Average(h_t)，

y＝Xs，

wherein, a_tIs the mean value of the t-th time step, h_tIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,

is the softmax activation function, W₁Is the weight parameter of the first layer in the time step attention module, W₂Is the weight parameter of the second layer in the time step attention module, b₁Is the bias parameter of the first layer in the time-step attention module, b₂Is the bias parameter of the second layer in the time-step attention module, f is an arbitrary activation function, a is the function of all a_tA constructed feature vector; y is the output of the output module and X is the input to the time step attention module.

When the convolutional layer performs a convolution operation with a kernel having a size of k and a step size of s, the output of the convolutional layer is calculated by the following formula:

W_out＝(W_in-k)/s+1，

H_out＝(H_in-k)/s+1，

wherein, W_outIs the width, W, of the feature map of the output speech spectrum_inIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; h_outIs the high of the output characteristic diagram, H_inIs the height of the input feature map, k is the convolution kernel size, and s is the step size of the convolution kernel move.

Adopting the pyramid memory block structure to take the length as a forward time step length N₁And a backward time step N₂Step of time h_tCoded to a fixed size length and then N is added₁And N₂Is calculated as the current output, which is specifically shown by the following equation:

wherein,

is the output of the memory module at time t, f is an arbitrary activation function, a_iIs the weight of the ith forward time step, h_t-iIs the ith forward time step, b_jIs the weight of the jth backward time step, h_t-jIs the jth forward time step;

the pyramid memory block structure may adopt a jump connection, and the input and output relationship of the jump connection is shown as the following formula:

wherein,

is the output of the memory block at the time t of the first layer,

is an arbitrary activation function that is a function of,

is the memory block output at time t of layer l-1,

is the input of the memory block at the t moment of the first layer,

is the l-th layer forward time step,

is the weight of the i-th forward time step of the l layer,

is the ith forward time step of the l layer, s₁Is the interval of the forward time step,

is the weight of the i-th backward time step of the l layers,

is the jth backward time step of the l layer, s₂Is the backward time step interval;

is the output of the hidden layer at the time t of the l +1 th layer, W^lIs a weight parameter of the first-layer memory block, b^l+1Is the bias of the first layer memory block.

The convolutional layer may be two layers, the shallow information may be audio loudness or frequency, and the plurality of emotion categories may be four emotion categories, which may be happy, sad, angry, and neutral.

In order to realize another purpose of the invention, the following technical scheme is adopted: a speech emotion recognition method applied to a speech emotion recognition system comprises the following steps:

step 1, an audio preprocessing module performs primary feature extraction and regularization operation on received voice to obtain a voice spectrum feature map, and inputs the voice spectrum feature map into a CNN module;

step 2, the CNN module performs convolution calculation operation on the received speech spectrum characteristic diagram to construct a speech spectrum characteristic diagram containing audio shallow layer information (such as audio loudness, frequency and the like);

step 3, the pyramid FSMN module further processes the speech spectrum feature map containing the audio shallow information, and obtains deeper semantic information and context information in the speech spectrum feature map through a pyramid memory block structure, such as the gender and emotion of a speaker contained in a section of speech;

step 4, the time step attention module processes the speech spectrum characteristic graph with deeper semantic information and context information, firstly, the attention scores of different time steps are calculated, then the scores are used for carrying out weighted summation on the whole speech spectrum characteristic graph in the time step dimension, and therefore the characteristic vector with the highest emotion correlation degree between the whole section of speech and the speaker is obtained; the time step attention module can be utilized to enable the model to pay more attention to the part related to the emotion of the speaker, so that the robustness of the model is improved;

step 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output, namely the model classifies the predicted emotion; the output module is a full-connection layer, and the output of the full-connection layer is a feature vector with the length of 4.

The technical problems solved by the invention are as follows: the speech emotion recognition problem is solved based on a deep learning technology, and a section of speech contains a lot of information, such as: gender of the speaker, background noise, content of the speech, emotional state of the speaker, etc., which presents great difficulties and challenges to speech emotion recognition problems. Also, although speech emotion recognition based on deep learning has been studied to some extent, most of the research is based on LSTM, which itself has problems such as huge amount of parameters, difficulty in training, etc.; in summary, the existing speech emotion recognition technology still faces many problems which are not solved well.

The invention has the advantages and beneficial effects that:

1. in the tasks of speech recognition, language models and the like, the FSMN can model the long-term relationship without any repeated feedback, and based on the research results, the invention provides an end-to-end feedforward deep neural network for solving the speech emotion recognition task. After the LSTM is eliminated, the recognition speed of the model is greatly improved, and meanwhile, the training time is effectively reduced. Compared with the traditional method, the method does not use various manually extracted audio features as model input, but directly uses the original spectrogram as input, which contains more original voice information, thereby ensuring that the model generalization capability is stronger. Meanwhile, the complexity of model construction is reduced, and different input characteristics do not need to be set for different models.

2. Different from most of deep learning-based speech emotion recognition research, the method does not use a recurrent neural network and the variety thereof as a basic network, but adopts a standard feedforward fully-connected neural network, namely DFSMN, as the basic network, and provides a pyramid-shaped memory block structure on the basis of the DFSMN, so that the whole model is more robust, and higher-level semantic information can be extracted along with deepening of the network.

3. The bottom layer of the model of the invention is 2 layers of convolution layers instead of directly using DFSMN layers, and in addition, with the depth of the network, the characteristic with stronger robustness can be extracted by adopting a down-sampling method, and the size of the characteristic is obviously reduced.

4. In order to enable the model to pay more Attention to emotion-related information and not be interfered by other information, the invention also provides an Attention mechanism based on time steps, the Attention mechanism is applied to the output of the pyramid FSMN, and each element in the output sequence depends on a specific element in the input sequence by using the Attention mechanism, so that the computational burden of the model is increased, but a more accurate and better-performance model can be generated, and the validity of the end-to-end network is verified by using an IEMOCAP voice emotion data set.

5. The end-to-end deep neural network structure provided by the invention and the method designed aiming at each problem can effectively operate, the experiment is well proved, the experiment is 3.3 times faster than the original model, and the speech emotion recognition performance is greatly improved through analysis and verification.

Drawings

Fig. 1 is a block diagram of a speech emotion recognition system, in which pFSMN is a pyramid FSMN.

Fig. 2a is a block diagram of the FSMN.

Fig. 2b is a block diagram of DFSMN.

FIG. 3 is a block diagram of a time step attention module.

FIG. 4 is a flow chart of a speech emotion recognition method.

Detailed Description

Examples

The present invention will be further described with reference to the following embodiments.

As shown in FIG. 1, a speech emotion recognition system includes: the system comprises an audio preprocessing module, a CNN module, a pyramid FSMN module, a time step attention module and an output module which are connected in sequence; the CNN module has a convolution layer, and the pyramid FSMN module has a pyramid memory block structure.

The speech emotion recognition system in the embodiment is an end-to-end feedforward deep neural network structure, and the invention improves the basic network in the classic FSMN and DFSMN structures, and adds a convolution layer in the basic network to realize the feature extraction of a lower layer.

The audio data of the invention is in a mainstream wav format, the sampling frequency is 16000Hz, the original audio data is further subjected to Fourier transform in frames, the length of each frame is 25ms, and the frame shift step length is 10 ms. Through preliminary processing, the audio data is converted into 2-dimensional spectrogram characteristics as model input. In detail, referring to fig. 1, the model is based on 2 convolutional layers instead of using DFSMN layers directly. In addition, with the depth of the network, the characteristics with stronger robustness can be extracted by adopting a down-sampling method, the size of the characteristics is obviously reduced, and the accuracy of the model is obviously improved by the module. When performing a convolution (pooling) operation using a kernel with a size of k and a step size of s, the output layer can be calculated by the following equation:

W_out＝(W_in-k)/s+1，

H_out＝(H_in-k)/s+1，

As shown in FIG. 2a, FSMN is a standard feedforward fully-connected neural network, which adds an additional memory module in a hidden layer and can use a tapped-delay structure to divide the time step h_tHas a length of N₁Forward time step and length of N₂The subsequent time steps of (a) are encoded to a fixed size length and then their sum is calculated as the current output, as shown in the following equation:

wherein,

is the output of the memory module at time t, f is an arbitrary activation function, a_iIs the weight of the ith forward time step, h_t-iIs the ith forward time step, b_jIs the weight of the jth backward time step, h_t-jIs the jth forward time step.

In order to make the depth of FSMN larger, as shown in fig. 2b, unlike the original FSMN architecture, DFSMN removes the direct forward connection between hidden layers, only takes the memory module as input, and at the same time, introduces the jump connection to overcome the gradient extinction and explosion problems, and the relationship between input and output is shown in the following formula:

wherein,

is the output of the memory block at the time t of the first layer,

is an arbitrary activation function that is a function of,

is the memory block output at time t of layer l-1,

is the input of the memory block at the t moment of the first layer,

is the l-th layer forward time step,

is the weight of the i-th forward time step of the l layer,

is the weight of the i-th backward time step of the l layers,

In the FSMN and DFSMN described above, the length of the memory module is the same, which means that in the above equation, the same is obtained through all hidden layers

And s₂In the memory block structure, the bottom layer not only extracts the context information of a specific time step t, but also contains the context information, so that the subsequent long-term relationship can be repeated, and unnecessary information does not need to be introduced at the top layer. The invention provides a pyramid memory block structure, in which a model extracts more context information at a deeper level, and the pyramid memory block structure is formed by adding

And s₂Therefore, the bottom layer extracts features from fine information such as the speed and rhythm of speech, the top layer extracts features from higher-level information such as emotion and gender, and the pyramid memory block structure improves the precision and reduces the number of parameters.

The present invention also adds an Attention mechanism by which each element in the output sequence depends on a particular element in the input sequence, applying the Attention mechanism to the output of the pyramid FSMN. This increases the computational burden of the model, but results in a more accurate, better performing model. In most implementations, note that it is implemented as a weight vector (usually as the output of a softmax function) whose dimension is equal to the length of the input sequence. In this embodiment, a segment of speech is divided into many small segments, referred to as time steps in the neural network. Obviously, when a speech segment contains a large number of blanks, not every time step is useful for the SER task, so the model needs to focus on a specific region, and on this basis, a time step attention module is constructed, as shown in fig. 3, and the time step attention module can be described as the following formula:

a_t＝Average(h_t)，

y＝Xs，

wherein, a_tIs the t thMean value of time step, h_tIs the feature vector of the t time step, Average is the averaging function; s is the output of the attention mechanism,

And finally, obtaining the output of the whole network through a full connection layer, wherein the optimization target of the model is a universal cross entropy loss function. The length of the output vector of the model is matched with the number of the emotion categories, the probability that the value of each position in the output vector corresponds to the emotion category is output, and finally the emotion category with the highest probability is selected as output.

As shown in fig. 4, the speech emotion recognition process in this embodiment specifically includes the following steps:

step 2, the CNN module performs convolution calculation operation on the received speech spectrum characteristic graph to construct a speech spectrum characteristic graph containing audio shallow layer information;

According to the technical scheme, after the LSTM is eliminated, the recognition speed of the model is greatly improved, meanwhile, the training time consumption is effectively reduced, and in addition, different from the traditional speech emotion recognition system, the method does not use artificial features as model input, but directly uses an original speech spectrogram as input, wherein more original information is contained, so that the generalization capability of the model is stronger. In order to make the model focus more on the information related to emotion and not be interfered by other information, the invention provides an attention mechanism based on time steps and integrates the attention mechanism into the model.

(1) Data sets used by the invention;

the SER model of the present invention was evaluated using an IEMOCAP corpus, which contains several sessions in each of which two participants communicate to present a particular type of emotion. These utterances are classified as anger, fear, excitement, neutrality, disgust, surprise, sadness, happiness, depression, others, and XXX. XXX is the case where the annotator cannot agree on the tag. In this example, only 5 classes of anger, excitement, happiness, neutrality and sadness were selected, and the total number of voices used was 5531. To balance the number of samples per emotion category, happy and excited emotions are merged into a happy category. In addition, 10% of the total data is randomly selected as a test object, the rest data is used as training data, and 10% of the training data is used as verification data to check whether the test object needs to be stopped in advance.

The corpus has two channels of video and audio data, and the invention only uses audio data. The audio acquisition used a high quality microphone (Schoeps CMIT 5U) with a sampling rate of 48 khz. It is down sampled to 16 khz and a 201D acoustic signature is extracted. Different from other technical solutions, in this embodiment, only the spectrogram is used as an input, and the extraction process is performed in a 25ms window with a moving step size of 10ms (100 fps). Meanwhile, normalization processing is carried out on the whole sentence voice data.

(2) Describing a test process;

using the pytorech framework as a training tool, the network architecture is shown in fig. 1, two 5 × 5conv layers are used in front, and the hidden layer and memory block of the 4 FSMN blocks have 256 and 128 nodes, respectively. In order to avoid overfitting, a batch normalization layer is arranged behind the CNN layer and the pFSMN layer, the time sequence is 4 to 32, the step length is 1 to 2, the model of the embodiment is trained on the basis of an Adam optimizer of a Pythrch, the batch size is set to 32, the learning rate is fixed to 0.003, preset 4470 pieces of training audio data are used for iterative training, the effect of the model on the verification set is tested at the same time in each iteration, and the training is stopped in advance when the identification accuracy on the verification set is unchanged for 3 continuous iteration rounds. All experiments were performed on a workstation at station 1 NVIDIATITAN XP.

(3) Testing results;

to measure the performance of the system, the overall accuracy (weighted accuracy, WA) and the average recall rate (unweighted accuracy, UA) of the different mood categories for the test sample were calculated, as well as the corresponding recall rate for each category.

The test result shows that compared with the LSTM, the improved sequence model performance is improved by 2.47%, which indicates that the FSMN has better sequence model performance in the task. HSF-CRNN (Luo D, Zou Y, Huang D. investment on Joint reproduction Learning for the Robust Feature Extraction in the Speech evaluation Recognition [ J ]. Proc. Interspeed 2018,2018: 152-. 156.) is an improved CNN combination RNN method proposed by Luo, which uses the manually made Speech features as input, the model of this embodiment realizes 0.53% and 3.99% absolute improvements on UA and WA, respectively, experiments prove that useful information can be automatically extracted from the spectrogram without using the commonly used manual Speech features, the invention also establishes a basic C-bilSTM model for comparison, the accuracy of the 'sad' sample is better than other methods, and the Recognition accuracy of other categories is much worse. In order to illustrate the working principle of the attention mechanism, a C-pFSMN model is established, the rest part of the whole model is completely consistent with the model in the invention except that the attention mechanism is not needed, and the result shows that compared with the C-pFSMN, the attention mechanism provided by the invention has good performance in a SER task, UA is improved by 6.3% absolutely, and in addition, a front-end CNN layer can extract more complex characteristics, so that the model performance is improved as expected.

C-bilSTM is constructed from 2-CNN layers and 2-Bi-LSTM, with 256 nodes in the hidden layer. It is similar to the model of the embodiment and widely applied to the sequence modeling task. Thus, the computational resources of C-bilSTM were also compared to the method of the present invention. The results show that the model of the invention has only 1.85M parameters and the training time is 64 minutes, which is much faster than the C-LSTM model. This means that the invention can achieve better performance while requiring less computing resources.

The technical scheme of the invention well solves the problem of speech emotion recognition, greatly improves the speed of speech recognition, and effectively reduces the time consumption of training. In addition, unlike the traditional speech emotion recognition system, the method does not use artificial features as model input, but directly uses the original spectrogram as input, wherein more original information is contained, so that the generalization capability of the model is stronger. In order to enable the model to pay more attention to the information related to emotion and not be interfered by other information, the invention provides an attention mechanism based on time step and combines the attention mechanism into the model, and experiments show that the model of the invention has good effect and needs less computing resources.

The above detailed description is specific to possible embodiments of the present invention, and the embodiments are not intended to limit the scope of the present invention, and all equivalent implementations or modifications that do not depart from the scope of the present invention are intended to be included within the scope of the present invention.

Claims

1. A speech emotion recognition system, comprising:

the audio frequency preprocessing module, the CNN module, the pyramid FSMN module, the time step attention module and the output module are connected in sequence, and the CNN module is provided with a convolution layer;

the output module is provided with a plurality of emotion categories and is used for outputting the emotion category which is most matched with the original audio data; wherein,

W_out＝(W_in-k)/s+1，

H_out＝(H_in-k)/s+1，

wherein, W_outIs the width, W, of the feature map of the output speech spectrum_inIs the width of the input speech spectrum feature map, k is the size of the convolution kernel, and s is the step length of the convolution kernel movement; h_outIs the high of the output characteristic diagram, H_inIs the height of the input feature map, k is the convolution kernel size, s is the step size of the convolution kernel move;

the pyramid FSMN module has a pyramid memory block structure, and the pyramid memory block structure is adoptedConstruct the length as the forward time step N₁And a backward time step N₂Step of time h_tCoded to a fixed size length and then N is added₁And N₂Is calculated as the current output, which is specifically shown by the following equation:

wherein,

the pyramid memory block structure adopts a jump connection, and the relation between the input and the output of the jump connection is shown as the following formula:

wherein,

is the output of the memory block at the time t of the first layer,

is an arbitrary activation function that is a function of,

is the memory block output at layer 1-1 time t,

is the input of the memory block at the t moment of the first layer,

is the l-th layer forward time step,

is the weight of the i-th forward time step of the l layer,

is the weight of the i-th backward time step of the l layers,

2. The speech emotion recognition system of claim 1, wherein the time step attention module is specifically represented by the following formula:

a_t＝Average(h_t)，

y＝Xs，

3. The speech emotion recognition system of any one of claims 1 to 2, wherein the convolutional layer is two layers.

4. The speech emotion recognition system of any one of claims 1 to 2, wherein the shallow information is audio loudness or frequency.

5. The speech emotion recognition system of any one of claims 1 to 2, wherein the plurality of emotion categories are four emotion categories.

6. The speech emotion recognition system of claim 5, wherein the four emotion classifications are happiness, injury, anger and neutrality.

7. A speech emotion recognition method applied to the speech emotion recognition system of claim 1, characterized by comprising the steps of:

step 3, the pyramid FSMN module further processes the speech spectrum characteristic diagram containing the audio shallow layer information, and obtains semantic information and context information of a deeper layer in the speech spectrum characteristic diagram through a pyramid memory block structure;

step 4, the time step attention module processes the speech spectrum characteristic graph with deeper semantic information and context information, firstly, the attention scores of different time steps are calculated, then the scores are used for carrying out weighted summation on the whole speech spectrum characteristic graph in the time step dimension, and therefore the characteristic vector with the highest emotion correlation degree between the whole section of speech and the speaker is obtained;

and 5, inputting the feature vector into an output module, wherein the dimensionality of the feature vector represents the probability of the corresponding emotion type, and the emotion type corresponding to the dimensionality with the highest probability is taken as a final output result, so that the emotion type corresponding to the whole voice is output.

8. The method of claim 7, wherein in step 5, the output module is a fully-connected layer, and the fully-connected layer outputs a feature vector with a length of 4.