CN112885378B

CN112885378B - Speech emotion recognition method and device and storage medium

Info

Publication number: CN112885378B
Application number: CN202110086550.XA
Authority: CN
Inventors: 刘振焘; 吴保晗; 佘锦华; 吴敏; 熊永华; 周莉; 赵兴旺
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2023-03-24
Anticipated expiration: 2041-01-22
Also published as: CN112885378A

Abstract

The invention provides a speech emotion recognition method, a speech emotion recognition device and a storage medium, and relates to the technical field of artificial intelligence. The current speech emotion recognition method mainly comprises the following steps: the method is based on deep neural network recognition and traditional machine learning method, but the mainstream method is still deep learning. The deep learning-based method mainly comprises the steps of extracting features of preprocessed voice signals, then sending the voice signals into a deep neural network for training, and then classifying the voice signals through methods such as a support vector machine and a decision tree. These methods have advantages, but in practical applications, a large amount of labeled data is required, and when sample data is insufficient, an overfitting phenomenon is likely to occur, and accurate identification cannot be performed. The method has the advantages of effectively improving the overfitting phenomenon caused by insufficient sample number in the deep neural network and improving the training efficiency and accuracy.

Description

Speech emotion recognition method and device and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a speech emotion recognition method, a speech emotion recognition device and a storage medium.

Background

Currently, the detection information applied to human emotion recognition research includes voice, facial expressions, physiological signals, body language, and the like. The voice is the fastest and most natural method for communication between people, and the voice emotion recognition research is significant for promoting harmonious human-computer interaction.

With the proposal of a large-scale calibration data set and a deep neural network structure, the deep learning algorithm obtains relatively high recognition accuracy rate on speech emotion recognition. However, these achievements are all achieved by iteratively updating model parameters using deep model training based on a large amount of label data. The real-world environment is often more complex than the published experimental data set, constructing a speech data set that can cover the complete sample distribution requires a lot of manpower and financial resources to collect data and calibration data, and it is difficult for some languages to collect sufficient corpora.

Disclosure of Invention

The technical problem solved by the present disclosure is how to train a model with a small amount of sample data, and improve the overfitting phenomenon caused by insufficient number of samples in a deep neural network.

According to one aspect of the present disclosure, there is provided a speech emotion recognition method including: acquiring a large-scale voice emotion data set; constructing a deep reinforcement learning model; inputting a large-scale voice emotion data set into a deep reinforcement learning model to pre-train a deep neural network; migrating the pre-trained deep neural network to a meta-learning model to form a deep neural network model; acquiring a small sample voice emotion data set; inputting the small sample voice emotion data set into the deep neural network model to carry out meta-training on the deep neural network; and carrying out element test on the depth neural network model after element training, and outputting a speech emotion recognition result.

In some embodiments, constructing the deep reinforcement learning model comprises: the deep learning framework comprises an intelligent agent and an environment, and a deep neural network is embedded into the intelligent agent to form a deep reinforcement learning model;

in some embodiments, inputting the large-scale speech emotion data set into the deep reinforcement learning model to pre-train the deep neural network comprises: acquiring a first voice emotion signal in a large-scale voice emotion data set; preprocessing the first voice emotion signal; extracting a first voice emotion characteristic from the preprocessed first voice emotion signal; and inputting the first speech emotion characteristics into a deep reinforcement learning model for training.

In some embodiments, inputting a small sample speech emotion data set into the deep neural network model meta-training the deep neural network model comprises: replacing the first classification layer of the deep neural network model with a second classification layer which conforms to the small sample data category; dividing a small sample voice emotion data set into a training set and a test set; acquiring a second voice emotion signal in the training set and the test set; preprocessing the second voice emotion signal; extracting a second voice emotion feature from the preprocessed second voice emotion signal; and inputting the second speech emotion characteristics extracted from the training set into the deep neural network model to perform meta-training on the deep neural network.

In some embodiments, the identification method comprises: dividing a training set into a first support set and a first query set; updating the parameters of the second classification layer by using the loss values obtained by the first support set, and updating a scaling movement function by using the loss values obtained by the first query set as the parameters; dividing the test set into a second support set and a second query set, and finely adjusting the parameters of the second classification layer by using the second support set; and outputting a speech emotion recognition result by utilizing the second query set and evaluating the deep neural network model.

According to another aspect of the present disclosure, there is provided a speech emotion recognition apparatus including: the deep reinforcement learning module is used for inputting a large-scale speech emotion data set and pre-training a deep neural network; the transfer learning module is used for transferring the pre-trained deep neural network to the meta-learning model to form a deep neural network model; and the deep neural network module is used for inputting a small sample voice emotion data set to perform meta-training on the deep neural network model, performing meta-testing on the meta-trained deep neural network model, and outputting a voice emotion recognition result.

In some embodiments, the deep reinforcement learning module is configured to: the intelligent system mainly comprises an intelligent agent and an environment, wherein a deep neural network is embedded in the intelligent agent; the method comprises the steps of obtaining a first voice emotion signal in a large-scale voice emotion data set, preprocessing the first voice emotion signal, extracting a first voice emotion feature from the preprocessed first voice emotion signal, and inputting the first voice emotion feature into a deep reinforcement learning frame for training.

In some embodiments, the deep neural network module is configured to: replacing the first classification layer of the deep neural network model with a second classification layer which conforms to the small sample data category; dividing a small sample speech emotion data set into a training set and a test set; acquiring a second voice emotion signal in the training set and the test set; preprocessing the second voice emotion signal; extracting a second voice emotion characteristic from the preprocessed second voice emotion signal; and inputting the second speech emotion characteristics extracted from the training set into the deep neural network model to perform meta-training on the deep neural network model.

In some embodiments, the deep neural network module is configured to: dividing a training set into a first support set and a first query set; updating the parameters of the second classification layer by using the loss values obtained by the first support set, and updating a scaling movement function by using the loss values obtained by the first query set as the parameters; dividing the test set into a second support set and a second query set, and finely adjusting the parameters of the second classification layer by using the second support set; and outputting a speech emotion recognition result by utilizing the second query set and evaluating the deep neural network model.

According to another aspect of the disclosure, a storage medium having computer instructions stored thereon is provided, wherein the computer instructions, when executed by a processor, are for implementing a method of speech emotion recognition.

The technical scheme provided by the invention has the beneficial effects that:

1. the deep neural network is subjected to enhanced pre-training by utilizing large-scale voice emotion data so as to be better adapted to the sample environment, and the training efficiency and accuracy are improved.

2. The neural network after the strengthening pre-training is migrated to the small sample learning, so that the fast convergence can be realized by fewer tasks, and the overfitting phenomenon caused by the insufficient number of samples in the deep neural network is effectively improved.

3. In the meta-learning, only a small number of parameters in the neural network after pre-training are updated, and the problem of 'catastrophic forgetting' for learning a specific new task is avoided.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a speech emotion recognition method in an embodiment of the present invention;

FIG. 2 is a diagram illustrating a deep reinforcement learning model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a speech emotion recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without undue experimentation, are within the scope of the present disclosure.

Some embodiments of the present disclosure are described in detail below in conjunction with fig. 1-3.

Fig. 1 illustrates some identification methods of the present disclosure.

As shown in fig. 1, the method comprises the steps of:

step S101, constructing a deep reinforcement learning model, which is characterized in that:

the reinforcement learning framework mainly comprises an intelligent agent and an environment, and a deep neural network is embedded into the intelligent agent part in the reinforcement learning framework to form a deep reinforcement learning model:

the learning framework is represented by a Markov decision model, which can be represented as a quadruple:

(S，A，P，R)

wherein S represents a state space, A represents a motion space, P represents a state transition strategy, and R represents a reward function;

the action decision is given by the "agent", which uses the following reward function to obtain the reward w _t ：

Wherein x is _t Is the basic truth value of the speech emotional characteristic, and the following formula is adopted to calculate the action selection probability n _t ，

n _t ＝Softmax(Q ^a gS _t +p ^b )

Wherein n is _t Is the action selection probability, Q ^a Is the weight, p, of the deep neural network ^b Is a heavy weight, S _t Is the output of the previous hidden layer.

Step S102, inputting the large-scale speech emotion data set into a deep reinforcement learning model to pre-train a deep neural network, and the method is characterized in that:

acquiring an input voice signal x (t) from a voice database, and preprocessing the voice signal x (t), wherein the preprocessing comprises pre-emphasis, framing and windowing;

performing discrete Fourier transform on a voice signal X (t) to obtain a discrete spectrum X (k), wherein the extracted transform formula is as follows:

wherein X (N) represents the preprocessed speech signal, X (k) represents the discrete processed speech signal, N is the number of discrete signals, N =0,1, \ 8230;, N-1, k =0,1, ·, N-1;

inputting the voice signal after discrete Fourier transform into a Mel filter bank, and obtaining the following logarithmic spectrum after logarithm:

wherein S (m) is a logarithmic spectrum, H _m (k) The transfer function of the triangular filter, M is the number of the triangular filters, generally 24-40, the center frequency is f (M), and M is more than or equal to 0 and less than M;

and transforming the discrete cosine into a cepstrum frequency domain to obtain a Mel frequency cepstrum coefficient signal as follows:

c (n) represents mel-frequency cepstrum coefficients;

and sending the obtained Mel frequency cepstrum coefficient speech emotion characteristics into a deep reinforcement learning model for pre-training.

And S103, migrating the pre-trained deep neural network to a meta-learning model to form a deep neural network model, wherein the classification layer of the deep neural network model is determined according to the number of large-scale voice emotion data sample types, and the large-scale voice emotion data set type and the small-sample voice emotion data set type are not necessarily the same, so that the classification layer of the deep neural network model needs to be replaced by a classification layer conforming to the small-sample type.

Step S104, inputting a small sample speech emotion data set into the deep neural network model to carry out meta-training on the deep neural network model, and the method is characterized in that:

(1) Firstly, dividing the voice data of the small sample into a training set and a testing set in proportion, wherein the division proportion can be 7: 1. 3:1, wherein the training set is divided into a support set and a query set, and the specific division process is as follows:

assuming that a small sample emotion voice data set is a k1 type sample, firstly, randomly extracting n1 samples from each type of k1 type samples in a training set as a support set, then randomly extracting x1 samples from the rest k1 type samples as a query set, and dividing the support set and the query set in a test set by the same method as the dividing method in the training set, so far, a k-way n-shot task is constructed, namely, one epamode in meta-training is selected and completed, and a plurality of epamodes can be constructed according to the scale of data in the training set, wherein each epamode corresponds to one task.

(2) Then, preprocessing and voice emotion feature extraction are carried out on the small sample voice emotion data set by using the method in the step S102;

(3) Inputting a small sample speech emotion data set into a deep neural network model to carry out meta-training on the deep neural network model, wherein:

updating the parameters of the replaced classification layer by using a gradient descent method for the loss values obtained by the support set in the test set, and updating a scaling function for the loss values obtained by the query set in the test set, so that most of the parameters in the network are kept unchanged;

each task is subjected to meta-training in a deep neural network, and loss values of a support set and a query set can be obtained according to a cross entropy loss function:

wherein O is the number of sample categories; y is _ic Indicating a variable (0 or 1) which is 1 if the class is the same as the class of sample i, otherwise 0 _ic A predicted probability of belonging to class c for an observation sample i;

setting the penalty value of the support set to L _s With the penalty value of the query set to L _q ，L _s And updating the parameters of the optimized and replaced classification layer by using a gradient descent algorithm:

/>

wherein β is the learning rate;

L _q updating parameters that optimize scaling and shifting using a gradient descent algorithm

And &>

Wherein the scaling parameter is->

Is set to 1, the movement parameter->

Is set to 0:

wherein γ is the learning rate;

the parameters of the neurons can be fixed by utilizing the scaling movement function, so that only the classification layer parameters and the scaling movement parameters need to be updated in the whole element training process, and the updating of the whole network parameters is avoided;

where X is an input, lines indicate that the array elements are multiplied one by one.

Step S105, performing element test on the deep neural network model after the element training, and outputting a speech emotion recognition result, including:

because a plurality of epicodes carry out optimization training on the whole neural network model in the meta-training process, the fast adaptation capability of the model to unknown tasks needs to be tested, and the following meta-testing mainly comprises two parts:

1. fine-tuning the parameters of the second classification layer by using a support set in the test set, namely, firstly solving a cross entropy loss function by using the support set, and then updating a small number of parameters in the neural network by using a gradient descent algorithm;

2. and finally outputting a recognition result by using the query set, and evaluating the whole model.

Some embodiments of the disclosed speech emotion recognition apparatus based on reinforcement element transfer learning are described below with reference to fig. 3.

Fig. 3 is a schematic structural diagram of a speech emotion recognition apparatus based on reinforcement element transfer learning according to some embodiments of the present disclosure.

The device 30 comprises:

the deep reinforcement learning module 301 is used for inputting a large-scale speech emotion data set and pre-training a deep neural network; a migration learning module 302, configured to migrate the pre-trained deep neural network to a meta learning model to form a deep neural network model; and the deep neural network module 303 is configured to input the small sample speech emotion data set to perform meta-training on the deep neural network model, perform meta-testing on the deep neural network model subjected to the meta-training, and output a speech emotion recognition result.

In some embodiments, the deep reinforcement learning module 301 is configured to: mainly composed of an agent 304, in which a deep neural network is embedded, and an environment 305; the method comprises the steps of obtaining a first voice emotion signal in a large-scale voice emotion data set, preprocessing the first voice emotion signal, extracting a first voice emotion feature from the preprocessed first voice emotion signal, and inputting the first voice emotion feature into a deep reinforcement learning module for training.

In some embodiments, the deep neural network module 303 is configured to: replacing the first classification layer of the deep neural network model with a second classification layer that conforms to a small sample data category; dividing a small sample voice emotion data set into a training set and a test set; acquiring a second voice emotion signal in the training set and the test set; preprocessing the second voice emotion signal; extracting a second voice emotion characteristic from the preprocessed second voice emotion signal; and inputting the second speech emotion characteristics extracted from the training set into the deep neural network model to perform meta-training on the deep neural network model.

In some embodiments, the deep neural network module 303 is further configured to: dividing a training set into a support set and a query set; updating the parameters of the second classification layer by using the loss values obtained by the first support set, and updating the scaling movement function by using the loss values obtained by the first query set; fine-tuning parameters of the second classification layer by using a second support set; and outputting a speech emotion recognition result by utilizing the second query set and evaluating the deep neural network model.

The present disclosure further includes a readable computer storage medium, on which computer instructions are stored, wherein the computer instructions, when executed by a processor, are configured to implement the method for speech emotion recognition based on reinforcement element migration learning in any of the foregoing embodiments.

The present disclosure is described in terms of flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional identical elements in the process, method, article, or apparatus comprising the element.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure and is not to be construed as limiting the present disclosure, but rather as the subject matter of the invention is to be construed in all aspects and as broadly as possible, and all changes, equivalents, and modifications that fall within the spirit and scope of the present disclosure are therefore intended to be embraced therein.

Claims

1. A speech emotion recognition method includes:

constructing a deep reinforcement learning model;

inputting a large-scale voice emotion data set into a deep reinforcement learning model to pre-train a deep neural network; wherein the pre-training comprises acquiring a first speech emotion signal in a large-scale speech emotion data set;

preprocessing the first voice emotion signal;

extracting a first voice emotion feature from the preprocessed first voice emotion signal;

inputting the first voice emotion feature into the deep reinforcement learning model for training;

migrating the pre-trained deep neural network to a meta-learning model to form a deep neural network model;

inputting a small sample voice emotion data set into the deep neural network model to perform element training on the pre-trained deep neural network; wherein the meta-training comprises replacing a first classification layer of the deep neural network model with a second classification layer that conforms to a small sample data category;

dividing the small sample voice emotion data set into a training set and a test set;

acquiring a second voice emotion signal in the training set and the test set;

preprocessing the second voice emotion signal;

extracting a second voice emotion feature from the preprocessed second voice emotion signal;

inputting the second speech emotion characteristics extracted from the training set into the deep neural network model to perform meta-training on the deep neural network model;

and performing element test on the deep neural network model after the element training, and outputting a speech emotion recognition result.

2. The method for recognizing speech emotion according to claim 1, wherein said constructing a deep reinforcement learning model includes:

the reinforcement learning framework mainly comprises an intelligent agent and an environment, and the deep neural network is embedded into the intelligent agent to form a deep reinforcement learning model.

3. A speech emotion recognition method as claimed in claim 1, comprising:

dividing the training set into a first support set and a first query set;

and updating the parameters of the second classification layer by using the loss values obtained by the first support set, and updating a scaling movement function by using the loss values obtained by the first query set as parameters.

4. A speech emotion recognition method as claimed in claim 3, wherein meta-testing said deep neural network model after said meta-training comprises:

dividing the test set into a second support set and a second query set;

fine-tuning parameters of the second classification layer by using the second support set;

and outputting a speech emotion recognition result by utilizing the second query set and evaluating the deep neural network model.

5. A speech emotion recognition device based on reinforced meta-migration learning comprises:

the deep reinforcement learning module is used for inputting a large-scale speech emotion data set and pre-training a deep neural network; wherein the deep reinforcement learning module is configured to:

the reinforcement learning module mainly comprises an intelligent agent and an environment, and the deep neural network is embedded into the intelligent agent to form a deep reinforcement learning module;

acquiring a first voice emotion signal in a large-scale voice emotion data set, preprocessing the first voice emotion signal, extracting a first voice emotion feature from the preprocessed first voice emotion signal, and inputting the first voice emotion feature into the deep reinforcement learning module for pre-training;

the transfer learning module is used for transferring the pre-trained deep neural network to a meta-learning model to form a deep neural network model;

the deep neural network module is used for inputting a small sample voice emotion data set to perform element training on the deep neural network, performing element testing on the deep neural network model subjected to the element training and outputting a voice emotion recognition result;

wherein the deep neural network module is configured to:

replacing the first classification layer of the deep neural network module with a second classification layer that conforms to a small sample data category;

acquiring a second voice emotion signal in the training set and the test set;

preprocessing the second voice emotion signal;

inputting the second speech emotion features extracted from the training set into the deep neural network module to perform meta-training on the deep neural network module;

dividing the training set into a first support set and a first query set;

updating the parameters of the second classification layer by using the loss values obtained by the first support set, and updating a scaling movement function by using the loss values obtained by the first query set;

dividing the test set into a second support set and a second query set;

6. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, are adapted to implement the steps of the method of any of claims 1-4.