CN116524962A

CN116524962A - Speech emotion recognition method based on Conformer structure and multitask learning framework

Info

Publication number: CN116524962A
Application number: CN202310552901.0A
Authority: CN
Inventors: 熊盛武; 李涛
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-01

Abstract

The invention provides a voice emotion recognition method based on a Conformer structure and a multitask learning framework, which comprises the following steps: s1, extracting characteristics of audio data of a training set and an verifying set in a voice emotion data set to generate a triplet sample, wherein the triplet sample comprises FBank characteristics, emotion labels and text labels; s2, constructing a network model for multi-task learning, wherein the network model comprises a shared encoder, an emotion recognition decoder and a voice recognition decoder, and the shared encoder comprises a Conformer model for deep feature extraction; s3, training the network model by utilizing the triplet sample corresponding to the training set; and S4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set. S5, inputting the audio data to be identified into the network model to realize voice identification and emotion identification of the audio data to be identified. The method builds the multitask learning model based on the Conformer structure, and can improve the generalization performance and the recognition accuracy of the model.

Description

Speech emotion recognition method based on Conformer structure and multitask learning framework

Technical Field

The invention relates to the technical field of computer voice recognition and voice emotion recognition, in particular to a voice emotion recognition method based on a Conformer structure and a multi-task learning framework.

Background

In recent years, with the development of the internet and artificial intelligence technology, applications of man-machine interaction are also becoming popular, and ways of man-machine interaction are various, and voice interaction is one of the most widespread and direct applications. The first step of man-machine voice interaction is that a computer recognizes and understands human voice, and voice recognition is one of important methods for semantic understanding by the computer. However, in practical applications, it is not enough to obtain text information only through speech recognition for semantic understanding, and a human generally has emotion when speaking, and different emotion utterances and sentence may express completely different meanings, so emotion information is also important for semantic understanding. The voice emotion recognition is introduced into the man-machine voice interaction application, emotion information can be provided for the subsequent semantic understanding task, and misunderstanding of a computer to a user is reduced as much as possible.

Early speech emotion recognition models were mostly based on traditional machine learning algorithms, most typically hidden markov models, gaussian mixture models and support vector machines, which train the models to obtain the relationship between speech features and emotion classification, but these methods perform poorly both in terms of recognition accuracy and model robustness. With the development of deep learning and the enhancement of computer hardware computing power in recent years, models based on a deep neural network are widely applied in the field of voice, and research shows that the voice emotion recognition model based on the deep learning has better performance compared with the traditional machine learning model.

At present, there are many application scenarios combining voice recognition and voice emotion recognition, and these application scenarios include intelligent customer service, intelligent voice assistant, intelligent medical treatment, intelligent cabin, and the like. The intelligent terminal is provided with comprehensive voice products such as little college, hundred-degree small degree, and various voice technologies such as voice recognition, voice emotion recognition, voice synthesis and the like are included in the intelligent terminal. Most applications, however, perform speech recognition and speech emotion recognition as two separate models, which can result in redundancy of the system and waste of computational resources.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art, and provides a voice emotion recognition method based on a Conformer structure and a multi-task learning framework.

The invention provides a voice emotion recognition method based on a Conformer structure and a multitask learning framework, which comprises the following steps:

s1, acquiring a voice emotion data set, and extracting features of audio data of a training set and an verifying set in the voice emotion data set to generate a triplet sample, wherein the triplet sample comprises FBank features, emotion tags and text tags;

s2, constructing a network model for multi-task learning, wherein the network model comprises a shared encoder, an emotion recognition decoder and a voice recognition decoder, the shared encoder comprises a Conformer model for deep feature extraction, and the emotion recognition decoder comprises an averaging layer and a linear full-connection layer and is used for outputting emotion vectors; the speech recognition decoder includes an averaging layer and a CTC decoder for outputting text vectors;

s3, training the network model by using the triplet sample corresponding to the training set;

s4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set;

s5, inputting the audio data to be identified into the network model to realize voice identification and emotion identification of the audio data to be identified.

By means of the method, a multi-task network model for simultaneously carrying out voice recognition and emotion recognition is built, the voice recognition subtasks and the emotion recognition subtasks are combined, correlation among tasks can be utilized, additional information is learned during model training, and therefore performance of each subtask is improved. The shared encoder comprises a Conformer model for deep feature extraction, so that the feature expression capability of the network model can be effectively improved, and the precision of voice recognition and emotion recognition can be improved.

Further, the Conformer model comprises a plurality of stacked Conformer modules, and the Conformer model comprises a plurality of stacked Conformer modules, wherein the Conformer modules are formed by introducing a CNN module into a converter module.

Hidden layer characteristics can be generated through stacked Conformer modules, and the Conformer modules can utilize the capturing capacity of the CNN modules on local characteristics and the invariance of the CNN modules on time domains and the modeling advantage of the Conformer modules in long distances to link the extracted similar characteristics together, so that the Conformer model can not only utilize the local characteristics, but also establish the relationship between the local characteristics.

Further, the Conformer module comprises a first feedforward module, a self-attention module, a convolution module, a second feedforward module and a normalization module, and the calculation formula of the Conformer module is as follows:

x″ _i ＝x′ _i +Conv(x′ _i )

wherein x is _i Is the input of the first feedforward module, y _i For the output of the normalization module, FFN is the feed-forward module, MHSA is the self-attention module, conv is the convolution module, and LN is the normalization module.

Further, the first feedforward module, the self-attention module, the convolution module and the second feedforward module in the Conformer module are connected through residual errors.

Further, a plurality of stacked Conformer modules are connected through residual errors.

Further, the shared encoder includes a convolutional downsampling layer, a full-concatenated layer, a Dropout layer, and a Conformer model.

Further, the step S3 includes: and calculating a loss function of the network model, wherein the loss function comprises emotion recognition task loss, voice recognition task loss and sentence embedding loss.

The generalization of the speech emotion recognition model can be improved by introducing sentence embedding loss.

Further, calculating the loss function of the network model includes: the triplet sample is processed by a network model to obtain an emotion vector f _e (X _n ) And a text vector f _t (X _n ) And shared encoder output g (X _n ) Wherein X is _n Representing FBank eigenvectors, calculating a loss function l=l ₁ +αL ₂ +βL ₃ Wherein alpha and beta are super parameters;

loss of emotion recognition taskWhere MSELoss denotes the mean square error loss function, e _n Representing emotion tags in the triplet sample;

speech recognition task penaltyWherein CTCLoss represents a CTC loss function, t _n Representing text labels in the triplet sample;

sentence embedding lossWherein cross entropy represents a cross entropy loss function, pool represents an average pooling calculation, and sentence embedding loss is used for constraining the integrity of deep information in speech at the sentence level.

The integrity of emotion information at a sentence level is constrained through sentence embedding loss, so that the recognition accuracy and generalization of the model are further improved.

Further, the step S1 includes: and (3) carrying out speed change on the audio data through the speed factor, adding noise interference on the audio data, and adding the obtained new audio data into the training set.

The data volume of the original audio data can be increased by the method.

The invention also provides an electronic device comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and is configured to be executed by the processor to implement the speech emotion recognition method based on the Conformer structure and the multi-task learning framework.

The present invention also provides a computer-readable storage medium having a computer program stored thereon; the computer program is executed by the processor to implement the speech emotion recognition method based on the Conformer structure and the multitasking learning framework.

The beneficial effects of the invention are as follows: according to the invention, a multi-task network model for simultaneously carrying out voice recognition and emotion recognition is constructed, and the voice recognition subtasks and the emotion recognition subtasks are combined to utilize the correlation between tasks, so that additional information is learned during model training, and the performance of each subtask is improved. The shared encoder comprises a Conformer model for deep feature extraction, so that the feature expression capability of the network model can be effectively improved, and the precision of voice recognition and emotion recognition can be improved.

Drawings

FIG. 1 is a training flow chart of the network model of the present invention;

FIG. 2 is a block diagram of a Conformer module of the present invention;

FIG. 3 is a block diagram of a shared encoder of the present invention;

FIG. 4 is a flow chart of speech characterization extraction in accordance with the present invention;

FIG. 5 is a training flow chart of the emotion recognition decoder of the present invention;

fig. 6 is a training flow chart of the speech recognition decoder of the present invention.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved by the present application more clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The environment adopted in the embodiment is GeForce GTX Titan X GPU, interCore i7-5930K, 3.50GHZ CPU, 64G RAM and linux operating system, and the development is carried out by using Python and an open source library PyTorch.

A speech emotion recognition method based on a Conformer structure and a multitask learning framework comprises the following steps:

step S1: the voice emotion data set is divided into a training set, a testing set and a verification set, the audio data of the training set and the verification set are subjected to feature extraction, a triplet sample is generated, the triplet sample comprises FBank features, emotion tags and text tags, and fig. 4 is a flow chart of voice characterization extraction.

The embodiment uses an open source data set IEMOCAP, IEMOCAP, collectively The Interactive Emotional Dyadic Motion Capture, which is an open source emotion recognition data set, and the IEMOCAP data set is composed of 151 recorded conversational videos, each conversation having 2 speakers, and the entire data set having a total of 302 videos. Each segment is marked with 9 types of emotions (anger, excitement, fear, sadness, surprise, depression, happiness, disappointment and neutrality) and texts corresponding to voices, and four types of emotion (anger, sadness, happiness and neutrality) are selected as training data. The data sets were integrated and divided into training, validation and test sets, with data sample distributions as shown in table 1 below.

Then, extracting the characteristics of the data set, wherein the specific operation is as follows:

framing, pre-emphasis and windowing the original audio: in the embodiment, the original wav file is divided into a plurality of small fragments with fixed length by using a frame window size of 25ms and a frame displacement size of 10ms; enhancing the signal of the high frequency part of each frame of the voice signal through pre-emphasis to improve the resolution of the high frequency signal; the time domain signal can better meet the periodicity requirement of Fast Fourier Transform (FFT) through windowing operation, and frequency leakage is reduced.

The speech signal is an unsteady, time-varying signal. But the speech signal can be considered stationary, time-invariant, over a short time frame. This short time is typically 10ms-30ms, so that in order to reduce the overall unsteady, time-varying effect of the speech signal during speech signal processing, the speech signal is segmented, where each segment is referred to as a frame, and typically the frame length is 25ms. In order to make the frames smoothly transition with each other and maintain the continuity of the frames, the method of overlapping and segmenting is generally adopted for framing to ensure that a part of adjacent frames overlap each other. The time difference between the starting positions of two adjacent frames is called frame shift (frame shift), and the frame shift size is typically 10ms; in the transmission process of the signals, the high-frequency signals are easier to attenuate, the pronunciation of some factors such as pel sounds contains more components of the high-frequency signals, and the loss of the high-frequency signals can cause that formants of phonemes are not obvious, so that modeling capability of an acoustic model on the phonemes is not strong. Pre-emphasis is a first order high pass filter that increases the energy of the high frequency part of the signal, which, in the real world, it is not possible to acquire signals from- +. only signals of a limited length of time. Since the framed signal is non-periodic, frequency leakage problems occur after FFT, and in order to minimize this leakage error, a weighting function, also called a window function, is required. Windowing is mainly used to make the time domain signal seem to better meet the periodicity requirements of FFT processing, reducing frequency leakage.

Extracting Fbank characteristics as input of a model: in this embodiment, the fast fourier transform is performed on the speech signal to obtain a frequency domain representation of the speech signal, and the obtained linear frequency f is converted into the Mel frequency of the cepstral domain, where the formula is as follows:

setting 80 triangular band-pass filters with equal bandwidth in the Mel frequency spectrum range, filtering the spectrum characteristics after discrete Fourier transformation to obtain 80 filter bank energies, and obtaining 80-dimensional Fbank characteristics after log operation.

The acoustic features are then data augmented: the present embodiment shifts the original audio data by 0.9 times and 1.1 times the speed factor for the acoustic features so that the amount of the original audio data is expanded by 3 times. In addition, some noise interference is added to the original audio data, and the obtained new audio data is added into the training set. This process may use an open source speech processing framework Kaldi or an audio processing tool librosa, etc.

Finally, processing the characteristics into a data file for model training: after the feature extraction is completed, the feature files are stored in a file form, each feature file corresponds to one emotion label, the paths of the feature files and the corresponding emotion labels are arranged into a text file, and the text file comprises a plurality of triplet samples and is used as input for training a network model.

And S2, constructing a network model for multi-task learning.

As shown in fig. 2, the shared encoder includes a convolutional downsampling layer, a linear layer, a Dropout layer, and a Conformer model, and forms an encoder as a whole as a shared encoder of the multi-tasking model. The present embodiment trains acoustic features by stacking 15 Conformer modules to generate hidden features.

The construction method of the Conformer model specifically comprises the following steps:

the Conformer module is used for connecting the extracted similar features together by introducing the CNN module into the transducer module and utilizing the capturing capacity of the CNN module on the local features and the invariance characteristic of the CNN module on the time domain and the modeling advantage of the transducer module in a long distance, so that the Conformer model can utilize the local features and establish the relationship between the local features, and further deep feature extraction is realized.

In this embodiment, the Conformer module adds a self-attention module and a convolution module between the first feedforward module and the second feedforward module, and finally performs normalization processing through the normalization module, where the first four modules are connected through residual errors, that is, the first feedforward module, the self-attention module, the convolution module and the second feedforward module are connected through residual errors. The Conformer model is calculated as follows:

x″ _i ＝x′ _i +Conv(x′ _i )

wherein x is _i Is the input of the first feedforward module, y _i For the output of the normalization module, FFN is the feed forward module, MHSA is the self-attention module, conv is the convolution module, and LN (layer normalization) is the normalization module.

And forming the constructed Conformer modules into a Conformer encoder, wherein residual connection exists among the Conformer modules.

And S3, training the network model by utilizing the triplet samples corresponding to the training set.

By using the network model, the voice recognition task and the emotion recognition task are respectively carried out based on the Conformer model, and the method specifically comprises the following steps:

as shown in fig. 5 and 6, the emotion recognition decoder includes an averaging layer and a linear full-connection layer for outputting emotion vectors; the speech recognition decoder includes an averaging layer and a CTC decoder for outputting text vectors.

As shown in FIG. 1, each triplet sample is processed by a shared encoder to obtain a speech representation, and the speech representation is processed by an emotion recognition decoder, a speech recognition decoder, and an average pooling to obtain an emotion vector f _e (X _n ) Text vectorf _t (X _n ) Shared encoder output g (X _n )。

sentence embedding loss is made on the output of the voice recognition taskWherein, cross entropy represents cross entropy loss function, pool represents average pooling calculation, sentence embedding loss is used for restricting the integrity of deep information in voice at sentence level; n is the number of triplet samples.

Loss function l=l ₁ +αL ₂ +βL ₃ Wherein alpha and beta are super parameters.

And S4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set.

And through the loss gradient calculation of the total loss function L, the joint optimization updating of the two tasks is realized. In the training process, super parameters such as the training batch size, the network layer number, the learning rate and the like can be adjusted to finely adjust the optimization model. The present embodiment obtains the following set of values of the super parameters through training and parameter adjustment:

and predicting emotion types corresponding to the audio in the test set by using the trained network, and calculating F1-score (a commonly used classification problem evaluation index, wherein the larger the numerical value is, the better the performance is) according to the real label of the test set.

In order to evaluate the effectiveness of the method of the present invention, the present embodiment is compared with several most advanced multitasking speech emotion recognition methods for recognition performance, including MTL with self-intent, MTL with wav2vec and TIM-net, the present experiment trains these models in the same training set and the same training environment and conditions, the trained model calculates F1-score on the IEMOCAP test set, and in order to compare generalizations of different methods, the present experiment also performs audio noise on the test set to obtain two sets of test data of "noiseless" and "noisy", and specific test results are shown in the following table.

The encoder in the network model has the functions of embedding a middle layer for extracting and outputting the information of the input voice data, and then identifying the corresponding emotion type through a rear-end classifier by embedding the middle layer. For a multi-task learning model comprising a speech recognition task and a emotion recognition task, embedding obtained through an encoder needs to contain semantic information and emotion information, and meanwhile, the model needs to have good generalization capability and can recognize data of different scenes. The invention provides an improved encoder based on a Conformer structure, wherein a residual error and a Dropout layer are introduced, and more overall information is reserved when emotion information is extracted, so that the generalization performance and the recognition precision of a model are improved.

In practical application, speech emotion recognition is used together with speech recognition in many scenes, so that two tasks can be combined, and the correlation between the two tasks is utilized, a multi-task learning model is formed by combining speech recognition (ASR) and Speech Emotion Recognition (SER) tasks based on a multi-task learning theory, a hidden layer is obtained through a front-end shared encoder, and then the hidden layer is embedded and simultaneously input into a speech recognition module and a speech emotion recognition module. In the training process, the related information among the tasks is helpful for the network to learn better internal representation, so that the shared encoder learns more shared characteristics, and the performance of each task is improved. In addition, the distance between the hidden layer embedding and the output of the voice recognition module is calculated through sentence embedding loss, so that the integrity of information in the hidden layer embedding in a sentence level is restrained.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A speech emotion recognition method based on a Conformer structure and a multitask learning framework is characterized by comprising the following steps of: comprising the following steps:

2. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 1, wherein: the Conformer model comprises a plurality of stacked Conformer modules, wherein the Conformer modules are formed by introducing CNN modules into a converter module.

3. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 2, characterized in that: the Conformer module comprises a first feedforward module, a self-attention module, a convolution module, a second feedforward module and a normalization module, and the calculation formula of the Conformer module is as follows:

x″ _i ＝x′ _i +Conv(x′ _i )

4. The speech emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 3, characterized in that: the first feedforward module, the self-attention module, the convolution module and the second feedforward module in the Conformer module are connected through residual errors.

5. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 2, characterized in that: the plurality of stacked Conformer modules are connected through residual errors.

6. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 1, wherein: the shared encoder includes a convolutional downsampling layer, a full-concatenated layer, a Dropout layer, and a Conformer model.

7. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 1, wherein: the step S3 includes: and calculating a loss function of the network model, wherein the loss function comprises emotion recognition task loss, voice recognition task loss and sentence embedding loss.

8. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 7, characterized by: calculating a loss function for the network model includes: the triplet sample is processed by a network model to obtain an emotion vector f _e (X _n ) And a text vector f _t (X _n ) And shared encoder output g (X _n ) Wherein X is _n Representing FBank eigenvectors, calculating a loss function l=l ₁ +αL ₂ +βL ₃ Wherein alpha and beta are super parameters;

sentence embedding lossWherein, cross entropy represents cross entropy loss function, pool represents average pooling calculation, sentence embedding loss is used for restricting the integrity of deep information in voice at sentence level; n is the number of triplet samples.

9. An electronic device, characterized in that: comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and configured to be executed by the processor to implement the speech emotion recognition method based on a Conformer structure and a multitasking learning framework of any of claims 1 to 8.

10. A computer-readable storage medium, characterized by: a computer program stored thereon; a computer program to be executed by a processor to implement the speech emotion recognition method based on a Conformer structure and a multitasking learning framework of any one of claims 1 to 8.