CN114495915A

CN114495915A - Voice emotion recognition model training method, emotion recognition method, device and equipment

Info

Publication number: CN114495915A
Application number: CN202210153023.0A
Authority: CN
Inventors: 赵情恩; 梁芸铭; 张银辉; 熊新雷; 周羊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-05-13

Abstract

The disclosure provides a speech emotion recognition model training method, an emotion recognition method, a speech emotion recognition device and equipment. Relate to artificial intelligence field, especially relate to intelligent speech recognition, intelligent emotion recognition field. The specific implementation scheme is as follows: obtaining a first feature and a second feature of a sample audio, wherein the first feature is used for characterizing features related to the waveform of the sample audio, and the second feature is used for characterizing features related to a speaker of the sample audio; performing emotional feature decoupling using the first feature and the second feature; and performing emotion recognition training by using the emotion characteristics obtained by decoupling to obtain a trained voice emotion recognition model. According to the method, emotion recognition training is performed by adopting the emotion characteristics obtained through decoupling, and the trained voice emotion recognition model can perform emotion recognition more accurately.

Description

Voice emotion recognition model training method, emotion recognition method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, particularly to the field of intelligent speech recognition and intelligent emotion recognition, and more particularly to a speech emotion recognition model training method, an emotion recognition device, and an emotion recognition device.

Background

Speech is an important carrier of emotion in human communication. Speech recognition is mainly concerned with what the speaker says. Emotion recognition is mainly concerned with what mood the speaker is speaking under. The expression of speech of people in different emotional states will be different, for example, the tone of speech will be more cheerful when people are happy, and the tone of speech will be more tedious when people are worried.

Deep learning techniques have accelerated the progress of detecting emotion from speech, but research in this area has remained deficient. A difficulty with speech emotion detection is that what the emotion expressed in a sentence may be different from person to person. Different people have different emotions for the same voice, and certain cultural differences exist, so that the accuracy of voice emotion recognition is not high.

Disclosure of Invention

The disclosure provides a speech emotion recognition model training method, an emotion recognition method, a device, equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a speech emotion recognition model training method, including:

obtaining a first feature and a second feature of a sample audio, wherein the first feature is used for characterizing features related to the waveform of the sample audio, and the second feature is used for characterizing features related to a speaker of the sample audio;

performing emotional feature decoupling using the first feature and the second feature;

and performing emotion recognition training by using the emotion characteristics obtained by decoupling to obtain a trained voice emotion recognition model.

According to another aspect of the present disclosure, there is provided an emotion recognition method including:

acquiring a first feature and a second feature of the audio to be recognized, wherein the first feature is used for characterizing the feature related to the waveform of the audio to be recognized, and the second feature is used for characterizing the feature related to a speaker of the audio to be recognized;

inputting the first characteristic and the second characteristic into a speech emotion recognition model for emotion category recognition to obtain a first recognition result;

the speech emotion recognition model is obtained by training through the speech emotion recognition model training method in the embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided a speech emotion recognition model training apparatus including:

an obtaining module, configured to obtain a first feature and a second feature of a sample audio, where the first feature is used to characterize a feature related to a waveform of the sample audio, and the second feature is used to characterize a feature related to a speaker of the sample audio;

a decoupling module for performing emotional feature decoupling using the first feature and the second feature;

and the training module is used for performing emotion recognition training by using the emotion characteristics obtained by decoupling to obtain a trained voice emotion recognition model.

According to another aspect of the present disclosure, there is provided an emotion recognition apparatus including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first characteristic and a second characteristic of the audio to be recognized, the first characteristic is used for characterizing the characteristic related to the waveform of the audio to be recognized, and the second characteristic is used for characterizing the characteristic related to the speaker of the audio to be recognized;

the first recognition module is used for inputting the first characteristic and the second characteristic into the speech emotion recognition model for emotion category recognition to obtain a first recognition result;

the speech emotion recognition model is obtained by training the speech emotion recognition device in the embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

According to the method, emotion recognition training is performed by adopting the emotion characteristics obtained through decoupling, and the trained voice emotion recognition model can perform emotion recognition more accurately.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow diagram of a speech emotion recognition model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure;

FIG. 3 is a schematic flow chart diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure;

FIG. 6 is a flow diagram of a method of emotion recognition according to an embodiment of the present disclosure;

fig. 7 is a flow diagram of a method of emotion recognition according to another embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a speech emotion recognition model training apparatus according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of a speech emotion recognition model training apparatus according to another embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a speech emotion recognition model training apparatus according to another embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a training method of a speech emotion recognition model according to another embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an emotion recognition apparatus according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of an emotion recognition apparatus according to another embodiment of the present disclosure;

FIG. 14 is a schematic diagram of an example according to a speech emotion recognition procedure;

FIG. 15 is a schematic diagram of a speech emotion recognition framework according to an embodiment of the present disclosure;

FIG. 16 is a schematic diagram of a speech emotion recognition flow according to an embodiment of the present disclosure;

FIG. 17 is a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flow chart of a speech emotion recognition model training method according to an embodiment of the present disclosure, which includes:

s101, obtaining a first feature and a second feature of a sample audio, wherein the first feature is used for characterizing features related to the waveform of the sample audio, and the second feature is used for characterizing features related to a speaker of the sample audio;

s102, conducting emotional characteristic decoupling by utilizing the first characteristic and the second characteristic;

s103, emotion recognition training is carried out by utilizing the emotion characteristics obtained through decoupling, and a trained voice emotion recognition model is obtained.

In the disclosed embodiment, the emotion of the speaker may be reflected in the audio. Such as engendering qi, happy, neutral, impaired, excited, fear, etc. The mood may also be referred to as emotion, etc. The speaker is typically a human being, and may also include other types of speakers such as AI robots and the like. Various types of features may be included in the audio. In the embodiment of the present disclosure, the first feature may represent a feature related to the waveform of the audio itself, and may also be referred to as a waveform feature or a sound wave feature. The second feature may characterize a speaker-dependent feature in the audio, which may also be referred to as a speaker feature, a speaker cepstral feature, or the like. According to the embodiment of the disclosure, the first characteristic and the second characteristic of the sample audio can be decoupled through the emotional characteristic, and the decoupled emotional characteristic is purer. The emotion characteristics obtained by decoupling are adopted for emotion recognition training, and the trained voice emotion recognition model can perform emotion recognition more accurately. The emotion characteristics obtained by decoupling are adopted for emotion recognition training, the emotion characteristics, the speaker characteristics and the content characteristics can be decoupled, purer emotion characteristics are obtained, then the trained speech emotion recognition model is adopted for recognition classification, and the emotion type of the audio is obtained by combining the text recognition result, so that the judgment result is more accurate.

The speech emotion recognition model training method in the embodiment of the application can be executed by a terminal, a server or other processing equipment in a single-machine, multi-machine or cluster system. The terminal may include, but is not limited to, a user device, a mobile device, a personal digital assistant, a handheld device, a computing device, an in-vehicle device, a wearable device, and the like. The server may include, but is not limited to, an application server, a data server, a cloud server, and the like.

Fig. 2 is a flow diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure, where the method of this embodiment includes one or more features of the above-described speech emotion recognition model training method embodiment. In one possible embodiment, S101 of the method includes: a first feature of sample audio is obtained. In embodiments of the present disclosure, the audio may have a variety of attributes that represent audio waveforms, including but not limited to pitch, volume, pace, stress, and the like. For example, a multi-dimensional vector may be extracted from the audio, and the feature values in the vector may represent a first feature of the audio. The dimension of the vector is not limited, and may be selected according to the actual scene, and may be, for example, 512 dimensions, 1024 dimensions, or other numerical values.

In one possible implementation, obtaining the first feature of the sample audio includes:

s201, extracting the first feature from the sample audio by using a waveform-to-vector (Wav2vec) model. For example, the Wav2vec model may be a pre-trained model. The Wav2vec model may include a multi-Layer Transformer Layer (Transformer Layer), and the specific number of layers of the Transformer Layer may be set according to requirements. And (3) carrying out unsupervised training on the Wav2vec model by adopting a large amount of unlabeled audio data, and obtaining the hierarchical feature extractor after iterative multi-round convergence. The features of the various layers may characterize different properties of the audio from different angles, depths. By utilizing the Wav2vec model, as many characteristics as possible of the waveform of the audio can be extracted, and a rich data basis is provided for training.

In one possible implementation, S201 utilizes a Wav2vec model to extract the first feature from the sample audio, including: framing the sample audio to obtain a plurality of first audio frames; extracting at least one audio segment from the plurality of first audio frames and inputting the audio segment into the Wav2vec model to obtain the first feature, wherein the first feature comprises the Wav2vec feature of the audio segment. Wherein the audio clip may comprise a first number of first audio frames.

For example, after the sample audio is framed to obtain a plurality of first audio frames, an audio segment may be extracted from the first audio frames at a time, and the audio segment may include a plurality of first audio frames. For example, if the entire audio partition results in 960 first audio frames, 10 audio segments may result if each audio segment includes 96 first audio frames. If each audio segment comprises 48 first audio frames, 20 audio segments may result. Each training can input one audio segment into the model to be trained, and the trained model is obtained through multiple iterations. For example, a first audio segment is input for the first time, and a second audio segment is input for the second time until all audio segments are input or the convergence result of the model is satisfactory.

In addition, the sample audio may include a plurality of samples, and the processes of framing, extracting audio pieces, and iteratively training may be repeatedly performed for each sample audio.

In the embodiment of the disclosure, the training is performed by audio segmentation, so that the data volume of each training can be controlled, the purpose of rapid convergence is achieved, and the samples can be more fully utilized.

In one possible embodiment, S101 of the method includes: a second feature of the sample audio is obtained. In the disclosed embodiment, the audio can represent the attributes of the speaker's voice characteristics, including but not limited to representing vocal cord vibration frequency, mouth size, lip thickness, tongue position, nasopharyngeal thickness, etc. For example, a multi-dimensional vector may be extracted from the audio, and the feature values in the vector may represent a second feature of the audio. The dimension of the vector is not limited, and can be selected according to the actual scene, and for example, the vector can be 128-dimensional, 256-dimensional or other values.

In one possible implementation, obtaining the second feature of the sample audio includes: s202, extracting the second feature from the sample audio by using a speaker classification model. For example, the speaker classification model may be referred to as a speaker feature extraction model, and the model may include an LSTM (Long-Short Term Memory) layer, a linear mapping layer (which may be simply referred to as a linear layer), a full connection layer, and the like. The input to the model may include features of each frame of audio and the output may include predicted probabilities of individual speakers. For example, the type of the input audio feature may include, but is not limited to, at least one of MFCC (Mel-Frequency Cepstral Coefficients, Mel Frequency Cepstral Coefficients), PLP (Perceptual linear prediction), Fbank (FilterBank-based features), and the like. The output characteristics may include speaker characteristics (speaker identification embedding) and the like. The speaker classification model may also be a pre-trained model. Through the speaker classification model, characteristics related to the speaker role can be extracted, so that the speaker classification model is beneficial to distinguishing different speakers and providing personalized accurate emotion recognition.

In a possible implementation, S202, extracting the second feature from the sample audio by using a speaker classification model, includes: framing the sample audio to obtain a plurality of second audio frames; and inputting the plurality of second audio frames into the speaker classification model to obtain the second characteristic.

For example, the second feature may comprise a multi-dimensional vector. The vector may characterize the characteristics of the speaker. For example, a portion of the vector may represent vocal cord vibration frequency, a portion of the vector may represent mouth size, a portion of the vector may represent lip thickness, a portion of the vector may represent tongue position, and a portion of the vector may represent nasopharyngeal cavity thickness. The second feature may also be referred to as a speaker cepstral feature. The speaker classification model can respectively calculate the probability of the speaker corresponding to each frame, and then the second characteristic is obtained after the second characteristic is processed in a multi-frame accumulation and averaging mode. For example, the average of the vectors is calculated every 50 frames. Wherein, when calculating the probability of the speaker corresponding to each frame, a certain number of contexts may be provided. For example, frame 1 and frame 2 are input when frame 1 is calculated. For another example, the 1 st frame, the 2 nd frame, and the 3 rd frame are input when the 2 nd frame is calculated. For another example, the 3 rd frame, the 4 th frame, and the 5 th frame are input when the 3 rd frame is calculated. The number and the mode of the specific contexts used can be set according to the practical application, and the embodiment of the disclosure is not limited.

In the embodiment of the disclosure, the speaker classification model is trained through a plurality of audio frames in the audio, so that a model which outputs more accurate speaker characteristics can be obtained.

In the embodiment of the present disclosure, the obtaining of the first feature of the sample audio and the second feature of the sample audio may be performed separately, and the execution timings of the two are not limited in this embodiment, and may have a sequential order, or may be parallel.

Fig. 3 is a flow diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure, where the method of this embodiment includes one or more features of the above-described speech emotion recognition model training method embodiment. In one possible implementation, S102 performs emotional feature decoupling using the first feature and the second feature, further comprising:

s301, inputting the first feature and the second feature into an encoder for encoding processing so as to achieve emotional feature decoupling. The emotion feature can be extracted based on the first feature and the second feature by the encoder, so that training of emotion recognition can be performed more accurately.

In one possible implementation, the encoder includes a weight average layer, a first connection layer, a first volume regularization layer, a BLSTM (Bi-directional) LSTM layer, and a down-sampling layer, and the first feature and the second feature are input to the encoder for an encoding process, including: inputting the first feature into the weighted average layer; inputting the second characteristic into the first connection layer; splicing the output characteristic of the weight average layer and the second characteristic at the first connection layer to obtain a first splicing characteristic; the first volume regularization layer, the BLSTM layer and the down-sampling layer which are input into the first splicing feature in series are sequentially processed to obtain the output feature of the encoder, and the output feature of the encoder comprises emotion features. Specifically, the first splicing feature may be input into the first convolution regularization layer for processing, the output feature of the first convolution regularization layer is input into the BLSTM layer for processing, and the output feature of the BLSTM layer is input into the down-sampling layer for processing, so as to obtain the output feature of the encoder.

For example, the input to the encoder may comprise features extracted from an audio segment (assumed to comprise H-frames) by the Wav2vec model, and the output may comprise emotional features. The feature dimensions of each frame in the audio piece are then vector weighted by learnable coefficients, such as a weighted average of the vectors of each frame at a multi-level (e.g., 25-level) transformer of the Wav2vec model. The coefficient of each vector may represent the importance of each feature, an initial value of the coefficient may be preset, and the value of the coefficient may be updated during training. After the weighted average structure result of each frame is obtained, the first connection layer and the second feature can be spliced, then the first connection layer and the second feature are input into a convolution regularization layer, and then the second connection layer and the BLSTM layer are input to obtain H2 x d-dimensional vectors. The vector is then down-sampled.

In the embodiment of the disclosure, after the first feature is weighted and averaged by the weight averaging layer of the encoder, the first connection layer and the second feature are spliced, and then the first convolution regularization layer, the BLSTM layer and the downsampling are used for processing, so that the input feature can be decoupled, the speaker feature and the text content feature are omitted, the emotional feature is decomposed, and the training of emotion recognition is performed more accurately.

Fig. 4 is a flow diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure, where the method of this embodiment includes one or more features of the above-described speech emotion recognition model training method embodiment.

In one possible embodiment, the method further comprises:

s401, extracting phoneme characteristics from the sample audio by using a phoneme alignment model.

In the disclosed embodiments, the phoneme alignment model extracts phoneme features from the audio segments. The audio clip may be similar to the audio clip used by the above-described Wav2vec model. For example, an audio clip comprising a first number of first audio frames is input into both the Wav2vec model to extract first features and the phoneme alignment model to extract phoneme features. The phoneme alignment model may also be referred to as a phoneme coder. The phoneme alignment model may be a pre-trained model. The phoneme alignment Model may employ a Network structure such as GMM (Gaussian Mixed Model) -HMM (Hidden Markov Model), DNN (Deep Neural Networks) -HMM, LSTM-CTC (connection Temporal Classification), RNN-T (Transformer based on Recurrent Neural Networks), and the like. The phoneme characteristics of the sample audio can be extracted through the phoneme alignment model, so that the subsequent reconstruction of the audio by using the phoneme characteristics, the second characteristics and the like is facilitated, and the decoupling effect is further improved.

In the embodiment of the present disclosure, the execution time sequence of extracting the phoneme feature, the first feature, and the second feature from the sample audio is not limited, and may have a sequential order or may be parallel.

In one possible implementation, S102 performs emotional feature decoupling using the first feature and the second feature, further comprising:

s402, inputting the output characteristic of the encoder, the second characteristic and the phoneme characteristic into a decoder for decoding;

and S403, updating parameters of at least one of the encoder, the decoder and the phoneme alignment model by using the output characteristics of the decoder.

In the embodiment of the present disclosure, the decoder processes the output characteristic, the second characteristic, and the phoneme characteristic of the encoder, so as to reconstruct the audio and update the parameters of the encoder by using the difference between the reconstructed audio and the original sample audio. Further, parameters of the decoder and/or the phoneme alignment model may also be updated. In this way, the decoder and the phoneme alignment model can be used for updating the model parameters, and the purity of the emotional features output by the encoder can be further improved.

In one possible implementation, the decoder includes an up-sampling layer, a second connection layer, an LSTM layer, a first linear layer, and a second convolution regularization layer, and S401 inputs the output features of the encoder, the second features, and the phoneme features to a decoder for decoding processing, including:

inputting the output characteristic of the encoder into the up-sampling layer for up-sampling;

inputting the second feature and the phoneme feature into the second connection layer;

splicing the features obtained by up-sampling, the second features and the phoneme features at the second connecting layer to obtain second splicing features;

inputting the second splicing characteristic into the LSTM layer and the first linear layer which are connected in series for processing in sequence to obtain a first Mel spectral characteristic and a first error; specifically, the second splicing characteristic may be input to the LSTM layer for processing, and the output characteristic of the LSTM layer may be input to the first linear layer for processing, so as to obtain a first mel-frequency spectrum characteristic and a first error;

and inputting the first Mel spectral feature into the second convolution regularization layer for convolution regularization processing to obtain a second Mel spectral feature and a second error.

For example, the input to the decoder may include the output characteristics of the encoder, and the upsampling may be repeated a plurality of times to restore the vector dimension, e.g., the upsampling may result in H2 × d vectors. And then, splicing the features obtained by up-sampling with the second features and the output features of the phoneme alignment model, and inputting the spliced features into a network module of a subsequent encoder. For example, at the linear layer of the encoder, H M-dimensional features F1 are obtained, and feature F1 may represent a sound spectrum, e.g., mel spectrum, of the audio piece (including the first number of first audio frames). Then, a first error, such as a minimum mean square error L, between the feature F1 and the true Mel spectrum is calculated_r1. Inputting the feature F1 into the subsequent convolutional layer to obtain a new Mel spectrum, and calculating a second error of the feature F1 and the true Mel spectrum, such as the minimum mean square error L_r2。

In the embodiment of the disclosure, the vector dimension of the output feature of the encoding vector can be recovered by using the up-sampling layer of the decoder, and the up-sampling layer is spliced with the phoneme feature and the second feature in the connection layer, so that the mel spectrum of the audio segment is reconstructed by the subsequent network module of the encoder, and thus, the model parameters are updated according to the error between the reconstructed mel spectrum and the real mel spectrum of the sample audio, and the purity of the emotional feature output by the encoder is further improved.

In one possible implementation, the step S402 of updating parameters of at least one of the encoder, the decoder and the phoneme alignment model by using the output features of the decoder includes:

updating parameters of at least one of the encoder, the decoder and the phoneme alignment model according to a stochastic gradient descent criterion using the first error and the second error. In the embodiment of the disclosure, based on the stochastic gradient descent criterion, model parameters can be updated by using each training sample, i.e., sample audio, and the model parameters are updated once by executing stochastic gradient descent each time, so that the execution speed is higher, and the model convergence can be quickly and accurately realized. In the embodiment of the present disclosure, parameter updating of the encoder, the decoder, and the phoneme alignment model may be achieved by multiple iterations, and the embodiment of the present disclosure does not limit the specific iteration number, and after the training target is reached, the training may be stopped.

Fig. 5 is a flow diagram of a speech emotion recognition model training method according to another embodiment of the present disclosure, where the method of this embodiment includes one or more features of the above-described speech emotion recognition model training method embodiment. In a possible implementation manner, S103 performs emotion recognition training by using the decoupled emotional characteristics, and further includes:

s501, inputting the output characteristics of the encoder into an emotion recognition classifier to perform emotion class recognition, and updating parameters of the encoder and the emotion recognition classifier by using emotion class recognition results.

In the disclosed embodiment, S501 may follow S301. The emotion recognition classifier may also be referred to as an emotion recognition classification model, an emotion recognition classification network, or the like. The output characteristics of the encoder may include the decoupled emotional characteristics. After the output features of the encoder are input into the emotion recognition classifier, the classification result of the emotion can be obtained by activating a function such as softmax. The emotion classification recognition result of the purer emotion characteristics is utilized to update the parameters of the encoder and the emotion recognition classifier, and the accuracy of the speech emotion recognition result can be improved.

In one possible embodiment, the emotion recognition classifier includes a second linear layer, a discarding layer and a third linear layer, S501 inputs the output characteristics of the encoder to the emotion recognition classifier for emotion category recognition, and updates the parameters of the encoder and the emotion recognition classifier by using the emotion category recognition result, including:

inputting the output characteristics of the encoder into the second linear layer, the discarding layer and the third linear layer which are connected in series for processing in sequence to obtain an emotion recognition result; specifically, the output characteristics of the encoder may be input into the second linear layer for processing, the output characteristics of the second linear layer may be input into the discard layer for processing, and the output characteristics of the discard layer may be input into the third linear layer for processing, so as to obtain an emotion recognition result;

calculating loss by using the cross entropy of the emotion recognition result;

parameters of the encoder and/or the emotion recognition classifier are updated with the loss.

In the embodiment of the present disclosure, the classification result of the emotion may include a probability corresponding to each emotion. Assuming that the audio has N segments, the probabilities of the N segments can be averaged to obtain the emotion classification result of the whole audio. Moreover, a loss function can be constructed by using the cross entropy, and parameters of the classifier are adjusted by using the loss function, so that the probability of the target emotion is increased, and the probabilities of other emotions are reduced.

In the embodiment of the disclosure, the training of the emotion recognition classifier can be realized through multiple iterations, the embodiment of the disclosure does not limit the specific iteration times, and the training can be stopped after the training target is reached. And performing emotion category identification on the purer emotion characteristics by using an emotion identification classifier, and updating parameters of the encoder and the emotion identification classifier by using the cross entropy to obtain a speech emotion identification model with a more accurate identification result.

Fig. 6 is a flowchart illustrating an emotion recognition method according to an embodiment of the present disclosure, the method including:

s601, acquiring a first feature and a second feature of the audio to be recognized, wherein the first feature is used for characterizing features related to the waveform of the audio to be recognized, and the second feature is used for characterizing features related to a speaker of the audio to be recognized;

s602, inputting the first characteristic and the second characteristic into a speech emotion recognition model for emotion category recognition to obtain a first recognition result;

the voice emotion recognition model is obtained by training through a voice emotion recognition model training method.

For example, the audio to be identified may include test audio. First features relating to waveforms and second features relating to speakers are extracted from the audio to be recognized. The first characteristic and the second characteristic are input into a speech emotion recognition model, a coder of the speech emotion recognition model can be used for obtaining decoupled emotion characteristics, and a classifier is used for classifying the emotion characteristics, so that an accurate emotion category recognition result can be obtained.

In the embodiment of the present disclosure, the obtaining of the first feature of the audio to be recognized and the obtaining of the second feature of the audio to be recognized may be performed separately, and the execution timings of the two features are not limited in this embodiment, and may have a sequential order or may be parallel.

Fig. 7 is a flow diagram of an emotion recognition method according to another embodiment of the present disclosure, which includes one or more features of the above-described emotion recognition method embodiments. In one possible implementation manner, in S601, the obtaining the first feature of the audio to be recognized includes: and extracting the first feature from the audio to be recognized by using a Wav2vec model. The specific structure and function of the Wav2vec model can be referred to the related description of the Wav2vec model in the above training method embodiment. By utilizing the Wav2vec model, the characteristics of the waveform of the audio can be extracted as much as possible, and a rich data basis is provided for emotion recognition.

In a possible implementation, the extracting the first feature from the audio to be recognized by using the Wav2vec model includes: framing the audio to be identified to obtain a plurality of first audio frames; extracting at least one audio segment from the plurality of first audio frames and inputting the audio segment into the Wav2vec model to obtain the first feature, wherein the first feature comprises the Wav2vec feature of the audio segment. Wherein the audio clip may comprise a first number of first audio frames. For example, after the audio to be identified is framed to obtain a plurality of first audio frames, an audio segment may be extracted from the first audio frames at a time, and the audio segment may include a plurality of first audio frames. And inputting one audio segment into the Wav2vec model each time to obtain the Wav2vec characteristics of the segment. And then inputting the Wav2vec characteristics into an encoder of the model, and carrying out recognition by combining speaker characteristics to obtain a speech emotion recognition result. And averaging the recognition results of the multiple segments of the audio to obtain the emotion recognition result of the whole audio. The emotion recognition result may also be referred to as an emotion classification result. The emotion recognition is carried out by utilizing the plurality of audio segments, and a more accurate recognition result can be obtained by combining the recognition result of each segment.

In one possible implementation, in S601, the obtaining the second feature of the audio to be recognized includes: the second feature is extracted from the audio to be recognized using a speaker classification model. The specific structure and function of the speaker classification model can be referred to the related description of the speaker classification model in the above training method embodiment. The characteristics extracted by the speaker classification model and the Wav2vec characteristics can be used for realizing emotion characteristic decoupling, and the purity of the emotion characteristics is improved, so that a more accurate emotion recognition result is obtained.

In a possible implementation, the extracting the second feature from the audio to be recognized by using a speaker classification model includes: framing the audio to be identified to obtain a plurality of second audio frames; and inputting the plurality of second audio frames into the speaker classification model to obtain the second characteristic. For example, the second feature may comprise a multi-dimensional vector. The vector may characterize the speaker's characteristics. The speaker classification model can respectively calculate the probability of the speaker corresponding to each frame, and then the second feature is obtained after multi-frame accumulation and averaging. The probability of the speaker corresponding to each frame is calculated with a certain amount of context.

In the embodiment of the disclosure, more accurate speaker characteristics can be obtained by performing speaker characteristic classification on a plurality of audio frames in audio.

In one possible embodiment, the method further comprises:

s701, performing text recognition on text content corresponding to the audio to be recognized by using a text emotion recognition model to obtain a second recognition result;

s702, weighting the first recognition result and the second recognition result to obtain a third recognition result.

In the embodiment of the present disclosure, the text emotion recognition model may also be referred to as a text emotion classification model, a text emotion classifier, or the like. The text emotion recognition model may include: CNN (Convolutional Neural Networks) layer, average Pooling layer (Mean Pooling), flattening layer (Flatten), and linear layer, etc. The input of the text emotion classification model may include feature vectors corresponding to respective words in the text content of the audio to be recognized. The feature vector can be obtained by pre-training through a Transformer network. The loss function may employ cross-entropy computation and may be training based on the entire text content. The speech emotion recognition result and the text emotion recognition result are combined, and a more accurate emotion recognition result can be obtained.

The emotion recognition method in the embodiments of the present application, the speech emotion recognition model training method in the embodiments of the present application, may be executed by a terminal, a server, or other processing devices in a single machine, multiple machines, or a cluster system. The terminal may include, but is not limited to, a user device, a mobile device, a personal digital assistant, a handheld device, a computing device, an in-vehicle device, a wearable device, and the like. The server may include, but is not limited to, an application server, a data server, a cloud server, and the like. The equipment for executing the speech emotion recognition model training method and the emotion recognition method can be the same equipment or different equipment, and can be specifically selected according to the requirements of actual application scenes.

Fig. 8 is a schematic structural diagram of a speech emotion recognition model training device according to an embodiment of the present disclosure, where the device includes:

an obtaining module 801, configured to obtain a first feature and a second feature of a sample audio, where the first feature is used to characterize a feature related to a waveform of the sample audio, and the second feature is used to characterize a feature related to a speaker of the sample audio;

a decoupling module 802 for performing emotional feature decoupling using the first feature and the second feature;

and the training module 803 is configured to perform emotion recognition training by using the emotion characteristics obtained through decoupling to obtain a trained speech emotion recognition model.

Fig. 9 is a schematic structural diagram of a speech emotion recognition model training device according to another embodiment of the present disclosure, where the device of this embodiment includes one or more features of the above-described speech emotion recognition model training device embodiment. In one possible implementation, the obtaining module 801 is configured to extract the first feature from the sample audio by using a waveform-to-vector Wav2vec model.

In one possible implementation, the obtaining module 801 includes:

a first framing submodule 9011, configured to frame the sample audio to obtain a plurality of first audio frames;

a first feature extraction sub-module 9012, configured to extract at least one audio segment from the plurality of first audio frames, input the Wav2vec model to obtain the first feature, where the first feature includes a Wav2vec feature of the audio segment. Wherein the audio clip may comprise a first number of first audio frames.

In a possible implementation, the obtaining module 801 is configured to extract the second feature from the sample audio by using a speaker classification model.

In one possible implementation, the obtaining module 801 includes:

a second framing submodule 9021, configured to frame the sample audio to obtain a plurality of second audio frames;

a second feature extraction sub-module 9022, configured to input the plurality of second audio frames into the speaker classification model to obtain the second feature.

Fig. 10 is a schematic structural diagram of a speech emotion recognition model training device according to another embodiment of the present disclosure, where the device of this embodiment includes one or more features of the above-described speech emotion recognition model training device embodiment. In one possible implementation, the decoupling module 802 includes:

and the encoding submodule 1001 is configured to input the first feature and the second feature into an encoder for encoding processing, so as to implement emotional feature decoupling.

In one possible implementation, the encoder comprises a weight-averaging layer, a first connection layer, a first convolution regularization layer, a BLSTM layer and a down-sampling layer, the encoding sub-module 1001 being configured to:

inputting the first feature into the weighted average layer;

inputting the second characteristic into the first connection layer;

splicing the output characteristic of the weight average layer and the second characteristic at the first connection layer to obtain a first splicing characteristic;

the first volume regularization layer, the BLSTM layer and the down-sampling layer which are input into the first splicing feature in series are sequentially processed to obtain the output feature of the encoder, and the output feature of the encoder comprises emotion features. Specifically, the first splicing feature may be input into the first convolution regularization layer for processing, the output feature of the first convolution regularization layer is input into the BLSTM layer for processing, and the output feature of the BLSTM layer is input into the down-sampling layer for processing, so as to obtain the output feature of the encoder.

Fig. 11 is a schematic structural diagram of a speech emotion recognition model training device according to another embodiment of the present disclosure, where the device of this embodiment includes one or more features of the above-described speech emotion recognition model training device embodiment.

In one possible embodiment, the apparatus further comprises:

a phoneme feature extraction module 804, configured to extract a phoneme feature from the sample audio by using a phoneme alignment model.

In one possible implementation, the decoupling module 802 includes:

a decoding submodule 1101 for inputting the output characteristic of the encoder, the second characteristic and the phoneme characteristic into a decoder for decoding;

an updating sub-module 1102 for updating parameters of at least one of the encoder, the decoder and the phoneme alignment model using the output characteristics of the decoder.

In one possible embodiment, the decoder includes an up-sampling layer, a second connection layer, an LSTM layer, a first linear layer, and a second convolution regularization layer, and the decoding submodule 1101 is configured to:

inputting the output characteristic of the encoder into an up-sampling layer of the decoder for up-sampling;

In one possible implementation, the update submodule 1102 is configured to:

updating parameters of at least one of the encoder, the decoder and the phoneme alignment model according to a stochastic gradient descent criterion using the first error and the second error.

In a possible implementation, the training module 803 is configured to: and inputting the output characteristics of the encoder into an emotion recognition classifier to perform emotion class recognition, and updating parameters of the encoder and the emotion recognition classifier by using emotion class recognition results.

In one possible embodiment, the emotion recognition classifier includes a second linear layer, a discarding layer and a third linear layer, the output features of the encoder are input to the emotion recognition classifier for emotion class recognition, and the training module 803 includes:

an emotion recognition sub-module 1103, configured to input the output characteristics of the encoder into the second linear layer, the discarding layer, and the third linear layer connected in series, and sequentially process the output characteristics to obtain an emotion recognition result; specifically, the output characteristics of the encoder may be input into the second linear layer for processing, the output characteristics of the second linear layer may be input into the discard layer for processing, and the output characteristics of the discard layer may be input into the third linear layer for processing, so as to obtain an emotion recognition result;

a loss calculation submodule 1104 for calculating a loss using the cross entropy of the emotion recognition result;

a parameter update sub-module 1105 for updating parameters of the encoder and/or the emotion recognition classifier with the loss.

For the description of the specific functions and examples of each module and sub-module of the speech emotion recognition model training apparatus in the embodiment of the present disclosure, reference may be made to the related description of the corresponding steps in the embodiment of the speech emotion recognition model training method, and details are not repeated here.

Fig. 12 is a schematic structural diagram of an emotion recognition apparatus according to an embodiment of the present disclosure, the apparatus including:

an obtaining module 1201, configured to obtain a first feature and a second feature of the audio to be recognized, where the first feature is used to characterize a feature related to a waveform of the audio to be recognized, and the second feature is used to characterize a feature related to a speaker of the audio to be recognized;

the first recognition module 1202 is configured to input the first feature and the second feature into a speech emotion recognition model for emotion category recognition to obtain a first recognition result;

the voice emotion recognition model is obtained by training through a voice emotion recognition model training device.

Fig. 13 is a schematic structural diagram of an emotion recognition apparatus according to another embodiment of the present disclosure, which includes one or more features of the embodiment of the emotion recognition apparatus described above. In a possible implementation, the obtaining module 1201 is configured to extract the first feature from the audio to be recognized by using a waveform-to-vector Wav2vec model.

In a possible implementation, the obtaining module 1201 includes:

the first framing submodule 1301 is configured to frame the audio to be identified to obtain a plurality of first audio frames;

a first extracting sub-module 1302, configured to extract at least one audio segment from the plurality of first audio frames, and input the at least one audio segment into the Wav2vec model to obtain the first feature, where the first feature includes a Wav2vec feature of the audio segment. Wherein the audio clip may comprise a first number of first audio frames.

In a possible implementation, the obtaining module 1201 is configured to extract the second feature from the audio to be recognized by using a speaker classification model.

In a possible implementation, the obtaining module 1201 includes:

the second framing submodule 1303 is configured to frame the audio to be identified to obtain a plurality of second audio frames;

the second extracting sub-module 1304 is configured to input the plurality of second audio frames into the speaker classification model to obtain the second feature.

In one possible embodiment, the apparatus further comprises:

the second identification module 1203 is configured to perform text identification on the text content corresponding to the audio to be identified by using the text emotion identification model to obtain a second identification result;

the processing module 1204 is configured to perform weighting processing on the first recognition result and the second recognition result to obtain a third recognition result.

For a description of specific functions and examples of each module and sub-module of the emotion recognition apparatus in the embodiment of the present disclosure, reference may be made to the related description of the corresponding steps in the embodiment of the emotion recognition method, and details are not repeated here.

In some application scenarios, as shown in fig. 14, an example of a speech emotion recognition procedure includes: and extracting front-end features of the voice data. For example, MFCC, PLP, Fbank, FFT (fast Fourier transform), and the like are extracted. Next, emotion features are extracted from the front-end features using a speech recognition model such as GMM, DNN, CNN, ResNet (Residual Neural Network), sincent (optimized convolutional Neural Network), and the like. Then, the emotional features are input into a back-end classifier such as a COS (cosine) classifier, a SVM (Support Vector Machines) classifier, and an LDA (Linear Discriminant Analysis) classifier to obtain a recognition result. The accuracy of the recognition result is generally improved by optimizing front-end feature extraction, optimizing an emotion feature model, adding more training data and the like.

Wherein, optimizing front-end feature extraction: different feature combinations may be tried, such as MFCC and PLP, or different dimensions such as from 40 to 80, etc. Optimizing the emotional characteristic model: for example, classification is performed by using a traditional HMM, SVM model, or classification recognition is performed by using a deep learning model such as DNN, CNN, LSTM, etc. In addition, more training data are added, and the coverage can be improved.

In general, from the perspective of audio, speech emotion is generally reflected in multiple dimensions such as pitch, volume, speech speed, stress, and the like, and there is no direct relationship with a speaker and content. For example, you may say anything when you are angry, say Zhang Sanqi or Liqu.

According to the embodiment of the disclosure, from the decoupling angle, the features of the audio are divided into the emotional features, the speaker features, the content features and the like, so that the accurate extraction of the emotional features is facilitated.

In addition, from the perspective of the text content of the audio, certain emotions may be embodied, such as "i love the down jacket too much and feel comfortable to the touch", and the speaker is in a happy state from the aspect of characters. The embodiment of the disclosure can realize multi-dimensional combination, and utilizes the characters to assist emotion recognition.

Embodiments of the present disclosure provide a method and system for emotion recognition in a decoupled Learning (separation Learning) multimodal (audio and text) that is capable of recognizing types of emotions that may include, but are not limited to, angry, happy, neutral, sad, etc. Referring to fig. 15, the system essentially comprises: 1) an audio feature decoupling module; 2) audio emotion classification module (Classifier); 3) and a text emotion classification module. The audio characteristic decoupling module can comprise an Encoder (Encoder), a Decoder (Decoder), a phoneme Encoder (Phone Encoder) and the like, and models characteristics by disassembling emotion characteristics, speaker characteristics and content characteristics and synthesizing and reconstructing original audio. Therefore, the three characteristics can be stripped more thoroughly, and pure emotional characteristics can be obtained. And then, carrying out recognition and classification, and jointly judging the emotion type of the audio by combining the information assistance of the text. The function of each module of the speech emotion recognition model (SER model) is described in detail below with reference to fig. 15:

1: audio characteristic decoupling module

a: wav2vec feature extraction module

For example, the Wav2vec features may be derived by pre-training the model. The pre-trained model may include a multi-Layer Transformer Layer (Transformer Layer). For example, { Transformer Layer × 25} represents a pre-trained model that includes 25 Transformer layers. The specific number of converter layers can be set as desired. A large amount of audio data, such as LibriVox (open voice data set), is adopted, no label is given for about 6 ten thousand hours, the model is subjected to unsupervised training, and after iteration and multiple rounds of convergence, a layered feature extractor is obtained. For each frame of audio (40 ms duration per frame, 20ms frame shift), 25 x 1024 dimensional vectors (25 layers in total, 1024 dimensions per layer output) would result. The features of the various layers may characterize different properties of the audio from different angles, depths. The pre-trained model may include features of various dimensions, as much as possible to extract all features of the waveform of the audio. If 96 frames are extracted at a time, the Wav2vec feature dimension can be represented as [25,1024,96 ]. The 1024 dimensions, 96 frames, etc. are merely examples and are not limiting, and other numbers are possible.

b: speaker Identity module

For example, the speaker feature extraction model may include a multi-layer LSTM layer, a Linear (Linear) mapping layer, a Full Connection (FC) layer, and the like. The output of the speaker feature extraction model includes the predicted probability of each speaker and the input may include features of each frame of audio. The type of the input audio feature may include, but is not limited to, at least one of MFCC (Mel-Frequency Cepstral Coefficients ), PLP (Perceptual linear prediction), Fbank (FilterBank-based features), and the like. For example, the audio features may be 20 in dimension, 25ms in duration per frame, and 10ms in frame shift. The calculation may be performed with a certain number of contexts, for example with the top and bottom 2 frames (previous and next). Specifically, for example, frame 2, frame 3, and frame 4 may be input when calculating frame 3, and the calculation result may include a probability that it is predicted that frame 3 is speech of a certain person. The loss function of the training may be CE (cross entropy). For example, a desired goal may be achieved with a loss function: the audio is expected to have the highest probability of Liqun, and the probabilities of Zhang three and Wang five are small.

The speaker spectral feature x is extracted at the Linear layer as the speaker feature of the input audio. For example, speaker characteristics may include d-vectors (d-vector) or x-vectors (x-vector). The vectors may characterize the speaker's characteristics such as vocal cord vibration frequency, mouth size, lip thickness, tongue position, and nasopharyngeal thickness, among others. The vector dimension may be 128 or otherwise. For example, a portion of the vector may represent vocal cord vibration frequency, a portion of the vector may represent mouth size, a portion of the vector may represent lip thickness, a portion of the vector may represent tongue position, and a portion of the vector may represent nasopharyngeal cavity thickness. The speaker spectrum feature x can be obtained by multi-frame accumulation and processing in an averaging and equalizing way. For example, the average of the vectors is calculated every 50 frames. For example, using d-vector, the speech can be segmented into segments, and then the features of each segment processed by the model are averaged. The method can be used for cutting the voice into a plurality of segments by adopting an x-vector, and the characteristics of each voice segment after model processing are aggregated in a mode.

The speaker feature extraction module may be pre-trained, for example, by using open source data VoxCeleb1, VoxCeleb2, Librispeech, etc., to obtain a stable model. The subsequent system is used as a speaker characteristic extracting module (speaker identification embedding), and the model parameters of the speaker characteristic extracting module can not be updated.

When the system extracts the speaker feature, a plurality of, for example, more than 10, audios of the speaker may be selected, and the vector (vector) extracted from each audio is averaged to be used as the final feature of the speaker, that is, the speaker feature embedding.

c: encoder for encoding a video signal

The encoder may include a weight averaging layer, a connection layer, a convolution regularization layer, a BLSTM layer, and a down-sampling layer. For example, an exemplary encoder having a structure of { weighted average + Concat + { Conv1D + BatchNormalization }. 3+ BLSTM 2+ Downsampler }, indicates that the encoder includes a weight average layer (weighted average), a connection layer (Concat), 3 convolution regularization layers (Conv1D + BatchNormalization), 2 BLSTM layers, and a down-sampling layer (Downsampler).

For example the input to the encoder may comprise 96 frames (an audio piece of about 2s total length) of features extracted by the Wav2vec model. Each frame feature dimension may include 25 × 1024, weighted by a learnable coefficient vector, as shown in the following equation:

wherein h is_avgIs the weighted average of the 25 vectors. Where α is a coefficient of each vector, representing the importance of each feature, an initial value of the coefficient may be preset, for example, 0.1, 0.3, etc., and the value of the coefficient may be updated during the training process. For example, there are 25 coefficients, h, for initialization_iIs each vector and i is the ith frame. To obtain h_avgThen, after being spliced with a speaker feature vector (256 dimensions), the input is input into a convolution regularization layer and then input into a BLSTM layer, so as to obtain 96 vectors (called bottle neck features) with 2 × d dimensions. Then downsampling the bottleck to obtain 96/f vectors with 2 x d dimensions, taking d as 128 and corresponding f as 2, or taking d as 8 and f as 48, i.e. if the dimensions are small, then the downsampling granularity is large. Input features can be decoupled through down-sampling, and speaker features and text content features are omittedAnd decomposing emotional characteristics. And training by combining a subsequent decoder and a phoneme encoder, so that a more pure emotion characteristic after decoupling can be obtained.

d: decoder

The decoder may include up-sampling layers, connection layers, LSTM layers, linear layers, and convolution regularization layers, among others. For example, an exemplary decoder structure includes { Upsampler average + Concat + LSTM + link + { Conv1D + batch normalization }. 5}, meaning that the decoder includes an Upsampler average, a connection layer (Concat), an LSTM layer, a linear layer (link), and a 5-layer convolution regularization layer (Conv1D + batch normalization). The input to the decoder may comprise the output of the encoder (96/f 2 x d dimensional vectors) and by upsampling, i.e. repeating f times, 96 2 x d dimensional vectors are obtained. Then, the features obtained by up-sampling are spliced with the speaker feature vector and the output of the phoneme coder and then input into a subsequent network module, and 96 80-dimensional features F1, namely the Mel spectrum of the small segment (about 2s), are obtained at the linear layer. Then, the minimum mean square error L of the characteristic F1 and the real Mel spectrum is calculated_r1. Inputting the feature F1 into the subsequent convolution layer to obtain a new Mel spectrum, and calculating the error L between the feature F1 and the true Mel spectrum_r2。L_r1And L_r2Can be calculated by the following formula:

wherein L is_rIndicating the error between the calculated value and the true value. x and y represent the calculated values one and the true values one. M represents the number of dimensions (points) of the feature, and k represents the kth dimension.

e: phoneme coder

The phoneme content of the audio may be obtained through an open-source phoneme alignment model, such as GMM-HMM, DNN-HMM, LSTM-CTC, RNN-T, etc., and then each phoneme is mapped to a corresponding vector.

For example, referring to the frame division method and the audio segment extraction method in the above coding, the phoneme of each frame is obtained by passing 96 frames of data (i.e. one audio segment) through a phoneme alignment model. Then mapping to 128-dimensional vectors, and inputting to a phoneme coder to obtain 96 256-dimensional vectors.

The phoneme encoder may include a convolution regularization layer, a BLSTM layer, or the like. An exemplary phoneme coder structure is included in { { Conv1D + batchnormalation }. 3+ BLSTM 2}, indicating that the phoneme coder includes 3 layers of convolution regularization and 2 layers of BLSTM. The phoneme coder may extract phoneme features from the phoneme sequence (PhoneSequence).

2: audio emotion classification module

The output of the encoder is connected to an Emotion recognition classifier (c)), and a final Emotion classification result such as Speech Emotion Recognition (SER) Emotion Classes logs is obtained by activating a function (e.g. softmax). The loss function is obtained by combining an Emotion classification label (Emotion Class Label). The loss function may employ cross-entropy calculations. For example, the loss function can be seen in the following formula:

wherein z is_jRepresenting the output of each node (probability of each emotion), z_cRepresenting a target emotion (e.g. happy, 4 z)_jOne of them), adjusting the model parameters may cause the probability of the target emotion to increase. j denotes the number of each emotion, for example: open heart, difficult to pass, impaired heart, neutral, etc.

The classifier may include: a linear layer, a discard layer, and a linear layer. An exemplary classifier structure includes { Linear 2+ Dropout + Linear 4}, indicating that the classifier includes 2 Linear layers, a drop layer (Dropout), and 4 Linear layers.

3: text emotion classification module

a: the textual emotion classification model used by this module may include: CNN layer, Mean Pooling layer (Mean Pooling), flattening layer (Flatten), and linear layer. The textual emotion classification model may also be referred to as a textual emotion recognition model (TER model). An exemplary structure of the text emotion classification model includes { CNN × 3+ Mean posing + scatter + Linear }, which indicates that the text emotion classification model includes. The feature vectors corresponding to all the words are input by the text emotion classification model, and the dimension is L. The feature vector can be obtained through the pre-training of a Transformer network, and the output is the predicted emotion type. The loss function may employ cross-entropy computation and may be a whole sentence based training and may not be segment based (96 frames) training as in audio emotion classification.

b: the module is equivalent to mapping the text to a high-dimensional space through a neural network, representing the content information represented by the text, the context associated information and semantic information of the text, and extracting emotion information by integrating the information.

Score Fusion (Score Fusion) can be performed on the results of speech Emotion recognition and text Emotion recognition to obtain multi-modal Emotion classification Probabilities (Multimodal Emotion Class Probabilities) corresponding to the audio.

The specific structures of the various models in the embodiments of the present disclosure are merely examples, and are not limited, and in practical applications, the structures such as the number of layers and the types of layers of the models may be flexibly modified according to specific requirements.

In practical application, the network architecture can comprise the following processes of pre-training; training; and (6) testing. As shown in fig. 16, the following describes each process:

s1601: pre-training

S1601 a: pre-training the Wav2vec model: obtained by pre-training a model structure such as { Transformer Layer 25 }. A large amount of audio data such as LibriVox is adopted, no label is given for about 6 ten thousand hours, the model is subjected to unsupervised training, and after iteration and multiple rounds of convergence, a layered feature extractor is obtained. The type and 6 ten thousand hours of the audio data are not labeled, but are only examples, not limitations, and can be flexibly selected according to the requirements of practical applications.

S1601 b: pre-training speaker feature extractor: training a speaker classifier on a certain model structure such as { LSTM + Linear + FC } by adopting open source data such as VoxColeb 1, VoxColeb 2, Librisipeech and the like to obtain an extractor of speaker characteristics (speaker identification embedding) in the system.

S1601 c: pre-training the phoneme alignment model: by using the open source data Librispeech, Aishell, etc., training on models such as GMM-HMM, DNN-HMM, LSTM-CTC, RNN-T, etc., a model for recognizing the phoneme content of audio can be obtained.

S1602: training

S1602 a: each piece of audio is framed (e.g., 40ms in length, 20ms in frame shift) using open source data such as IEMOCAP, with emotion labeling and corresponding content labeling for each piece of audio. And extracting the Wav2vec feature w by taking 96 frames each time. And, speaker characteristics s are extracted for the whole audio, and corresponding phoneme characteristics p are extracted for 96 frames of audio through a phoneme alignment model. Wherein, 96 frames extracted each time can be used as an audio clip processed once.

S1602 b: the Wav2vec feature w is input to the encoder and concatenated with the feature s. The emotional characteristic e is then obtained by downsampling the spliced characteristic. And inputting the features e into an emotion classification network (such as the classifier) to perform emotion class identification calculation, and obtaining the loss Le through cross entropy. Then, the feature e is inputted to a decoder, and a sound spectrum (Mel spectrum) feature is reconstructed together with the phoneme feature and the speaker feature, and the minimum mean square error L is calculated respectively_r1And L_r2. Thus, the network parameters are updated according to the random gradient descent criterion and the back propagation gradient. For example, updating parameters of at least one of the encoder, the decoder, and the phoneme alignment model.

S1602 c: repeating the training steps such as S1601 and S1602 on each segment, and iterating for multiple rounds until convergence is reached, so as to obtain a speech emotion classification model with emotion, speaker and content decoupling.

S1602 d: and inputting the content of each audio into a text emotion classifier, training a model, calculating cross entropy loss, reversely updating the network, and repeating iteration for multiple times until convergence is achieved, so that an emotion classification model of the text is obtained.

S1603: testing

S1603 a: for the test audio, speaker characteristics s are extracted. Also, the test audio may be framed, extracting Wav2v every 96 framesAnd ec characteristic. Then, combining the speaker characteristics to input into the encoder, the audio emotion classifier, and obtaining the emotion recognition result (e.g. probability in four types of emotions) R of the segment. Assuming that the audio has N96 frames, averaging the N calculated Rs can obtain the emotion classification result p of the whole audio_s。

S1603 b: inputting the text content of the audio into a text emotion classifier to obtain a corresponding emotion recognition result p_t。

S1603 c: weighting the emotion recognition results of the voice and the text to obtain a final recognition result p_fSee the following equation:

p_f＝w₁·p_s+w₂·p_t

wherein, w₁And w₂May be empirical, and may take, for example, 0.6 and 0.9, respectively.

In the embodiment of the disclosure, the emotional characteristics, the speaker characteristics and the content characteristics in the audio are stripped and decoupled in the ways of reconstructing the sound spectrum and multi-task learning, so that the correlation among the emotional characteristics is reduced to the maximum extent, and the accuracy of the emotional characteristics is improved; in addition, the method combines the feature recognition result of the emotion in the aspect of text semantics, and integrally further improves the recognition accuracy of the emotion

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 17 shows a schematic block diagram of an example electronic device 1700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 17, the apparatus 1700 includes a computing unit 1701 that may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1702 or a computer program loaded from a storage unit 1708 into a Random Access Memory (RAM) 1703. In the RAM1703, various programs and data required for the operation of the device 1700 can also be stored. The calculation unit 1701, the ROM 1702, and the RAM1703 are connected to each other through a bus 1704. An input/output (I/O) interface 1705 is also connected to bus 1704.

Various components in the device 1700 are connected to the I/O interface 1705, including: an input unit 1706 such as a keyboard, a mouse, and the like; an output unit 1707 such as various types of displays, speakers, and the like; a storage unit 1708 such as a magnetic disk, optical disk, or the like; and a communication unit 1709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1709 allows the device 1700 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1701 executes various methods and processes described above, such as a speech emotion recognition model training method or an emotion recognition method. For example, in some embodiments, the speech emotion recognition model training method or emotion recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1708. In some embodiments, part or all of a computer program may be loaded and/or installed onto device 1700 via ROM 1702 and/or communications unit 1709. When the computer program is loaded into RAM1703 and executed by computing unit 1701, one or more steps of the speech emotion recognition model training method or emotion recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 1701 may be configured to perform a speech emotion recognition model training method or an emotion recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A speech emotion recognition model training method comprises the following steps:

obtaining a first feature and a second feature of a sample audio, wherein the first feature is used for characterizing features related to a waveform of the sample audio, and the second feature is used for characterizing features related to a speaker of the sample audio;

2. The method of claim 1, obtaining a first feature of sample audio comprising:

extracting the first feature from the sample audio using a waveform-to-vector Wav2vec model.

3. The method of claim 2, wherein extracting the first feature from the sample audio using a Wav2vec model comprises:

framing the sample audio to obtain a plurality of first audio frames;

extracting at least one audio segment from the plurality of first audio frames and inputting the audio segment into the Wav2vec model to obtain the first feature, wherein the first feature comprises Wav2vec features of the audio segment.

4. The method of any of claims 1 to 3, obtaining the second feature of the sample audio comprising:

extracting the second feature from the sample audio using a speaker classification model.

5. The method of claim 4, wherein extracting the second features from the sample audio using a speaker classification model comprises:

framing the sample audio to obtain a plurality of second audio frames;

and inputting the second audio frames into the speaker classification model to obtain the second characteristic.

6. The method of any of claims 1-5, wherein utilizing the first and second features for emotional feature decoupling comprises:

inputting the first characteristic and the second characteristic into an encoder for encoding processing so as to realize emotional characteristic decoupling.

7. The method of claim 6, wherein the encoder comprises a weight average layer, a first connection layer, a first convolution regularization layer, a bidirectional long-and-short-term memory (BLSTM) layer and a down-sampling layer, and the inputting of the first feature and the second feature into the encoder for encoding processing comprises:

inputting the first feature into the weighted average layer;

inputting the second characteristic into the first connection layer;

and inputting the first splicing characteristic into the first volume regularization layer, the BLSTM layer and the down-sampling layer which are connected in series, and sequentially processing to obtain the output characteristic of the encoder, wherein the output characteristic of the encoder comprises the decoupled emotion characteristic.

8. The method of claim 6 or 7, wherein the method further comprises: extracting the phoneme features from the sample audio by using a phoneme alignment model;

performing emotional feature decoupling using the first feature and the second feature, further comprising:

inputting the output characteristic of the encoder, the second characteristic and the phoneme characteristic into a decoder for decoding;

updating parameters of at least one of the encoder, the decoder, and the phoneme alignment model using output characteristics of the decoder.

9. The method of claim 8, wherein the decoder comprises an upsampling layer, a second connecting layer, an long-term memory (LSTM) layer, a first linear layer and a second convolution regularization layer, and the output features of the encoder, the second features and the phoneme features are input into the decoder for decoding processing, and the method comprises the following steps:

inputting the output characteristics of the encoder into the up-sampling layer for up-sampling;

inputting the second splicing characteristic into the LSTM layer and the first linear layer which are connected in series for processing in sequence to obtain a first Mel spectral characteristic and a first error;

and inputting the first Mel spectral feature into the second convolution regularization layer to carry out convolution regularization processing, so as to obtain a second Mel spectral feature and a second error.

10. The method of claim 9, wherein updating parameters of at least one of the encoder, the decoder, and the phoneme alignment model using output characteristics of the decoder comprises:

updating parameters of at least one of the encoder, the decoder, and the phoneme alignment model according to a stochastic gradient descent criterion using the first error and the second error.

11. The method of any one of claims 6 to 10, wherein the emotion recognition training using the decoupled emotional features comprises:

and inputting the output characteristics of the encoder into an emotion recognition classifier to recognize emotion types, and updating parameters of the encoder and the emotion recognition classifier by using emotion type recognition results.

12. The method of claim 11, wherein the emotion recognition classifier comprises a second linear layer, a discarding layer and a third linear layer, the output features of the encoder are input to the emotion recognition classifier for emotion class recognition, and parameters of the encoder and the emotion recognition classifier are updated using emotion class recognition results, comprising:

inputting the output characteristics of the encoder into the second linear layer, the discarding layer and the third linear layer which are connected in series for processing in sequence to obtain an emotion recognition result;

calculating loss by using the cross entropy of the emotion recognition result;

updating parameters of the encoder and/or the emotion recognition classifier with the loss.

13. A method of emotion recognition, comprising:

acquiring a first feature and a second feature of the audio to be recognized, wherein the first feature is used for characterizing features related to the waveform of the audio to be recognized, and the second feature is used for characterizing features related to a speaker of the audio to be recognized;

the speech emotion recognition model is trained by the training method of any one of claims 1 to 13.

14. The method of claim 13, wherein obtaining the first feature of the audio to be identified comprises:

and extracting the first feature from the audio to be recognized by using a Wav2vec model.

15. The method of claim 14, wherein extracting the first feature from the audio to be recognized using a Wav2vec model comprises:

framing the audio to be identified to obtain a plurality of first audio frames;

extracting at least one audio segment from the plurality of first audio frames and inputting the audio segment into the Wav2vec model to obtain the first feature, wherein the first feature comprises the Wav2vec feature of the audio segment.

16. The method of any of claims 13 to 15, wherein obtaining the second characteristic of the audio to be identified comprises:

extracting the second feature from the audio to be recognized by using a speaker classification model.

17. The method of claim 16, wherein extracting the second features from the audio to be recognized using a speaker classification model comprises:

framing the audio to be identified to obtain a plurality of second audio frames;

18. The method of any of claims 13 to 17, wherein the method further comprises:

performing text recognition on text content corresponding to the audio to be recognized by using a text emotion recognition model to obtain a second recognition result;

and weighting the first recognition result and the second recognition result to obtain a third recognition result.

19. A speech emotion recognition model training apparatus comprising:

20. The apparatus of claim 19, the acquisition module to extract the first feature from the sample audio using a Wav2vec model.

21. The apparatus of claim 20, wherein the means for obtaining comprises:

the first framing submodule is used for framing the sample audio to obtain a plurality of first audio frames;

a first feature extraction sub-module, configured to extract at least one audio segment from the plurality of first audio frames, input the Wav2vec model to obtain the first feature, where the first feature includes a Wav2vec feature of the audio segment.

22. The apparatus of any of claims 19-21, the obtaining means to extract the second features from the sample audio using a speaker classification model.

23. The apparatus of claim 22, wherein the means for obtaining comprises:

the second framing submodule is used for framing the sample audio to obtain a plurality of second audio frames;

and the second feature extraction submodule is used for inputting the second audio frames into the speaker classification model to obtain the second features.

24. The apparatus of any of claims 19 to 23, wherein the decoupling module comprises:

and the coding submodule is used for inputting the first characteristic and the second characteristic into a coder for coding processing so as to realize emotion characteristic decoupling.

25. The apparatus of claim 24, wherein the encoder comprises a weight averaging layer, a first connection layer, a first convolution regularization layer, a BLSTM layer, and a downsampling layer, the encoding sub-module to:

inputting the first feature into the weighted average layer;

inputting the second characteristic into the first connection layer;

and inputting the first splicing characteristic into the first volume regularization layer, the BLSTM layer and the down-sampling layer which are connected in series, and sequentially processing to obtain the output characteristic of the encoder, wherein the output characteristic of the encoder comprises an emotional characteristic.

26. The apparatus of claim 24 or 25, wherein the apparatus further comprises: a phoneme feature extraction module, configured to extract the phoneme features from the sample audio using a phoneme alignment model;

the decoupling module further comprises:

the decoding submodule is used for inputting the output characteristics of the encoder, the second characteristics and the phoneme characteristics into a decoder for decoding processing;

an updating sub-module for updating parameters of at least one of the encoder, the decoder and the phoneme alignment model using the output characteristics of the decoder.

27. The apparatus of claim 26, wherein the decoder comprises an upsampling layer, a second connection layer, an LSTM layer, a first linear layer, and a second convolution regularization layer, the decoding sub-module to:

28. The apparatus of claim 27, wherein the update submodule is to:

29. The apparatus of any of claims 24 to 28, wherein the training module is to:

30. The apparatus of claim 29, wherein the emotion recognition classifier comprises a second linear layer, a discard layer, and a third linear layer, the output features of the encoder are input to the emotion recognition classifier for emotion class recognition, the training module comprises:

the emotion recognition sub-module is used for inputting the output characteristics of the encoder into the second linear layer, the discarding layer and the third linear layer which are connected in series to be sequentially processed to obtain an emotion recognition result;

the loss calculation submodule is used for calculating loss by utilizing the cross entropy of the emotion recognition result;

a parameter update sub-module for updating parameters of the encoder and/or the emotion recognition classifier with the loss.

31. An emotion recognition apparatus comprising:

an obtaining module, configured to obtain a first feature and a second feature of an audio to be recognized, where the first feature is used to characterize a feature related to a waveform of the audio to be recognized, and the second feature is used to characterize a feature related to a speaker of the audio to be recognized;

the first recognition module is used for inputting the first characteristic and the second characteristic into a speech emotion recognition model to perform emotion category recognition to obtain a first recognition result;

wherein the speech emotion recognition model is a speech emotion recognition model obtained by training with the training apparatus of any one of claims 20 to 32.

32. The apparatus of claim 31, wherein the obtaining means is configured to extract the first feature from the audio to be recognized using a Wav2vec model.

33. The apparatus of claim 32, wherein the means for obtaining comprises:

the first framing submodule is used for framing the audio to be identified to obtain a plurality of first audio frames;

a first extraction sub-module, configured to extract at least one audio segment from the plurality of first audio frames, input the Wav2vec model to obtain the first feature, where the first feature includes a Wav2vec feature of the audio segment.

34. The apparatus according to any of claims 31-33, wherein the obtaining means is configured to extract the second feature from the audio to be recognized using a speaker classification model.

35. The apparatus of claim 34, wherein the means for obtaining comprises:

the second framing submodule is used for framing the audio to be identified to obtain a plurality of second audio frames;

and the second extraction submodule is used for inputting the second audio frames into the speaker classification model to obtain the second characteristics.

36. The apparatus of any one of claims 31 to 35, wherein the apparatus further comprises:

the second recognition module is used for performing text recognition on text contents corresponding to the audio to be recognized by using the text emotion recognition model to obtain a second recognition result;

and the processing module is used for weighting the first recognition result and the second recognition result to obtain a third recognition result.

37. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-18.

38. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-18.

39. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-18.