WO2023009020A1

WO2023009020A1 - Neural network training and segmenting an audio recording for emotion recognition

Info

Publication number: WO2023009020A1
Application number: PCT/RU2021/000316
Authority: WO
Inventors: Ilya Valeryevich TIMOFEEV; Iurii Olegovich AGAFONOV; Artem Viktorovich AKULOV
Original assignee: "Stc"-Innovations Limited"
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2023-02-02

Abstract

This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting speech and recognizing an emotion in said speech segments, more particularly, the invention is directed to selecting speech segments with a required emotion from long audio recordings. The presented method of training neural network for the purpose of emotion recognition in a speech segment includes the following steps of: freezing an OpenL3 convolutional neural network; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, wherein a corresponding emotion label or a noise label is attributed to each utterance using assessors, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss' Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database; unfreezing the upper layers of said pre-trained OpenL3 convolutional neural network for further training of the neural network.

Description

NEURAL NETWORK TRAINING AND SEGMENTING AN AUDIO RECORDING FOR EMOTION RECOGNITION

TECHNICAL FIELD

This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting an audio recording and emotion recognition in said speech segments, in particular, the invention is directed to selecting speech segments with a required emotion from long audio recordings.

BACKGROUND OF THE INVENTION

Accounting for the client’s emotional state in voice assistants’ operating scenarios, in particular, a robot’s empathic (adequate) response to the client’s (interlocutor’s) emotions, as well as the timely resolution of controversial situations play an increasingly important role in the Internet industry.

The most common modelling and classifying methods in the field of speech emotion recognition are mixtures of Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Support Vector Machines (SVM) and Deep Neural Networks (DNN).

One of the main criteria for the successful training of a deep neural network for the purpose of emotion recognition based on speech is abundant examples. However, since emotional speech data is hard to obtain and difficult to label, the databases containing labeled speech is often limited in volume.

People display emotions through speech significantly less often than through facial expression, which is conditional upon cultural norms. As a result, the vast majority of processed spontaneous speech material does not contain emotions. Furthermore, displays of spontaneous emotions in speech significantly differs from histrionic “actors’” emotions, which is why neural networks trained on “actors’” emotions will have low accuracy.

The difficulty of speech data labeling arises from the fact that labels for each emotion are set manually, and the assessors are fairly subjective in their interpretation of displays of emotions in speech, i.e., the same display of emotions is often interpreted differently by different assessors. Thus, most data containing labeled speech does not have definitive labeling, because assessors’ opinions often differ. These problems of creating a database lead to insufficient deep neural networks training and, consequently, to a low evaluation precision of emotions in an audio recording.

Furthermore, controversial situations arising when a client communicates with a voice assistant and their solutions are time-consuming to analyze, since the detection of emotions in speech generally attributes the emotion to the entire audio recording.

Therefore, there is a need for providing a solution that would enable high-accuracy speech emotion recognition and speed up the listening to controversial sessions for effective interaction with the client.

SUMMARY OF THE INVENTION

In the first aspect of the present invention, a method of training a neural network for the purpose of emotion recognition in a speech segment is provided. The method comprises: frizzing an OpenL3 convolutional neural network, that was pretrained on a large unlabeled amount of data in a self training mode; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, to each utterance a corresponding emotion label or a noise label is attributed using assessors, wherein using a hard label technique for utterances to which the majority of said assessors attributed the same emotion label, and using a soft label technique for the rest of the utterances, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss’ Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database, wherein transmitting successively the utterances with hard labels utilizing the cross-entropy loss function and the utterances with soft labels utilizing the mean squared error (MSE) loss function in batches; unfreezing the upper layers of said pre-trained OpenL3 convolutional neural network for further training the neural network.

The presented method of training neural network can provide a neural network capable of emotion recognition for each speech segment selected from an audio recording.

According to one of the embodiments, a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors.

According to another embodiment, an emotion label indicates anger, sadness, happiness or neutrality. In a second aspect of the present invention, a system for segmenting an audio recording and for emotion recognition in a speech segment is provided. The system comprises: a Voice Activity Detector (VAD) unit configured to select utterances from an audio recording, an utterance splitter unit configured to split utterances into 3-second fragments, an emotion recognition unit comprising a neural network trained according to the method set forth in claim 1, the neural network comprising a mel-spectrogram providing unit configured to provide a mel spectrogram from the fragments, an OpenL3 convolutional neural network unit configured to transform the mel spectrogram into a sequence of vectors with a dimension of 512, and a low-capacity recurrent neural network unit, which is configured to form, from the obtained vector sequence, a vector of the probability of presence of a corresponding emotion or noise in each fragment. The system also comprises a filtering unit configured to determine a probability value of presence of a corresponding emotion in each fragment, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion and to join successive fragments with the same emotion into segments.

The presented system allows segmenting an audio recording into short speech segments and recognizing an emotion for each selected speech segment.

According to one of the embodiments, one threshold value is used for different emotions. According to another embodiment, a corresponding threshold is used for each emotion.

According to a further embodiment, each segment comprising a corresponding emotion has information about its start and end time in the audio recording.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter of the invention is further explained through non-limiting embodiments with a reference to the accompanying drawings, wherein:

Fig. 1 is a flow-chart of a method of training a neural network for the purpose of emotion recognition in speech segments,

Fig. 2 is a flow-chart of forming a labeled utterances database,

Fig. 3 is the architecture of a trained neural network for the purpose of emotion recognition in a speech segment,

Fig. 3 is a system for segmenting speech and recognizing an emotion in said speech segments.

DETAILED DESCRIPTION A method of training a neural network for the purpose of emotion recognition in a speech segment according to various embodiments of the present invention can be implemented using, for example, known computer or multiprocessor systems. In other embodiments, the claimed method can be implemented through customized hardware and software.

Fig. 1 shows a flow-chart 10 which can be followed to implement the method of training a neural network for the purpose of emotion recognition in a speech segment (hereinafter, the method of training a neural network) according to one of the embodiments of the present invention.

Training a neural network requires a training database, namely, a labeled utterances database (Fig. 1, block 100). The flow-chart for forming the labeled utterances database is shows in Fig. 2. For forming the labeled utterances database, call-center audio recordings containing spontaneous emotions in a dialogue are used as a source data (Fig. 2, block 101). Also, any other audio recordings containing emotional spontaneous speech can be used, such as audio recordings of interviews with different people and talk show audio recordings. Using audio recordings containing “actors’” (histrionic) emotions for training a neural network will reduce the emotion recognition precision.

Traditional emotion recognition models use the entire audio recording or an audio recording split into rather long utterances, since these audio recordings or utterances contain more properties that enable detecting a corresponding emotion. Therefore, said models detect an emotion of the entire audio recording.

The presented method uses an audio recording split into shorter utterances of a fixed duration. In particular, the presented method of training neural network is directed to recognizing an emotion of each segment of an audio recording separately. Said task of recognizing emotions in short speech segments is more challenging in comparison with traditional methods, since there are fewer properties that enable to recognize a corresponding emotion in a short segment, but the advantage is the gained ability to single out emotions more accurately in cases when there is a mix of emotions. In a preferred embodiment, audio recordings containing emotional spontaneous speech are split into utterances of up to 10 seconds using a Voice Activity Detector (VAD) unit (Fig. 2, block 102). Subsequently, only a 3-second arbitrarily chosen speech fragment is used from said utterances.

As stated above, the used audio recordings contain spontaneous speech interactions in which the boundaries between emotions (emotion classes) are blurred and mixed. Therefore, for recognizing emotions of each utterance (utterance segment) with high precision, it is important to ensure the agreement between the assessors attributing a corresponding emotion label to an utterance. In a preferred embodiment, at least 7 assessors attribute corresponding emotion labels to utterances (Fig. 2, block 103). For a cluster of utterances (at least 100), at least 5 assessors with the highest rate of Fleiss’ kappa agreement perform the selection reaching the agreement level of 0.4, which allows excluding assessors whose idea of the display of a certain emotion in speech differs from that of the majority (Fig. 2, block 104).

Aside from the emotion label which represents anger, sadness, happiness or neutrality, assessors also attribute a noise label to the corresponding utterance. The noise label is needed for VAD error tolerance. In particular, when splitting an audio recording into speech utterances, the VAD can mistake noise interference (microphone, wind and other noise) for speech. Therefore, if noise is not taken into account in the neural network training, it will have a high error rate when subsequently used. The presented method of training the neural network provides for the presence of noise between utterances, which allows minimizing the neural network error; more particularly, it allows predicting an emotion label, such as anger, being attributed to a noise containing utterance, as their properties share some similarities.

The value of emotion intensity (anger, sadness, happiness, neutrality) in an utterance is determined by the percentage of the assessors who voted for this emotion (Fig. 2, block 105). More particularly, a hard label technique is used for utterances, to which the majority of said assessors attributed the same emotion label. In a preferred embodiment, a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors. For other ambiguously labeled utterances, a soft label technique is used, which assigns the emotion label strength proportionally to the percentage of assessors ascribing the utterance to a corresponding emotion label. Utterances that belong to the noise category are labeled only as the noise category.

A trainable neural network is a deep neural network using the transfer learning technology, more particularly, between a convolutional neural network (CNN), an OpenL3 pretrained on a large unlabeled amount of data in a self-training mode, and a low-capacity recurrent neural network which follows said OpenL3 neural network.

The neural network training is performed in two steps. At the first step (Fig. 1, block 200), an OpenL3 convolutional neural network is frozen (the weights are not updated) and trained only a low- capacity recurrent neural network built on said OpenL3 convolutional neural network. Weights of the recurrent neural network are randomly initialized. The recurrent neural network is trained using the formed labeled utterances database, wherein the utterances with hard labels and soft labels are transmitted successively in batches and using the Cross Entropy and mean squared error (MSE) loss functions, respectively. The Adam optimization algorithm is used during the training. The training continues until the loss function at validation stabilizes.

At the second stage (Fig. 1, block 300), the upper layers of said pre-trained OpenL3 convolutional neural network are unfrozen, which layers along with the successive recurrent neural network pass through a limited number of training iterations. The architecture of the neural network trained for the purpose of emotion recognition in speech segments is shown in Fig. 3. A MelSpec unit comprises FFT and MEL blocks and configured to provide a mel spectrogram from speech segments (a fast Fourier transform (FFT) to transition into a frequency domain and subsequent frequency aggregation taking into account the human ear sensitivity) does not require training. An OpenL3 unit comprises CNN Blockl (1,64), CNN Block2 (64,128), CNN Block3, (128,256) and CNN Block4 (256,512), wherein CNN Block comprises the following functions and in the specified sequence: Conv2d (in, out), BatchNorm2d, Relu, Conv2d (out, out), BatchNormld, Relu and MaxPool2d. Using the OpenL3 unit, which is a convolutional neural network pre-trained on millions of audio segments, can solve the problem of sound pre-processing, in particular, in the presented method, it functions as a feature extractor transforming a mel spectrogram into a sequence of vectors with a dimension of 512. It should be noted that, in the prior art, an OpenL3 convolutional neural network was used only for detecting acoustic events, analyzing music or video.

The sequence of vectors representing the state of the fragment sound in a second with a half- second overlap is transferred as tokens to an Emo unit, which is a recurrent neural network, more particularly, a long short-term memory (LSTM). The Emo unit comprises the following functions and in the specified sequence: BathNormld, LSTM, ReLu, Linear, ReLu and Linear. The tokens passing through two Dence units with ReLu activation form a probability vector of the presence of the following emotions in the fragment: anger, sadness, happiness, neutrality or noise. The addition of the noise type allows reducing false activations in the case of specific microphone interferences.

Fig. 4 shows a system 400 for segmenting an audio recording and recognizing an emotion in a speech segment (hereinafter, the system) in accordance with one of the embodiments of the present invention.

According to the system, utterances are selected using a Voice Activity Detector (VAD) unit (Fig. 4, block 401), and the utterances are split into 3-second fragments using an utterance splitter unit (Fig. 4, block 402).

Then, in an emotion recognition unit (Fig. 4, block 403), probabilities of the emotional state for each fragment are determined using a neural network trained in accordance to the method described above. In particular, the resulting fragments are sent to the mel-spectrogram providing unit configured to provide a mel spectrogram from fragments for further transformation into a sequence of vectors with a dimension of 512 using the OpenL3 convolutional neural network unit, and to form, from the resulting vector sequence, a vector of probabilities of presence of a corresponding emotion or noise in each fragment using the low-capacity recurrent neural network unit.

Afterwards, in a filtering unit (Fig. 4, block 404), a probability value of presence of a corresponding emotion in each fragment is determined, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion. In other words, the threshold affects the precision and recall of emotions comprised in the obtained fragments, and therefore, for each of the obtained fragments, the system determines a level of confidence in the result, wherein the filtering is performed based on this value.

The threshold is selected in the range of 0 to 1 for each segmentation task and can be set to be the same for all emotion classes or different for each emotion, and therefore, a user can adjust the resulting output of the system.

The obtained successive fragments containing the same emotion are joined into segments, form example, successive fragments containing anger can be joined. Separate fragments are also transformed into segments.

Thus, for example, when setting the same threshold close to 0, a user will receive more segments to be analyzed, which reduces the probability to miss a segment with the required intensity (degree of manifestation) of an emotion. When the threshold close to 1 is chosen, the neural network will determine segments with higher precision (with emotions that are more intense), a user will receive less segments, which will speed up the analysis thereof, but some of segments will be overlooked.

When setting the threshold separately for each emotion, a user can filter out segments with the target emotion.

Therefore, the presented system allows a user to avoid listening to the entire audio recording and only to listen to selected segments containing the required intensity of the specifically determined emotion.

The presented system allows for high-precision speech emotion recognition and speeds up the process of listening to controversial conversation sessions, for example between a client and a voice assistant, which allows controlling the call-center operation quality and, in particular, allows determining the problem of the session correctly and promptly resolve it, thus having a positive impact on the client’s impression of the company. Furthermore, the presented invention can be used for automatically creating a selection of intense moments of an audio or video program, reducing the viewing time.

Also, since each segment including a corresponding emotion has information about its start and end time in the audio recording, the audio recording can be labeled based on emotions.

The present invention is not limited to the specific embodiments disclosed in the specification for the purpose of illustration and encompasses all possible modifications and alternatives that fall under the scope of the present invention defined in the claims.

Claims

1. A method of training a neural network for the purpose of emotion recognition in a speech segment, the method comprising frizzing an OpenL3 convolutional neural network pretrained on a large unlabeled amount of data in a self-training mode, forming a labeled utterances database containing utterances not exceeding 10 seconds in length, wherein a corresponding emotion label or a noise label is attributed to each utterance using assessors, wherein a hard label technique is used for utterances to which the majority of said assessors attributed the same emotion label, and a soft label technique is used for the rest of the utterances, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss’ Kappa agreement level of 0.4, training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database, wherein the utterances with hard labels utilizing the cross-entropy loss function and the utterances with soft labels utilizing the mean squared error (MSE) loss function are transmitted successively in batches, unfreezing upper layers of said pre-trained OpenL3 convolutional neural network for further training of the neural network.

2. The method according to claim 1, wherein a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors.

3. The method according to claim 1, wherein the emotion label indicates anger, sadness, happiness or neutrality.

4. A system for segmenting an audio recording and emotion recognition in a speech segment, the system comprising a Voice Activity Detector (VAD) unit configured to select utterances from the audio recording, an utterance splitter unit configured to split the utterances into 3-second fragments, an emotion recognition unit comprising a neural network trained according to the method of claim 1 , the neural network comprising a mel-spectrogram providing unit configured to provide a mel spectrogram from the fragments, an OpenL3 convolutional neural network unit configured to transform the mel spectrogram into a sequence of vectors with a dimension of 512, and a low-capacity recurrent neural network unit configured to form, from the obtained vector sequence, a vector of probabilities of presence of a corresponding emotion or noise in each fragment, a filtering unit configured to determine a probability value of presence of a corresponding emotion in each fragment, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion and to join successive fragments with the same emotion into segments.

5. The system according to claim 1, wherein the same threshold is used for different emotions.

6. The system according to claim 1, wherein a corresponding threshold is used for each emotion.

7. The system according to claim 1, wherein each segment containing a corresponding emotion includes information about its start and end time in the audio recording.