WO2023009020A1 - Neural network training and segmenting an audio recording for emotion recognition - Google Patents
Neural network training and segmenting an audio recording for emotion recognition Download PDFInfo
- Publication number
- WO2023009020A1 WO2023009020A1 PCT/RU2021/000316 RU2021000316W WO2023009020A1 WO 2023009020 A1 WO2023009020 A1 WO 2023009020A1 RU 2021000316 W RU2021000316 W RU 2021000316W WO 2023009020 A1 WO2023009020 A1 WO 2023009020A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- emotion
- utterances
- assessors
- training
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 50
- 238000012549 training Methods 0.000 title claims abstract description 34
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 25
- 230000008451 emotion Effects 0.000 claims abstract description 82
- 238000000034 method Methods 0.000 claims abstract description 34
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 24
- 230000000306 recurrent effect Effects 0.000 claims abstract description 12
- 239000012634 fragment Substances 0.000 claims description 27
- 239000013598 vector Substances 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 4
- 230000008014 freezing Effects 0.000 abstract 1
- 238000007710 freezing Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 7
- 230000002269 spontaneous effect Effects 0.000 description 6
- 230000002996 emotional effect Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000001994 activation Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Definitions
- This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting an audio recording and emotion recognition in said speech segments, in particular, the invention is directed to selecting speech segments with a required emotion from long audio recordings.
- GMM Gaussian Mixture Models
- HMM Hidden Markov Models
- SVM Support Vector Machines
- DNN Deep Neural Networks
- a method of training a neural network for the purpose of emotion recognition in a speech segment comprises: frizzing an OpenL3 convolutional neural network, that was pretrained on a large unlabeled amount of data in a self training mode; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, to each utterance a corresponding emotion label or a noise label is attributed using assessors, wherein using a hard label technique for utterances to which the majority of said assessors attributed the same emotion label, and using a soft label technique for the rest of the utterances, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss’ Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database, wherein transmitting successively the utterances
- the presented method of training neural network can provide a neural network capable of emotion recognition for each speech segment selected from an audio recording.
- a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors.
- an emotion label indicates anger, sadness, happiness or neutrality.
- a system for segmenting an audio recording and for emotion recognition in a speech segment comprises: a Voice Activity Detector (VAD) unit configured to select utterances from an audio recording, an utterance splitter unit configured to split utterances into 3-second fragments, an emotion recognition unit comprising a neural network trained according to the method set forth in claim 1, the neural network comprising a mel-spectrogram providing unit configured to provide a mel spectrogram from the fragments, an OpenL3 convolutional neural network unit configured to transform the mel spectrogram into a sequence of vectors with a dimension of 512, and a low-capacity recurrent neural network unit, which is configured to form, from the obtained vector sequence, a vector of the probability of presence of a corresponding emotion or noise in each fragment.
- VAD Voice Activity Detector
- the system also comprises a filtering unit configured to determine a probability value of presence of a corresponding emotion in each fragment, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion and to join successive fragments with the same emotion into segments.
- the presented system allows segmenting an audio recording into short speech segments and recognizing an emotion for each selected speech segment.
- one threshold value is used for different emotions.
- a corresponding threshold is used for each emotion.
- each segment comprising a corresponding emotion has information about its start and end time in the audio recording.
- Fig. 1 is a flow-chart of a method of training a neural network for the purpose of emotion recognition in speech segments
- Fig. 2 is a flow-chart of forming a labeled utterances database
- Fig. 3 is the architecture of a trained neural network for the purpose of emotion recognition in a speech segment
- Fig. 3 is a system for segmenting speech and recognizing an emotion in said speech segments.
- a method of training a neural network for the purpose of emotion recognition in a speech segment can be implemented using, for example, known computer or multiprocessor systems. In other embodiments, the claimed method can be implemented through customized hardware and software.
- Fig. 1 shows a flow-chart 10 which can be followed to implement the method of training a neural network for the purpose of emotion recognition in a speech segment (hereinafter, the method of training a neural network) according to one of the embodiments of the present invention.
- Training a neural network requires a training database, namely, a labeled utterances database (Fig. 1, block 100).
- the flow-chart for forming the labeled utterances database is shows in Fig. 2.
- call-center audio recordings containing spontaneous emotions in a dialogue are used as a source data (Fig. 2, block 101).
- any other audio recordings containing emotional spontaneous speech can be used, such as audio recordings of interviews with different people and talk show audio recordings.
- audio recordings containing “actors’” histrionic emotions for training a neural network will reduce the emotion recognition precision.
- the presented method uses an audio recording split into shorter utterances of a fixed duration.
- the presented method of training neural network is directed to recognizing an emotion of each segment of an audio recording separately. Said task of recognizing emotions in short speech segments is more challenging in comparison with traditional methods, since there are fewer properties that enable to recognize a corresponding emotion in a short segment, but the advantage is the gained ability to single out emotions more accurately in cases when there is a mix of emotions.
- audio recordings containing emotional spontaneous speech are split into utterances of up to 10 seconds using a Voice Activity Detector (VAD) unit (Fig. 2, block 102). Subsequently, only a 3-second arbitrarily chosen speech fragment is used from said utterances.
- VAD Voice Activity Detector
- the used audio recordings contain spontaneous speech interactions in which the boundaries between emotions (emotion classes) are blurred and mixed. Therefore, for recognizing emotions of each utterance (utterance segment) with high precision, it is important to ensure the agreement between the assessors attributing a corresponding emotion label to an utterance.
- at least 7 assessors attribute corresponding emotion labels to utterances (Fig. 2, block 103).
- At least 5 assessors with the highest rate of Fleiss’ kappa agreement perform the selection reaching the agreement level of 0.4, which allows excluding assessors whose idea of the display of a certain emotion in speech differs from that of the majority (Fig. 2, block 104).
- the noise label is needed for VAD error tolerance.
- the VAD can mistake noise interference (microphone, wind and other noise) for speech. Therefore, if noise is not taken into account in the neural network training, it will have a high error rate when subsequently used.
- the presented method of training the neural network provides for the presence of noise between utterances, which allows minimizing the neural network error; more particularly, it allows predicting an emotion label, such as anger, being attributed to a noise containing utterance, as their properties share some similarities.
- the value of emotion intensity (anger, sadness, happiness, neutrality) in an utterance is determined by the percentage of the assessors who voted for this emotion (Fig. 2, block 105). More particularly, a hard label technique is used for utterances, to which the majority of said assessors attributed the same emotion label. In a preferred embodiment, a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors. For other ambiguously labeled utterances, a soft label technique is used, which assigns the emotion label strength proportionally to the percentage of assessors ascribing the utterance to a corresponding emotion label. Utterances that belong to the noise category are labeled only as the noise category.
- a trainable neural network is a deep neural network using the transfer learning technology, more particularly, between a convolutional neural network (CNN), an OpenL3 pretrained on a large unlabeled amount of data in a self-training mode, and a low-capacity recurrent neural network which follows said OpenL3 neural network.
- CNN convolutional neural network
- OpenL3 pretrained on a large unlabeled amount of data in a self-training mode
- a low-capacity recurrent neural network which follows said OpenL3 neural network.
- the neural network training is performed in two steps.
- an OpenL3 convolutional neural network is frozen (the weights are not updated) and trained only a low- capacity recurrent neural network built on said OpenL3 convolutional neural network. Weights of the recurrent neural network are randomly initialized.
- the recurrent neural network is trained using the formed labeled utterances database, wherein the utterances with hard labels and soft labels are transmitted successively in batches and using the Cross Entropy and mean squared error (MSE) loss functions, respectively.
- MSE mean squared error
- the Adam optimization algorithm is used during the training. The training continues until the loss function at validation stabilizes.
- a MelSpec unit comprises FFT and MEL blocks and configured to provide a mel spectrogram from speech segments (a fast Fourier transform (FFT) to transition into a frequency domain and subsequent frequency aggregation taking into account the human ear sensitivity) does not require training.
- FFT fast Fourier transform
- An OpenL3 unit comprises CNN Blockl (1,64), CNN Block2 (64,128), CNN Block3, (128,256) and CNN Block4 (256,512), wherein CNN Block comprises the following functions and in the specified sequence: Conv2d (in, out), BatchNorm2d, Relu, Conv2d (out, out), BatchNormld, Relu and MaxPool2d.
- Using the OpenL3 unit which is a convolutional neural network pre-trained on millions of audio segments, can solve the problem of sound pre-processing, in particular, in the presented method, it functions as a feature extractor transforming a mel spectrogram into a sequence of vectors with a dimension of 512. It should be noted that, in the prior art, an OpenL3 convolutional neural network was used only for detecting acoustic events, analyzing music or video.
- the sequence of vectors representing the state of the fragment sound in a second with a half- second overlap is transferred as tokens to an Emo unit, which is a recurrent neural network, more particularly, a long short-term memory (LSTM).
- the Emo unit comprises the following functions and in the specified sequence: BathNormld, LSTM, ReLu, Linear, ReLu and Linear.
- the tokens passing through two Dence units with ReLu activation form a probability vector of the presence of the following emotions in the fragment: anger, sadness, happiness, neutrality or noise.
- the addition of the noise type allows reducing false activations in the case of specific microphone interferences.
- Fig. 4 shows a system 400 for segmenting an audio recording and recognizing an emotion in a speech segment (hereinafter, the system) in accordance with one of the embodiments of the present invention.
- utterances are selected using a Voice Activity Detector (VAD) unit (Fig. 4, block 401), and the utterances are split into 3-second fragments using an utterance splitter unit (Fig. 4, block 402).
- VAD Voice Activity Detector
- an emotion recognition unit (Fig. 4, block 403), probabilities of the emotional state for each fragment are determined using a neural network trained in accordance to the method described above.
- the resulting fragments are sent to the mel-spectrogram providing unit configured to provide a mel spectrogram from fragments for further transformation into a sequence of vectors with a dimension of 512 using the OpenL3 convolutional neural network unit, and to form, from the resulting vector sequence, a vector of probabilities of presence of a corresponding emotion or noise in each fragment using the low-capacity recurrent neural network unit.
- a probability value of presence of a corresponding emotion in each fragment is determined, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion.
- the threshold affects the precision and recall of emotions comprised in the obtained fragments, and therefore, for each of the obtained fragments, the system determines a level of confidence in the result, wherein the filtering is performed based on this value.
- the threshold is selected in the range of 0 to 1 for each segmentation task and can be set to be the same for all emotion classes or different for each emotion, and therefore, a user can adjust the resulting output of the system.
- the obtained successive fragments containing the same emotion are joined into segments, form example, successive fragments containing anger can be joined. Separate fragments are also transformed into segments.
- a user when setting the same threshold close to 0, a user will receive more segments to be analyzed, which reduces the probability to miss a segment with the required intensity (degree of manifestation) of an emotion.
- the threshold close to 1 the neural network will determine segments with higher precision (with emotions that are more intense), a user will receive less segments, which will speed up the analysis thereof, but some of segments will be overlooked.
- the presented system allows a user to avoid listening to the entire audio recording and only to listen to selected segments containing the required intensity of the specifically determined emotion.
- the presented system allows for high-precision speech emotion recognition and speeds up the process of listening to controversial conversation sessions, for example between a client and a voice assistant, which allows controlling the call-center operation quality and, in particular, allows determining the problem of the session correctly and promptly resolve it, thus having a positive impact on the client’s impression of the company. Furthermore, the presented invention can be used for automatically creating a selection of intense moments of an audio or video program, reducing the viewing time.
- each segment including a corresponding emotion has information about its start and end time in the audio recording, the audio recording can be labeled based on emotions.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting speech and recognizing an emotion in said speech segments, more particularly, the invention is directed to selecting speech segments with a required emotion from long audio recordings. The presented method of training neural network for the purpose of emotion recognition in a speech segment includes the following steps of: freezing an OpenL3 convolutional neural network; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, wherein a corresponding emotion label or a noise label is attributed to each utterance using assessors, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss' Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database; unfreezing the upper layers of said pre-trained OpenL3 convolutional neural network for further training of the neural network.
Description
NEURAL NETWORK TRAINING AND SEGMENTING AN AUDIO RECORDING FOR EMOTION RECOGNITION
TECHNICAL FIELD
This invention relates to a method of training a neural network for the purpose of emotion recognition in speech segments and to a system for segmenting an audio recording and emotion recognition in said speech segments, in particular, the invention is directed to selecting speech segments with a required emotion from long audio recordings.
BACKGROUND OF THE INVENTION
Accounting for the client’s emotional state in voice assistants’ operating scenarios, in particular, a robot’s empathic (adequate) response to the client’s (interlocutor’s) emotions, as well as the timely resolution of controversial situations play an increasingly important role in the Internet industry.
The most common modelling and classifying methods in the field of speech emotion recognition are mixtures of Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Support Vector Machines (SVM) and Deep Neural Networks (DNN).
One of the main criteria for the successful training of a deep neural network for the purpose of emotion recognition based on speech is abundant examples. However, since emotional speech data is hard to obtain and difficult to label, the databases containing labeled speech is often limited in volume.
People display emotions through speech significantly less often than through facial expression, which is conditional upon cultural norms. As a result, the vast majority of processed spontaneous speech material does not contain emotions. Furthermore, displays of spontaneous emotions in speech significantly differs from histrionic “actors’” emotions, which is why neural networks trained on “actors’” emotions will have low accuracy.
The difficulty of speech data labeling arises from the fact that labels for each emotion are set manually, and the assessors are fairly subjective in their interpretation of displays of emotions in speech, i.e., the same display of emotions is often interpreted differently by different assessors. Thus, most data containing labeled speech does not have definitive labeling, because assessors’ opinions often differ.
These problems of creating a database lead to insufficient deep neural networks training and, consequently, to a low evaluation precision of emotions in an audio recording.
Furthermore, controversial situations arising when a client communicates with a voice assistant and their solutions are time-consuming to analyze, since the detection of emotions in speech generally attributes the emotion to the entire audio recording.
Therefore, there is a need for providing a solution that would enable high-accuracy speech emotion recognition and speed up the listening to controversial sessions for effective interaction with the client.
SUMMARY OF THE INVENTION
In the first aspect of the present invention, a method of training a neural network for the purpose of emotion recognition in a speech segment is provided. The method comprises: frizzing an OpenL3 convolutional neural network, that was pretrained on a large unlabeled amount of data in a self training mode; forming a labeled utterances database containing utterances not exceeding 10 seconds in length, to each utterance a corresponding emotion label or a noise label is attributed using assessors, wherein using a hard label technique for utterances to which the majority of said assessors attributed the same emotion label, and using a soft label technique for the rest of the utterances, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss’ Kappa agreement level of 0.4; training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database, wherein transmitting successively the utterances with hard labels utilizing the cross-entropy loss function and the utterances with soft labels utilizing the mean squared error (MSE) loss function in batches; unfreezing the upper layers of said pre-trained OpenL3 convolutional neural network for further training the neural network.
The presented method of training neural network can provide a neural network capable of emotion recognition for each speech segment selected from an audio recording.
According to one of the embodiments, a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors.
According to another embodiment, an emotion label indicates anger, sadness, happiness or neutrality.
In a second aspect of the present invention, a system for segmenting an audio recording and for emotion recognition in a speech segment is provided. The system comprises: a Voice Activity Detector (VAD) unit configured to select utterances from an audio recording, an utterance splitter unit configured to split utterances into 3-second fragments, an emotion recognition unit comprising a neural network trained according to the method set forth in claim 1, the neural network comprising a mel-spectrogram providing unit configured to provide a mel spectrogram from the fragments, an OpenL3 convolutional neural network unit configured to transform the mel spectrogram into a sequence of vectors with a dimension of 512, and a low-capacity recurrent neural network unit, which is configured to form, from the obtained vector sequence, a vector of the probability of presence of a corresponding emotion or noise in each fragment. The system also comprises a filtering unit configured to determine a probability value of presence of a corresponding emotion in each fragment, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion and to join successive fragments with the same emotion into segments.
The presented system allows segmenting an audio recording into short speech segments and recognizing an emotion for each selected speech segment.
According to one of the embodiments, one threshold value is used for different emotions. According to another embodiment, a corresponding threshold is used for each emotion.
According to a further embodiment, each segment comprising a corresponding emotion has information about its start and end time in the audio recording.
BRIEF DESCRIPTION OF THE DRAWINGS
The subject matter of the invention is further explained through non-limiting embodiments with a reference to the accompanying drawings, wherein:
Fig. 1 is a flow-chart of a method of training a neural network for the purpose of emotion recognition in speech segments,
Fig. 2 is a flow-chart of forming a labeled utterances database,
Fig. 3 is the architecture of a trained neural network for the purpose of emotion recognition in a speech segment,
Fig. 3 is a system for segmenting speech and recognizing an emotion in said speech segments.
DETAILED DESCRIPTION
A method of training a neural network for the purpose of emotion recognition in a speech segment according to various embodiments of the present invention can be implemented using, for example, known computer or multiprocessor systems. In other embodiments, the claimed method can be implemented through customized hardware and software.
Fig. 1 shows a flow-chart 10 which can be followed to implement the method of training a neural network for the purpose of emotion recognition in a speech segment (hereinafter, the method of training a neural network) according to one of the embodiments of the present invention.
Training a neural network requires a training database, namely, a labeled utterances database (Fig. 1, block 100). The flow-chart for forming the labeled utterances database is shows in Fig. 2. For forming the labeled utterances database, call-center audio recordings containing spontaneous emotions in a dialogue are used as a source data (Fig. 2, block 101). Also, any other audio recordings containing emotional spontaneous speech can be used, such as audio recordings of interviews with different people and talk show audio recordings. Using audio recordings containing “actors’” (histrionic) emotions for training a neural network will reduce the emotion recognition precision.
Traditional emotion recognition models use the entire audio recording or an audio recording split into rather long utterances, since these audio recordings or utterances contain more properties that enable detecting a corresponding emotion. Therefore, said models detect an emotion of the entire audio recording.
The presented method uses an audio recording split into shorter utterances of a fixed duration. In particular, the presented method of training neural network is directed to recognizing an emotion of each segment of an audio recording separately. Said task of recognizing emotions in short speech segments is more challenging in comparison with traditional methods, since there are fewer properties that enable to recognize a corresponding emotion in a short segment, but the advantage is the gained ability to single out emotions more accurately in cases when there is a mix of emotions. In a preferred embodiment, audio recordings containing emotional spontaneous speech are split into utterances of up to 10 seconds using a Voice Activity Detector (VAD) unit (Fig. 2, block 102). Subsequently, only a 3-second arbitrarily chosen speech fragment is used from said utterances.
As stated above, the used audio recordings contain spontaneous speech interactions in which the boundaries between emotions (emotion classes) are blurred and mixed. Therefore, for recognizing emotions of each utterance (utterance segment) with high precision, it is important to ensure the agreement between the assessors attributing a corresponding emotion label to an utterance. In a preferred embodiment, at least 7 assessors attribute corresponding emotion labels to utterances (Fig.
2, block 103). For a cluster of utterances (at least 100), at least 5 assessors with the highest rate of Fleiss’ kappa agreement perform the selection reaching the agreement level of 0.4, which allows excluding assessors whose idea of the display of a certain emotion in speech differs from that of the majority (Fig. 2, block 104).
Aside from the emotion label which represents anger, sadness, happiness or neutrality, assessors also attribute a noise label to the corresponding utterance. The noise label is needed for VAD error tolerance. In particular, when splitting an audio recording into speech utterances, the VAD can mistake noise interference (microphone, wind and other noise) for speech. Therefore, if noise is not taken into account in the neural network training, it will have a high error rate when subsequently used. The presented method of training the neural network provides for the presence of noise between utterances, which allows minimizing the neural network error; more particularly, it allows predicting an emotion label, such as anger, being attributed to a noise containing utterance, as their properties share some similarities.
The value of emotion intensity (anger, sadness, happiness, neutrality) in an utterance is determined by the percentage of the assessors who voted for this emotion (Fig. 2, block 105). More particularly, a hard label technique is used for utterances, to which the majority of said assessors attributed the same emotion label. In a preferred embodiment, a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors. For other ambiguously labeled utterances, a soft label technique is used, which assigns the emotion label strength proportionally to the percentage of assessors ascribing the utterance to a corresponding emotion label. Utterances that belong to the noise category are labeled only as the noise category.
A trainable neural network is a deep neural network using the transfer learning technology, more particularly, between a convolutional neural network (CNN), an OpenL3 pretrained on a large unlabeled amount of data in a self-training mode, and a low-capacity recurrent neural network which follows said OpenL3 neural network.
The neural network training is performed in two steps. At the first step (Fig. 1, block 200), an OpenL3 convolutional neural network is frozen (the weights are not updated) and trained only a low- capacity recurrent neural network built on said OpenL3 convolutional neural network. Weights of the recurrent neural network are randomly initialized. The recurrent neural network is trained using the formed labeled utterances database, wherein the utterances with hard labels and soft labels are transmitted successively in batches and using the Cross Entropy and mean squared error (MSE) loss
functions, respectively. The Adam optimization algorithm is used during the training. The training continues until the loss function at validation stabilizes.
At the second stage (Fig. 1, block 300), the upper layers of said pre-trained OpenL3 convolutional neural network are unfrozen, which layers along with the successive recurrent neural network pass through a limited number of training iterations. The architecture of the neural network trained for the purpose of emotion recognition in speech segments is shown in Fig. 3. A MelSpec unit comprises FFT and MEL blocks and configured to provide a mel spectrogram from speech segments (a fast Fourier transform (FFT) to transition into a frequency domain and subsequent frequency aggregation taking into account the human ear sensitivity) does not require training. An OpenL3 unit comprises CNN Blockl (1,64), CNN Block2 (64,128), CNN Block3, (128,256) and CNN Block4 (256,512), wherein CNN Block comprises the following functions and in the specified sequence: Conv2d (in, out), BatchNorm2d, Relu, Conv2d (out, out), BatchNormld, Relu and MaxPool2d. Using the OpenL3 unit, which is a convolutional neural network pre-trained on millions of audio segments, can solve the problem of sound pre-processing, in particular, in the presented method, it functions as a feature extractor transforming a mel spectrogram into a sequence of vectors with a dimension of 512. It should be noted that, in the prior art, an OpenL3 convolutional neural network was used only for detecting acoustic events, analyzing music or video.
The sequence of vectors representing the state of the fragment sound in a second with a half- second overlap is transferred as tokens to an Emo unit, which is a recurrent neural network, more particularly, a long short-term memory (LSTM). The Emo unit comprises the following functions and in the specified sequence: BathNormld, LSTM, ReLu, Linear, ReLu and Linear. The tokens passing through two Dence units with ReLu activation form a probability vector of the presence of the following emotions in the fragment: anger, sadness, happiness, neutrality or noise. The addition of the noise type allows reducing false activations in the case of specific microphone interferences.
Fig. 4 shows a system 400 for segmenting an audio recording and recognizing an emotion in a speech segment (hereinafter, the system) in accordance with one of the embodiments of the present invention.
According to the system, utterances are selected using a Voice Activity Detector (VAD) unit (Fig. 4, block 401), and the utterances are split into 3-second fragments using an utterance splitter unit (Fig. 4, block 402).
Then, in an emotion recognition unit (Fig. 4, block 403), probabilities of the emotional state for each fragment are determined using a neural network trained in accordance to the method described
above. In particular, the resulting fragments are sent to the mel-spectrogram providing unit configured to provide a mel spectrogram from fragments for further transformation into a sequence of vectors with a dimension of 512 using the OpenL3 convolutional neural network unit, and to form, from the resulting vector sequence, a vector of probabilities of presence of a corresponding emotion or noise in each fragment using the low-capacity recurrent neural network unit.
Afterwards, in a filtering unit (Fig. 4, block 404), a probability value of presence of a corresponding emotion in each fragment is determined, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion. In other words, the threshold affects the precision and recall of emotions comprised in the obtained fragments, and therefore, for each of the obtained fragments, the system determines a level of confidence in the result, wherein the filtering is performed based on this value.
The threshold is selected in the range of 0 to 1 for each segmentation task and can be set to be the same for all emotion classes or different for each emotion, and therefore, a user can adjust the resulting output of the system.
The obtained successive fragments containing the same emotion are joined into segments, form example, successive fragments containing anger can be joined. Separate fragments are also transformed into segments.
Thus, for example, when setting the same threshold close to 0, a user will receive more segments to be analyzed, which reduces the probability to miss a segment with the required intensity (degree of manifestation) of an emotion. When the threshold close to 1 is chosen, the neural network will determine segments with higher precision (with emotions that are more intense), a user will receive less segments, which will speed up the analysis thereof, but some of segments will be overlooked.
When setting the threshold separately for each emotion, a user can filter out segments with the target emotion.
Therefore, the presented system allows a user to avoid listening to the entire audio recording and only to listen to selected segments containing the required intensity of the specifically determined emotion.
The presented system allows for high-precision speech emotion recognition and speeds up the process of listening to controversial conversation sessions, for example between a client and a voice assistant, which allows controlling the call-center operation quality and, in particular, allows determining the problem of the session correctly and promptly resolve it, thus having a positive impact on the client’s impression of the company.
Furthermore, the presented invention can be used for automatically creating a selection of intense moments of an audio or video program, reducing the viewing time.
Also, since each segment including a corresponding emotion has information about its start and end time in the audio recording, the audio recording can be labeled based on emotions.
The present invention is not limited to the specific embodiments disclosed in the specification for the purpose of illustration and encompasses all possible modifications and alternatives that fall under the scope of the present invention defined in the claims.
Claims
1. A method of training a neural network for the purpose of emotion recognition in a speech segment, the method comprising frizzing an OpenL3 convolutional neural network pretrained on a large unlabeled amount of data in a self-training mode, forming a labeled utterances database containing utterances not exceeding 10 seconds in length, wherein a corresponding emotion label or a noise label is attributed to each utterance using assessors, wherein a hard label technique is used for utterances to which the majority of said assessors attributed the same emotion label, and a soft label technique is used for the rest of the utterances, wherein the assessors are a group of assessors excluding assessors that do not meet the Fleiss’ Kappa agreement level of 0.4, training a low-capacity recurrent neural network built on said pre-trained OpenL3 convolutional neural network using the formed labeled utterances database, wherein the utterances with hard labels utilizing the cross-entropy loss function and the utterances with soft labels utilizing the mean squared error (MSE) loss function are transmitted successively in batches, unfreezing upper layers of said pre-trained OpenL3 convolutional neural network for further training of the neural network.
2. The method according to claim 1, wherein a hard label technique is used for utterances that have been attributed the same emotion label by 80% assessors.
3. The method according to claim 1, wherein the emotion label indicates anger, sadness, happiness or neutrality.
4. A system for segmenting an audio recording and emotion recognition in a speech segment, the system comprising a Voice Activity Detector (VAD) unit configured to select utterances from the audio recording, an utterance splitter unit configured to split the utterances into 3-second fragments, an emotion recognition unit comprising a neural network trained according to the method of claim 1 , the neural network comprising
a mel-spectrogram providing unit configured to provide a mel spectrogram from the fragments, an OpenL3 convolutional neural network unit configured to transform the mel spectrogram into a sequence of vectors with a dimension of 512, and a low-capacity recurrent neural network unit configured to form, from the obtained vector sequence, a vector of probabilities of presence of a corresponding emotion or noise in each fragment, a filtering unit configured to determine a probability value of presence of a corresponding emotion in each fragment, wherein the probability value is used for filtering each fragment comprising a corresponding emotion using a threshold to detect intensity of an emotion and to join successive fragments with the same emotion into segments.
5. The system according to claim 1, wherein the same threshold is used for different emotions.
6. The system according to claim 1, wherein a corresponding threshold is used for each emotion.
7. The system according to claim 1, wherein each segment containing a corresponding emotion includes information about its start and end time in the audio recording.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2021/000316 WO2023009020A1 (en) | 2021-07-26 | 2021-07-26 | Neural network training and segmenting an audio recording for emotion recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2021/000316 WO2023009020A1 (en) | 2021-07-26 | 2021-07-26 | Neural network training and segmenting an audio recording for emotion recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023009020A1 true WO2023009020A1 (en) | 2023-02-02 |
Family
ID=85087100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2021/000316 WO2023009020A1 (en) | 2021-07-26 | 2021-07-26 | Neural network training and segmenting an audio recording for emotion recognition |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023009020A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116306686A (en) * | 2023-05-22 | 2023-06-23 | 中国科学技术大学 | A Method for Empathic Dialogue Generation Guided by Multiple Emotions |
CN119181383A (en) * | 2024-11-20 | 2024-12-24 | 湖南快乐阳光互动娱乐传媒有限公司 | A method and system for evaluating singing level in multiple dimensions |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210192332A1 (en) * | 2019-12-19 | 2021-06-24 | Sling Media Pvt Ltd | Method and system for analyzing customer calls by implementing a machine learning model to identify emotions |
-
2021
- 2021-07-26 WO PCT/RU2021/000316 patent/WO2023009020A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210192332A1 (en) * | 2019-12-19 | 2021-06-24 | Sling Media Pvt Ltd | Method and system for analyzing customer calls by implementing a machine learning model to identify emotions |
Non-Patent Citations (5)
Title |
---|
ISSA DIAS; FATIH DEMIRCI M.; YAZICI ADNAN: "Speech emotion recognition with deep convolutional neural networks", BIOMEDICAL SIGNAL PROCESSING AND CONTROL, ELSEVIER, AMSTERDAM, NL, vol. 59, 27 February 2020 (2020-02-27), NL , XP086144914, ISSN: 1746-8094, DOI: 10.1016/j.bspc.2020.101894 * |
KANNAN VENKATARAMANAN; HARESH RENGARAJ RAJAMOHAN: "Emotion Recognition from Speech", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 December 2019 (2019-12-22), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081565123 * |
SHANSHAN WANG; TONI HEITTOLA; ANNAMARIA MESAROS; TUOMAS VIRTANEN: "Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 January 1900 (1900-01-01), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081970789 * |
SHUIYANG MAO; P. C. CHING; TAN LEE: "Emotion Profile Refinery for Speech Emotion Classification", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 August 2020 (2020-08-12), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081739195 * |
VLADIMIR CHERNYKH; GRIGORIY STERLING; PAVEL PRIHODKO: "Emotion Recognition From Speech With Recurrent Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 January 2017 (2017-01-27), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080752039 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116306686A (en) * | 2023-05-22 | 2023-06-23 | 中国科学技术大学 | A Method for Empathic Dialogue Generation Guided by Multiple Emotions |
CN116306686B (en) * | 2023-05-22 | 2023-08-29 | 中国科学技术大学 | A Method for Empathic Dialogue Generation Guided by Multiple Emotions |
CN119181383A (en) * | 2024-11-20 | 2024-12-24 | 湖南快乐阳光互动娱乐传媒有限公司 | A method and system for evaluating singing level in multiple dimensions |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11276407B2 (en) | Metadata-based diarization of teleconferences | |
El-Moneim et al. | Text-independent speaker recognition using LSTM-RNN and speech enhancement | |
Anguera et al. | Speaker diarization: A review of recent research | |
US11100932B2 (en) | Robust start-end point detection algorithm using neural network | |
US7620547B2 (en) | Spoken man-machine interface with speaker identification | |
CN112233680B (en) | Speaker character recognition method, speaker character recognition device, electronic equipment and storage medium | |
Masumura et al. | Online end-of-turn detection from speech based on stacked time-asynchronous sequential networks. | |
Wang et al. | Multi-source domain adaptation for text-independent forensic speaker recognition | |
US12217760B2 (en) | Metadata-based diarization of teleconferences | |
WO2023009020A1 (en) | Neural network training and segmenting an audio recording for emotion recognition | |
Yella et al. | A comparison of neural network feature transforms for speaker diarization. | |
Markov et al. | Never-ending learning system for on-line speaker diarization | |
Jia et al. | A deep learning system for sentiment analysis of service calls | |
CN109065026B (en) | Recording control method and device | |
Stefanidi et al. | Application of convolutional neural networks for multimodal identification task | |
Agrawal et al. | Prosodic feature based text dependent speaker recognition using machine learning algorithms | |
Berdibayeva et al. | Features of speech commands recognition using an artificial neural network | |
Mathur et al. | Unsupervised domain adaptation under label space mismatch for speech classification | |
Kalita et al. | Use of bidirectional long short term memory in spoken word detection with reference to the Assamese language | |
Fan et al. | Automatic emotion variation detection in continuous speech | |
JPH06266386A (en) | Word spotting method | |
Chen et al. | End-to-end speaker-dependent voice activity detection | |
Salah et al. | Kernel function and dimensionality reduction effects on speaker verification system | |
Mporas et al. | Evaluation of classification algorithms for text dependent and text independent speaker identification | |
Aafaq et al. | Multi-Speaker Diarization using Long-Short Term Memory Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21952040 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202392106 Country of ref document: EA |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21952040 Country of ref document: EP Kind code of ref document: A1 |