CN116524962A - Speech emotion recognition method based on Conformer structure and multitask learning framework - Google Patents
Speech emotion recognition method based on Conformer structure and multitask learning framework Download PDFInfo
- Publication number
- CN116524962A CN116524962A CN202310552901.0A CN202310552901A CN116524962A CN 116524962 A CN116524962 A CN 116524962A CN 202310552901 A CN202310552901 A CN 202310552901A CN 116524962 A CN116524962 A CN 116524962A
- Authority
- CN
- China
- Prior art keywords
- conformer
- module
- emotion
- voice
- emotion recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000008451 emotion Effects 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 30
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000012795 verification Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 22
- 238000010606 normalization Methods 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 8
- 238000012935 Averaging Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 230000037433 frameshift Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a voice emotion recognition method based on a Conformer structure and a multitask learning framework, which comprises the following steps: s1, extracting characteristics of audio data of a training set and an verifying set in a voice emotion data set to generate a triplet sample, wherein the triplet sample comprises FBank characteristics, emotion labels and text labels; s2, constructing a network model for multi-task learning, wherein the network model comprises a shared encoder, an emotion recognition decoder and a voice recognition decoder, and the shared encoder comprises a Conformer model for deep feature extraction; s3, training the network model by utilizing the triplet sample corresponding to the training set; and S4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set. S5, inputting the audio data to be identified into the network model to realize voice identification and emotion identification of the audio data to be identified. The method builds the multitask learning model based on the Conformer structure, and can improve the generalization performance and the recognition accuracy of the model.
Description
Technical Field
The invention relates to the technical field of computer voice recognition and voice emotion recognition, in particular to a voice emotion recognition method based on a Conformer structure and a multi-task learning framework.
Background
In recent years, with the development of the internet and artificial intelligence technology, applications of man-machine interaction are also becoming popular, and ways of man-machine interaction are various, and voice interaction is one of the most widespread and direct applications. The first step of man-machine voice interaction is that a computer recognizes and understands human voice, and voice recognition is one of important methods for semantic understanding by the computer. However, in practical applications, it is not enough to obtain text information only through speech recognition for semantic understanding, and a human generally has emotion when speaking, and different emotion utterances and sentence may express completely different meanings, so emotion information is also important for semantic understanding. The voice emotion recognition is introduced into the man-machine voice interaction application, emotion information can be provided for the subsequent semantic understanding task, and misunderstanding of a computer to a user is reduced as much as possible.
Early speech emotion recognition models were mostly based on traditional machine learning algorithms, most typically hidden markov models, gaussian mixture models and support vector machines, which train the models to obtain the relationship between speech features and emotion classification, but these methods perform poorly both in terms of recognition accuracy and model robustness. With the development of deep learning and the enhancement of computer hardware computing power in recent years, models based on a deep neural network are widely applied in the field of voice, and research shows that the voice emotion recognition model based on the deep learning has better performance compared with the traditional machine learning model.
At present, there are many application scenarios combining voice recognition and voice emotion recognition, and these application scenarios include intelligent customer service, intelligent voice assistant, intelligent medical treatment, intelligent cabin, and the like. The intelligent terminal is provided with comprehensive voice products such as little college, hundred-degree small degree, and various voice technologies such as voice recognition, voice emotion recognition, voice synthesis and the like are included in the intelligent terminal. Most applications, however, perform speech recognition and speech emotion recognition as two separate models, which can result in redundancy of the system and waste of computational resources.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art, and provides a voice emotion recognition method based on a Conformer structure and a multi-task learning framework.
The invention provides a voice emotion recognition method based on a Conformer structure and a multitask learning framework, which comprises the following steps:
s1, acquiring a voice emotion data set, and extracting features of audio data of a training set and an verifying set in the voice emotion data set to generate a triplet sample, wherein the triplet sample comprises FBank features, emotion tags and text tags;
s2, constructing a network model for multi-task learning, wherein the network model comprises a shared encoder, an emotion recognition decoder and a voice recognition decoder, the shared encoder comprises a Conformer model for deep feature extraction, and the emotion recognition decoder comprises an averaging layer and a linear full-connection layer and is used for outputting emotion vectors; the speech recognition decoder includes an averaging layer and a CTC decoder for outputting text vectors;
s3, training the network model by using the triplet sample corresponding to the training set;
s4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set;
s5, inputting the audio data to be identified into the network model to realize voice identification and emotion identification of the audio data to be identified.
By means of the method, a multi-task network model for simultaneously carrying out voice recognition and emotion recognition is built, the voice recognition subtasks and the emotion recognition subtasks are combined, correlation among tasks can be utilized, additional information is learned during model training, and therefore performance of each subtask is improved. The shared encoder comprises a Conformer model for deep feature extraction, so that the feature expression capability of the network model can be effectively improved, and the precision of voice recognition and emotion recognition can be improved.
Further, the Conformer model comprises a plurality of stacked Conformer modules, and the Conformer model comprises a plurality of stacked Conformer modules, wherein the Conformer modules are formed by introducing a CNN module into a converter module.
Hidden layer characteristics can be generated through stacked Conformer modules, and the Conformer modules can utilize the capturing capacity of the CNN modules on local characteristics and the invariance of the CNN modules on time domains and the modeling advantage of the Conformer modules in long distances to link the extracted similar characteristics together, so that the Conformer model can not only utilize the local characteristics, but also establish the relationship between the local characteristics.
Further, the Conformer module comprises a first feedforward module, a self-attention module, a convolution module, a second feedforward module and a normalization module, and the calculation formula of the Conformer module is as follows:
x″ i =x′ i +Conv(x′ i )
wherein x is i Is the input of the first feedforward module, y i For the output of the normalization module, FFN is the feed-forward module, MHSA is the self-attention module, conv is the convolution module, and LN is the normalization module.
Further, the first feedforward module, the self-attention module, the convolution module and the second feedforward module in the Conformer module are connected through residual errors.
Further, a plurality of stacked Conformer modules are connected through residual errors.
Further, the shared encoder includes a convolutional downsampling layer, a full-concatenated layer, a Dropout layer, and a Conformer model.
Further, the step S3 includes: and calculating a loss function of the network model, wherein the loss function comprises emotion recognition task loss, voice recognition task loss and sentence embedding loss.
The generalization of the speech emotion recognition model can be improved by introducing sentence embedding loss.
Further, calculating the loss function of the network model includes: the triplet sample is processed by a network model to obtain an emotion vector f e (X n ) And a text vector f t (X n ) And shared encoder output g (X n ) Wherein X is n Representing FBank eigenvectors, calculating a loss function l=l 1 +αL 2 +βL 3 Wherein alpha and beta are super parameters;
loss of emotion recognition taskWhere MSELoss denotes the mean square error loss function, e n Representing emotion tags in the triplet sample;
speech recognition task penaltyWherein CTCLoss represents a CTC loss function, t n Representing text labels in the triplet sample;
sentence embedding lossWherein cross entropy represents a cross entropy loss function, pool represents an average pooling calculation, and sentence embedding loss is used for constraining the integrity of deep information in speech at the sentence level.
The integrity of emotion information at a sentence level is constrained through sentence embedding loss, so that the recognition accuracy and generalization of the model are further improved.
Further, the step S1 includes: and (3) carrying out speed change on the audio data through the speed factor, adding noise interference on the audio data, and adding the obtained new audio data into the training set.
The data volume of the original audio data can be increased by the method.
The invention also provides an electronic device comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and is configured to be executed by the processor to implement the speech emotion recognition method based on the Conformer structure and the multi-task learning framework.
The present invention also provides a computer-readable storage medium having a computer program stored thereon; the computer program is executed by the processor to implement the speech emotion recognition method based on the Conformer structure and the multitasking learning framework.
The beneficial effects of the invention are as follows: according to the invention, a multi-task network model for simultaneously carrying out voice recognition and emotion recognition is constructed, and the voice recognition subtasks and the emotion recognition subtasks are combined to utilize the correlation between tasks, so that additional information is learned during model training, and the performance of each subtask is improved. The shared encoder comprises a Conformer model for deep feature extraction, so that the feature expression capability of the network model can be effectively improved, and the precision of voice recognition and emotion recognition can be improved.
Drawings
FIG. 1 is a training flow chart of the network model of the present invention;
FIG. 2 is a block diagram of a Conformer module of the present invention;
FIG. 3 is a block diagram of a shared encoder of the present invention;
FIG. 4 is a flow chart of speech characterization extraction in accordance with the present invention;
FIG. 5 is a training flow chart of the emotion recognition decoder of the present invention;
fig. 6 is a training flow chart of the speech recognition decoder of the present invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects to be solved by the present application more clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The environment adopted in the embodiment is GeForce GTX Titan X GPU, interCore i7-5930K, 3.50GHZ CPU, 64G RAM and linux operating system, and the development is carried out by using Python and an open source library PyTorch.
A speech emotion recognition method based on a Conformer structure and a multitask learning framework comprises the following steps:
step S1: the voice emotion data set is divided into a training set, a testing set and a verification set, the audio data of the training set and the verification set are subjected to feature extraction, a triplet sample is generated, the triplet sample comprises FBank features, emotion tags and text tags, and fig. 4 is a flow chart of voice characterization extraction.
The embodiment uses an open source data set IEMOCAP, IEMOCAP, collectively The Interactive Emotional Dyadic Motion Capture, which is an open source emotion recognition data set, and the IEMOCAP data set is composed of 151 recorded conversational videos, each conversation having 2 speakers, and the entire data set having a total of 302 videos. Each segment is marked with 9 types of emotions (anger, excitement, fear, sadness, surprise, depression, happiness, disappointment and neutrality) and texts corresponding to voices, and four types of emotion (anger, sadness, happiness and neutrality) are selected as training data. The data sets were integrated and divided into training, validation and test sets, with data sample distributions as shown in table 1 below.
Then, extracting the characteristics of the data set, wherein the specific operation is as follows:
framing, pre-emphasis and windowing the original audio: in the embodiment, the original wav file is divided into a plurality of small fragments with fixed length by using a frame window size of 25ms and a frame displacement size of 10ms; enhancing the signal of the high frequency part of each frame of the voice signal through pre-emphasis to improve the resolution of the high frequency signal; the time domain signal can better meet the periodicity requirement of Fast Fourier Transform (FFT) through windowing operation, and frequency leakage is reduced.
The speech signal is an unsteady, time-varying signal. But the speech signal can be considered stationary, time-invariant, over a short time frame. This short time is typically 10ms-30ms, so that in order to reduce the overall unsteady, time-varying effect of the speech signal during speech signal processing, the speech signal is segmented, where each segment is referred to as a frame, and typically the frame length is 25ms. In order to make the frames smoothly transition with each other and maintain the continuity of the frames, the method of overlapping and segmenting is generally adopted for framing to ensure that a part of adjacent frames overlap each other. The time difference between the starting positions of two adjacent frames is called frame shift (frame shift), and the frame shift size is typically 10ms; in the transmission process of the signals, the high-frequency signals are easier to attenuate, the pronunciation of some factors such as pel sounds contains more components of the high-frequency signals, and the loss of the high-frequency signals can cause that formants of phonemes are not obvious, so that modeling capability of an acoustic model on the phonemes is not strong. Pre-emphasis is a first order high pass filter that increases the energy of the high frequency part of the signal, which, in the real world, it is not possible to acquire signals from- +. only signals of a limited length of time. Since the framed signal is non-periodic, frequency leakage problems occur after FFT, and in order to minimize this leakage error, a weighting function, also called a window function, is required. Windowing is mainly used to make the time domain signal seem to better meet the periodicity requirements of FFT processing, reducing frequency leakage.
Extracting Fbank characteristics as input of a model: in this embodiment, the fast fourier transform is performed on the speech signal to obtain a frequency domain representation of the speech signal, and the obtained linear frequency f is converted into the Mel frequency of the cepstral domain, where the formula is as follows:
setting 80 triangular band-pass filters with equal bandwidth in the Mel frequency spectrum range, filtering the spectrum characteristics after discrete Fourier transformation to obtain 80 filter bank energies, and obtaining 80-dimensional Fbank characteristics after log operation.
The acoustic features are then data augmented: the present embodiment shifts the original audio data by 0.9 times and 1.1 times the speed factor for the acoustic features so that the amount of the original audio data is expanded by 3 times. In addition, some noise interference is added to the original audio data, and the obtained new audio data is added into the training set. This process may use an open source speech processing framework Kaldi or an audio processing tool librosa, etc.
Finally, processing the characteristics into a data file for model training: after the feature extraction is completed, the feature files are stored in a file form, each feature file corresponds to one emotion label, the paths of the feature files and the corresponding emotion labels are arranged into a text file, and the text file comprises a plurality of triplet samples and is used as input for training a network model.
And S2, constructing a network model for multi-task learning.
As shown in fig. 2, the shared encoder includes a convolutional downsampling layer, a linear layer, a Dropout layer, and a Conformer model, and forms an encoder as a whole as a shared encoder of the multi-tasking model. The present embodiment trains acoustic features by stacking 15 Conformer modules to generate hidden features.
The construction method of the Conformer model specifically comprises the following steps:
the Conformer module is used for connecting the extracted similar features together by introducing the CNN module into the transducer module and utilizing the capturing capacity of the CNN module on the local features and the invariance characteristic of the CNN module on the time domain and the modeling advantage of the transducer module in a long distance, so that the Conformer model can utilize the local features and establish the relationship between the local features, and further deep feature extraction is realized.
In this embodiment, the Conformer module adds a self-attention module and a convolution module between the first feedforward module and the second feedforward module, and finally performs normalization processing through the normalization module, where the first four modules are connected through residual errors, that is, the first feedforward module, the self-attention module, the convolution module and the second feedforward module are connected through residual errors. The Conformer model is calculated as follows:
x″ i =x′ i +Conv(x′ i )
wherein x is i Is the input of the first feedforward module, y i For the output of the normalization module, FFN is the feed forward module, MHSA is the self-attention module, conv is the convolution module, and LN (layer normalization) is the normalization module.
And forming the constructed Conformer modules into a Conformer encoder, wherein residual connection exists among the Conformer modules.
And S3, training the network model by utilizing the triplet samples corresponding to the training set.
By using the network model, the voice recognition task and the emotion recognition task are respectively carried out based on the Conformer model, and the method specifically comprises the following steps:
as shown in fig. 5 and 6, the emotion recognition decoder includes an averaging layer and a linear full-connection layer for outputting emotion vectors; the speech recognition decoder includes an averaging layer and a CTC decoder for outputting text vectors.
As shown in FIG. 1, each triplet sample is processed by a shared encoder to obtain a speech representation, and the speech representation is processed by an emotion recognition decoder, a speech recognition decoder, and an average pooling to obtain an emotion vector f e (X n ) Text vectorf t (X n ) Shared encoder output g (X n )。
Loss of emotion recognition taskWhere MSELoss denotes the mean square error loss function, e n Representing emotion tags in the triplet sample;
speech recognition task penaltyWherein CTCLoss represents a CTC loss function, t n Representing text labels in the triplet sample;
sentence embedding loss is made on the output of the voice recognition taskWherein, cross entropy represents cross entropy loss function, pool represents average pooling calculation, sentence embedding loss is used for restricting the integrity of deep information in voice at sentence level; n is the number of triplet samples.
Loss function l=l 1 +αL 2 +βL 3 Wherein alpha and beta are super parameters.
And S4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set.
And through the loss gradient calculation of the total loss function L, the joint optimization updating of the two tasks is realized. In the training process, super parameters such as the training batch size, the network layer number, the learning rate and the like can be adjusted to finely adjust the optimization model. The present embodiment obtains the following set of values of the super parameters through training and parameter adjustment:
and predicting emotion types corresponding to the audio in the test set by using the trained network, and calculating F1-score (a commonly used classification problem evaluation index, wherein the larger the numerical value is, the better the performance is) according to the real label of the test set.
In order to evaluate the effectiveness of the method of the present invention, the present embodiment is compared with several most advanced multitasking speech emotion recognition methods for recognition performance, including MTL with self-intent, MTL with wav2vec and TIM-net, the present experiment trains these models in the same training set and the same training environment and conditions, the trained model calculates F1-score on the IEMOCAP test set, and in order to compare generalizations of different methods, the present experiment also performs audio noise on the test set to obtain two sets of test data of "noiseless" and "noisy", and specific test results are shown in the following table.
The encoder in the network model has the functions of embedding a middle layer for extracting and outputting the information of the input voice data, and then identifying the corresponding emotion type through a rear-end classifier by embedding the middle layer. For a multi-task learning model comprising a speech recognition task and a emotion recognition task, embedding obtained through an encoder needs to contain semantic information and emotion information, and meanwhile, the model needs to have good generalization capability and can recognize data of different scenes. The invention provides an improved encoder based on a Conformer structure, wherein a residual error and a Dropout layer are introduced, and more overall information is reserved when emotion information is extracted, so that the generalization performance and the recognition precision of a model are improved.
In practical application, speech emotion recognition is used together with speech recognition in many scenes, so that two tasks can be combined, and the correlation between the two tasks is utilized, a multi-task learning model is formed by combining speech recognition (ASR) and Speech Emotion Recognition (SER) tasks based on a multi-task learning theory, a hidden layer is obtained through a front-end shared encoder, and then the hidden layer is embedded and simultaneously input into a speech recognition module and a speech emotion recognition module. In the training process, the related information among the tasks is helpful for the network to learn better internal representation, so that the shared encoder learns more shared characteristics, and the performance of each task is improved. In addition, the distance between the hidden layer embedding and the output of the voice recognition module is calculated through sentence embedding loss, so that the integrity of information in the hidden layer embedding in a sentence level is restrained.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.
Claims (10)
1. A speech emotion recognition method based on a Conformer structure and a multitask learning framework is characterized by comprising the following steps of: comprising the following steps:
s1, acquiring a voice emotion data set, and extracting features of audio data of a training set and an verifying set in the voice emotion data set to generate a triplet sample, wherein the triplet sample comprises FBank features, emotion tags and text tags;
s2, constructing a network model for multi-task learning, wherein the network model comprises a shared encoder, an emotion recognition decoder and a voice recognition decoder, the shared encoder comprises a Conformer model for deep feature extraction, and the emotion recognition decoder comprises an averaging layer and a linear full-connection layer and is used for outputting emotion vectors; the speech recognition decoder includes an averaging layer and a CTC decoder for outputting text vectors;
s3, training the network model by using the triplet sample corresponding to the training set;
s4, carrying out parameter adjustment on the network model by using the triplet sample corresponding to the verification set;
s5, inputting the audio data to be identified into the network model to realize voice identification and emotion identification of the audio data to be identified.
2. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 1, wherein: the Conformer model comprises a plurality of stacked Conformer modules, wherein the Conformer modules are formed by introducing CNN modules into a converter module.
3. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 2, characterized in that: the Conformer module comprises a first feedforward module, a self-attention module, a convolution module, a second feedforward module and a normalization module, and the calculation formula of the Conformer module is as follows:
x″ i =x′ i +Conv(x′ i )
wherein x is i Is the input of the first feedforward module, y i For the output of the normalization module, FFN is the feed-forward module, MHSA is the self-attention module, conv is the convolution module, and LN is the normalization module.
4. The speech emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 3, characterized in that: the first feedforward module, the self-attention module, the convolution module and the second feedforward module in the Conformer module are connected through residual errors.
5. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 2, characterized in that: the plurality of stacked Conformer modules are connected through residual errors.
6. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 1, wherein: the shared encoder includes a convolutional downsampling layer, a full-concatenated layer, a Dropout layer, and a Conformer model.
7. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 1, wherein: the step S3 includes: and calculating a loss function of the network model, wherein the loss function comprises emotion recognition task loss, voice recognition task loss and sentence embedding loss.
8. The voice emotion recognition method based on a Conformer structure and a multitasking learning framework of claim 7, characterized by: calculating a loss function for the network model includes: the triplet sample is processed by a network model to obtain an emotion vector f e (X n ) And a text vector f t (X n ) And shared encoder output g (X n ) Wherein X is n Representing FBank eigenvectors, calculating a loss function l=l 1 +αL 2 +βL 3 Wherein alpha and beta are super parameters;
loss of emotion recognition taskWhere MSELoss denotes the mean square error loss function, e n Representing emotion tags in the triplet sample;
speech recognition task penaltyWherein CTCLoss represents a CTC loss function, t n Representing text labels in the triplet sample;
sentence embedding lossWherein, cross entropy represents cross entropy loss function, pool represents average pooling calculation, sentence embedding loss is used for restricting the integrity of deep information in voice at sentence level; n is the number of triplet samples.
9. An electronic device, characterized in that: comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and configured to be executed by the processor to implement the speech emotion recognition method based on a Conformer structure and a multitasking learning framework of any of claims 1 to 8.
10. A computer-readable storage medium, characterized by: a computer program stored thereon; a computer program to be executed by a processor to implement the speech emotion recognition method based on a Conformer structure and a multitasking learning framework of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310552901.0A CN116524962A (en) | 2023-05-17 | 2023-05-17 | Speech emotion recognition method based on Conformer structure and multitask learning framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310552901.0A CN116524962A (en) | 2023-05-17 | 2023-05-17 | Speech emotion recognition method based on Conformer structure and multitask learning framework |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116524962A true CN116524962A (en) | 2023-08-01 |
Family
ID=87390192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310552901.0A Pending CN116524962A (en) | 2023-05-17 | 2023-05-17 | Speech emotion recognition method based on Conformer structure and multitask learning framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116524962A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116796250A (en) * | 2023-08-22 | 2023-09-22 | 暨南大学 | Intelligent identification and separation method and system for aliased wireless signals |
-
2023
- 2023-05-17 CN CN202310552901.0A patent/CN116524962A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116796250A (en) * | 2023-08-22 | 2023-09-22 | 暨南大学 | Intelligent identification and separation method and system for aliased wireless signals |
CN116796250B (en) * | 2023-08-22 | 2024-03-08 | 暨南大学 | Intelligent identification and separation method and system for aliased wireless signals |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fahad et al. | A survey of speech emotion recognition in natural environment | |
WO2022083083A1 (en) | Sound conversion system and training method for same | |
CN111312245B (en) | Voice response method, device and storage medium | |
Nicholson et al. | Emotion recognition in speech using neural networks | |
Wali et al. | Generative adversarial networks for speech processing: A review | |
Demircan et al. | Feature extraction from speech data for emotion recognition | |
Kumar et al. | Machine learning based speech emotions recognition system | |
CN112735404A (en) | Ironic detection method, system, terminal device and storage medium | |
CN111968622A (en) | Attention mechanism-based voice recognition method, system and device | |
CN116524962A (en) | Speech emotion recognition method based on Conformer structure and multitask learning framework | |
CN116994553A (en) | Training method of speech synthesis model, speech synthesis method, device and equipment | |
Kumar et al. | CNN based approach for Speech Emotion Recognition Using MFCC, Croma and STFT Hand-crafted features | |
Gudmalwar et al. | Improving the performance of the speaker emotion recognition based on low dimension prosody features vector | |
Li et al. | Dual-path modeling with memory embedding model for continuous speech separation | |
Sakamoto et al. | Stargan-vc+ asr: Stargan-based non-parallel voice conversion regularized by automatic speech recognition | |
Kadyan et al. | Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system | |
Nanduri et al. | A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data | |
Mehra et al. | Improving speech command recognition through decision-level fusion of deep filtered speech cues | |
Bera et al. | Identification of mental state through speech using a deep learning approach | |
Honarmandi Shandiz et al. | Voice activity detection for ultrasound-based silent speech interfaces using convolutional neural networks | |
Yusuf et al. | A novel multi-window spectrogram augmentation approach for speech emotion recognition using deep learning | |
Alhlffee | MFCC-Based Feature Extraction Model for Long Time Period Emotion Speech Using CNN. | |
Yusuf et al. | RMWSaug: robust multi-window spectrogram augmentation approach for deep learning based speech emotion recognition | |
Kadyan et al. | Developing in-vehicular noise robust children ASR system using Tandem-NN-based acoustic modelling | |
Sushma et al. | Emotion analysis using signal and image processing approach by implementing deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |