CN117649861A

CN117649861A - Voice emotion recognition method and system based on frame-level emotion state alignment

Info

Publication number: CN117649861A
Application number: CN202311430903.9A
Authority: CN
Inventors: 李雅; 李启飞; 高迎明; 王聪
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-03-05

Abstract

The invention provides a voice emotion recognition method and a system based on frame-level emotion state alignment, wherein the method comprises the following steps: and carrying out voice emotion recognition on the input voice data by utilizing the pre-trained voice emotion recognition model to obtain sentence-level voice emotion recognition results. In the pre-training process of the voice emotion recognition model, extracting frame-level deep emotion characterization from voice data contained in a training set, utilizing a pre-trained clustering model to obtain frame-level emotion pseudo tags based on frame-level deep emotion characterization reasoning, training a training set containing the voice data and the frame-level emotion pseudo tags to obtain a frame-level emotion state alignment model, and performing migration learning training on the frame-level emotion state alignment model and the emotion tags to obtain the voice emotion recognition model. The invention can solve the interference of inconsistent frames in the voice sample and avoid the problem of high cost.

Description

Voice emotion recognition method and system based on frame-level emotion state alignment

Technical Field

The invention relates to the technical field of voice emotion recognition, in particular to a voice emotion recognition method and system based on frame-level emotion state alignment.

Background

Speech emotion recognition is an important component of man-machine interaction system, and its function is to recognize emotion state contained in speech of current speaker in man-machine interaction process. In man-machine interaction, there are mainly two methods for a machine to understand the emotion of a human. The first is to transcribe the speaking content of the person into text through a voice recognition system, and then perform semantic emotion analysis on the text layer through a natural language processing technology. But text presents something that is often inconsistent with the true emotion expressed by speech, such as "good" spoken when angry, where the speaker is in angry, but "good" in the meaning of text is understood by the machine as feedback that the human agrees or approves. So that the overall interactive experience is poor. The second method is to introduce voice emotion recognition, so that the machine can directly recognize the emotion state of a speaker through voice emotion, and then combine the text semantic content, so that misunderstanding of the machine on the user intention can be reduced, and the man-machine interaction experience is improved. So that speech emotion recognition is an essential component of intelligent perceptions.

Speech emotion recognition technology has evolved rapidly in the last decade. Early speech emotion recognition systems consisted of high-dimensional manual speech features and machine learning algorithms, with commonly used machine learning algorithms including support vector machines, random forests, hidden markov, etc. But these methods have low performance and poor robustness. This is because the dimensions of the manual features are too high to fit over easily and there is a loss of information in extracting the manual features, and secondly the machine learning algorithm cannot or is not good at modeling the frame level features, resulting in its low performance. Modeling from the frame level is more suitable for this task because of the long-term nature of emotion information.

At present, all marked voice emotion data are labels with sentence level only and have no emotion marking with frame level. Because labeling subjective emotions to the frame level is too costly and time consuming. So all speech emotion recognition systems are currently implemented based on sentence-level tags. However, the emotion states of not all frames in a voice are consistent with the sentence labels of the voice, so that the phenomenon that frames irrelevant to sentence-level emotion labels interfere with the model to recognize the true emotion labels of the voice can occur, and the performance of the voice emotion recognition model is poor.

As an easily understood example, assume that the emotion tag of a speech is "happy" and it has 100 frames in total, wherein 50 frames are "happy" and 50 frames are other emotions, which would cause trouble for the model to recognize the true emotion. Because we assume that 50% of the speech is "happy", but the model does not know that 50% of the speech representation is happy, there is no frame-level label guidance in the learning process, and the model is prone to misjudgment if only 50% of the probability is "happy".

Disclosure of Invention

In view of the foregoing, embodiments of the present invention provide a method and system for speech emotion recognition based on frame-level emotion state alignment, which obviates or mitigates one or more of the disadvantages of the prior art.

One aspect of the present invention provides a speech emotion recognition method based on frame-level emotion state alignment, the method comprising the steps of:

performing voice emotion recognition on the input voice data by utilizing a pre-trained voice emotion recognition model to obtain sentence-level voice emotion recognition results;

in the pre-training process of the voice emotion recognition model, extracting frame-level deep emotion characterization from voice data contained in a training set, utilizing a pre-trained clustering model to obtain frame-level emotion pseudo tags based on frame-level deep emotion characterization reasoning, training a training set containing the voice data and the frame-level emotion pseudo tags to obtain a frame-level emotion state alignment model, and performing migration learning training on the frame-level emotion state alignment model and the emotion tags to obtain the voice emotion recognition model.

In some embodiments of the present invention, the method further includes a step of pre-training to obtain a speech emotion recognition model, specifically including:

extracting frame-level deep emotion characterization for voice data contained in a training set;

obtaining a frame-level emotion pseudo tag based on frame-level deep emotion characterization reasoning by utilizing a pre-trained clustering model;

training by using a training set containing voice data and frame-level emotion pseudo labels thereof to obtain a frame-level emotion state alignment model;

and performing transfer learning training on the frame-level emotion state alignment model and emotion labels to obtain the voice emotion recognition model.

In some embodiments of the invention, the method further comprises:

and pre-sampling and normalizing the voice data contained in the training set.

In some embodiments of the present invention, the step of extracting a frame-level deep emotion characterization for speech data contained in a training set includes:

inputting voice data into a pre-training model for voice emotion recognition, wherein the pre-training model comprises a preset number of transducer layers for sequentially extracting features of the voice data, and extracting to obtain frame-level deep emotion characterization of the voice data.

In some embodiments of the present invention, the method further includes a step of pre-training a cluster model for obtaining a frame-level emotion pseudo tag based on the frame-level deep emotion characterization, including:

presetting the clustering quantity of a clustering model;

and inputting a training set containing the frame-level deep emotion characterization for training the clustering model, and training the clustering model.

In some embodiments of the present invention, the training using a training set including speech data and its frame-level emotion pseudo tags to obtain a frame-level emotion state alignment model includes:

inputting a training set containing voice data and frame-level emotion pseudo tags thereof into a pre-training model;

the pre-training model carries out iterative training through a training set based on an MLM pre-training method, so that the training-completed frame-level emotion state alignment model can align frame-level emotion pseudo labels and frame-level deep emotion characterization.

In some embodiments of the present invention, the training set is subjected to a pre-training model process followed by a label Embedding layer, a full connection layer, and a Softmax layer.

In some embodiments of the present invention, the step of obtaining the speech emotion recognition model by performing transfer learning training on the frame-level emotion state alignment model in combination with emotion labels includes:

adding a layer of attention mechanism on the frame-level emotion alignment model, wherein the type of the attention mechanism comprises any one of a self-attention mechanism, an additive attention mechanism and a hard attention mechanism;

and performing transfer learning training on the frame-level emotion alignment model added with the attention mechanism layer by using a training set containing voice data and emotion labels to obtain a trained voice emotion recognition model.

Another aspect of the present invention provides a speech emotion recognition system based on frame-level emotion state alignment, comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, which when executed by the processor, implement the steps of the method of any of the above embodiments.

Another aspect of the invention provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of the method according to any of the above embodiments.

According to the voice emotion recognition method and system based on frame level emotion state alignment, deep emotion representation and pseudo tags can be aligned based on the pre-trained frame level emotion state alignment model, sentence level voice emotion recognition results are obtained based on the frame level emotion pseudo tags through the voice emotion recognition model, on one hand, the voice emotion recognition model obtained based on frame level emotion state alignment model transfer learning training can learn emotion features in fine granularity, interference of voice frames with inconsistent emotion is weakened, on the other hand, a strategy of frame level emotion state alignment is avoided due to frame level emotion tag marking, and great improvement of cost is avoided.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a training method for a speech emotion recognition model based on frame-level emotion state alignment in an embodiment of the present invention.

FIG. 2 is a flow chart of training a speech emotion recognition model based on frame-level emotion state alignment in accordance with another embodiment of the present invention.

FIG. 3 is a schematic diagram of a sentence-level speech emotion recognition model according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a clustering model training process in an embodiment of the invention.

FIG. 5 is a schematic diagram of a frame-level emotion pseudo tag generation flow in an embodiment of the present invention.

FIG. 6 is a schematic diagram of a frame-level emotion state alignment model according to an embodiment of the present invention.

FIG. 7 is a block diagram of a speech emotion recognition model based on frame-level emotion alignment in an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled" may refer to not only a direct connection, but also an indirect connection in which an intermediate is present, unless otherwise specified.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

In recent years, with the rapid development of deep learning, convolutional neural networks and various timing neural networks have emerged, such as Long short-term memory neural networks (LSTM), gated cyclic neural networks (Gated Recurrent Unit, GRU), timing convolutional neural networks (Temporal convolution network, TCN), and transformers, and the like. Low-level speech features may be encoded using a convolutional network to obtain deep learning features. The time sequence neural network can model the deep learning features at the frame level, so that the performance of the voice emotion recognition system is improved to a great extent. With the advent of speech self-supervised pre-training large models, e.g., huBERT (Hidden-unit Bidirectional Encoder Representation from Transformers), wav2vec2.0 (wav to vector 2.0). The performance of speech emotion recognition can be further improved by performing transfer learning on the pre-training model or directly extracting the representation of the pre-training model for speech emotion recognition.

In order to solve the problems of the prior art that the sentence-level speech emotion recognition method has a problem that the correct emotion can be recognized by the interference model due to the fact that frames inconsistent with sentence-level labels exist in speech, and the performance is poor. The invention provides a voice emotion recognition method based on frame-level emotion state alignment, which is used for aligning frame-level emotion states and then carrying out sentence-level voice emotion recognition. Not only improves the accuracy of voice emotion recognition, but also avoids the cost improvement caused by adopting frame-level voice emotion recognition. The frame level emotion state alignment belongs to one of the modal alignment, the input of multi-modal learning is multi-modal data, and the input belongs to data of different sources, such as voice, characters, emotion, images and the like. Alignment (alignment) is an important operation, so that the multi-modal model can learn information such as mutual representation among different modalities. The aim to be achieved by the scheme is that: and aligning the frame-level deep emotion characterization obtained based on the voice data with the frame-level emotion pseudo tag, and further carrying out voice emotion recognition based on the aligned frame-level emotion state. Sentence tags refer to a voice corresponding to an emotion tag. For example, a recording of 3 seconds in duration, whose emotion label is "happy".

In one aspect, the invention provides a training method of a voice emotion recognition model based on frame-level emotion state alignment, which utilizes a pre-trained voice emotion recognition model to perform voice emotion recognition on input voice data so as to obtain sentence-level voice emotion recognition results. In the pre-training process of the voice emotion recognition model, extracting frame-level deep emotion characterization from voice data contained in a training set, utilizing a pre-trained clustering model to obtain frame-level emotion pseudo tags based on frame-level deep emotion characterization reasoning, training a training set containing the voice data and the frame-level emotion pseudo tags to obtain a frame-level emotion state alignment model, and performing transfer learning training on the frame-level emotion state alignment model and the emotion tags to obtain the voice emotion recognition model.

FIG. 1 is a flowchart of a training method for a speech emotion recognition model based on frame-level emotion state alignment, according to an embodiment of the present invention, the method comprises the following steps:

step S110: and extracting the deep emotion characterization at the frame level for the voice data contained in the training set.

In the specific implementation process, the frame-level deep emotion characterization can be extracted through the pre-training model HuBERT, and then the pre-training model HuBERT is trained again based on the frame-level emotion pseudo tag. The frame-level deep emotion characterization of the voice data can be extracted by using a transducer layer, the expected extraction effect can be achieved by stacking a plurality of transducers, and the number of layers to be stacked is determined according to the specification of the voice data. HuBERT is merely an example pre-training model, and BERT-like speech emotion recognition models may be migration-adapted into this method step.

Furthermore, prior to step S110, a preprocessing is also required for the voice data, where the preprocessing may include processing steps such as resampling and normalization.

Step S120: and obtaining a frame-level emotion pseudo tag based on frame-level deep emotion characterization reasoning by utilizing a pre-trained cluster model.

For the frame-level emotion pseudo tag, all voices need to be framed when being played or feature extracted or used as input features of a neural network. For example, an audio sample rate of 1 second (1000 ms), 16kHz, a frame length of 25 ms, a frame shift of 10 ms, which together can separate out ((1000-25)/10) +1=99 frames (here approximately equal to 99 frames). The tags of the existing data set are of sentence level, and the tags are difficult and high in cost when marked on the frame level. However, the frame-level tag may better describe the overall emotion of the audio, for example, 100 frames, 50 frames are happy, 50 frames are other emotions, and the model may better recognize the speech sample as happy. However, if there are only tags at sentence level, the model has difficulty in separating the emotion tags for the sample because there are 50 frames that disrupt the model to identify the sample.

Step S130: training by using a training set containing voice data and frame-level emotion pseudo labels thereof to obtain a frame-level emotion state alignment model.

Step S140: and performing transfer learning training on the frame-level emotion state alignment model and emotion labels to obtain the voice emotion recognition model.

According to the embodiment of the invention, deep emotion characterization and pseudo tags can be aligned based on the pre-trained frame-level emotion state alignment model, sentence-level speech emotion recognition results are obtained based on the frame-level emotion pseudo tags through the speech emotion recognition model, on one hand, the speech emotion recognition model obtained based on the frame-level emotion state alignment model transfer learning training can learn emotion features in a fine granularity, interference of speech frames with inconsistent emotion is weakened, on the other hand, a strategy of frame-level emotion state alignment due to frame-level emotion tag labeling is avoided, and great improvement of cost is avoided.

In one embodiment of the present invention, the step of extracting the frame-level deep emotion representation for the speech data included in the training set in step S110 includes: inputting voice data into a pre-training model for voice emotion recognition, wherein the pre-training model comprises a preset number of transducer layers for sequentially extracting features of the voice data, and extracting to obtain frame-level deep emotion characterization of the voice data. The number of transducer layers selected is determined according to the operation scale of the training set and the voice data, and the specific parameter range is not limited.

By adopting the embodiment, the deep emotion characterization of the frame level can be extracted, so that the emotion label of the frame level is generated directly in a bypass manner, the pseudo label of the frame level is obtained directly from the deep emotion characterization based on the pre-trained model, and the pseudo emotion label of the frame level is obtained while the great increase of the operand is avoided.

In some embodiments of the present invention, the clustering model in step S120 needs to be trained in advance, and the method further includes a step of training in advance a clustering model for obtaining a frame-level emotion pseudo tag based on frame-level deep emotion characterization, including: the method comprises the steps of (1) presetting the clustering quantity of a clustering model; (2) And inputting a training set containing the frame-level deep emotion characterization for training the clustering model, and training the clustering model. Specifically, the number of clusters may be 50, 100,150,200 or 500, etc., and specifically needs to be rated according to the size of the data set.

Optionally, the type of the clustering model comprises any one of a K-means model, a spectral cluster or a Gaussian mixture model. However, the present invention is not limited thereto, and the above type of the clustering model is only illustrative, and the purpose of the present invention is to obtain a preset number of clusters based on the extracted deep emotion characterization clusters, as a preset number of pseudo tags. The pseudo tag technology is suitable for small sample learning, the definition of the tag comes from semi-supervised learning, and the core idea of the semi-supervised learning is to improve the model performance in the supervised process by means of unlabeled data.

By adopting the embodiment, the pseudo tag can be generated based on the clustering model and used for training the frame-level emotion state alignment model, so that the emotion tag of the frame level is directly generated in a bypassing manner, and the operation amount is reduced.

In some embodiments of the present invention, the training using the training set including the voice data and the frame-level emotion pseudo tag thereof to obtain the frame-level emotion state alignment model includes: (1) Inputting a training set containing voice data and frame-level emotion pseudo tags thereof into a pre-training model; (2) The pre-training model carries out iterative training through a training set based on an MLM pre-training method, so that the training-completed frame-level emotion state alignment model can align frame-level emotion pseudo labels and frame-level deep emotion characterization. Where MLM (Masked Language Modeling) is a pre-training method for natural language processing by randomizing the ratios of words or characters in the text and then letting the model predict the correct content of the ratios of words or characters, in such a way that the contextual relationship between words, as well as the statistical nature of speech, can be learned. In the embodiment of the invention, the semantic association of the voice data and the frame-level emotion pseudo tag in the training set is learned based on the MLM method, so that the alignment of the emotion states of the frame level is realized.

In the implementation process, the frame-level emotion state alignment model may be included in a speech emotion recognition model, or in other words, speech emotion recognition with frame-level emotion alignment is realized by using a speech emotion recognition model after transfer learning training, and a pre-trained speech emotion recognition model may be used to obtain a frame-level emotion pseudo tag aligned with the frame-level deep emotion representation. And the training set is processed by the pre-training model and then passes through a label Embedding layer, a full connection layer and a Softmax layer.

By adopting the embodiment, the generation of the frame-level label directly based on the frame-level voice emotion recognition model can be avoided, and the operand is reduced.

In some embodiments of the present invention, step S140 includes the step of performing a transition learning training on the frame-level emotion state alignment model in combination with emotion labels to obtain the speech emotion recognition model, where the step includes:

(1) Adding a layer of attention mechanism on the frame-level emotion alignment model, wherein the type of the attention mechanism comprises any one of a self-attention mechanism, an additive attention mechanism and a hard attention mechanism; after alignment of the emotion states at the frame level, the transducer layer at the last layer of HuBERT should correspond to a pseudo tag for each frame, and then an attention mechanism is introduced to help the model focus on pseudo tag frames related to sentence-level tags, and the frame-level emotion characterization and sentence-level emotion tags are aligned through the attention mechanism.

(2) And performing transfer learning training on the frame-level emotion alignment model added with the attention mechanism layer by using a training set containing voice data and emotion labels to obtain a trained voice emotion recognition model. Alternatively, the attention mechanism layer may use self-attention mechanisms (self-attention), additive attention mechanisms (addtive attention), hard attention mechanisms (hard attention), or the like.

By adopting the embodiment of the invention, the frame-level deep emotion characterization of the voice data can be extracted, and the semantic association of the frame-level deep emotion characterization and the frame-level emotion pseudo tag is migrated to the semantic association of the frame-level deep emotion characterization and the emotion tag. The voice emotion recognition model itself comprises a transducer layer which can extract the deep emotion characterization at the frame level, so that a sentence-level voice emotion recognition result comprising emotion labels is obtained. On the basis of not increasing algorithm complexity, the influence of the interference frame on the voice emotion recognition is eliminated or reduced through the alignment of the emotion states at the frame level, and a more accurate voice emotion recognition result is obtained.

Further, in some embodiments of the present invention, the training set including speech data and emotion labels related to depression diagnosis is used to perform transfer learning training on the frame-level emotion alignment model to which the attention mechanism layer is added, so as to obtain a trained speech emotion recognition model. Wherein the emotion tags related to depression diagnosis comprise emotion tags for classifying and grading depression emotion differently. The speech emotion recognition model can be used for the detection of depression.

Further, on the basis of frame-level emotion alignment, the method further comprises a step of weakening inconsistent speech frames, consistent speech frames and inconsistent speech frames are screened through a frame-level emotion state alignment model, different weights are given to different speech frames, and a calculation mode of weighting operation is adopted in the process of generating sentence-level emotion recognition results, so that interference of the inconsistent speech frames on the speech emotion recognition results is weakened.

In a specific embodiment of the invention, aiming at the problem that the frame-level emotion label is high in labeling cost, the invention provides a method for automatically labeling frame-level emotion pseudo labels based on frame-level emotion state identification. The method mainly comprises three steps of: (1) automatically labeling a frame-level emotion pseudo tag; (2) And a frame level alignment step based on the frame level emotion pseudo tag. (3) A speech emotion recognition method based on frame-level emotion state alignment.

Specifically, in the step of automatically labeling the frame-level emotion pseudo tag, a sentence-level emotion recognition model is established by using a transfer learning technology based on a pre-training model HuBERT and a voice emotion data set (i.e. voice data), and a deep emotion characterization obtained after the voice data is processed by a ninth transducer layer of the pre-training model is extracted. And clustering the deep emotion characterization based on a k-means clustering algorithm to obtain a frame-level emotion pseudo tag of each frame. The deep emotion characterization is a feature matrix, the dimension of which is the number of frames, and then the dimension of each frame is 1, which accords with the input of a clustering algorithm, so that the clustering algorithm can cluster by itself. The clustering algorithm is an unsupervised learning method, essentially a classification algorithm. HuBERT is one of the self-supervised phonetic representation learning models, by way of example only, and other BERT-like models may be substituted.

In the frame level alignment step, a frame level emotion alignment model is implemented using a mask large language model training method (Masked Language Model, MLM) pre-training HuBERT based on the frame level emotion pseudo tag. In popular terms, huBERT is utilized to obtain a large amount of voice emotion data marked with a frame-level pseudo tag, and frame-level alignment is performed by utilizing the voice emotion data marked with the frame-level pseudo tag. The pre-training nature of the HuBERT model is a process of frame and pseudo tag alignment, except that it uses pseudo tags that represent the structure of the speech sequence. The invention can realize frame-level emotion state alignment by using the emotion pseudo tag to pretrain HuBERT.

In the step of speech emotion recognition based on frame-level emotion state alignment, an attention mechanism layer can be added on the frame-level emotion alignment model, and transfer learning training can be performed to obtain a speech emotion recognition model based on frame-level emotion alignment.

FIG. 2 is a flowchart of training a speech emotion recognition model based on frame-level emotion state alignment in accordance with another embodiment of the present invention, comprising the steps of:

step S201: acquiring a voice emotion data set, resampling and normalizing all voice data;

steps S202 to S203: establishing a sentence-level emotion recognition model by using a transfer learning technology based on a pre-training model HuBERT and a voice emotion data set;

step S204: extracting deep emotion characterization of a ninth transducer layer of the system built based on a pre-training model HuBERT;

steps S205-S207: clustering the deep emotion characterization of the ninth transducer layer based on a K-means clustering algorithm, for example, 50 clusters can be obtained to obtain emotion pseudo tags of each frame of voice;

steps S208 to S209: continuously pre-training HuBERT based on the frame-level emotion pseudo tag and the emotion data set to obtain a frame-level emotion state alignment model;

steps S210-S211: and performing transfer learning based on the frame-level emotion state alignment model and the attention mechanism to realize a voice emotion recognition model based on frame-level emotion state alignment.

Specifically, the step of resampling and normalizing in step S201 specifically includes: (1) Resampling all audio samples, the sampling rate may be set to 16000Hz; (2) normalizing the resampled speech. Wherein, the normalized flow is expressed as:

where X represents the set of sample points of the audio,representing the average value of all sampling points, X _std Represents standard deviation of all sampling points, +.>Representing normalized audio sample points, X _i I sample point data lower.

In step S204, a sentence-level emotion recognition model is built, and a deep emotion representation of a ninth transducer layer of the system is extracted, and the specific structure is as shown in fig. 3, and the method includes: (1) Constructing an initial sentence level emotion recognition model, wherein the model consists of a pre-trained HuBERT model, a full-connection layer and a softmax layer positioned at the last layer; (2) The existing emotion voice corpus is utilized to transfer and learn on an initial sentence level emotion recognition model, a plurality of emotion categories are predicted, and a loss function is a cross entropy loss function; (3) Deep emotion characterization is extracted from a ninth transducer layer of the sentence-level emotion recognition model after transfer learning training.

FIG. 4 is a training step of a clustering model, taking a K-means model as an example, forming a sentence-level speech emotion recognition model by the trained HuBERT and the full connection layer, extracting a frame-level emotion representation, and clustering by the frame-level emotion representation to train the K-means model.

In the process of step S205-207, as shown in FIG. 5, a sentence-level speech emotion recognition model is formed by the trained HuBERT and the full connection layer, a frame-level emotion characterization is extracted, and a frame-level emotion pseudo tag is inferred by a cluster model K-means.

In the step of obtaining the frame-level emotion state alignment model by pre-training HuBERT in steps S208 to 209, as shown in fig. 6, the method specifically includes: (1) An initial HuBERT pre-training model is built, and the model comprises a HuBERT pre-training model, a label Embedding layer, a full-connection layer and a softmax layer positioned at the last layer; (2) And continuing to pretrain the initial HuBERT by using an MLM pretraining method and a frame-level emotion pseudo tag to obtain a frame-level emotion state alignment model. The pretraining process of HuBERT is the process of aligning the pseudo tag with the frame level representation, so continuing to pretrain HuBERT can align the frame level pseudo emotion tag with the frame representation. Continuing the process of pre-training HuBERT calculates cross entropy loss only for occluded frames. The alignment refers to the corresponding relation between the representation of the voice frame and the pseudo tag finding. In this example, speech recognition is the transcription of what we say into text. It is when working that it is to judge what sound the current frame is sounding, whether "a" or "b" is sounding. The idea of the scheme is similar to that of finding the category of the current voice frame belonging to the pseudo tag, namely whether the current voice frame belongs to the category 0 or the category 10.

The steps of implementing the speech emotion recognition model based on frame-level emotion state alignment by transfer learning in steps S210 to 211, as shown in fig. 7, specifically include:

(1) The method comprises the steps of constructing an initial voice emotion recognition model based on frame-level emotion state alignment, firstly removing a label Embedding layer and a full-connection layer of the frame-level emotion state alignment model, and then adding an attention mechanism layer, a full-connection layer and a Softmax layer positioned at the last layer. The attention mechanism is used for aligning frame-level emotion characterization and sentence-level labels, and the calculation flow is as follows:

α _i ＝softmax(tanh(Wh _i ))； (4)

where W is a learnable attention matrix weight, h _i Representation of the ith frame, alpha _i Representing the attention weight of the i-th frame, Z represents the output of the last attention mechanism. Softmax and tanh represent different activation functions.

(2) And utilizing the voice emotion data set to transfer and learn on an initial voice emotion recognition model based on frame-level emotion state alignment, wherein a loss function is a cross entropy loss function, so that a final voice emotion recognition model based on frame-level emotion state alignment is obtained, and a voice emotion recognition result based on frame-level emotion state alignment, namely emotion type, can be obtained based on input voice data based on the voice emotion recognition model. On a pre-trained model (e.g., huBERT), the transfer learning performs other tasks (e.g., speech recognition, emotion recognition), which is a downstream task based on the pre-trained model HuBERT. The pre-training models are classified into supervised pre-training models and unsupervised (or self-supervised) pre-training models. A supervised pre-training model refers to training a model (e.g., speech recognition) on a large labeled dataset, and then using transfer learning to other domain tasks (e.g., speech emotion recognition). This model of speech recognition is a pre-trained model. The self-supervision pre-training model refers to designing agent tasks and constructing labels of the agent tasks on a large-scale unlabeled data set, and training the model through comparison and learning. The training mode is characterized by the commonality of the learning samples. These commonalities may be used as a task in other fields (e.g., speech emotion recognition). The HuBERT model as used herein is a typical self-supervising pre-training model.

According to the voice emotion recognition method and system based on frame level emotion state alignment, deep emotion representation and pseudo tags can be aligned based on the pre-trained frame level emotion state alignment model, sentence level voice emotion recognition results are obtained based on the frame level emotion pseudo tags through the voice emotion recognition model, on one hand, the voice emotion recognition model obtained based on frame level emotion state alignment model transfer learning training can learn emotion features in fine granularity, interference of voice frames with inconsistent emotion is weakened, on the other hand, a strategy of frame level emotion state alignment is avoided due to frame level emotion tag marking, and great improvement of cost is avoided. According to the method, the problem of disturbing model emotion recognition results caused by inconsistent emotion states and labels of some frames in a voice sample can be solved through frame-level emotion state alignment on the basis of sentence-level emotion recognition, the emotion recognition performance of a pre-training model is improved, and the accuracy of voice emotion multi-classification is improved to a great extent.

The frame-level emotion label introduced by the invention can better guide a model to learn emotion characteristics in finer granularity. After the frame levels are aligned, each frame has corresponding emotion representation, so that when sentence-level multi-emotion recognition is performed, the model strengthens the contribution of voice frames with the pseudo tags consistent with sentence-level emotion tags to recognition, weakens the interference of voice frames with inconsistent emotion, and relieves the problem that the emotion states of a plurality of frames in a voice sample are inconsistent with the sentence tags. The scheme can effectively solve the problem of high cost of the emotion labels at the level of the marked frames. Previous studies have shown that frame-level emotion is involved in the characterization of sentence-level speech emotion recognition models, so that the characterization of certain layers of sentence-level speech emotion recognition models is extracted, and then declustering can form pseudo emotion tags that approximate emotion tags.

The invention can align the frame-level emotion states, further realize voice emotion recognition based on the alignment of the frame-level emotion states, solve the interference of some frames inconsistent with the tags in the voice sample, and solve the problem of high labeling cost of the frame-level emotion tags. Compared with the prior art, the method has outstanding substantive characteristics and remarkable progress.

Accordingly, the present invention also provides a speech emotion recognition system based on frame-level emotion state alignment, comprising a computer device including a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, which system implements the steps of the method as described above when the computer instructions are executed by the processor.

Embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disk, a removable memory disk, a CD-ROM, or any other form of storage medium known in the art.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for speech emotion recognition based on frame-level emotion state alignment, the method comprising the steps of:

2. The method according to claim 1, further comprising the step of pre-training to obtain a speech emotion recognition model, comprising in particular:

3. The method according to claim 2, characterized in that the method further comprises:

and pre-sampling and normalizing the voice data contained in the training set.

4. The method of claim 2, wherein the step of extracting frame-level deep emotion characterizations for speech data contained in the training set comprises:

5. The method of claim 2, further comprising the step of pre-training a clustering model that obtains frame-level emotion pseudo tags based on frame-level deep emotion characterization, comprising:

presetting the clustering quantity of a clustering model;

6. The method of claim 2, wherein the training using a training set comprising speech data and its frame-level emotion pseudo tags results in a frame-level emotion state alignment model comprising:

7. The method of claim 6, wherein the training set is subjected to a pre-training model process followed by a label Embedding layer, a full connectivity layer, and a Softmax layer.

8. The method according to claim 2, wherein the step of obtaining the speech emotion recognition model by performing transfer learning training on the frame-level emotion state alignment model in combination with emotion labels comprises:

9. A speech emotion recognition system based on frame level emotion state alignment, comprising a processor and a memory, wherein the memory has stored therein computer instructions for executing the computer instructions stored in the memory, which when executed by the processor, implement the steps of the method of any of claims 1 to 8.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 8.