WO2022178942A1

WO2022178942A1 - Emotion recognition method and apparatus, computer device, and storage medium

Info

Publication number: WO2022178942A1
Application number: PCT/CN2021/084252
Authority: WO
Inventors: 顾艳梅; 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-02-26
Filing date: 2021-03-31
Publication date: 2022-09-01
Also published as: CN112949708B; CN112949708A

Abstract

The present application relates to the field of artificial intelligence, implements elimination of the effect of different speakers on emotion recognition, thereby improving the accuracy of emotion recognition, and relates to an emotion recognition method and apparatus, a device, and a medium. The method comprises: calling an emotion recognition model to be trained, inputting emotion feature information and speaker feature information into a feature generator for feature generation, to obtain an emotion feature vector group and a speaker feature vector group; inputting the speaker feature vector group and a speaker category label into a speaker classification model for training, and acquiring a prediction feature vector corresponding to the speaker classification model; backpropagating the prediction feature vector to the feature generator for feature generation, and inputting the emotion feature vector group of which speaker features are eliminated and an emotion category label into an emotion classification model for training; and acquiring a speech signal to be recognized, and inputting said speech signal into a trained emotion classification model to obtain an emotion recognition result. In addition, the present application further relates to blockchain technology, and the emotion recognition model can be stored in a blockchain.

Description

Emotion recognition method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202110218668.3 and the invention titled "Emotion Recognition Method, Apparatus, Computer Equipment and Storage Medium" filed with the China Patent Office on February 26, 2021, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to an emotion recognition method, device, computer equipment and storage medium.

Background technique

With the rapid development of artificial intelligence, human-computer interaction technology has been highly valued by people. In the process of human-computer interaction, it is necessary to give different emotional feedback and support to different users, different tasks, and different scenarios, and to make friendly, sensitive and intelligent responses to human emotions. Therefore, it is necessary to train computers for emotion recognition, so that computers can learn the ability of human beings to understand, perceive and feedback emotional characteristics.

technical problem

To sum up, the inventors realized that the existing emotion recognition models generally predict emotion categories by analyzing and recognizing speech signals. However, in actual scenarios, the emotional state expressed by humans is often affected by various factors such as culture, country, and population. The existing emotion recognition models cannot effectively avoid the influence of these factors, so the accuracy of emotion recognition is low.

Therefore, how to improve the accuracy of emotion recognition model has become an urgent problem to be solved.

technical solutions

The present application provides an emotion recognition method, device, computer equipment and storage medium, by back-propagating the predicted feature vector output by the speaker classification model to the feature generator to generate the emotion feature vector that eliminates the speaker's features, and according to the elimination of speaking The emotion feature vector of human features is used to train the emotion classification model, which can eliminate the influence of different speakers on the emotion classification model and improve the accuracy of emotion recognition.

In a first aspect, the present application provides an emotion recognition method, the method comprising:

Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;

Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;

Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;

Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;

Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;

The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;

Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.

In a second aspect, the present application also provides an emotion recognition device, the device comprising:

a training data acquisition module, used for acquiring training data, the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels;

a model calling module, used for calling an emotion recognition model to be trained, the emotion recognition model comprising a feature generator, an emotion classification model and a speaker classification model;

a first feature generation module, configured to input the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;

The first training module is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the corresponding speaker classification model after training. predict feature vector;

The second feature generation module is used for back-propagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;

The second training module is used for inputting the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and the trained emotion is obtained. Identify the model;

The emotion recognition module is used for acquiring the speech signal to be recognized, and inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.

In a third aspect, the present application also provides a computer device, the computer device comprising a memory and a processor;

the memory for storing computer programs;

The processor is configured to execute the computer program and realize when executing the computer program:

In a fourth aspect, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements:

beneficial effect

Compared with the prior art, the embodiments of the present application have the following beneficial effects: by acquiring training data, emotional feature information and labeled emotional category labels, speaker feature information and labeled speaker category labels can be obtained; by calling the to-be-trained The emotion recognition model can separately train the emotion classification model and the speaker classification model in the emotion recognition model to obtain the trained emotion recognition model; by inputting the emotion feature information and speaker feature information into the feature generator for feature generation, it can Obtain the corresponding emotional feature vector group and speaker feature vector group; by inputting the speaker feature vector group and the labeled speaker category label into the speaker classification model to iteratively train to convergence, the trained speaker classification model can be used to obtain predictions feature vector; by back-propagating the predicted feature vector to the feature generator for feature generation, the speaker feature vector can be unified, and then the emotional feature vector group that eliminates the speaker feature can be obtained; the emotional feature vector group that eliminates the speaker feature and the label Iteratively trains the emotion classification model by inputting the emotion category labels of the speech signals into the emotion classification model, and can obtain an emotion recognition model that eliminates the influence of different speakers; by inputting the speech signal to be recognized into the trained emotion recognition model for emotion recognition, the accuracy of emotion recognition is improved. .

Description of drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic flowchart of a method for emotion recognition provided by an embodiment of the present application;

2 is a schematic flowchart of a sub-step of acquiring training data provided by an embodiment of the present application;

3 is a schematic structural diagram of an emotion recognition model provided by an embodiment of the present application;

4 is a schematic diagram of a feature generated by a feature generator provided by an embodiment of the present application;

5 is a schematic flowchart of a sub-step of training a speaker classification model provided by an embodiment of the present application;

FIG. 6 is a schematic interaction diagram for obtaining an emotion feature vector group that eliminates speaker features provided by an embodiment of the present application;

7 is a schematic flowchart of a sub-step of acquiring an emotion feature vector group that eliminates speaker features according to an embodiment of the present application;

8 is a schematic interaction diagram of invoking an emotion recognition model for emotion recognition provided by an embodiment of the present application;

9 is a schematic block diagram of an emotion recognition device provided by an embodiment of the present application;

FIG. 10 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.

Embodiments of the present invention

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The flowcharts shown in the figures are for illustration only, and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to the actual situation.

It should be understood that the terms used in the specification of the present application herein are for the purpose of describing particular embodiments only and are not intended to limit the present application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise.

It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

Embodiments of the present application provide an emotion recognition method, apparatus, computer device, and storage medium. The emotion recognition method can be applied to a server or a terminal to generate an emotion feature vector that eliminates the speaker feature by back-propagating the predicted feature vector output by the speaker classification model to the feature generator. The emotion feature vector training the emotion classification model can eliminate the influence of different speakers on the emotion recognition model and improve the accuracy of emotion recognition.

The server may be an independent server or a server cluster. Terminals can be electronic devices such as smart phones, tablet computers, notebook computers, and desktop computers.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and features in the embodiments may be combined with each other without conflict.

As shown in FIG. 1 , the emotion recognition method includes steps S101 to S106.

Step S101: Acquire training data, where the training data includes emotional feature information and labeled emotional category labels, and speaker feature information and labeled speaker category labels.

In the embodiment of the present application, by acquiring training data, emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels can be obtained; The label trains the speaker classification model, and obtains the predicted feature vector corresponding to the trained speaker classification model, and then can generate the emotional feature vector that eliminates the speaker feature according to the predicted feature vector, and will eliminate the speaker feature. Emotion feature vector Input into the emotion classification model for training to eliminate the influence of different speakers on the emotion classification model and improve the accuracy of emotion recognition.

Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of sub-steps of acquiring training data in step S101 , which may specifically include the following steps S1011 to S1014 .

Step S1011: Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain.

Exemplarily, sample voice signals corresponding to a preset number of sample users may be obtained from the blockchain.

Among them, the sample users include different speakers. For example, the voices of test subjects in different regions, cultures or age groups in different emotions can be collected. The obtained sample speech signal includes speech signals of different emotional categories corresponding to multiple speakers.

Illustratively, the emotion categories may include positive emotions and negative emotions. For example, positive emotions may include, but are not limited to, calm, optimistic, happy, etc.; negative emotions may include, but are not limited to, complaints, blame, abuse, complaints, and the like.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned sample voice signal, the above-mentioned sample voice signal may also be stored in a node of a blockchain. Of course, the sample speech signal may also be stored in a local database or an external storage device, which is not specifically limited.

It should be noted that, since the sample speech signal may include useless signals, in order to improve the recognition accuracy of subsequent speaker categories and emotion categories, it is necessary to extract useful speech signals in the sample speech signals. The useless signals may include but are not limited to footsteps, silence, horns, and machine noises.

In this embodiment of the present application, the useful speech signal in the sample speech signal may be extracted based on the speech activity endpoint detection model. It should be noted that, in voice signal processing, voice activity detection (Voice Activity Detection, VAD) is used to detect whether there is voice, so as to separate the voice segment and the non-speech segment in the signal. VAD can be used for echo cancellation, noise suppression, speaker recognition, and speech recognition, among others.

In some embodiments, extracting useful speech signals in the sample speech signals based on the voice activity endpoint detection model may include: segmenting the sample speech signals to obtain at least one segmented speech signal corresponding to the sample speech signals; determining each segmented speech signal The short-term energy of the segmented speech signal; the segmented speech signals whose short-term energy is greater than the preset energy amplitude are spliced to obtain a useful speech signal.

The preset energy amplitude can be set according to the actual situation, and the specific value is not limited here.

Exemplarily, when extracting the useful voice signal in the sample voice signal based on the voice activity endpoint detection model, in addition to the short-term energy, features such as spectral energy and zero-crossing rate of the sample voice signal can also be selected for judgment. The specific process is here. Not limited.

By extracting useful speech signals from the sample speech signals, the recognition accuracy of subsequent speaker categories and emotion categories can be improved.

Step S1012, perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information.

It should be noted that, in this embodiment of the present application, the emotional feature information may include, but is not limited to, energy, fundamental frequency, speech rate, frequency spectrum, formant frequency, etc.; speaker feature information may include voiceprint features.

In some embodiments, pre-emphasis processing, framing and windowing can be performed on the useful speech signal to obtain window data corresponding to the useful speech signal; characteristic parameters of the window data are calculated, and the characteristic parameters at least include energy, fundamental frequency, speech rate, One of spectrum and formant frequency, and the characteristic parameter is determined as emotional characteristic information.

Exemplarily, a windowing function, such as a rectangular window, a Haining window, or a Hamming window, can be used to implement windowing processing on the framed signals.

It can be understood that by performing pre-emphasis processing, framing and windowing on the useful speech signal, high-frequency components can be increased and leakage in the frequency domain can be reduced, thereby achieving the effect of improving subsequent feature extraction.

Exemplarily, the energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency may be calculated according to respective calculation formulas corresponding to energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency. The specific calculation process is not limited here.

In some embodiments, mel spectral data of the window data may be calculated, and the mel spectral data may be determined as speaker characteristic information.

Exemplarily, the process of calculating the mel spectral data of the window data: performing fast Fourier transform processing and squaring on the window data to obtain the spectral line energy corresponding to the window data; process to obtain Mel spectrum data corresponding to the window data. The window data may include multiple pieces, so that Mel spectrum data corresponding to each window data can be obtained.

Step S1013: Label the feature information according to the identity information and emotion information of the sample user, and obtain the labeled speaker category label and the labeled emotion category label.

Exemplarily, for sample user 1, if the identity information of sample user 1 is A and the emotional information is positive, the characteristic information of sample user 1 can be marked; for example, the emotional characteristic information of sample user 1 is marked "positive". , and mark "A" for the speaker feature information, so as to obtain the speaker category label marked by sample user 1 and the marked emotion category label.

Exemplarily, for sample user 2, if the identity information of sample user 2 is B and the emotional information is negative, the characteristic information of sample user 2 can be marked; for example, the emotional characteristic information of sample user 2 is marked "negative". , and mark "B" for the speaker feature information, so as to obtain the speaker category label marked by sample user 2 and the marked emotion category label.

Step S1014: Determine the emotional feature information, speaker feature information, the marked emotional category label, and the marked speaker category label as the training data.

Exemplarily, emotion feature information, speaker feature information, annotated emotion category labels, and annotated speaker category labels are used as training data. The training data includes data sets corresponding to multiple sample users.

For example, the training data may include a data set of sample user 1, the data set including emotion feature information, speaker feature information, annotated emotion category label "positive", and annotated speaker category label "A". The training data may also include a data set of sample user 2, including emotion feature information, speaker feature information, annotated emotion category label "Negative", and annotated speaker category label "B".

Step S102 , calling the emotion recognition model to be trained, where the emotion recognition model includes a feature generator, an emotion classification model, and a speaker classification model.

It should be noted that the emotion recognition model may include a Generative Adversarial Network (GAN). Among them, the generative adversarial network mainly includes a feature generator and a feature discriminator; the feature generator is used to generate text, image, video and other data from the input data. The feature discriminator is equivalent to a classifier, which is used to judge the authenticity of the input data.

Please refer to FIG. 3 , which is a schematic structural diagram of an emotion recognition model provided by an embodiment of the present application. As shown in FIG. 3 , in this embodiment of the present application, the emotion recognition model includes a feature generator, an emotion classification model, and a speaker classification model. Among them, the emotion classification model and the speaker classification model are both feature discriminators.

Exemplarily, the feature generator may use an MLP (Multi Layer Perceptron, multi-layer perceptron) network or a deep neural network to represent the generating function. The emotion classification model and the speaker classification model may include, but are not limited to, convolutional neural networks, restricted Boltzmann machines, or recurrent neural networks, among others.

By calling the emotion recognition model to be trained, the feature vector required for training can be generated by the feature generator, and then the speaker classification model and the emotion classification model can be trained to convergence according to the feature vector.

Step S103: Input the emotional feature information and the speaker feature information into the feature generator for feature generation, and obtain a corresponding emotional feature vector group and speaker feature vector group.

Please refer to FIG. 4. FIG. 4 is a schematic diagram of generating a feature by a feature generator according to an embodiment of the present application. As shown in Figure 4, the emotion feature information and the speaker feature information are input into the feature generator, and the feature generator generates an emotion feature vector group according to the emotion feature information, and generates a speaker feature vector group according to the speaker feature information. Wherein, the emotion feature vector group includes at least one emotion feature vector; the speaker feature vector group includes at least one speaker feature vector.

Exemplarily, the feature generator may generate a corresponding feature vector according to the feature information by using a generating function. For example, the corresponding feature vector can be generated according to the feature information through a deep neural network. The specific feature generation process is not limited here.

By inputting the emotional feature information and speaker feature information into the feature generator for feature generation, the corresponding emotional feature vector group and speaker feature vector group can be obtained, and the speaker feature vector group can be input into the speaker classification model for training subsequently.

Step S104: Input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the predicted feature vector corresponding to the trained speaker classification model.

Please refer to FIG. 5. FIG. 5 is a schematic flowchart of sub-steps of training a speaker classification model provided by an embodiment of the present application, which may specifically include the following steps S1041 to S1044.

Step S1041: Determine the training sample data for each round of training by comparing one of the speaker feature vectors in the speaker feature vector group with the speaker category label corresponding to the speaker feature vector.

Exemplarily, one of the speaker feature vectors and the speaker category label corresponding to the speaker feature vector may be sequentially selected from the speaker feature vector group, and determined as the training sample data for each round of training.

Step S1042: Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data.

Exemplarily, the speaker classification prediction result may include the prediction probability corresponding to the speaker prediction category and the speaker prediction category.

Step S1043: Determine a loss function value corresponding to the current round according to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result.

Exemplarily, the loss function value corresponding to the current round may be determined based on the preset loss function, according to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result.

Exemplarily, a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross-entropy loss function, a squared loss function, or an exponential loss function can be used to calculate the loss function value.

Step S1044, if the loss function value is greater than the preset loss value threshold, adjust the parameters of the speaker classification model, and perform the next round of training, until the obtained loss function value is less than or equal to the loss value threshold, After finishing the training, the trained speaker classification model is obtained.

Exemplarily, the preset loss value threshold may be set according to the actual situation, and the specific value is not limited herein.

Exemplarily, a convergence algorithm such as a gradient descent algorithm, a Newton algorithm, a conjugate gradient method, or a Cauchy-Newton method can be used to adjust the parameters of the speaker classification model. After adjusting the parameters of the speaker classification model, input the next round of training sample data into the speaker classification model for speaker classification training, and determine the corresponding loss function value, until the obtained loss function value is less than or equal to the loss value threshold, end Train to get the trained speaker classification model.

By updating the parameters of the speaker classification model according to the preset loss function and convergence algorithm, the speaker classification model can be rapidly converged, thereby improving the training efficiency and accuracy of the speaker classification model.

By inputting the speaker feature vector group and the speaker category label into the speaker classification model for iterative training until convergence, the speaker classification model learns the speaker features, and then the learned speaker features can be back-propagated to the feature generator to generate Generate a sentiment feature vector that removes speaker features.

In some embodiments, obtaining the prediction feature vector corresponding to the trained speaker classification model may include: inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining speech The feature vector output by the fully connected layer of the person classification model; the mean value of all the obtained feature vectors is determined as the predicted feature vector.

Among them, the speaker classification model includes at least a fully connected layer. Exemplarily, the speaker classification model may be a convolutional neural network model, including a convolutional layer, a pooling layer, a fully connected layer, a normalization layer, and the like.

In this embodiment of the present application, the feature vector output by the fully connected layer of the speaker classification model can be obtained. Exemplarily, one feature vector is output corresponding to the training sample data of each round of training, so multiple feature vectors can be obtained.

In some embodiments, the mean value of all the obtained feature vectors may be determined as the predicted feature vector. Understandably, the predicted feature vector can be understood as the speaker feature learned by the trained speaker classification model.

By inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model, the speaker classification model can learn the speaker The predicted feature vector for the feature.

Step S105 , back-propagating the predicted feature vector to the feature generator for feature generation, to obtain an emotion feature vector group from which speaker features are eliminated.

Please refer to FIG. 6 . FIG. 6 is a schematic interaction diagram of obtaining an emotion feature vector group with speaker features removed according to an embodiment of the present application. As shown in Figure 6, the predicted feature vector is back-propagated to the feature generator for feature generation, and the emotional feature vector group that eliminates the speaker feature is obtained; then the emotional feature vector group that eliminates the speaker feature is sent to the emotion classification model. train.

Please refer to FIG. 7 . FIG. 7 is a schematic flowchart of the sub-steps of step S105 . The specific step S105 may include the following steps S1051 and S1052 .

Step S1051: Adjust the speaker feature vector group in the feature generator according to the predicted feature vector, and obtain the adjusted speaker feature vector group, wherein, in the adjusted speaker feature vector group The eigenvectors of each speaker are the same.

Exemplarily, the speaker feature vector may be represented by a first distribution function, and the speaker feature vector group includes at least one first distribution function. It can be understood that, since the speaker feature vector includes speaker feature information of multiple sample users, the speaker feature vector group corresponds to a plurality of different first distribution functions.

Exemplarily, the first distribution function may be a normal distribution function, which may be expressed as:

In the formula, μ represents the mean; σ ² represents the variance.

In some embodiments, adjusting the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group may include: determining a second distribution function corresponding to the predicted feature vector, and obtaining a second distribution function corresponding to the predicted feature vector. The mean and variance of the distribution function; according to the mean and variance, the mean and variance in each first distribution function are updated to obtain an updated first distribution function.

Exemplarily, the second distribution function corresponding to the predicted feature vector can be expressed as:

Exemplarily, according to the mean μ and the variance σ ² in the second distribution function F(x), the mean μ and the variance σ ² in each first distribution function f(x) can be updated to obtain the updated each The first distribution function is f'(x).

It can be understood that each updated first distribution function f'(x) has the same mean and the same variance, that is, each speaker feature vector in the adjusted speaker feature vector group is the same.

Step S1052 , based on the adjusted speaker feature vector group, generate the emotion feature vector group from which the speaker feature is eliminated through the generating function.

Exemplarily, after obtaining the adjusted speaker feature vector group, based on the adjusted speaker feature vector group, a generating function in the feature generator can generate an emotion feature vector group that eliminates the speaker feature.

It can be understood that the output of the generator function is the speaker feature vector group and the speaker feature vector group, wherein each speaker feature vector in the speaker feature vector group is the same, so the speaker feature vector does not affect the emotion feature vector. Influence, that is, the obtained emotion feature vector group is the emotion feature vector that eliminates the speaker features.

Based on the adjusted speaker feature vector group, the speaker feature-eliminated emotional feature vector group is generated, so that when the emotion classification model is trained according to the speaker-feature-eliminated emotional feature vector group, different speaker feature pairs can be eliminated. The impact of sentiment classification models.

Step S106: Input the emotion feature vector group from which the speaker feature is eliminated and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained.

It should be noted that, in the training process of the emotion recognition model in the prior art, the emotion feature information and the speaker feature information are generally input into the feature generator, and the feature generator generates the emotion feature vector group and the speaker feature vector group; Then, the emotion feature vector group and the speaker feature vector group are input into the feature discriminator for emotion classification training, and the trained feature discriminator is obtained. Therefore, the emotion recognition models of the prior art cannot eliminate the influence of different speakers on emotion recognition.

Exemplarily, the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training until the emotion classification model converges.

The training process may include: determining the training sample data for each round of training according to the emotion feature vector group and emotion category label that eliminates the speaker features; inputting the current round of training sample data into the emotion classification model for emotion classification training, and obtaining the current The emotion classification prediction result corresponding to the training sample data of the current round is used to determine the loss function value according to the emotion classification label corresponding to the current training sample data and the emotion classification prediction result; if the loss function value is greater than the preset loss value threshold, the emotion classification model is adjusted. parameters, and perform the next round of training until the obtained loss function value is less than or equal to the loss value threshold, end the training, and obtain the trained sentiment classification model.

Exemplarily, a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross-entropy loss function, a squared loss function, or an exponential loss function can be used to calculate the loss function value. Convergence algorithms such as gradient descent algorithm, Newton algorithm, conjugate gradient method, or Cauchy-Newton method can be used to adjust the parameters of the speaker classification model.

It should be noted that since the emotion recognition model includes an emotion classification model and a speaker classification model, when the emotion classification model converges, it means that the emotion recognition model also converges, and a trained emotion recognition model is obtained. The trained emotion recognition model is not affected by speaker characteristics.

In some embodiments, in order to further ensure the privacy and security of the above-mentioned trained emotion recognition model, the above-mentioned trained emotion recognition model may also be stored in a node of a blockchain. When the trained emotion recognition model needs to be used, it can be obtained from the nodes of the blockchain.

By iterative training of the emotion feature vector group and the labeled emotion category labels that eliminate the speaker features into the emotion classification model, an emotion recognition model that is not affected by the speaker features can be obtained, thereby improving the accuracy of emotion recognition.

Step S107, obtain the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.

It should be noted that, in this embodiment of the present application, the voice data to be recognized may be a voice signal collected in advance and stored in a database, or may be generated according to a voice signal collected in real time.

Exemplarily, in a human-computer interaction scenario, the voice signal input by the user at the robot terminal may be collected by a voice acquisition device, and then noise reduction processing is performed on the voice signal, and the voice signal after noise reduction processing is determined as the voice signal to be recognized. .

Wherein, the voice collection device may include electronic devices such as a voice recorder, a voice recorder, and a microphone that collect voice.

The noise reduction processing of the speech signal can be realized according to the spectral subtraction algorithm, the Wiener filter algorithm, the minimum average error algorithm and the wavelet transform algorithm.

By performing noise reduction processing on the speech signal, the accuracy of subsequent recognition of the emotion category corresponding to the speech signal can be improved.

In some embodiments, before inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal, the method may further include: extracting the useful speech signal in the speech signal, and performing feature extraction on the useful speech signal to obtain the speech signal. The emotional feature information and speaker feature information corresponding to the signal.

Exemplarily, the useful voice signal in the voice signal can be extracted based on the voice activity endpoint detection model. For the specific process of extracting the useful speech signal, reference may be made to the detailed description of the foregoing embodiment, and the specific process will not be repeated here.

By extracting useful speech signals from speech signals, the accuracy of subsequent recognition of emotion categories can be improved.

In some embodiments, performing feature extraction on the useful voice signal to obtain emotional feature information and speaker feature information corresponding to the voice signal may include: performing pre-emphasis processing, framing and windowing on the useful voice signal to obtain the useful voice signal Corresponding window data; calculate the characteristic parameters of the window data, the characteristic parameters include at least one of energy, fundamental frequency, speech rate, frequency spectrum, and formant frequency, and determine the characteristic parameters as emotional characteristic information; calculate the Mel spectrum of the window data data, and the Mel spectrum data is determined as speaker feature information.

For the specific process of feature extraction, reference may be made to the detailed description of the foregoing embodiment, and the specific process will not be repeated here.

By performing pre-emphasis processing, framing and windowing on the useful speech signal, high-frequency components can be enhanced and leakage in the frequency domain can be reduced, thereby improving the effect of subsequent feature extraction.

In some embodiments, inputting the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal may include: inputting the emotion feature information and speaker feature information into the emotion recognition model for emotion recognition, and obtaining an emotion recognition result corresponding to the speech signal. Emotion recognition results.

It should be noted that the emotion recognition model is a pre-trained model, which can be stored in the blockchain or in a local database.

Please refer to FIG. 8 . FIG. 8 is a schematic interaction diagram of invoking an emotion recognition model for emotion recognition provided by an embodiment of the present application. As shown in Figure 8, the trained emotion recognition model can be called from the blockchain, and the emotion feature information and speaker feature information can be input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal can be obtained.

Exemplarily, the emotion recognition result may include the prediction probability corresponding to the emotion prediction category and the emotion prediction category. Among them, the sentiment prediction category can be positive or negative. For example, the emotion recognition result is "positive, 90%".

By inputting the emotion feature information and speaker feature information into the pre-trained emotion recognition model for prediction, the influence of different speaker features on emotion recognition can be eliminated, and the accuracy of emotion recognition can be improved.

The emotion recognition method provided by the above embodiment can improve the recognition accuracy of subsequent speaker categories and emotion categories by extracting useful voice signals in the sample voice signals; pre-emphasis processing, framing and Windowing can improve high-frequency components and reduce leakage in the frequency domain, so as to achieve the effect of improving subsequent feature extraction; by calling the emotion recognition model to be trained, the feature vector required for training can be generated by the feature generator, and then according to The feature vector trains the speaker classification model and the emotion classification model to convergence; by inputting the emotion feature information and speaker feature information into the feature generator for feature generation, the corresponding emotion feature vector group and speaker feature vector group can be obtained. The speaker feature vector group can be input into the speaker classification model for training; by updating the parameters of the speaker classification model according to the preset loss function and convergence algorithm, the speaker classification model can be quickly converged, thereby improving the speaker classification model. The training efficiency and accuracy are improved; the speaker classification model is iteratively trained to converge by inputting the speaker feature vector group and the speaker category label into the speaker classification model, so that the speaker classification model learns the speaker features, and the learned speaker features can be reversed later. Propagated to the feature generator to generate the emotion feature vector that eliminates the speaker features; the speaker classification prediction is performed by inputting the training sample data of each round of training into the trained speaker classification model, and the full range of the speaker classification model is obtained. The feature vector output by the connection layer can be obtained by the speaker classification model to learn the predicted feature vector of the speaker feature; based on the adjusted speaker feature vector group, the emotion feature vector group that eliminates the speaker feature is generated, so as to eliminate the speech according to the speaker feature vector group. The emotion feature vector group of human features can eliminate the influence of different speaker features on the emotion classification model when training the emotion classification model; by inputting the emotion feature vector group of speaker features and the labeled emotion category label into the emotion classification model Through iterative training, an emotion recognition model that is not affected by speaker characteristics can be obtained; by inputting emotional characteristic information and speaker characteristic information into the pre-trained emotion recognition model for prediction, the influence of different speaker characteristics on emotion recognition can be eliminated. Improves the accuracy of emotion recognition.

Please refer to FIG. 9. FIG. 9 is a schematic block diagram of an emotion recognition apparatus 1000 further provided by an embodiment of the present application, and the emotion recognition apparatus is used for executing the foregoing emotion recognition method. Wherein, the emotion recognition device may be configured in a server or a terminal.

As shown in FIG. 9 , the emotion recognition device 1000 includes: a training data acquisition module 1001 , a model calling module 1002 , a first feature generation module 1003 , a first training module 1004 , a second feature generation module 1005 , and a second training module 1006 and emotion recognition module 1007.

The training data acquisition module 1001 is configured to acquire training data, where the training data includes emotion feature information and labeled emotion category labels, and speaker feature information and labeled speaker category labels.

The model invoking module 1002 is used for invoking an emotion recognition model to be trained, where the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model.

The first feature generation module 1003 is configured to input the emotion feature information and the speaker feature information into the feature generator for feature generation, and obtain a corresponding emotion feature vector group and speaker feature vector group.

The first training module 1004 is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtain the corresponding speaker classification model after training. The predicted feature vector of .

The second feature generation module 1005 is configured to back-propagate the predicted feature vector to the feature generator for feature generation, and obtain an emotion feature vector group from which speaker features are eliminated.

The second training module 1006 is configured to input the emotion feature vector group and the labeled emotion category label into the emotion classification model for iterative training until the emotion classification model converges, and obtain the trained emotion classification model. Emotion recognition model.

The emotion recognition module 1007 is configured to acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.

It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the device and each module described above may refer to the corresponding process in the foregoing method embodiments, which is not repeated here. Repeat.

The above-mentioned apparatus can be implemented in the form of a computer program that can be executed on a computer device as shown in FIG. 10 .

Please refer to FIG. 10. FIG. 10 is a schematic structural block diagram of a computer device provided by an embodiment of the present application. The computer device can be a server or a terminal.

Referring to FIG. 10, the computer device includes a processor and a memory connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.

The internal memory provides an environment for running a computer program in the non-volatile storage medium, and when the computer program is executed by the processor, the processor can cause the processor to execute any emotion recognition method.

It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to run a computer program stored in the memory to implement the following steps:

Obtain training data, the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels; call the emotional recognition model to be trained, and the emotional recognition model includes a feature generator, An emotion classification model and a speaker classification model; the emotion feature information and the speaker feature information are input into the feature generator for feature generation, and corresponding emotion feature vector groups and speaker feature vector groups are obtained; The person feature vector group and the labeled speaker category label are input into the speaker classification model for iterative training to convergence, and the predicted feature vector corresponding to the trained speaker classification model is obtained; the predicted feature vector is reversed. Propagating to the feature generator for feature generation to obtain an emotion feature vector group that eliminates the speaker feature; the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for Iterative training until the emotion classification model converges, and a trained emotion recognition model is obtained; the speech signal to be recognized is obtained, and the speech signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal .

In one embodiment, the speaker feature vector group includes at least one speaker feature vector; the processor is implementing inputting the speaker feature vector group and the annotated speaker category label into the speaker classification When the model is iteratively trained to converge, it is used to achieve:

Determine the training sample data of each round of training by using one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector; input the current round of training sample data into the speech Perform speaker classification training in the speaker classification model to obtain the speaker classification prediction result corresponding to the current round of training sample data; determine the speaker classification prediction result according to the speaker classification label corresponding to the current round of training sample data and the speaker classification prediction result. The loss function value corresponding to the current round; if the loss function value is greater than the preset loss value threshold, adjust the parameters of the speaker classification model, and perform the next round of training until the obtained loss function value is less than or equal to the The loss value threshold, the training is ended, and the trained speaker classification model is obtained.

In one embodiment, the speaker classification model includes at least a fully-connected layer; when the processor obtains the prediction feature vector corresponding to the trained speaker classification model, the processor is configured to:

Input the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtain the feature vector output by the fully connected layer of the speaker classification model; The mean value of the eigenvectors is determined as the predicted eigenvectors.

In one embodiment, the feature generator includes a generating function; the processor performs feature generation by back-propagating the predicted feature vector to the feature generator to obtain a set of emotion feature vectors that eliminates speaker features , used to achieve:

Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vectors are the same; based on the adjusted speaker feature vector group, the emotion feature vector group in which the speaker feature is eliminated is generated by the generating function.

In one embodiment, the speaker feature vector group includes at least one first distribution function; the processor adjusts the speaker feature vector group in the feature generator according to the predicted feature vector to obtain When the adjusted set of speaker feature vectors is used to achieve:

Determine the second distribution function corresponding to the predicted feature vector, and obtain the mean value and variance of the second distribution function; according to the mean value and the variance, perform the mean value and variance in each of the first distribution functions. update to obtain the updated first distribution function.

In one embodiment, before implementing that the voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, the processor is further configured to:

Extracting useful voice signals in the voice signals, and performing feature extraction on the useful voice signals to obtain emotional feature information and speaker feature information corresponding to the voice signals.

In one embodiment, the processor inputs the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal, so as to achieve:

The emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.

In one embodiment, when the processor acquires the training data, the processor is configured to:

Obtain sample voice signals corresponding to a preset number of sample users, extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain; perform feature extraction on the useful voice signals to obtain Corresponding feature information, the feature information includes emotional feature information and speaker feature information; the feature information is marked according to the identity information and emotional information of the sample user, and the marked speaker category label and the marked feature information are obtained. The emotion category label; the emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.

The embodiments of the present application further provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, the computer-readable storage medium stores a computer program, and the The computer program includes program instructions, and the processor executes the program instructions to implement any emotion recognition method provided by the embodiments of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a Secure Digital Card (Secure Digital Card) , SD Card), flash memory card (Flash Card), etc.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

An emotion recognition method, which includes:

Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;

Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;

Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;

Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;

Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;

The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;

Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
The emotion recognition method according to claim 1, wherein the speaker feature vector group includes at least one speaker feature vector; the speaker feature vector group and the labeled speaker category label are input into the The speaker classification model is iteratively trained to converge, including:

Determine the training sample data of each round of training with one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector;

Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data;

According to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result, determine the loss function value corresponding to the current round;

If the loss function value is greater than the preset loss value threshold, the parameters of the speaker classification model are adjusted, and the next round of training is performed until the obtained loss function value is less than or equal to the loss value threshold, and the training is ended. The trained speaker classification model is obtained.
The emotion recognition method according to claim 2, wherein the speaker classification model includes at least a fully connected layer; the obtaining the prediction feature vector corresponding to the trained speaker classification model comprises:

Inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model;

The mean value of all the obtained feature vectors is determined as the predicted feature vector.
The emotion recognition method according to claim 1, wherein the feature generator comprises a generating function; the feature generation is performed by back-propagating the predicted feature vector to the feature generator to obtain the emotion that eliminates the speaker feature set of eigenvectors, including:

Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vector is the same;

Based on the adjusted set of speaker feature vectors, the set of emotion feature vectors with speaker features removed is generated by the generating function.
The emotion recognition method according to claim 4, wherein the speaker feature vector group includes at least one first distribution function; the speaker feature vector in the feature generator is adjusted according to the predicted feature vector group, and obtain the adjusted speaker feature vector group, including:

determining the second distribution function corresponding to the predicted feature vector, and obtaining the mean and variance of the second distribution function;

According to the mean value and the variance, the mean value and the variance in each of the first distribution functions are updated to obtain the updated first distribution function.
The emotion recognition method according to claim 1, wherein before the inputting the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal, the method further comprises:

Extracting useful voice signals in the voice signals, and performing feature extraction on the useful voice signals to obtain emotional feature information and speaker feature information corresponding to the voice signals;

The described voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, including:

The emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.
The emotion recognition method according to any one of claims 1-6, wherein the acquiring training data comprises:

Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain;

Perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information;

Mark the feature information according to the identity information and emotional information of the sample user, and obtain the marked speaker category label and the marked emotional category label;

The emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.
An emotion recognition device, comprising:

a training data acquisition module, used for acquiring training data, the training data includes emotional feature information and marked emotional category labels, and speaker feature information and marked speaker category labels;

a model calling module, used for calling an emotion recognition model to be trained, the emotion recognition model comprising a feature generator, an emotion classification model and a speaker classification model;

a first feature generation module, configured to input the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;

The first training module is used to input the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training until convergence, and obtain the corresponding speaker classification model after training. predict feature vector;

The second feature generation module is used for back-propagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;

The second training module is used for inputting the emotion feature vector group that eliminates the speaker feature and the labeled emotion category label into the emotion classification model for iterative training, until the emotion classification model converges, and the trained emotion is obtained. Identify the model;

The emotion recognition module is used for acquiring the speech signal to be recognized, and inputting the speech signal into the trained emotion recognition model to obtain the emotion recognition result corresponding to the speech signal.
A computer device, wherein the computer device includes a memory and a processor;

the memory for storing computer programs;

The processor is configured to execute the computer program and realize when executing the computer program:

Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;

Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;

Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;

The speaker feature vector group and the marked speaker category label are input into the speaker classification model for iterative training to convergence, and the prediction feature vector corresponding to the trained speaker classification model is obtained;

Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;

The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;

Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
The computer device according to claim 9, wherein the speaker feature vector group includes at least one speaker feature vector; the speaker feature vector group and the annotated speaker category label are input into the speaking The person classification model is iteratively trained to converge, including:

Determine the training sample data of each round of training with one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector;

Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data;

According to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result, determine the loss function value corresponding to the current round;

If the loss function value is greater than the preset loss value threshold, the parameters of the speaker classification model are adjusted, and the next round of training is performed until the obtained loss function value is less than or equal to the loss value threshold, and the training is ended. The trained speaker classification model is obtained.
The computer device according to claim 10, wherein the speaker classification model comprises at least a fully-connected layer; the obtaining the predicted feature vector corresponding to the trained speaker classification model comprises:

Inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model;

The mean value of all the obtained feature vectors is determined as the predicted feature vector.
The computer device according to claim 9, wherein the feature generator comprises a generating function; and the feature generator is back-propagated the predicted feature vector to the feature generator for feature generation, so as to obtain an emotional feature that eliminates speaker features set of vectors, including:

Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vector is the same;

Based on the adjusted set of speaker feature vectors, the set of emotion feature vectors with speaker features removed is generated by the generating function.
13. The computer device of claim 12, wherein the set of speaker feature vectors includes at least one first distribution function; the adjustment of the set of speaker feature vectors in the feature generator according to the predicted feature vector , to obtain the adjusted set of speaker feature vectors, including:

determining the second distribution function corresponding to the predicted feature vector, and obtaining the mean and variance of the second distribution function;

According to the mean value and the variance, the mean value and the variance in each of the first distribution functions are updated to obtain the updated first distribution function.
The computer device according to claim 9, wherein before the voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, the method further comprises:

Extracting useful voice signals in the voice signals, and performing feature extraction on the useful voice signals to obtain emotional feature information and speaker feature information corresponding to the voice signals;

The described voice signal is input into the trained emotion recognition model to obtain an emotion recognition result corresponding to the voice signal, including:

The emotion feature information and the speaker feature information are input into the emotion recognition model for emotion recognition, and the emotion recognition result corresponding to the speech signal is obtained.
The computer device according to any one of claims 9-14, wherein the acquiring training data comprises:

Obtain sample voice signals corresponding to a preset number of sample users, and extract useful voice signals in the sample voice signals, wherein the sample voice signals are stored in the blockchain;

Perform feature extraction on the useful speech signal to obtain corresponding feature information, where the feature information includes emotional feature information and speaker feature information;

Mark the feature information according to the identity information and emotional information of the sample user, and obtain the marked speaker category label and the marked emotional category label;

The emotion feature information, speaker feature information, the marked emotion category label, and the marked speaker category label are determined as the training data.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to implement:

Obtaining training data, the training data includes emotional feature information and labeled emotional category labels, as well as speaker feature information and labeled speaker category labels;

Calling the emotion recognition model to be trained, the emotion recognition model includes a feature generator, an emotion classification model and a speaker classification model;

Inputting the emotional feature information and the speaker feature information into the feature generator for feature generation to obtain a corresponding emotional feature vector group and speaker feature vector group;

Inputting the speaker feature vector group and the marked speaker category label into the speaker classification model for iterative training to convergence, and obtaining the predicted feature vector corresponding to the trained speaker classification model;

Backpropagating the predicted feature vector to the feature generator for feature generation to obtain an emotional feature vector group that eliminates speaker features;

The emotion feature vector group that eliminates the speaker feature and the labeled emotion category label are input into the emotion classification model for iterative training, until the emotion classification model converges, and a trained emotion recognition model is obtained;

Acquire the speech signal to be recognized, and input the speech signal into the trained emotion recognition model to obtain an emotion recognition result corresponding to the speech signal.
The computer-readable storage medium of claim 16, wherein the set of speaker feature vectors includes at least one speaker feature vector; the input of the set of speaker feature vectors and the labeled speaker category label The speaker classification model is iteratively trained to converge, including:

Determine the training sample data of each round of training with one of the speaker feature vectors in the speaker feature vector group and the speaker category label corresponding to the speaker feature vector;

Input the current round of training sample data into the speaker classification model for speaker classification training, and obtain the speaker classification prediction result corresponding to the current round of training sample data;

According to the speaker category label corresponding to the current round of training sample data and the speaker classification prediction result, determine the loss function value corresponding to the current round;

If the loss function value is greater than the preset loss value threshold, the parameters of the speaker classification model are adjusted, and the next round of training is performed until the obtained loss function value is less than or equal to the loss value threshold, and the training is ended. The trained speaker classification model is obtained.
The computer-readable storage medium according to claim 17, wherein the speaker classification model comprises at least a fully connected layer; and the obtaining the prediction feature vector corresponding to the trained speaker classification model comprises:

Inputting the training sample data of each round of training into the trained speaker classification model for speaker classification prediction, and obtaining the feature vector output by the fully connected layer of the speaker classification model;

The mean value of all the obtained feature vectors is determined as the predicted feature vector.
The computer-readable storage medium of claim 16, wherein the feature generator comprises a generating function; the back-propagation of the predicted feature vector to the feature generator for feature generation results in removing speaker features The sentiment feature vector group of , including:

Adjust the speaker feature vector group in the feature generator according to the predicted feature vector to obtain the adjusted speaker feature vector group, wherein each of the adjusted speaker feature vector groups The speaker feature vector is the same;

Based on the adjusted set of speaker feature vectors, the set of emotion feature vectors with speaker features removed is generated by the generating function.
19. The computer-readable storage medium of claim 19, wherein the set of speaker feature vectors includes at least one first distribution function; the adjustment of the speakers in the feature generator based on the predicted feature vectors A feature vector group to obtain the adjusted speaker feature vector group, including:

determining the second distribution function corresponding to the predicted feature vector, and obtaining the mean and variance of the second distribution function;

According to the mean value and the variance, the mean value and the variance in each of the first distribution functions are updated to obtain the updated first distribution function.