CN115376214A

CN115376214A - Emotion recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115376214A
Application number: CN202210813381.XA
Authority: CN
Inventors: 殷兵; 李晋; 褚繁; 高天; 方昕; 刘俊华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-11-22

Abstract

The invention provides an emotion recognition method, an emotion recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining data to be identified of at least two modalities; determining emotion probability distribution of data to be identified of each modality based on the emotion identification model of each modality; determining emotion recognition results based on emotion probability distribution of each modality; the emotion recognition model is used for extracting the characteristics of the data to be recognized of the corresponding modality and recognizing the emotion based on the data characteristics obtained by the characteristic extraction; the emotion recognition model of each mode is obtained by joint training based on the feature similarity of the sample data features of the sample data of each mode in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each mode, the model is trained by utilizing the consistency of emotion information represented by the sample data of different modes and the complementary relation of the same emotion among different modes, and the generalization capability of the model and the accuracy of the emotion recognition process can be improved.

Description

Emotion recognition method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an emotion recognition method and device, electronic equipment and a storage medium.

Background

The emotion is a subjective experience, is a psychological response generated by people to external stimuli and an incidental physiological response, and has important significance in the fields of medical treatment, education, interrogation and the like. With the rapid development of artificial intelligence and the abundance of deep learning software and hardware resources, people have higher and higher attention on human-computer interaction, and emotion recognition is taken as an important branch in human-computer interaction and naturally becomes a popular research topic.

The existing emotion recognition technology focuses on single-mode layers such as voice, facial expressions, electroencephalogram signals and texts, the single-mode emotion recognition accuracy is low, and the reliability of a recognition result is low; in addition, some emotion recognition methods based on deep learning exist, such methods usually adopt a network structure of multitask learning to predict emotion, and correspondingly adopt a multitask learning mode to train a model, which requires that abstract representation information between different modalities is completely shared as a precondition, in such a case, if a model cannot be aggregated to obtain a matched high-dimensional information expression, a training effect of the model is poor, and a prediction performance is poor.

Disclosure of Invention

The invention provides an emotion recognition method, an emotion recognition device, electronic equipment and a storage medium, which are used for solving the defects that in the prior art, due to the fact that a training mode of multi-task learning requires that abstract representation information among different modes is completely shared, when a model cannot be aggregated to obtain matched high-dimensional information expression, deviation occurs in model training, and the training effect is poor.

The invention provides an emotion recognition method, which comprises the following steps:

determining data to be identified of at least two modalities;

determining emotion probability distribution of data to be identified of each modality based on an emotion identification model of each modality;

determining emotion recognition results based on the emotion probability distribution of each modality;

the emotion recognition model is used for extracting features of data to be recognized of corresponding modalities and recognizing emotion based on the data features obtained by feature extraction;

the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality.

According to the emotion recognition method provided by the invention, the emotion recognition models of all the modes are trained on the basis of the following steps:

determining sample data characteristics and a prediction probability distribution of sample data of each modality based on the initial emotion recognition model of each modality;

mapping the sample data features of the sample data of each mode to the same space to obtain the sample projection features of the sample data of each mode in the same space;

determining a joint training loss based on feature similarity between sample projection features of the sample data of each modality and/or distribution similarity between predicted probability distributions of the sample data of each modality;

and performing parameter iteration on the initial emotion recognition models of the modes based on the joint training loss to obtain the emotion recognition models of the modes.

According to an emotion recognition method provided by the present invention, determining a joint training loss based on feature similarity between sample projection features of sample data of each modality and/or distribution similarity between predicted probability distributions of the sample data of each modality includes:

selecting sample data of at least two modalities with the same sample emotion recognition result from the sample data of each modality as a positive sample data set, and selecting sample data of at least two modalities with different sample emotion recognition results from the sample data of each modality as a negative sample data set;

determining the contrast loss based on the feature similarity between the sample projection features of the sample data in the positive sample data set and the feature similarity between the sample projection features of the sample data in the negative sample data set;

determining distribution loss based on distribution similarity among the prediction probability distributions of the sample data in the positive sample data set;

determining a joint training loss based on the contrast loss and/or the distribution loss.

According to the emotion recognition method provided by the invention, the parameter iteration is performed on the initial emotion recognition model of each modality based on the joint training loss to obtain the emotion recognition model of each modality, and the method comprises the following steps:

determining the prediction loss of the initial emotion recognition model of each modality based on the prediction probability distribution of the sample data of each modality and the sample emotion recognition result corresponding to the sample data of each modality;

and performing parameter iteration on the initial emotion recognition model of each modality based on the prediction loss and the joint training loss to obtain the emotion recognition model of each modality.

According to the emotion recognition method provided by the invention, the initial emotion recognition model of each modality is trained on the basis of the following steps:

determining sample data of at least two modalities;

inputting the sample data of each mode into a first emotion recognition model of a corresponding mode to obtain a first prediction probability distribution of each mode output by the first emotion recognition model;

and performing parameter iteration on the first emotion recognition models of all the modalities based on the first prediction probability distribution and sample emotion recognition results to obtain initial emotion recognition models of all the modalities.

According to an emotion recognition method provided by the invention, the determining sample data of at least two modalities comprises the following steps:

determining initial sample data of at least two modes, wherein the at least two modes comprise at least two of an audio mode, an image mode, a text mode, an electroencephalogram mode, a behavior mode and a genetic mode;

and dividing time intervals of the initial sample data of each mode to obtain the sample data of each mode.

According to the emotion recognition method provided by the invention, the emotion recognition result is determined based on the emotion probability distribution of each modality, and the method comprises the following steps of:

carrying out weighted fusion on the emotion probability distribution of each mode to obtain fused emotion probability distribution;

and determining an emotion recognition result based on the fusion emotion probability distribution.

The present invention also provides an emotion recognition apparatus including:

the device comprises a to-be-identified data determining unit, a to-be-identified data determining unit and a recognizing unit, wherein the to-be-identified data determining unit is used for determining to-be-identified data of at least two modals;

the probability distribution determining unit is used for determining the emotion probability distribution of the data to be identified of each modality based on the emotion identification model of each modality; the emotion recognition model is used for extracting features of data to be recognized of corresponding modalities and recognizing emotion based on the data features obtained by feature extraction; the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality;

and the recognition result determining unit is used for determining the emotion recognition result based on the emotion probability distribution of each modality.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the emotion recognition method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of emotion recognition as described in any of the above.

According to the emotion recognition method, the emotion recognition device, the electronic equipment and the storage medium, the feature similarity of sample data features of sample data of each mode in the same space and/or the distribution similarity between prediction probability distributions of the sample data of each mode are taken as references, the initial emotion recognition models of each mode are subjected to combined training, so that the models can fully learn the near-far relationship between the sample data features and/or the prediction probability distributions corresponding to the sample data of different modes in the training process, and therefore, key assistance can be provided for the improvement of emotion recognition accuracy and accuracy, and the defects that in the traditional scheme, abstract characterization information of different modes is required to be completely shared due to the training mode of multi-task learning, so that the model training is biased and the training effect is poor when the models cannot be aggregated to obtain high-dimensional information expression matched with the emotion recognition accuracy and accuracy are overcome; moreover, the model is trained by utilizing the complementary relation of the same emotion among different modalities, the generalization capability of the model can be improved, and the reliability of the emotion recognition result and the accuracy of the emotion recognition process are improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of emotion recognition provided by the present invention;

FIG. 2 is a schematic diagram of a training process for an emotion recognition model provided by the present invention;

FIG. 3 is a schematic flow chart of step 230 in the emotion recognition method provided by the present invention;

FIG. 4 is a schematic flow chart of step 240 of the emotion recognition method provided by the present invention;

FIG. 5 is a schematic diagram of the training process of the initial emotion recognition model provided by the present invention;

FIG. 6 is a flow chart illustrating step 510 of the emotion recognition method provided by the present invention;

FIG. 7 is a schematic flow chart of step 130 of the emotion recognition method provided by the present invention;

FIG. 8 is an overall framework diagram of the training process of the emotion recognition model provided by the present invention;

fig. 9 is a schematic structural diagram of an emotion recognition apparatus provided by the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The emotion is an internal subjective experience, is a psychological response and an attached physiological response of people to external stimuli, and plays an important role in the fields of medical treatment, education, interrogation and the like. With the development of artificial intelligence and the abundance of deep learning software and hardware resources, human-computer interaction is concerned by more and more scholars and researchers, emotion recognition is taken as an important branch in human-computer interaction, and is naturally an important research subject at present.

The current emotion recognition technology mostly focuses on single-mode levels such as voice, facial expressions, electroencephalogram signals and texts. Most of the traditional facial expression recognition is based on artificial design features or shallow learning features, and the specific recognition process comprises image acquisition, preprocessing, feature extraction, classification and the like, wherein the preprocessing operation generally comprises data enhancement, face recognition, normalization and the like. In addition, voice emotion recognition is also developed along with the enrichment of a data set, voice is the most direct means for people to communicate in daily life, rich emotion information is also covered, emotion change of people can be shown through voice characteristics, voice emotion recognition is that voice signals containing emotion information are converted into readable physical characteristics, the voice characteristics related to emotion expression are extracted from the readable physical characteristics, emotion recognition is carried out through a trained emotion recognition classifier, and finally, emotion recognition results are output.

In recent years, however, deep learning approaches have achieved significant success in a number of areas by analyzing low-level features in combination to form abstract high-level attribute descriptions to determine a distributed feature representation of data. The deep learning method does not need to reserve professional knowledge in more fields and define the actual physical meaning of the extracted characteristic parameters, mainly designs a neural network structure, extracts the characteristics through the neural network and identifies the corresponding emotion types, and the emotion identification method is an end-to-end identification method, establishes a mathematical model and an algorithm through the neural network and trains to obtain connection weight parameters so that the network can realize data-based pattern identification, function mapping and the like; the sample data carrying emotion category labels are used for training the model, so that the deep mining of essential information of weak emotion contained in the sample data can be realized, the emotion recognition is carried out by adopting a deep learning method, a parameter extraction method of fine characteristics does not need to be manually designed in advance, the dependence of researchers on prior knowledge of relevant specialties is reduced, the research threshold is lowered, and the method becomes an advanced technology in the field of emotion recognition.

Currently, a method for emotion recognition based on deep learning generally adopts a Network structure of multitask learning, and inputs multimodal input signals into corresponding Neural Networks, for example, TDNN (Time Delay Neural Network), RNN (Recurrent Neural Network), CNN (Convolutional Neural Network), and the like, and abstractly fuses the multimodal input signals together through multilayer information connection and high-level information sharing, and then predicts emotion categories to which the input signals belong. In a plurality of neural networks, local receptive fields, weight sharing, pooling calculation and other modules owned by the convolutional neural network can reduce the network scale to a great extent and relieve the overfitting problem of the network due to the large scale, the convolutional neural network generated by parameter random initialization is used for predicting emotion categories of multi-modal input signals, errors between the predicted values and labeled values of real emotion categories are used for driving the convolutional neural network to update parameters, and more accurate model parameters can be obtained after a plurality of rounds of parameter updating, so that model training is completed.

However, in the above scheme, a multi-task learning mode is adopted for model training, emotion characterization vectors abstracted from different modalities are combined together in a sharing and fusing mode, and such a training mode needs to be based on the premise that abstract characterization information is completely shared among different modalities, that is, the abstract characterization information is required to be completely shared among different modalities, and if the models cannot be aggregated to obtain a matched high-dimensional information expression, deviation occurs in model training, a training effect is poor, and a prediction performance is poor.

In view of the above situation, the present invention provides an emotion recognition method, which aims to perform emotion recognition from a multi-modal level, perform model training by using consistency of emotion information represented by sample data of different modalities, and enable a trained model to improve accuracy of emotion recognition when the model faces multi-modal data to be recognized, so as to achieve dual improvement of reliability and accuracy of an emotion recognition result, where fig. 1 is a schematic flow diagram of the emotion recognition method provided by the present invention, and as shown in fig. 1, the method includes:

step 110, determining data to be identified of at least two modalities;

specifically, before performing emotion recognition, data to be recognized, which is data to be recognized, needs to be determined first, and since emotion recognition is in a multi-modal level, data to be recognized needs to be determined in at least two modalities, where the at least two modalities may be an audio modality, an image modality, a text modality, an electroencephalogram modality, a behavior modality, a genetic modality, and the like.

In addition, in order to ensure consistency of emotion information represented by data to be identified of each mode, in the embodiment of the invention, the data sources of the data to be identified of each mode are required to be the same, that is, the data to be identified of each mode needs to be derived from the same multi-mode data, the multi-mode data can be audio and video data, that is, a section of audio and video data intercepted from an audio and video data stream recorded in real time, for example, the duration of the audio and video data can be preset, and in the recording process, the audio and video data stream is intercepted once every preset duration, so that the latest recorded audio and video data with a section of preset duration is obtained; or a section of audio and video data intercepted from the recorded audio and video data, or the whole recorded audio and video data.

It should be noted that after the audio and video data is obtained, modality separation is performed on the audio and video data to obtain at least two pieces of initial data of a single modality, for example, initial data of an audio modality, initial data of an image modality, initial data of a text modality, and the like, and in order to enable the specification of the portion of initial data to be adapted to the size of an input window of the emotion recognition model, in the embodiment of the present invention, time interval division is performed on the initial data of each modality, that is, the initial data of each modality may be segmented by using a preset time window, so that data to be recognized of each modality may be obtained. Here, the window length of the preset time window is fixed, and may be preset according to actual conditions, for example, may be 4 seconds, 5 seconds, 6 seconds, and the like, which is not specifically limited in the embodiment of the present invention.

In addition, it is worth noting that if the preset duration for intercepting the audio and video data stream is exactly equal to the window length of the preset time window, the specification of the initial data of each modality separated from the intercepted audio and video data can also meet the input requirement of the emotion recognition model, and in this case, the time interval division of the separated initial data of each modality is not needed, and the separated initial data of each modality can be directly used as the data to be recognized.

Correspondingly, if the duration of the initial data of any modality obtained through separation is smaller than the window length of the preset time window, the initial data of the modality needs to be copied and spliced, that is, multiple copies of the initial data of the modality are copied, and the copied initial data is spliced with the original initial data of the modality, so that the duration corresponding to the spliced initial data can be larger than or equal to the window length of the preset time window, and then the time interval division can be performed on the spliced initial data.

In addition, the data group formed by the data to be recognized of at least two modalities may be one or multiple, and in the case of multiple data groups, the emotion classification to which each data group belongs needs to be determined, that is, emotion recognition needs to be performed on each data group to determine the corresponding emotion recognition result.

Step 120, determining emotion probability distribution of the data to be identified of each modality based on the emotion identification model of each modality; the emotion recognition model is used for extracting the characteristics of the data to be recognized of the corresponding modality and recognizing the emotion based on the data characteristics obtained by the characteristic extraction;

the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality;

specifically, in step 110, after determining the data to be identified of at least two modalities, step 120 may be performed to apply an emotion recognition model of each modality to determine an emotion probability distribution of the data to be identified of each modality, where the process specifically includes the following steps:

firstly, inputting data to be recognized of each modality into an emotion recognition model of the corresponding modality, then, performing feature extraction on the input data to be recognized of the corresponding modality by the emotion recognition model, extracting features capable of representing emotion information in the data to be recognized of the corresponding modality, so as to obtain data features of the data to be recognized of the corresponding modality, then, performing emotion recognition by the emotion recognition model according to the data features to determine emotion probability distribution of the data to be recognized of the corresponding modality, namely posterior probability that the emotion information represented by the data to be recognized of the corresponding modality belongs to various emotions, and finally obtaining emotion probability distribution of the data to be recognized of each modality output by the emotion recognition model of each modality.

Before the data to be recognized of each modality is input to the emotion recognition model of the corresponding modality, the emotion recognition model of each modality can be obtained through pre-training. The method is different from the traditional scheme in which the multi-task learning mode is adopted for model training, and in the embodiment of the invention, the training mode of the multi-task learning requires that abstract representation information among different modes is completely shared, and if the condition is not met, the model cannot be aggregated to obtain matched high-dimensional information expression, so that the training of the model is deviated, and further the model prediction performance is poor, therefore, the consistency of emotion information represented by sample data of each mode is adopted for model training, so that the emotion recognition model of each mode after training is obtained.

Specifically, when model training is performed, firstly, a large amount of sample data of at least two modals needs to be collected, and sample data characteristics and prediction probability distribution of the sample data of each modality are determined through an initial emotion recognition model of each modality, wherein the sample data characteristics are obtained by performing characteristic extraction on the sample data of the corresponding modality, and the prediction probability distribution is obtained by performing emotion recognition on the basis of the sample data characteristics; and then, applying the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality to carry out combined training on the initial emotion recognition models of each modality, thereby obtaining the emotion recognition models of each modality after training.

Here, the initial emotion recognition model of each modality may be an original emotion recognition model, that is, the model parameters of the original emotion recognition model are directly generated by a random number generator, or may be a pre-trained emotion recognition model, that is, sample data of a single modality is used for model training to obtain an optimal initial emotion recognition model in the modality.

In the embodiment of the invention, the optimal initial emotion recognition models under each mode obtained by pre-training are directly loaded in the process of the combined training, so that the time of the combined training can be greatly shortened, the training effect of the initial emotion recognition models of each mode in the process of the combined training can be better, the prediction performance of the emotion recognition models of each mode obtained by training is better, the process of the combined training is refined, and the overall process of multi-mode emotion recognition is promoted.

Compared with the traditional scheme, emotion characterization vectors are abstracted from different modalities, combined in a sharing fusion mode, and then emotion type prediction is carried out uniformly, and an error driving model between a predicted value and a labeled value is used for parameter updating.

In the embodiment of the invention, the initial emotion recognition model can judge the feature similarity and/or the distribution similarity between prediction probability distributions of sample data features in the same space according to the sample emotion recognition result corresponding to the sample data by combining the joint training process of the feature similarity and/or the distribution similarity, so that the feature similarity and/or the distribution similarity between the prediction probability distributions of the sample data features in the same space are as high as possible when the sample emotion recognition results corresponding to the sample data of each modality are the same, namely the sample data of each modality can form a positive sample data set; conversely, when the sample emotion recognition results corresponding to the sample data of each modality are different, that is, the sample data of each modality can form a negative sample data set, the feature similarity and/or the distribution similarity between the prediction probability distributions of the sample data features in the same space are/is made as low as possible.

And step 130, determining emotion recognition results based on the emotion probability distribution of each modality.

Specifically, after obtaining the emotion probability distribution of the data to be recognized in each modality through step 120, step 130 may be executed to determine an emotion recognition result according to the emotion probability distribution in each modality, and the specific process may include the following steps:

firstly, emotion probability distribution of data to be identified of each modality can be fused to obtain fused emotion probability distribution, the fusion mode can be addition, splicing, weighted fusion and the like, and the embodiment of the invention is not particularly limited to this;

preferably, in the embodiment of the present invention, the fusion mode is determined as weighted fusion, that is, posterior probabilities corresponding to the same emotion in the emotion probability distributions of the data to be recognized in different modalities are weighted fusion, that is, posterior probabilities corresponding to the same emotion in the emotion probability distributions of the modalities are fused with reference to the weight of the data to be recognized in each modality, so as to obtain fused emotion probability distributions, where the fused emotion probability distributions include fused posterior probabilities corresponding to various emotions.

Then, the fused emotion probability distribution can be referred to determine an emotion recognition result, where the maximum fused posterior probability can be directly determined from the fused emotion probability distribution, an emotion category corresponding to the maximum fused posterior probability is determined, the emotion category is used as an emotion category to which data to be recognized of each modality belongs uniformly, that is, an emotion recognition result, and the maximum fused posterior probability and the emotion category corresponding to the maximum fused posterior probability can also be used as a final emotion recognition result together.

The emotion recognition method provided by the invention has the advantages that the feature similarity of the sample data features of the sample data of each mode in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each mode are taken as the reference, and the initial emotion recognition models of each mode are jointly trained, so that the models can fully learn the distance relation among the sample data features and/or the prediction probability distributions corresponding to the sample data of different modes in the training process, and therefore, the key assistance can be provided for the improvement of emotion recognition accuracy and accuracy, and the defects that in the traditional scheme, due to the fact that the training mode of multi-task learning requires the complete sharing of abstract characterization information among different modes, the model training is deviated and the training effect is poor when the model cannot be aggregated to obtain the expression of matched high-dimensional information are overcome; in addition, the model is trained by utilizing the complementary relation of the same emotion among different modalities, the generalization capability of the model can be improved, and the reliability of the emotion recognition result and the accuracy of the emotion recognition process are improved.

Based on the above embodiment, fig. 2 is a schematic diagram of a training process of the emotion recognition model provided by the present invention, and as shown in fig. 2, the emotion recognition model of each modality is trained based on the following steps:

step 210, determining sample data characteristics and prediction probability distribution of sample data of each modality based on the initial emotion recognition model of each modality;

step 220, mapping the sample data features of the sample data of each modality to the same space to obtain the sample projection features of the sample data of each modality in the same space;

step 230, determining a joint training loss based on the feature similarity between the sample projection features of the sample data of each modality and/or the distribution similarity between the prediction probability distributions of the sample data of each modality;

and 240, performing parameter iteration on the initial emotion recognition model of each modality based on the joint training loss to obtain the emotion recognition model of each modality.

Specifically, the training process of the emotion recognition model of each modality includes the following steps:

firstly, step 210 is executed to determine an initial emotion recognition model of each modality, where the initial emotion recognition model may be an original emotion recognition model, that is, a model parameter thereof is directly generated by a random number generator, or may be a pre-trained emotion recognition model, that is, sample data of a single modality is used for model training to obtain an optimal initial emotion recognition model under the modality, then the sample data of each modality is input into the initial emotion recognition model of the corresponding modality, the initial emotion recognition model performs feature extraction on the input sample data of the corresponding modality, and performs emotion recognition based on the sample data features obtained by the feature extraction, so as to finally obtain a predicted probability distribution of the sample data of each modality output by the initial emotion recognition model of each modality, and the predicted probability distribution includes a predicted posterior probability that the sample data of the corresponding modality belongs to various emotions;

then, step 220 is executed to project the sample data features of the sample data of each modality to map the sample data features of the sample data of each modality to the same space, which can be implemented by means of MLP (multi layer Perceptron), that is, the sample data features of the sample data of each modality can be mapped to the same space by using a multi-layer Perceptron module, so as to obtain the projection features of the sample data features of each modality in the space, that is, the sample projection features of the sample data of each modality;

then, step 230 is executed to determine feature similarities between sample projection features of sample data of each modality, and/or determine distribution similarities between predicted probability distributions of the sample data of each modality, and calculate a loss in a joint training process, that is, a joint training loss, based on the feature similarities and/or the distribution similarities, where the feature similarities and/or the distribution similarities are specifically, the feature similarities between sample projection features of sample data of different modalities with the same sample emotion recognition result, and/or the distribution similarities between predicted probability distributions, and the feature similarities between sample projection features of sample data of different modalities with different sample emotion recognition results, and/or the distribution similarities between predicted probability distributions, and then determine the joint training loss according to the feature similarities and/or the distribution similarities of the sample emotion recognition results with the same sample emotion recognition results and under different conditions with sample emotion recognition results;

the training target of the initial emotion recognition model is to make the feature similarity between sample projection features and/or the distribution similarity between prediction probability distributions of sample data of each modality as high as possible under the condition that sample data of different modalities can form a positive sample data set, namely under the condition that sample emotion recognition results corresponding to the sample data are the same; correspondingly, when the sample data of different modalities form a negative sample data set, that is, when the emotion recognition results of the samples corresponding to the sample data are different, the feature similarity between the sample projection features and/or the distribution similarity between the prediction probability distributions of the sample data of each modality are made as low as possible.

Therefore, under the conditions that the feature similarity between the sample projection features of each sample data in the positive sample data set and/or the distribution similarity between the prediction probability distributions are high, and the feature similarity between the sample projection features of each sample data in the negative sample data set is low, it can be determined that the joint training loss is small; accordingly, in the case where the feature similarity between the sample projection features of each sample data in the positive sample data set and/or the distribution similarity between the prediction probability distributions is low, and/or the feature similarity between the sample projection features of each sample data in the negative sample data set is high, it can be determined that the joint training loss is large.

And then, executing step 240, performing parameter iteration on the initial emotion recognition model of each modality according to the joint training loss, specifically, adjusting parameters of the initial emotion recognition of each modality according to the joint training loss, so that the adjusted initial emotion recognition model of each modality can judge that the feature similarity between the projection features of the sample and/or the distribution similarity between the prediction probability distributions of the sample is as high as possible under the condition that the sample data of each modality belongs to the positive sample data set, judge that the feature similarity between the projection features of the sample is as low as possible under the condition that the sample data of each modality belongs to the negative sample data set, and finally obtain the emotion recognition model of each modality after training.

Based on the above embodiment, fig. 3 is a schematic flowchart of step 230 in the emotion recognition method provided by the present invention, as shown in fig. 3, step 230 includes:

231, selecting sample data of at least two modalities with the same sample emotion recognition result from the sample data of each modality as a positive sample data set, and selecting sample data of at least two modalities with different sample emotion recognition results from the sample data of each modality as a negative sample data set;

step 232, determining the contrast loss based on the feature similarity between the sample projection features of the sample data in the positive sample data set and the feature similarity between the sample projection features of the sample data in the negative sample data set;

step 233, determining distribution loss based on distribution similarity among the predicted probability distributions of each sample data in the positive sample data set;

based on the contrast loss and/or the distribution loss, a joint training loss is determined, step 234.

Specifically, in step 230, the process of determining the joint training loss according to the feature similarity between the sample projection features of the sample data of each modality and/or the distribution similarity between the prediction probability distributions of the sample data of each modality specifically includes the following steps:

231, determining a positive sample data set from the sample data of each modality, wherein the positive sample data set can be specifically determined by firstly determining a sample emotion recognition result corresponding to the sample data of each modality, wherein the sample emotion recognition result can be understood as a label of the sample data, and then selecting the sample data from each modality according to the sample emotion recognition to form the positive sample data set, namely selecting the sample data of at least two modalities with the same sample emotion recognition result from the sample data of each modality, and establishing the positive sample data set; it should be noted that, the sample data in the positive sample data set herein corresponds to different modalities;

meanwhile, sample data of at least two modals with different sample emotion recognition results are selected from the sample data of each modality, so that a negative sample data set is established, and similarly, the sample data in the negative sample data set also correspond to different modalities;

step 232, determining the feature similarity between the sample projection features of each sample data in the positive sample data set and the feature similarity between the sample projection features of each sample data in the negative sample data set, wherein the determination process of the sample projection features of the sample data is described in detail above, and is not repeated here, and the comparison learning loss, namely the comparison loss, in the joint training process of the initial emotion recognition model is determined according to the two; it is noted that the feature similarity here can be expressed as cosine similarity, euclidean distance, minz distance, etc.;

step 233, determining distribution similarity among the predicted probability distributions of each sample data in the positive sample data set, wherein the determination process of the predicted probability distributions of the sample data is described in detail above, and is not described herein any more, and determining the distribution loss of the joint training process according to the distribution similarity, specifically, introducing KLD (Kullback-Leibler divergence, relative entropy) to use the relative entropy of the predicted posterior probability of each sample data in the positive sample data set belonging to various emotions as the difference loss among the predicted probability distributions of each sample data in the positive sample data set, that is, the distribution loss of the joint training process;

step 234, determining a joint training loss according to the contrast loss or the distribution loss, that is, directly using the contrast loss or the distribution loss as the joint training loss, specifically, determining that the contrast loss is small, that is, the joint training loss is small, under the condition that the feature similarity between the sample projection features of each sample data in the positive sample data set is high, and the feature similarity between the sample projection features of each sample data in the negative sample data set is low; similarly, under the condition that the distribution similarity between the prediction probability distributions of the sample data in the positive sample data set is high, it can be determined that the distribution loss is small, that is, the joint training loss is small.

Correspondingly, under the condition that the feature similarity between the sample projection features of the sample data in the positive sample data set is low and/or the feature similarity between the sample projection features of the sample data in the negative sample data set is high, the fact that the contrast loss is large, namely the combined training loss is large, can be determined; similarly, under the condition that the distribution similarity between the prediction probability distributions of the sample data in the positive sample data set is low, it can be determined that the distribution loss is large, that is, the joint training loss is large.

The joint training loss can be determined according to the contrast loss and the distribution loss, namely the comprehensive contrast loss and the distribution loss, and the joint training loss is determined; under the condition of large contrast loss and distribution loss, the large loss of the joint training can be determined; and under the condition of larger contrast loss and smaller distribution loss, or smaller contrast loss and larger distribution loss, the weights of the contrast loss and the distribution loss are combined to measure the magnitude of the loss of the joint training.

Based on the above embodiment, fig. 4 is a schematic flowchart of step 240 in the emotion recognition method provided by the present invention, and as shown in fig. 4, step 240 includes:

241, determining the prediction loss of the initial emotion recognition model of each modality based on the prediction probability distribution of the sample data of each modality and the sample emotion recognition result corresponding to the sample data of each modality;

and 242, performing parameter iteration on the initial emotion recognition model of each modality based on the prediction loss and the joint training loss to obtain an emotion recognition model of each modality.

In the embodiment of the invention, when the initial emotion recognition models of the modes are subjected to parameter iteration to obtain the emotion recognition models of the modes, the combined training loss of the initial emotion recognition models in the combined training process is considered, and the prediction loss of the initial emotion recognition models of the modes can be considered.

Therefore, in step 240, the process of performing parameter iteration on the initial emotion recognition model of each modality according to the joint training loss to obtain an emotion recognition model of each modality specifically includes the following steps:

241, determining the prediction loss of the initial emotion recognition model of each modality during emotion recognition, namely calculating the prediction loss of the initial emotion recognition model of each modality according to the prediction probability distribution of the sample data of each modality and the sample emotion recognition result corresponding to the sample data of each modality, specifically, calculating the prediction loss of the initial emotion recognition model of each modality by adopting a Cross Entropy criterion (Cross Entropy, CE), namely calculating the prediction posterior probability corresponding to each emotion in the prediction probability distribution of the sample data of each modality and the difference between the sample emotion recognition result corresponding to the sample data of the corresponding modality, wherein the difference is the error between the predicted value output by the initial emotion recognition model and the real labeled value, and determining the prediction loss of the initial emotion recognition model of each modality according to the difference;

step 242, performing parameter iteration on the initial emotion recognition model according to the prediction loss and the joint training loss to obtain an emotion recognition model of each modality, specifically, using the prediction loss and the joint training loss as a reference, and using an Error Back Propagation (BP) algorithm to complete parameter update of the initial emotion recognition model of each modality, where joint training of multiple loss functions can greatly improve the performance of the model, so that the feature extraction capability and emotion recognition capability of the emotion recognition model obtained by training are better, and adjusting the model parameters according to the prediction loss can make the output of the model approach to real labeling.

In the embodiment of the invention, the loss in the training process of the initial emotion recognition model is determined from three different levels of sample projection characteristics, prediction probability distribution and prediction recognition results of sample data, parameter adjustment is carried out according to the loss, the optimization of model performance is realized from different angles, the emotion recognition capability of the emotion recognition model obtained by training can be essentially improved by superposition of multiple optimization, the emotion recognition process is refined, and a powerful support is provided for improving the reliability and the accuracy of the emotion recognition results.

Based on the above embodiment, the prediction loss, the distribution loss, the contrast loss, and the model overall loss of the initial emotion recognition model of each modality can be expressed by the following formulas:

the following describes the loss function using an audio modality and an image modality as examples:

the calculation formula of the prediction loss of the initial emotion recognition model of each modality is as follows:

wherein L is _v The prediction loss for the initial emotion recognition model of the image modality,

and v represents the sample data characteristics of the sample data of the image modality.

L _w The prediction loss for the initial emotion recognition model of the audio modality,

and the predicted posterior probability corresponding to the sample emotion recognition result j corresponding to the sample data representing the audio modality can be obtained through calculation of a Softmax function, and w represents the sample data characteristics of the sample data of the audio modality.

The distribution loss can be expressed by the following formula:

wherein L is _KLD Representing the distribution loss, q, in the course of the joint training _v Predicted posterior probability, q, corresponding to sample emotion recognition result corresponding to sample data of image modality _w V and w respectively represent sample data characteristics of sample data of the image modality and sample data characteristics of the sample data of the audio modality, and the sample data of the audio modality and the sample emotion recognition results corresponding to the sample data of the image modality are the same, namely the sample data of the positive sample data set.

The loss of contrast can then be expressed by the following equation:

wherein L is _contrast (vi) represents the loss of contrast during the joint training process, (v) ⁰ ,w ⁰ ) A set of positive sample data is represented,

and

represents a negative sample data set, S (v) ⁰ ,w ⁰ ) Sample projection features v representing sample data of an image modality in a positive sample dataset ⁰ And sample projection features w of sample data of an audio modality ⁰ The degree of similarity of the features between them,

sample projection features v of sample data representing an image modality in a negative sample dataset ⁰ Sample projection features of sample data of audio modality

The degree of similarity of the features between them,

sample projection features of sample data representing an image modality in a negative sample dataset

And sample projection features w of sample data of an audio modality ⁰ Feature similarity between them.

The feature similarity between the sample projection features of each sample data in the positive sample data set can be calculated by the following formula:

wherein (v) ⁰ ) ^T Sample projection features v representing sample data of an image modality in a positive sample dataset ⁰ Is a transpose of | v ⁰ II denotes v ⁰ Mold II w ⁰ | sample projection feature w of sample data of audio modality ⁰ The die of (1).

The model overall loss can then be expressed by the following formula:

L＝L _v +L _w +αL _KLD +βL _contrast

wherein L represents the model bulk loss, L _v Loss of prediction for the initial emotion recognition model of the image modality, L _w Prediction loss of initial emotion recognition model for audio modality, L _KLD To distribute the loss, α is the weight corresponding to the distributed loss, L _contrast For the contrast loss, β is the weight corresponding to the contrast loss.

Based on the above embodiment, fig. 5 is a schematic diagram of a training process of the initial emotion recognition model provided by the present invention, and as shown in fig. 5, the initial emotion recognition model of each modality is trained based on the following steps:

step 510, determining sample data of at least two modalities;

step 520, inputting the sample data of each modality into the first emotion recognition model of the corresponding modality to obtain first prediction probability distribution of each modality output by the first emotion recognition model;

and 530, performing parameter iteration on the first emotion recognition model of each modality based on the first prediction probability distribution and the sample emotion recognition result to obtain an initial emotion recognition model of each modality.

Specifically, before training to obtain the emotion recognition models of each modality, the initial emotion recognition models of each modality need to be determined, and in order to shorten the time of the joint training and refine the joint training process, in the embodiment of the present invention, the initial emotion recognition models of each modality can be obtained by training in advance, and the training process may include the following steps:

firstly, step 510 is executed to determine sample data of at least two modalities, and the sample data is used as a sample data set, where the sample data set includes sample data of different modalities with the same sample emotion recognition result, and also includes sample data of different modalities with different sample emotion recognition results, and it can also be understood that the sample data set here includes sample data that can constitute a positive sample data set, and also includes sample data that can constitute a negative sample data set;

the sample data capable of forming the positive sample data set can be determined based on the same sample multi-modal data, namely, the sample multi-modal data can be subjected to modal separation and time interval division to obtain the sample data in the same time interval, and the same time interval can ensure that the sample emotion recognition results are the same.

Correspondingly, the sample data forming the negative sample data set can be determined based on the same sample multi-modal data or different sample multi-modal data, that is, the same sample multi-modal data or different sample multi-modal data can be subjected to modal separation and time interval division, so as to screen out sample data of different modes with different sample emotion recognition results.

Then, the sample data of each modality can be input into the first emotion recognition model of the corresponding modality, the first emotion recognition model can be understood as an untrained initial emotion recognition model, the first emotion recognition model performs feature extraction on the input sample data of the corresponding modality, emotion recognition is performed on the basis of the features of the sample data obtained by the feature extraction, and finally, a first prediction probability distribution of the sample data of each modality output by the first emotion recognition model of each modality can be obtained;

after that, parameter iteration is performed on the first emotion recognition model of each modality based on the first prediction probability distribution and the sample emotion recognition result, so as to obtain an initial emotion recognition model of each modality, specifically, a first prediction loss of the first emotion recognition model of each modality is calculated by using a Cross Entropy Criterion (CE), that is, on the basis of the first prediction probability distribution, a loss of the first emotion recognition model of each modality in a training process is determined by combining the sample emotion recognition result, that is, a first post-prediction probability corresponding to each type of emotion in the first prediction probability distribution and a first prediction loss between the sample emotion recognition results are determined, where the first difference is an error between a predicted value output by the first emotion recognition model and a real labeled value, and the first prediction loss of the first emotion recognition model of each modality is determined according to the first difference, that the first prediction loss is a reference, a parameter of the corresponding to the modality is adjusted, so that a predicted value output by the adjusted first emotion recognition model and the real labeled value approach to 0, and thus, the initial emotion recognition model can be trained without limit.

In the embodiment of the invention, the initial emotion recognition models of all the modes are obtained through training, and the optimal initial emotion recognition models under all the modes obtained through pre-training can be directly loaded in the subsequent joint training process, so that the time of the joint training is greatly shortened, the training effect of the initial emotion recognition models of all the modes in the joint training process is better, the prediction performance of the emotion recognition models of all the modes obtained through training is better, the joint training process is refined, and the overall process of multi-mode emotion recognition is promoted.

Based on the above embodiment, fig. 6 is a schematic flowchart of step 510 in the emotion recognition method provided by the present invention, and as shown in fig. 6, step 510 includes:

step 511, determining initial sample data of at least two modes, wherein the at least two modes comprise at least two of an audio mode, an image mode, a text mode, an electroencephalogram mode, a behavior mode and a genetic mode;

and 512, performing time interval division on the initial sample data of each mode to obtain the sample data of each mode.

Specifically, in step 510, the process of determining sample data of at least two modalities may specifically include the following two steps:

firstly, step 511 is executed to determine initial sample data of at least two modalities, where the initial sample data may be understood as initial sample data of each modality separated from sample multi-modality data, where the at least two modalities may be at least two of an audio modality, an image modality, a text modality, a behavior modality, a genetic modality, and an electroencephalogram modality, that is, at least two modalities need to be determined from the modalities, and initial sample data of the at least two modalities is determined;

the behavior mode can be behavior action, gesture, psychological behavior and the like, and the corresponding initial sample data can be behavior action data, gesture data, psychological behavior data, corresponding test data and the like; genetic modalities are understood to mean biological characteristics such as blood, heart rate changes, cell activity, hormonal secretion status, muscle contraction and relaxation.

And then, step 512 is executed, since the initial sample data is obtained by performing modal separation on the basis of the sample multi-modal data, at this time, only time interval division needs to be performed on the initial sample data of each modal, so that the specification of the initial sample data after the time interval division can be adapted to the size of the input window of the first emotion recognition model, the initial sample data of each modal can be segmented by using a preset time window to obtain sample data of each modal, the initial sample data of each modal in each time interval obtained by segmentation can be used as the sample data of each modal, and the sample data capable of establishing a positive sample data set and a negative sample data set can be selected from the initial sample data of each modal in each time interval to be used as the sample data of each modal.

It should be noted that the window length of the preset time window here is fixed, and may be preset according to actual situations, for example, 4 seconds, 5 seconds, 6 seconds, and the like, and preferably, in the embodiment of the present invention, the window length of the preset time window is determined to be 5 seconds, that is, every 5 seconds, the initial sample data of each modality is sliced, so that the sample data of each modality can be obtained.

In addition, it is noted that if the duration of the initial sample data of any modality is less than the window length of the preset time window, the initial sample data of the modality needs to be copied and spliced, that is, multiple copies of the initial sample data of the modality are copied, and the copied initial sample data is spliced with the original initial sample data of the modality, so that the duration corresponding to the spliced initial sample data can be greater than or equal to the window length of the preset time window, and then the spliced initial sample data can be subjected to time interval division.

Based on the above embodiment, fig. 7 is a schematic flowchart of step 130 in the emotion recognition method provided by the present invention, and as shown in fig. 7, step 130 includes:

131, performing weighted fusion on the emotion probability distribution of each modality to obtain fused emotion probability distribution;

and step 132, determining emotion recognition results based on the fusion emotion probability distribution.

Specifically, in step 130, the process of determining the emotion recognition result according to the emotion probability distribution of each modality specifically includes the following steps:

step 131, firstly, the emotion probability distributions of the data to be recognized in each modality can be weighted and fused to obtain a fused emotion probability distribution, that is, posterior probabilities corresponding to the same emotion in the emotion probability distributions of the data to be recognized in different modalities can be weighted and fused, and it can also be understood that the posterior probabilities corresponding to the same emotion in the emotion probability distributions of each modality are fused by taking the weight of the data to be recognized in each modality as a reference to obtain a fused emotion probability distribution, where the fused emotion probability distribution includes the fused posterior probabilities corresponding to various emotions;

step 132, then, the fused emotion probability distribution may be referred to determine an emotion recognition result, that is, the maximum fused posterior probability may be directly selected from the fused emotion probability distribution, and an emotion category corresponding to the maximum fused posterior probability is determined, and the emotion category is used as an emotion category to which the data to be recognized of each modality belongs uniformly, that is, an emotion recognition result, or the maximum fused posterior probability and the emotion category corresponding to the maximum fused posterior probability may be used together as a final emotion recognition result, which is not specifically limited in the embodiment of the present invention.

Based on the above embodiments, fig. 8 is an overall framework diagram of a training process of an emotion recognition model provided by the present invention, and as shown in fig. 8, an overall flow of the training process of the emotion recognition model is described by taking an image modality and an audio modality as examples:

firstly, an initial emotion recognition model of an audio modality and an initial emotion recognition model of an image modality are determined, which can be obtained through training by the following steps:

the training process of the initial emotion recognition model of the image modality comprises the following steps:

firstly, determining initial sample data of an image modality, and performing time interval division on the initial sample data of the image modality to obtain sample data of the image modality;

then, sample data of the image modality can be input into a first emotion recognition model of the image modality, and then feature extraction is performed on the sample data of the image modality by utilizing a plurality of convolution modules, residual modules and pooling modules in the first emotion recognition model of the image modality to obtain sample data features v of the sample data of the image modality; here, the first emotion recognition model of the image modality is constructed on the basis of a Residual Network (ResNet);

then, emotion recognition can be performed through a linear Layer (full Connection Layer) by taking the sample data characteristics v as a reference, so as to obtain a first prediction probability distribution output by a first emotion recognition model of the image modality;

then, a first prediction loss of the first emotion recognition model of the image modality may be determined by combining a sample emotion recognition result corresponding to the sample data of the image modality and a first prediction probability distribution output by the first emotion recognition model of the image modality, and specifically, the first prediction loss may be calculated by using a Cross Entropy criterion (Cross Entropy, CE)

Wherein

The prediction posterior probability corresponding to the sample emotion recognition result i corresponding to the sample data representing the image modality;

finally, the loss L can be predicted with the first prediction _v As a reference, parameter iteration is performed on the first emotion recognition model of the image modality through an Error Back Propagation (BP) algorithm, so as to obtain an initial emotion recognition model of the image modality.

The training process of the initial emotion recognition model of the audio modality is as follows:

firstly, determining initial sample data of an audio mode, and performing time interval division on the initial sample data of the audio mode to obtain sample data of the audio mode;

then, inputting the sample data of the audio modality into a first emotion recognition model of the audio modality, and then performing feature extraction on the input sample data of the audio modality by the first emotion recognition model of the audio modality to obtain sample data features w of the sample data of the audio modality; here, the first emotion recognition model of the audio modality is constructed on the basis of a Time Delay Neural Network (TDNN);

then, emotion recognition can be performed through a linear Layer (full Connection Layer) by taking the sample data characteristics w as a reference, so as to obtain a first prediction probability distribution output by a first emotion recognition model of the audio modality;

then, a first prediction loss of the first emotion recognition model of the audio modality may be determined by combining the sample emotion recognition result corresponding to the sample data of the audio modality and the first prediction probability distribution output by the first emotion recognition model of the audio modality, and specifically, the first prediction loss may be calculated by using a cross entropy criterion

Wherein,

the prediction posterior probability corresponding to the sample emotion recognition result j corresponding to the sample data representing the audio modality;

finally, the loss L can be predicted with the first prediction _w And performing parameter iteration on the first emotion recognition model of the audio modality by using an error back propagation algorithm to obtain an initial emotion recognition model of the audio modality.

Secondly, determining sample data characteristics and prediction probability distribution of sample data of the image modality by applying an initial emotion recognition model of the image modality, and determining the sample data characteristics and the prediction probability distribution of the sample data of the audio modality by applying an initial emotion recognition model of the audio modality;

thirdly, mapping sample data features of sample data of the image modality and sample data features of sample data of the audio modality to the same space through an MLP (multi layer Perceptron) module, so as to obtain sample projection features of sample data of the image modality and sample projection features of sample data of the audio modality in the same space;

fourthly, determining the joint training loss according to the feature similarity between the sample projection features of the sample data in the image mode and the sample projection features of the sample data in the audio mode, and/or the distribution similarity between the prediction probability distribution of the sample data in the image mode and the prediction probability distribution of the sample data in the audio mode;

the process for determining the loss of the joint training specifically comprises the following steps:

firstly, selecting sample data with the same sample emotion recognition result from the sample data of an image modality and the sample data of an audio modality, and using the sample data as a positive sample data set, wherein the sample data in the positive sample data set correspond to different modalities; meanwhile, sample data with different sample emotion recognition results can be selected from the sample data of the image modality and the sample data of the audio modality to serve as a negative sample data set, and the sample data in the negative sample data set also corresponds to different modalities;

then, determining the Contrast Loss (Contrast Loss) in the joint training process by taking the sample projection characteristics of the sample data in the image modality in the positive sample data set, the characteristic similarity between the sample projection characteristics of the sample data in the image modality in the audio modality, the sample projection characteristics of the sample data in the image modality in the negative sample data set and the characteristic similarity between the sample projection characteristics of the sample data in the audio modality as references;

meanwhile, the distribution Loss (KLD Loss) in the joint training process can be determined by taking the distribution similarity between the prediction probability distribution of the sample data of the image modality in the positive sample data set and the prediction probability distribution of the sample data of the audio modality as a reference;

thereafter, the joint training loss may be determined based on the contrast loss and/or the distribution loss.

Fifthly, the initial emotion recognition model of the image modality and the initial emotion recognition model of the audio modality can be jointly trained according to the joint training loss, namely, the initial emotion recognition model of the image modality and the initial emotion recognition model of the audio modality are subjected to parameter iteration, so that the emotion recognition model of the image modality and the emotion recognition model of the audio modality are obtained.

Further, in the parameter iteration process, parameter adjustment can be performed jointly by combining the prediction loss of the initial emotion recognition model of the image modality and the prediction loss of the initial emotion recognition model of the audio modality on the basis of the joint training loss, so that the trained emotion recognition model of the image modality and the trained emotion recognition model of the audio modality are obtained.

The determining process of the prediction Loss (CE Loss) of the initial emotion recognition model may specifically be determining the prediction Loss of the initial emotion recognition model of the image modality according to the prediction probability distribution of the sample data of the image modality and the sample emotion recognition result corresponding to the sample data of the image modality, and similarly, determining the prediction Loss of the initial emotion recognition model of the audio modality according to the prediction probability distribution of the sample data of the audio modality and the sample emotion recognition result corresponding to the sample data of the audio modality.

According to the method provided by the embodiment of the invention, the training of the emotion recognition model is divided into two stages, the initial emotion recognition model of each modality is obtained through early training, the optimal initial emotion recognition model under each modality obtained through pre-training is directly loaded in the later stage, and the combined training is carried out through combined training loss and prediction loss, so that the unified optimization of the initial emotion recognition model of each modality can be realized, and in combination with the combined training process of a plurality of target functions, the model can fully learn the near-far relationship between sample data characteristics and/or prediction probability distribution corresponding to sample data of different modalities in the training process, so that the critical assistance can be provided for the improvement of emotion recognition accuracy and precision; in addition, the model is trained by utilizing the complementary relation of the same emotion among different modalities, the generalization capability of the model can be improved, and the reliability of the emotion recognition result and the accuracy of the emotion recognition process are improved.

The emotion recognition apparatus provided by the present invention is described below, and the emotion recognition apparatus described below and the emotion recognition method described above may be referred to in correspondence with each other.

Fig. 9 is a schematic structural diagram of an emotion recognition apparatus provided in the present invention, and as shown in fig. 9, the apparatus includes:

a to-be-identified data determining unit 910, configured to determine to-be-identified data of at least two modalities;

a probability distribution determining unit 920, configured to determine, based on an emotion recognition model of each modality, an emotion probability distribution of to-be-recognized data of each modality; the emotion recognition model is used for extracting features of data to be recognized of corresponding modalities and recognizing emotion based on the data features obtained by feature extraction; the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality;

a recognition result determining unit 930 configured to determine an emotion recognition result based on the emotion probability distributions of the respective modalities.

The emotion recognition device provided by the invention has the advantages that the feature similarity of the sample data features of the sample data of each mode in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each mode are taken as the reference, the initial emotion recognition models of each mode are jointly trained, so that the models can fully learn the distance relation among the sample data features and/or the prediction probability distributions corresponding to the sample data of different modes in the training process, and therefore, the emotion recognition accuracy and the accuracy can be improved by providing key assistance, and the defects that in the traditional scheme, the abstract representation information among different modes is required to be completely shared due to the training mode of multi-task learning, so that the model training is biased and the training effect is poor when the models cannot be aggregated to obtain the expression of matched high-dimensional information are overcome; moreover, the model is trained by utilizing the complementary relation of the same emotion among different modalities, the generalization capability of the model can be improved, and the reliability of the emotion recognition result and the accuracy of the emotion recognition process are improved.

Based on the above embodiment, the apparatus further includes a joint training unit, configured to:

Based on the above embodiments, the joint training unit is configured to:

selecting sample data of at least two modals with the same sample emotion recognition result from the sample data of each modality as a positive sample data set, and selecting sample data of at least two modals with different sample emotion recognition results from the sample data of each modality as a negative sample data set;

determining a contrast loss based on the feature similarity between the sample projection features of the sample data in the positive sample data set and the feature similarity between the sample projection features of the sample data in the negative sample data set;

Based on the above embodiments, the joint training unit is configured to:

Based on the above embodiment, the apparatus further includes an initial model training unit, configured to:

determining sample data of at least two modalities;

inputting the sample data of each modal into a first emotion recognition model of a corresponding modal to obtain first prediction probability distribution of each modal output by the first emotion recognition model;

Based on the above embodiment, the initial model training unit is configured to:

determining initial sample data of at least two modals, wherein the at least two modals comprise at least two of an audio modality, an image modality, a text modality, an electroencephalogram modality, a behavior modality and a genetic modality;

Based on the above embodiment, the recognition result determining unit 930 is configured to:

Fig. 10 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 10: a processor (processor) 1010, a communication Interface (Communications Interface) 1020, a memory (memory) 1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 are in communication with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a method of emotion recognition, the method comprising: determining data to be identified of at least two modalities; determining emotion probability distribution of data to be identified of each modality based on an emotion identification model of each modality; determining emotion recognition results based on the emotion probability distribution of each modality; the emotion recognition model is used for extracting features of data to be recognized of corresponding modalities and recognizing emotion based on the data features obtained by feature extraction; the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the emotion recognition method provided by the above-mentioned methods, the method comprising: determining data to be identified of at least two modalities; determining emotion probability distribution of data to be identified of each modality based on an emotion identification model of each modality; determining emotion recognition results based on the emotion probability distribution of each modality; the emotion recognition model is used for extracting features of data to be recognized of corresponding modalities and recognizing emotion based on the data features obtained by feature extraction; the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality.

In still another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the emotion recognition method provided by the above methods, the method including: determining data to be identified of at least two modalities; determining emotion probability distribution of data to be identified of each modality based on an emotion identification model of each modality; determining emotion recognition results based on the emotion probability distribution of each modality; the emotion recognition model is used for extracting features of data to be recognized of corresponding modalities and recognizing emotion based on the data features obtained by feature extraction; the emotion recognition model of each modality is obtained by joint training based on the feature similarity of the sample data features of the sample data of each modality in the same space and/or the distribution similarity among the prediction probability distributions of the sample data of each modality.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of emotion recognition, comprising:

determining data to be identified of at least two modalities;

2. The emotion recognition method according to claim 1, wherein the emotion recognition models of the respective modalities are trained based on:

and performing parameter iteration on the initial emotion recognition models of all the modalities based on the joint training loss to obtain the emotion recognition models of all the modalities.

3. The emotion recognition method of claim 2, wherein determining a joint training loss based on feature similarities between sample projected features of sample data of the respective modalities and/or distribution similarities between predicted probability distributions of the sample data of the respective modalities comprises:

4. The emotion recognition method of claim 2, wherein the performing parameter iteration on the initial emotion recognition model of each modality based on the joint training loss to obtain the emotion recognition model of each modality comprises:

and performing parameter iteration on the initial emotion recognition models of all the modals on the basis of the prediction loss and the joint training loss to obtain the emotion recognition models of all the modals.

5. The emotion recognition method according to any one of claims 2 to 4, wherein the initial emotion recognition models of the respective modalities are trained based on:

determining sample data of at least two modalities;

and performing parameter iteration on the first emotion recognition model of each modality based on the first prediction probability distribution and a sample emotion recognition result to obtain an initial emotion recognition model of each modality.

6. The method of emotion recognition of claim 5, wherein the determining sample data of at least two modalities comprises:

7. The emotion recognition method according to any one of claims 1 to 4, wherein the determining of the emotion recognition result based on the emotion probability distribution of each modality includes:

8. An emotion recognition apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the emotion recognition method as claimed in any of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the emotion recognition method of any of claims 1 to 7.