CN113990353B

CN113990353B - Emotion recognition method, emotion recognition model training method, emotion recognition device and emotion recognition equipment

Info

Publication number: CN113990353B
Application number: CN202111259447.7A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2024-05-07
Anticipated expiration: 2041-10-27
Also published as: CN113990353A

Abstract

The present disclosure provides a method for identifying emotion, relates to the field of artificial intelligence, and in particular relates to the field of deep learning. The specific implementation scheme is as follows: acquiring a first content feature and a first audio feature of target data; inputting the first content features into a first feature extraction model to obtain second content features; inputting the first audio features into a first feature extraction model to obtain second audio features; and identifying the emotion of the target object corresponding to the target data according to the second content feature and the second audio feature. The disclosure also provides a method, a device, an electronic device and a storage medium for training the emotion recognition model.

Description

Emotion recognition method, emotion recognition model training method, emotion recognition device and emotion recognition equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to deep learning techniques. More particularly, the present disclosure provides a method of recognizing emotion, a method of training an emotion recognition model, an apparatus, an electronic device, and a storage medium.

Background

Speech is an important carrier of emotion in human communication. The language expression of people in different emotion states is different. For example, sentences having the same content may be accompanied by different emotions, and may express completely different meanings.

Disclosure of Invention

The present disclosure provides a method of recognizing emotion, a method of training emotion recognition model, an apparatus, a device, and a storage medium.

According to a first aspect, there is provided a method of identifying emotion, the method comprising: acquiring a first content feature and a first audio feature of target data; inputting the first content features into a first feature extraction model to obtain second content features; inputting the first audio features into a first feature extraction model to obtain second audio features; and identifying emotion of the target object corresponding to the target data according to the second content feature and the second audio feature.

According to a second aspect, there is provided a method of training an emotion recognition model including a first feature extraction model, the method comprising: acquiring a first content feature and a first audio feature of sample data; inputting the first content features into a first feature extraction model to obtain second content features; inputting the first audio features into a first feature extraction model to obtain second audio features; identifying emotion of a sample object corresponding to the sample data based on the second content feature and the second audio feature; obtaining a loss value according to the emotion of the sample object and the label of the sample data; and training the emotion recognition model according to the loss value.

According to a third aspect, there is provided an apparatus for identifying emotion, the apparatus comprising: the first acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the target data; the first obtaining module is used for inputting the first content characteristics into the first characteristic extraction model to obtain second content characteristics; the second obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features; and the first recognition module is used for recognizing the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.

According to a fourth aspect, there is provided an apparatus for training an emotion recognition model including a first feature extraction model, the apparatus comprising: the second acquisition module is used for acquiring the first content characteristics and the first audio characteristics of the sample data; the third obtaining module is used for inputting the first content characteristics into the first characteristic extraction model to obtain second content characteristics; a fourth obtaining module, configured to input the first audio feature into a first feature extraction model to obtain a second audio feature; a second identifying module for identifying emotion of a sample object corresponding to the sample data according to the second content feature and the second audio feature; a fifth obtaining module, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data; and the training module is used for training the emotion recognition model according to the loss value.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of identifying emotion in accordance with one embodiment of the present disclosure;

FIG. 2A is a schematic diagram of a chain graph structure according to one embodiment of the present disclosure;

FIG. 2B is a schematic diagram of a line graph structure according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a method of identifying emotion in accordance with one embodiment of the present disclosure;

FIG. 4 is a flowchart of a method of training an emotion recognition model according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a method of training an emotion recognition model according to one embodiment of the present disclosure;

FIG. 6 is a block diagram of an apparatus for recognizing emotion in accordance with one embodiment of the present disclosure;

FIG. 7 is a block diagram of an apparatus for training emotion recognition models according to one embodiment of the present disclosure; and

Fig. 8 is a block diagram of an electronic device to which a method of recognizing emotion and/or a method of training an emotion recognition model may be applied, according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The language expression in different emotion states is different. For example, intonation may be relatively cheerful when emotion is happy. For another example, intonation may be clumsy when emotion is dysphoria or heart injury.

Deep learning techniques have accelerated the development of emotion recognition from speech. However, there are still disadvantages in this research. For example, different subjects expressed emotions are different for the same speech, but related techniques are not easy to recognize different emotions.

Currently, in order to improve the effect of emotion recognition model, front-end feature extraction may be optimized, for example, MFCC (Mel Frequency Cepstrum Coefficient, mel frequency cepstral coefficient) of speech may be extracted, so as to improve the accuracy of emotion recognition. For another example, the feature dimension may also be increased, such as from 40 to 80, to increase the accuracy of emotion recognition. However, the technical means of front-end feature extraction is optimized, so that the accuracy of emotion recognition cannot be improved obviously.

Fig. 1 is a flowchart of a method of identifying emotion according to one embodiment of the present disclosure.

As shown in fig. 1, the method 100 may include operations S110 to S140.

In operation S110, a first content feature and a first audio feature of target data are acquired.

In the disclosed embodiments, the target data may be voice data.

For example, the target data may be a segment of speech derived from the target object.

In the disclosed embodiments, the target data may be voice data in video data.

For example, video data of a target object may be acquired. The target data may be voice data extracted from the video data. In one example, video data for a target object may be collected, with an audio stream in the video data as target data.

In the embodiment of the disclosure, the target data may be input into the second feature extraction model to obtain text information and time information of the target data.

For example, the second feature extraction model may include a forced alignment sub-model. In some examples, the forced alignment sub-model may be a GMM-HMM (Gaussian Mixture Model-Hidden Markov Model, gaussian mixture-hidden Markov model), an LSTM-CTC (Long-Short Term Memory-Connectionist Temporal Classifier, long-short term memory-connected timing classifier) model, or a Chain model.

For example, the second feature extraction model may be a pre-trained model. In some examples, the second feature extraction model may be pre-trained using open source data Aishell or LibriSpeech, or the like, as training samples.

For example, the text information may include information of phonemes, words, and the like in the target data.

For example, the time information may include a time stamp of the occurrence of a phoneme, a word, or a word. In one example, the time information includes a start time of occurrence of a phoneme and a duration of the phoneme.

In the embodiment of the disclosure, the first content feature may be obtained according to the text information.

For example, the second feature extraction model further includes a content feature generation sub-model. The text information may be input into a content feature generation sub-model to obtain a first content feature. In one example, one or more of phonemes, words of the target data may be input into the content feature generation sub-model to obtain the first content feature. The content feature generation sub-model may be a convolutional neural network model.

In the embodiment of the disclosure, the first audio feature may be obtained according to the text information and the time information.

For example, the forced alignment sub-model may output a first audio feature based on the text information and the time information.

In operation S120, the first content features are input into the first feature extraction model, resulting in second content features.

In an embodiment of the present disclosure, the first feature extraction model may include a graph convolution sub-model.

For example, the graph roll-up sub-model may be a GCN (Graph Convolutional Network, graph roll-up neural network) model. The graph structure employed by the GCN model may be an undirected graph structure (Undirected Graph Structures).

For example, the graph structure adopted by the graph convolution sub-model is a chain graph structure, and the first adjacent matrix corresponding to the chain graph structure is:

for example, a _C is the first adjacency matrix and a is a real number greater than 0.

For example, the first adjacency matrix is a matrix of n×n, N is a positive integer greater than 2, the i+1th row vector of the first adjacency matrix is obtained by cycling one bit to the right according to the i-th row vector, i is an integer greater than 1 and less than or equal to N-2.

For example, each row vector of the first adjacency matrix comprises two non-zero data (such as a). In one example, a=1.

For example, the graph structure adopted by the graph convolution sub-model is a line graph structure, and the second adjacent matrix corresponding to the line graph structure is:

For example, a _L is the second adjacency matrix and b is a real number greater than 0.

For example, the second adjacency matrix is a matrix of m×m, M is a positive integer greater than 2, and the j+1th row vector of the second adjacency matrix is obtained by right-cycling one bit according to the j-th row vector, j being an integer greater than 1 and less than or equal to M-2.

For example, the first row vector of the second adjacency matrix comprises a non-zero data (e.g., b), and the last row vector of the second adjacency matrix comprises a non-zero data (e.g., b). The second through M-1 th row vectors of the second adjacency matrix include two non-zero data. In one example, b=1.

The graph structure of the graph convolution sub-model includes a plurality of nodes. The relationship between adjacent nodes is much more important than the relationship between non-adjacent nodes. By adopting a graph convolution submodel of a chained graph structure or a linear graph structure, the relation between adjacent nodes can be learned. The calculation amount of the graph convolution submodel can be reduced, and meanwhile, the accuracy of emotion recognition can be ensured.

In an embodiment of the present disclosure, the convolution sub-model may include a first convolution network, which may include H first convolution layers.

In the embodiment of the disclosure, the first content feature may be input into the 1 st first graph convolution layer to obtain the 1 st first intermediate feature.

For example, the 1 st first intermediate feature can be obtained using the following formula:

for the 1 st first intermediate feature, U is the normalized graph Laplace matrix/> Is described. U ^T is the transpose of U,/>For the first content feature,/>Is the parameter of the first layer of the convolution of figure 1.

Drawing laplace matrixThe method can be obtained by the following formula:

Wherein a is an adjacency matrix, such as a _C or a _L described above. D is a degree matrix.

The graph Laplacian matrix can be mapped by the following formulaAnd (3) performing eigenvalue decomposition:

Lambda _g is the g-th eigenvalue and the corresponding eigenvector is u _g,U＝[u₁,u₂,......u_G],Λ＝diag(λ_g).

In some examples, a in equation three is a _C, a graph laplace matrixThe corresponding graph fourier transform is a cyclic matrix, which is a discrete fourier transform. Accordingly,/>Is a matrix of N x N.

In some examples, a in equation three is a _L and the corresponding graph fourier transform is a discrete cosine transform. In a corresponding manner,Matrix of M x M

In the embodiment of the disclosure, the h first intermediate feature may be input into the h+1 first graph convolution layer to obtain the h+1 first intermediate feature.

For example, h=1, … … H-1.

For example, the h+1th first intermediate feature can be obtained by the following formula:

for the h+1th first intermediate feature,/> For the h first intermediate feature,/>The parameters for the h+1th first figure convolutionally layer.

In the embodiment of the disclosure, the second content feature may be obtained according to the H first intermediate features.

For example, H first intermediate features may be input into a first pooling layer, which is pooled to obtain a second content feature by the following formula:

c is a second characteristic of the content, Is the h first intermediate feature.

In one example, the first graph rolling network employs a graph structure that includes 16 nodes.

In operation S130, the first audio feature is input into the first feature extraction model to obtain a second audio feature.

For a detailed description of the first feature extraction model, reference may be made to the description of the first feature extraction model, for example, as described in operation S120, and this disclosure is not repeated here.

In an embodiment of the present disclosure, the graph convolution sub-model includes a second graph convolution network, the first graph convolution network including K second graph convolution layers.

In the embodiment of the disclosure, the first audio feature may be input into the 1 st second graph convolution layer to obtain the 1 st second intermediate feature.

For example, the 1 st second intermediate feature can be obtained using the following formula:

For the 1 st second intermediate feature, U is the normalized graph Laplace matrix/> Is described. U ^T is the transpose of U,/>For the first audio feature,/>Is the parameter of the convolutionally layer of the 1 st second graph.

Drawing laplace matrixThe eigenvector matrix of the normalized graph laplace matrix may be obtained by referring to, for example, the formula three and the formula four described above, and will not be described in detail herein.

In the embodiment of the disclosure, the kth second intermediate feature may be input into the kth+1th second graph convolution layer to obtain the kth+1th second intermediate feature.

For example, k=1, … … K-1.

For example, the (k+1) th first intermediate feature may be obtained by the following formula:

for the (k+1) th second intermediate feature,/> For the kth second intermediate feature,/>Is the parameter of the k+1th second graph convolutional layer.

In the embodiment of the disclosure, the second audio feature may be obtained according to K second intermediate features.

For example, K second intermediate features may be input into a second pooling layer, which is pooled to obtain a second audio feature by the following formula:

The Audio is the second Audio feature and, Is the kth second intermediate feature.

In one example, the graph structure employed by the second graph rolling network includes 120 nodes.

In some examples, H may be equal to K. I.e. the first and second convolution networks may have the same number of convolution layers.

It should be noted that, the graph structure adopted by the graph convolution sub-model may be the graph structure adopted by the first graph convolution network and/or the second graph convolution network.

It should be noted that, if the first graph convolution network adopts a chained graph structure, the graph structure adopted by one or more first graph convolution layers in the H first graph convolution layers may be the chained graph structure.

It should be noted that, if the second graph convolution network adopts a chained graph structure, the graph structure adopted by one or more second graph convolution layers in the K second graph convolution layers may be the chained graph structure.

It should be noted that, the first graph rolling network and the second graph rolling network may have the same parameters except for the nodes of the graph structure.

It should be noted that, the first graph rolling network and the second graph rolling network may both adopt a chained graph structure. Or the first and second graph roll-up networks may each employ a line graph structure. Alternatively, the first graph roll-up network may employ a chained graph structure and the second graph roll-up network may employ a line graph structure. Alternatively, the first graph roll-up network may employ a line graph structure and the second graph roll-up network may employ a chain graph structure.

In operation S140, emotion of the target object corresponding to the target data is recognized according to the second content feature and the second audio feature.

In the embodiment of the disclosure, a fusion operation may be performed on the second content feature and the second audio feature, so as to obtain a fusion feature.

For example, the second content feature and the second audio feature may be spliced to obtain a fusion feature.

In embodiments of the present disclosure, the emotion of the target object may be identified from the fusion features.

For example, fusion features may be entered into the fully connected layer to identify the emotion of the target object. The emotion may be happiness or sadness, etc.

According to the embodiment of the disclosure, the emotion corresponding to the audio frame is determined by considering the front-back sequence relation between the audio and the association between the audio and the content of the audio frame, so that the accuracy of emotion recognition is improved. By adopting the chain graph structure or the line graph structure, the operation amount can be reduced, and the accuracy of emotion recognition can be further improved.

Fig. 2A is a schematic diagram of a chain graph structure according to one embodiment of the present disclosure.

As shown in fig. 2A, the chained graph structure 201 includes N nodes, where the nth base point V _N is connected to the 1 st node. In one example, n=120. In one example, n=16.

Fig. 2B is a schematic diagram of a line graph structure according to one embodiment of the present disclosure.

As shown in fig. 2B, the line graph structure 202 includes M nodes, where the mth base point V' _M is not connected to the 1 st node. In one example, m=120. In one example, m=16.

Fig. 3 is a schematic diagram of a method of recognizing emotion according to one embodiment of the present disclosure.

As shown in fig. 3, the input of the second feature extraction model 302 is target data 301, outputting the first content features and the first audio features.

The first feature extraction model includes a graph convolution sub-model, which may include a first graph convolution network 303 and a second graph convolution network 305. The first feature extraction model may also include a first pooling layer 304 and a second pooling layer 306.

The input to the first graph winding network 303 is a first content feature and outputs a second content feature. The first graph convolution network 303 includes H first graph convolution layers. The input to the 1 st first graph convolutional layer 3031 is the first content feature, outputting the 1 st first intermediate feature. The 1 st first intermediate feature serves as an input to the 2 nd first graph convolutional layer. The input to the h first convolutional layer 3032 is the h-1 th first intermediate feature, which is output. The input to the H first convolutional layer 3033 is the H-1 first intermediate feature, which is output. The input to the first pooling layer 304 is the H first intermediate features and the second content features are output. In one example, the first graph rolling network 303 employs a graph structure that includes 16 nodes.

The input to the second graph winding network 305 is a first audio feature and outputs a second audio feature. The second graph convolution network 305 includes K first graph convolution layers. The input to the 1 st second graph convolutional layer 3051 is the first audio feature, outputting the 1 st second intermediate feature. The 1 st second intermediate feature serves as an input to the 2 nd second graph convolutional layer. The input to the kth second graph convolutional layer 3052 is the kth-1 second intermediate feature, which is output. The input to the Kth second graph convolutional layer 3053 is the Kth-1 second intermediate feature, which is output. The inputs to the second pooling layer 306 are K second intermediate features, outputting second audio features. In one example, the second graph rolling network 305 employs a graph structure that includes 120 nodes.

The inputs to the fusion model 307 are the second content feature and the second audio feature, outputting the fusion feature. The fusion model 307 may splice the second content feature and the second audio feature.

Classification model 308 may include one or several fully connected layers, with the input of classification model 308 being a fusion feature, outputting the emotion of the target object corresponding to target data 301.

Fig. 4 is a flowchart of a method of training an emotion recognition model according to one embodiment of the present disclosure.

As shown in fig. 4, the method 400 may include operations S410 to S460.

In operation S410, a first content feature and a first audio feature of sample data are acquired.

In an embodiment of the present disclosure, the emotion recognition model may include a second feature extraction model.

In the embodiment of the disclosure, the sample data may be input into the second feature extraction model to obtain text information and time information of the sample data.

The embodiment of operation S410 may refer to the embodiment of operation S110 described above, and this disclosure is not repeated here.

In operation S420, the first content features are input into the first feature extraction model, resulting in second content features.

In an embodiment of the disclosure, the first feature extraction model may include a graph convolution sub-model, and the graph structure adopted by the graph convolution sub-model may be a chained graph structure.

For example, the first adjacency matrix corresponding to the chain graph structure is:

For example, the first adjacency matrix is a matrix of n×n, N being a positive integer greater than 2; the (i+1) th row vector of the first adjacent matrix is obtained by cycling one bit to the right according to the (i) th row vector, i is an integer greater than 1 and less than or equal to N-2.

In an embodiment of the disclosure, the first feature extraction model includes a graph convolution sub-model, and a graph structure adopted by the graph convolution sub-model is a line graph structure.

For example, the second adjacency matrix corresponding to the line graph structure is:

In the embodiment of the disclosure, the H first intermediate feature may be input into the h+1th first graph convolution layer to obtain the h+1th first intermediate feature, where h=2, … … H-1.

The embodiment of operation S420 may refer to the embodiment of operation S120 described above, and this disclosure is not repeated here.

In operation S430, the first audio feature is input into the first feature extraction model, resulting in a second audio feature.

In an embodiment of the present disclosure, the graph convolution sub-model may include a second graph convolution network, which may include K second graph convolution layers.

In the embodiment of the disclosure, the kth second intermediate feature may be input into the kth+1th second graph convolution layer to obtain the kth+1th second intermediate feature, k=2, … … K-1.

The embodiment of operation S430 may refer to the embodiment of operation S130 described above, and this disclosure is not repeated here.

In operation S440, emotion of the sample object corresponding to the sample data is identified according to the second content feature and the second audio feature.

In embodiments of the present disclosure, the emotion recognition model may include a fusion model and a classification model.

In the embodiment of the disclosure, the second content feature and the second audio feature may be input into a fusion model to obtain a fusion feature.

In embodiments of the present disclosure, the fusion features may be input into a classification model to identify the emotion of the sample object.

The embodiment of operation S440 may refer to the embodiment of operation S140 described above, and this disclosure is not repeated here.

In operation S450, a loss value is obtained according to the emotion of the sample object and the tag of the sample data.

For example, a cross entropy penalty function may be used to derive a penalty value based on the mood of the sample object and the label of the sample data.

In operation S460, an emotion recognition model is trained according to the loss value.

In embodiments of the present disclosure, parameters of the first feature extraction model may be adjusted according to the loss value to train the emotion recognition model.

For example, one can adjust, for example, in equation two based on the loss valueIt is also possible to adjust, for example, in equation fiveIt is also possible to adjust e.g./>, in equation sevenIt is also possible to adjust e.g./>, in equation eight

For example, the number of nodes of the graph structure employed by the graph convolution sub-model may be adjusted based on the loss value. A model that accurately recognizes the emotion of the target object can be obtained.

Fig. 5 is a schematic diagram of a method of training an emotion recognition model according to one embodiment of the present disclosure.

As shown in fig. 5, the emotion recognition model may include, for example, a first feature extraction model, a second feature extraction model 302, a fusion model 307, and a classification model 308. The first feature extraction model may include a convolution sub-model that may include a first convolution network 303 and a second convolution network 305. The first feature extraction model may also include a first pooling layer 304 and a second pooling layer 306.

The processing manner of the sample data 501 may refer to, for example, the processing manner of the target data 301 described in fig. 3, and this disclosure will not be repeated here.

The emotion recognition model processes the sample data 501, and outputs the emotion of the sample object corresponding to the sample data. From the emotion of the sample object and the label of the sample data, a penalty value may be obtained. Parameters of the first graph roll-up network 303 and parameters of the second graph roll-up network 305 may be adjusted based on the loss values to train the emotion recognition model.

Fig. 6 is a block diagram of an apparatus for recognizing emotion according to one embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 may include a first acquisition module 610, a first acquisition module 620, a second acquisition module 630, and a first identification module 640.

A first acquisition module 610 is configured to acquire a first content feature and a first audio feature of the target data.

The first obtaining module 620 is configured to input the first content feature into the first feature extraction model to obtain a second content feature.

The second obtaining module 630 is configured to input the first audio feature into the first feature extraction model to obtain a second audio feature.

The first identifying module 640 is configured to identify an emotion of the target object corresponding to the target data according to the second content feature and the second audio feature.

In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph structure adopted by the graph convolution sub-model is a chained graph structure, and a first adjacency matrix corresponding to the chained graph structure is:

Wherein a _C is the first adjacency matrix, and a is a real number greater than 0; the first adjacent matrix is a matrix of N, N is a positive integer greater than 2, the (i+1) th row vector of the first adjacent matrix is obtained by right-circulating one bit according to the (i) th row vector, and i is an integer greater than 1 and less than or equal to N-2.

In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph structure adopted by the graph convolution sub-model is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

Wherein a _L is the second adjacency matrix, and b is a real number greater than 0; wherein the second adjacent matrix is a matrix of m×m, M is a positive integer greater than 2, the j+1th row vector of the second adjacent matrix is obtained by right-cycling one bit according to the j-th row vector, and j is an integer greater than 1 and less than or equal to M-2.

In some embodiments, the first obtaining module includes: a first obtaining unit, configured to input the target data into a second feature extraction model, to obtain text information and time information of the target data; the second obtaining unit is used for obtaining the first content characteristics according to the text information; and a third obtaining unit, configured to obtain the first audio feature according to the text information and the time information.

In some embodiments, the convolution sub-model includes a first convolution network including H first convolution layers, and the first obtaining module includes: a fourth obtaining unit, configured to input the first content feature into a1 st first graph convolution layer to obtain a1 st first intermediate feature; a fifth obtaining unit, configured to input the H first intermediate feature into the h+1th first graph convolutional layer to obtain the h+1th first intermediate feature, where h=1, … … H-1; a sixth obtaining unit, configured to obtain the second content feature according to the H first intermediate features.

In some embodiments, the convolution sub-model includes a second convolution network, the second convolution network including K second convolution layers, the second obtaining module includes: a seventh obtaining unit, configured to input the first audio feature into a1 st second graph convolution layer to obtain a1 st second intermediate feature; an eighth obtaining unit, configured to input the kth second intermediate feature into the kth+1th second graph convolutional layer to obtain the kth+1th second intermediate feature, where k=1, … … K-1; and a ninth obtaining unit, configured to obtain the second audio feature according to the K second intermediate features.

In some embodiments, the first identification module includes: the first fusion unit is used for executing fusion operation on the second content characteristics and the second audio characteristics to obtain fusion characteristics; and the first recognition unit is used for recognizing the emotion of the target object according to the fusion characteristics.

Fig. 7 is a block diagram of an apparatus for training an emotion recognition model according to one embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 includes a second acquisition module 710, a third acquisition module 720, a fourth acquisition module 730, a second identification module 740, a fifth acquisition module 750, and a training module 760. The emotion recognition model includes a first feature extraction model.

The second obtaining module 710 is configured to obtain a first content feature and a first audio feature of the sample data.

The third obtaining module 720 is configured to input the first content feature into the first feature extraction model to obtain a second content feature.

A fourth obtaining module 730, configured to input the first audio feature into the first feature extraction model to obtain a second audio feature.

And a second identifying module 740, configured to identify emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature.

A fifth obtaining module 750, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data.

The training module 760 is configured to train the emotion recognition model according to the loss value.

Wherein a _C is the first adjacency matrix, and a is a real number greater than 0; wherein the first adjacent matrix is a matrix of n×n, N being a positive integer greater than 2; the (i+1) th row vector of the first adjacent matrix is obtained by cycling one bit to the right according to the (i) th row vector, wherein i is an integer greater than 1 and less than or equal to N-2.

In some embodiments, the emotion recognition model includes a second feature extraction model, and the second obtaining module includes: a tenth obtaining unit configured to input the sample data into a second feature extraction model to obtain text information and time information of the sample data; an eleventh obtaining unit configured to obtain the first content feature according to the text information; a twelfth obtaining unit, configured to obtain the first audio feature according to the text information and the time information.

In some embodiments, the convolution sub-model includes a first convolution network, the first convolution network includes H first convolution layers, and the third obtaining module includes: a thirteenth obtaining unit, configured to input the first content feature into the 1 st first graph convolution layer to obtain a1 st first intermediate feature; a fourteenth obtaining unit, configured to input the H first intermediate feature into the h+1th first graph convolutional layer to obtain the h+1th first intermediate feature, where h=1, … … H-1; a fifteenth obtaining unit, configured to obtain the second content feature according to the H first intermediate features.

In some embodiments, the convolution sub-model includes a second convolution network, the second convolution network includes K second convolution layers, and the fourth obtaining module includes: a sixteenth obtaining unit, configured to input the first audio feature into a1 st second graph convolution layer to obtain a1 st second intermediate feature; a seventeenth obtaining unit, configured to input a kth second intermediate feature into a kth+1th second graph convolutional layer to obtain a kth+1th second intermediate feature, k=1, … … K-1; an eighteenth obtaining unit, configured to obtain the second audio feature according to the K second intermediate features.

In some embodiments, the emotion recognition model includes a fusion model and a classification model, and the second recognition module includes: the second fusion unit is used for inputting the second content characteristics and the second audio characteristics into the fusion model to obtain fusion characteristics; and the second recognition unit is used for inputting the fusion characteristics into the classification model and recognizing the emotion of the sample object.

In some embodiments, the training module is further configured to adjust parameters of the first feature extraction model according to the loss value to train the emotion recognition model.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as a method of recognizing emotion and/or a method of training an emotion recognition model. For example, in some embodiments, the method of identifying emotion and/or the method of training an emotion recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the above-described method of identifying emotion and/or method of training an emotion recognition model may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method of identifying emotion and/or the method of training an emotion recognition model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of identifying emotion comprising:

acquiring a first content feature and a first audio feature of target data;

Inputting the first content features into a first feature extraction model to obtain second content features;

inputting the first audio features into a first feature extraction model to obtain second audio features; and

Identifying an emotion of a target object corresponding to target data based on the second content feature and the second audio feature,

Wherein the first feature extraction model comprises a graph convolution sub-model comprising a first graph convolution network and a second graph convolution network, the first graph convolution network comprising H first graph convolution layers, the second graph convolution network comprising K second graph convolution layers, the number of nodes in the graph structure of the second graph convolution network being greater than the number of nodes in the graph structure of the first graph convolution network, parameters of the first graph convolution layer other than the number of nodes of the graph structure being the same as parameters of the second graph convolution layer other than the number of nodes of the graph structure,

Inputting the first content features into a first feature extraction model to obtain second content features, wherein the step of obtaining the second content features comprises the following steps:

inputting the first content features into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature;

Inputting the H first intermediate feature into the h+1 first graph convolution layer to obtain the h+1 first intermediate feature, h=1.

Obtaining the second content feature according to the H first intermediate features,

Inputting the first audio feature into a first feature extraction model to obtain a second audio feature comprises:

Inputting the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature;

Inputting the kth second intermediate feature into the kth+1th second graph convolutional layer to obtain the kth+1th second intermediate feature, k=1.

And obtaining the second audio features according to the K second intermediate features.

2. The method of claim 1, wherein the graph structure adopted by the graph convolution sub-model is a chained graph structure, and the first adjacency matrix corresponding to the chained graph structure is:

Wherein a _C is the first adjacency matrix and a is a real number greater than 0;

The first adjacent matrix is a matrix of N, N is a positive integer greater than 2, the (i+1) th row vector of the first adjacent matrix is obtained by right-circulating one bit according to the (i) th row vector, and i is an integer greater than 1 and less than or equal to N-2.

3. The method of claim 1, wherein the graph structure adopted by the graph convolution sub-model is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

Wherein a _L is the second adjacency matrix and b is a real number greater than 0;

the j+1th row vector of the second adjacent matrix is obtained by cycling one bit rightwards according to the j-th row vector, and j is an integer greater than 1 and less than or equal to M-2.

4. The method of claim 1, wherein the acquiring the first content feature and the first audio feature of the target data comprises:

Inputting the target data into a second feature extraction model to obtain text information and time information of the target data;

obtaining the first content characteristics according to the text information;

and obtaining the first audio feature according to the text information and the time information.

5. The method of any of claims 1 to 4, wherein the identifying the emotion of the target object corresponding to target data from the second content feature and the second audio feature comprises:

Performing fusion operation on the second content features and the second audio features to obtain fusion features;

And identifying the emotion of the target object according to the fusion characteristics.

6. A method of training a mood recognition model, the mood recognition model comprising a first feature extraction model, comprising:

Acquiring a first content feature and a first audio feature of sample data;

inputting the first audio features into a first feature extraction model to obtain second audio features;

identifying an emotion of a sample object corresponding to the sample data based on the second content feature and the second audio feature;

obtaining a loss value according to the emotion of the sample object and the label of the sample data; and

Training the emotion recognition model based on the loss value,

inputting the H first intermediate feature into the h+1 first graph convolution layer to obtain the h+1 first intermediate feature, wherein h=2.

inputting the kth second intermediate feature into the kth+1th second graph convolutional layer to obtain the kth+1th second intermediate feature, k=2.

7. The method of claim 6, wherein the graph structure adopted by the graph convolution sub-model is a chained graph structure, and the first adjacency matrix corresponding to the chained graph structure is:

Wherein the first adjacent matrix is a matrix of n×n, N being a positive integer greater than 2; the (i+1) th row vector of the first adjacent matrix is obtained by cycling one bit to the right according to the (i) th row vector, wherein i is an integer greater than 1 and less than or equal to N-2.

8. The method of claim 6, wherein the graph structure adopted by the graph convolution sub-model is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

9. The method of claim 6, wherein the emotion recognition model comprises a second feature extraction model,

The acquiring the first content feature and the first audio feature of the sample data includes:

Inputting the sample data into a second feature extraction model to obtain text information and time information of the sample data;

obtaining the first content characteristics according to the text information;

10. The method according to any one of claims 6 to 9, wherein the emotion recognition model comprises a fusion model and a classification model,

The identifying, based on the second content feature and the second audio feature, the emotion of the sample object corresponding to the sample data includes:

inputting the second content features and the second audio features into the fusion model to obtain fusion features;

Inputting the fusion features into the classification model, and identifying the emotion of the sample object.

11. The method of any of claims 6 to 9, wherein the training the emotion recognition model according to the loss value comprises:

and adjusting parameters of the first feature extraction model according to the loss value to train the emotion recognition model.

12. An apparatus for recognizing emotion, comprising:

the first acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the target data;

The first obtaining module is used for inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics;

The second obtaining module is used for inputting the first audio features into a first feature extraction model to obtain second audio features; and

A first recognition module for recognizing emotion of a target object corresponding to target data based on the second content feature and the second audio feature,

The first obtaining module includes:

A fourth obtaining unit, configured to input the first content feature into a1 st first graph convolution layer to obtain a1 st first intermediate feature;

a fifth obtaining unit, configured to input the H first intermediate feature into the h+1 first graph convolution layer to obtain the h+1 first intermediate feature, where h=1;

a sixth obtaining unit for obtaining the second content feature based on the H first intermediate features,

The second obtaining module includes:

A seventh obtaining unit, configured to input the first audio feature into a1 st second graph convolution layer to obtain a1 st second intermediate feature;

an eighth obtaining unit configured to input a kth second intermediate feature into a kth+1 second graph convolution layer to obtain a kth+1 second intermediate feature, k=1;

and a ninth obtaining unit, configured to obtain the second audio feature according to K second intermediate features.

13. An apparatus for training a mood recognition model, the mood recognition model comprising a first feature extraction model, comprising:

The second acquisition module is used for acquiring the first content characteristics and the first audio characteristics of the sample data;

The third obtaining module is used for inputting the first content characteristics into the first characteristic extraction model to obtain second content characteristics;

a fourth obtaining module, configured to input the first audio feature into a first feature extraction model to obtain a second audio feature;

a second identifying module for identifying emotion of a sample object corresponding to the sample data according to the second content feature and the second audio feature;

A fifth obtaining module, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data; and

A training module for training the emotion recognition model according to the loss value,

The first feature extraction model comprises a graph convolution sub-model comprising a first graph convolution network and a second graph convolution network, the first graph convolution network comprising H first graph convolution layers, the second graph convolution network comprising K second graph convolution layers, the number of nodes in the graph structure of the second graph convolution network being greater than the number of nodes in the graph structure of the first graph convolution network, parameters of the first graph convolution layer other than the number of nodes in the graph structure being the same as parameters of the second graph convolution layer other than the number of nodes in the graph structure,

The third obtaining module includes:

A thirteenth obtaining unit, configured to input the first content feature into the 1 st first graph convolution layer to obtain a 1 st first intermediate feature;

a fourteenth obtaining unit configured to input an H first intermediate feature into an h+1 first graph convolution layer to obtain an h+1 first intermediate feature, h=1;

A fifteenth obtaining unit for obtaining the second content feature from the H first intermediate features,

The fourth obtaining module includes:

A sixteenth obtaining unit, configured to input the first audio feature into a 1 st second graph convolutional layer to obtain a 1 st second intermediate feature;

a seventeenth obtaining unit configured to input a kth second intermediate feature into a kth+1th second graph convolution layer to obtain a kth+1th second intermediate feature, k=1.

An eighteenth obtaining unit, configured to obtain the second audio feature according to K second intermediate features.

14. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.

15. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 11.