CN113990353A

CN113990353A - Method for recognizing emotion, method, device and equipment for training emotion recognition model

Info

Publication number: CN113990353A
Application number: CN202111259447.7A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-01-28

Abstract

The disclosure provides a method for recognizing emotion, relates to the field of artificial intelligence, and particularly relates to the field of deep learning. The specific implementation scheme is as follows: acquiring a first content characteristic and a first audio characteristic of target data; inputting the first content features into a first feature extraction model to obtain second content features; inputting the first audio features into a first feature extraction model to obtain second audio features; and identifying an emotion of the target object corresponding to the target data according to the second content feature and the second audio feature. The disclosure also provides a method and a device for training the emotion recognition model, electronic equipment and a storage medium.

Description

Method for recognizing emotion, method, device and equipment for training emotion recognition model

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to deep learning techniques. More particularly, the present disclosure provides a method of recognizing emotion, a method of training an emotion recognition model, an apparatus, an electronic device, and a storage medium.

Background

Speech is an important carrier of emotion in human communication. People have different language expressions in different emotional states. For example, sentences having the same content carry different emotions, and completely different meanings can be expressed.

Disclosure of Invention

The present disclosure provides a method of recognizing emotion, a method of training an emotion recognition model, an apparatus, a device, and a storage medium.

According to a first aspect, there is provided a method of identifying an emotion, the method comprising: acquiring a first content characteristic and a first audio characteristic of target data; inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics; inputting the first audio features into a first feature extraction model to obtain second audio features; and identifying the emotion of the target object corresponding to the target data according to the second content feature and the second audio feature.

According to a second aspect, there is provided a method of training an emotion recognition model, the emotion recognition model including a first feature extraction model, the method comprising: acquiring a first content characteristic and a first audio characteristic of sample data; inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics; inputting the first audio features into a first feature extraction model to obtain second audio features; recognizing the emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature; obtaining a loss value according to the emotion of the sample object and the label of the sample data; and training the emotion recognition model according to the loss value.

According to a third aspect, there is provided an apparatus for recognizing emotion, the apparatus comprising: the first acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the target data; the first obtaining module is used for inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics; the second obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features; and the first identification module is used for identifying the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.

According to a fourth aspect, there is provided an apparatus for training a emotion recognition model, the emotion recognition model including a first feature extraction model, the apparatus comprising: the second acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the sample data; a third obtaining module, configured to input the first content feature into a first feature extraction model to obtain a second content feature; a fourth obtaining module, configured to input the first audio feature into the first feature extraction model to obtain a second audio feature; a second identification module, configured to identify an emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature; a fifth obtaining module, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data; and the training module is used for training the emotion recognition model according to the loss value.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided in accordance with the present disclosure.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a method of identifying emotions according to one embodiment of the present disclosure;

FIG. 2A is a schematic diagram of a chain graph structure according to one embodiment of the present disclosure;

FIG. 2B is a schematic diagram of a line graph structure according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a method of recognizing emotion according to one embodiment of the present disclosure;

FIG. 4 is a flow diagram of a method of training an emotion recognition model according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a method of training a mood recognition model according to one embodiment of the present disclosure;

FIG. 6 is a block diagram of an apparatus to recognize emotions according to one embodiment of the present disclosure;

FIG. 7 is a block diagram of an apparatus for training emotion recognition models, according to one embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device to which a method of recognizing emotion and/or a method of training an emotion recognition model may be applied, according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The language expression pattern differs for different emotional states. For example, when the emotion is happy, the tone of speech is cheerful. For another example, when the emotion is dysphoria or impairment of the heart, the tone of speech is rather dull.

Deep learning techniques have accelerated the development of emotion recognition from speech. But the research in this respect still remains insufficient. For example, different subjects express different emotions for the same speech, but the related art cannot easily recognize different emotions.

At present, in order to improve the effect of the emotion recognition model, front-end feature extraction, for example, MFCC (Mel Frequency Cepstrum Coefficient) of speech, may be optimized to improve the accuracy of emotion recognition. For another example, the feature dimension may be increased, such as from 40 dimensions to 80 dimensions, to increase the accuracy of emotion recognition. However, the accuracy of emotion recognition cannot be obviously improved by optimizing the technical means of front-end feature extraction.

Fig. 1 is a flow diagram of a method of recognizing emotion according to one embodiment of the present disclosure.

As shown in fig. 1, the method 100 may include operations S110 to S140.

In operation S110, a first content feature and a first audio feature of target data are acquired.

In the disclosed embodiment, the target data may be voice data.

For example, the target data may be a segment of speech originating from the target object.

In the disclosed embodiment, the target data may be voice data in video data.

For example, video data of the target object may be acquired. The target data may be voice data extracted from the video data. In one example, video data for a target object may be captured, with an audio stream in the video data as the target data.

In the embodiment of the present disclosure, the target data may be input into the second feature extraction model, so as to obtain text information and time information of the target data.

For example, the second feature extraction model may include a forced alignment submodel. In some examples, the forced alignment submodel may be a GMM-HMM (Gaussian Mixture Model-Hidden Markov Model), a LSTM-CTC (Long-Short Term Memory-connection Temporal Classifier), or a Chain Model.

For example, the second feature extraction model may be a pre-trained model. In some examples, the second feature extraction model may be pre-trained using open source data Aishell or LibriSpeech, etc. as training samples.

For example, the text information may include information such as phonemes, words, and the like in the target data.

For example, the time information may include time stamps of occurrences of phonemes, words. In one example, the time information includes a start time of occurrence of a phoneme and a duration of the phoneme.

In the embodiment of the present disclosure, the first content feature may be obtained according to the text information.

For example, the second feature extraction model further includes a content feature generation submodel. The text information may be input into a content feature generation submodel to obtain a first content feature. In one example, one or several of phonemes, words of the target data may be input into a content feature generation sub-model to obtain the first content feature. The content feature generation submodel may be a convolutional neural network model.

In the embodiment of the present disclosure, the first audio feature may be obtained according to the text information and the time information.

For example, the forced alignment submodel may output the first audio feature according to the text information and the time information.

In operation S120, the first content feature is input into the first feature extraction model, and a second content feature is obtained.

In an embodiment of the present disclosure, the first feature extraction model may include a graph convolution sub-model.

For example, the Graph convolution submodel may be a GCN (Graph Convolutional neural Network) model. The Graph structure adopted by the GCN model can be an Undirected Graph structure (Undirected Graph Structures).

For example, the graph structure adopted by the graph convolution sub-model is a chain graph structure, and the first adjacency matrix corresponding to the chain graph structure is:

for example, A_CA is a real number greater than 0, which is the first adjacency matrix.

For example, the first adjacency matrix is a matrix of N × N, N is a positive integer greater than 2, the i +1 th row vector of the first adjacency matrix is obtained by rotating one bit to the right according to the i-th row vector, and i is an integer greater than 1 and less than or equal to N-2.

For example, each row vector of the first adjacency matrix includes two non-zero data (such as a). In one example, a ═ 1.

For example, the graph structure adopted by the graph convolution sub-model is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

for example, A_LB is a real number greater than 0, which is the second adjacency matrix.

For example, the second adjacency matrix is a matrix of M × M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacency matrix is obtained by rotating the j th row vector one bit to the right, and j is an integer greater than 1 and less than or equal to M-2.

For example, the first row vector of the second adjacency matrix includes a non-zero data (e.g., b), and the last row vector of the second adjacency matrix includes a non-zero data (e.g., b). The second through M-1 th row vectors of the second adjacency matrix include two non-zero data. In one example, b is 1.

The graph structure of the graph convolution submodel includes a plurality of nodes. The importance of the relationship between adjacent nodes is much greater than the relationship between non-adjacent nodes. By adopting the graph convolution submodel with a chain graph structure or a line graph structure, the relationship between adjacent nodes can be learned. The operation amount of the graph convolution submodel can be reduced, and meanwhile, the emotion recognition accuracy can be guaranteed.

In an embodiment of the present disclosure, the graph convolution submodel may include a first graph convolution network, which may include H first graph convolution layers.

In an embodiment of the present disclosure, the first content feature may be input into the 1 st first graph convolution layer, resulting in the 1 st first intermediate feature.

For example, the 1 st first intermediate feature may be obtained using the following formula:

for the 1 st first intermediate feature, U is the normalized graph Laplace matrix

The feature vector matrix of (2). U shape^TIs the transpose of the U and,

in order to be a first one of the content features,

the parameters of the layer are convolved for the 1 st first graph.

Graph Laplace matrix

This can be obtained by the following formula:

wherein A is an adjacency matrix, e.g. A as described above_COr A_L. D is a degree matrix.

The graph laplacian matrix can be aligned by the following formula

And (3) carrying out characteristic value decomposition:

λ_gis the g-th bitEigenvalues, corresponding to eigenvectors u_g，U＝[u₁，u₂，......u_G]，Λ＝diag(λ_g)。

In some examples, A in formula three is A_CGraph laplacian matrix

The corresponding graph fourier transform is a discrete fourier transform, which is a circulant matrix. Accordingly, the method can be used for solving the problems that,

is a matrix of N x N.

In some examples, a in formula three is a_LThe corresponding graph is fourier transformed to a discrete cosine transform. Accordingly, the method can be used for solving the problems that,

matrix of M

In the disclosed embodiment, the h +1 th first intermediate feature may be input into the h +1 th first graph convolution layer to obtain the h +1 th first intermediate feature.

For example, H1, … … H-1.

For example, the h +1 th first intermediate feature can be obtained by the following formula:

for the h +1 th first intermediate feature,

for the h-th first intermediate feature,

is the parameter of the h +1 th first map convolutional layer.

In the embodiment of the present disclosure, the second content feature may be obtained according to the H first intermediate features.

For example, H first intermediate features may be input into the first pooling layer, and pooled by the following formula to obtain the second content feature:

c is a second content characteristic of the content,

is the h-th first intermediate feature.

In one example, the first graph convolution network employs a graph structure that includes 16 nodes.

In operation S130, the first audio feature is input into the first feature extraction model to obtain a second audio feature.

For a detailed description of the first feature extraction model, reference may be made to the description of the first feature extraction model, for example, described in operation S120, and the detailed description of the present disclosure is omitted here.

In an embodiment of the disclosure, the graph convolution submodel includes a second graph convolution network, the first graph convolution network including K second graph convolution layers.

In an embodiment of the present disclosure, the first audio feature may be input into the 1 st second graph convolution layer, resulting in the 1 st second intermediate feature.

For example, the 1 st second intermediate feature can be obtained using the following formula:

for the 1 st second intermediate feature, U is the normalized TuraThe placian matrix

The feature vector matrix of (2). U shape^TIs the transpose of the U and,

in order to be the first audio feature,

the parameters of the layer are scrolled for the 1 st second graph.

Graph Laplace matrix

The eigenvector matrix of the normalized graph laplacian matrix can be obtained by referring to, for example, the formula three and the formula four described above, and the details of the disclosure are not repeated herein.

In the disclosed embodiment, the kth second intermediate feature may be input into the (k + 1) th second graph convolution layer to obtain a (k + 1) th second intermediate feature.

For example, K is 1, … … K-1.

For example, the (k + 1) th first intermediate feature can be obtained by the following formula:

for the (k + 1) th second intermediate feature,

for the k-th second intermediate feature,

is the parameter of the (k + 1) th second map convolutional layer.

In the embodiment of the present disclosure, the second audio feature may be obtained according to the K second intermediate features.

For example, K second intermediate features may be input into the second pooling layer, and pooled by the following formula to obtain a second audio feature:

audio is the second Audio characteristic and,

is the kth second intermediate feature.

In one example, the graph structure employed by the second graph convolution network contains 120 nodes.

In some examples, H may be equal to K. I.e. the first and second graph convolution networks may have the same number of graph convolution layers.

It should be noted that the graph structure adopted by the graph convolution sub-model may be a graph structure adopted by the first graph convolution network and/or the second graph convolution network.

It should be noted that, if the first graph convolution network adopts a chain graph structure, the graph structure adopted by one or more first graph convolution layers in the H first graph convolution layers may be a chain graph structure.

It should be noted that, if the second graph convolution network adopts a chain graph structure, the graph structure adopted by one or more second graph convolution layers in the K second graph convolution layers may be a chain graph structure.

It should be noted that the first graph convolution network and the second graph convolution network may have the same parameters except for the nodes of the graph structure.

It should be noted that the first graph convolution network and the second graph convolution network may both adopt a chain graph structure. Alternatively, the first and second graph convolution networks may both employ a line graph structure. Alternatively, the first graph convolution network may adopt a chain graph structure, and the second graph convolution network adopts a line graph structure. Alternatively, the first graph convolution network may employ a line graph structure, and the second graph convolution network may employ a chain graph structure.

In operation S140, an emotion of the target object corresponding to the target data is identified according to the second content feature and the second audio feature.

In the embodiment of the present disclosure, a fusion operation may be performed on the second content feature and the second audio feature, resulting in a fusion feature.

For example, the second content feature and the second audio feature may be spliced to obtain a fused feature.

In the embodiment of the present disclosure, the emotion of the target object may be recognized according to the fusion feature.

For example, the fused features may be input into the fully connected layer to identify the mood of the target object. The mood may be happy or sad, etc.

Through the embodiment of the disclosure, the emotion corresponding to the audio frame is determined by considering the front-back sequence relation between the audios and the association between the audio and the content of the audio frame, and the emotion recognition accuracy is improved. By adopting a chain graph structure or a line graph structure, the calculation amount can be reduced, and meanwhile, the emotion recognition accuracy is further improved.

Fig. 2A is a schematic diagram of a chain graph structure according to one embodiment of the present disclosure.

As shown in FIG. 2A, the chain graph structure 201 comprises N nodes, wherein the Nth base point V_NConnected with the 1 st node. In one example, N-120. In one example, N ═ 16.

Fig. 2B is a schematic diagram of a line graph structure according to one embodiment of the present disclosure.

As shown in FIG. 2B, the wire graph structure 202 comprises M nodes, wherein the Mth base point V'_MUnconnected to the 1 st node. In one example, M-120. In one example, M ═ 16.

Fig. 3 is a schematic diagram of a method of recognizing emotion according to one embodiment of the present disclosure.

As shown in fig. 3, the input of the second feature extraction model 302 is target data 301, and the first content feature and the first audio feature are output.

The first feature extraction model comprises a graph convolution sub-model, which may comprise a first graph convolution network 303 and a second graph convolution network 305. The first feature extraction model may also include a first pooling layer 304 and a second pooling layer 306.

The input to the first graph convolution network 303 is a first content characteristic and the output is a second content characteristic. The first graph convolution network 303 includes H first graph convolution layers. The input to the 1 st first graph convolution layer 3031 is the first content feature, outputting the 1 st first intermediate feature. The 1 st first intermediate feature serves as an input to the 2 nd first map convolutional layer. The input to the h-th first graph convolution layer 3032 is the h-1 st first intermediate feature, outputting the h-th first intermediate feature. The input to the H first graph convolutional layer 3033 is the H-1 st first intermediate feature, and the H first intermediate feature is output. The input to the first pooling layer 304 is the H first intermediate features and the second content feature is output. In one example, the first graph convolution network 303 employs a graph structure that includes 16 nodes.

The input to the second graph convolution network 305 is a first audio feature and the output is a second audio feature. The second graph convolution network 305 includes K first graph convolution layers. The input of the 1 st second graph convolutional layer 3051 is the first audio feature, and the 1 st second intermediate feature is output. The 1 st second intermediate feature serves as an input to the 2 nd second map convolutional layer. The input to the kth second graph convolutional layer 3052 is the kth-1 second intermediate feature, and the kth second intermediate feature is output. The input to the Kth second graph convolutional layer 3053 is the K-1 th second intermediate feature, and the Kth second intermediate feature is output. The inputs to the second pooling layer 306 are K second intermediate features, and the second audio features are output. In one example, the second graph convolution network 305 employs a graph structure that includes 120 nodes.

The input of the fusion model 307 is the second content feature and the second audio feature, and the fusion feature is output. The fusion model 307 may stitch the second content feature and the second audio feature.

The classification model 308 may include one or several fully connected layers, with the input to the classification model 308 being fusion features that output the emotion of the target object corresponding to the target data 301.

FIG. 4 is a flow diagram of a method of training an emotion recognition model according to one embodiment of the present disclosure.

As shown in fig. 4, the method 400 may include operations S410 to S460.

In operation S410, a first content feature and a first audio feature of sample data are acquired.

In an embodiment of the present disclosure, the emotion recognition model may include a second feature extraction model.

In the embodiment of the present disclosure, the sample data may be input into the second feature extraction model, and text information and time information of the sample data may be obtained.

For the embodiment of operation S410, reference may be made to the above-mentioned embodiment of operation S110, and the disclosure is not repeated herein.

In operation S420, the first content feature is input into the first feature extraction model, and a second content feature is obtained.

In this disclosure, the first feature extraction model may include a graph convolution sub-model, and a graph structure adopted by the graph convolution sub-model may be a chain graph structure.

For example, the first adjacency matrix corresponding to the chain graph structure is:

For example, the first adjacency matrix is a matrix of N × N, N being a positive integer greater than 2; the i +1 th row vector of the first adjacency matrix is obtained by one bit of right circulation of the ith row vector, and i is an integer which is greater than 1 and less than or equal to N-2.

In the embodiment of the present disclosure, the first feature extraction model includes a graph convolution sub-model, and a graph structure adopted by the graph convolution sub-model is a line graph structure.

For example, the second adjacency matrix corresponding to the line graph structure is:

In the embodiment of the present disclosure, the H-th first intermediate feature may be input into the H + 1-th first graph convolution layer to obtain the H + 1-th first intermediate feature, H is 2, … … H-1.

For the embodiment of operation S420, reference may be made to the above-mentioned embodiment of operation S120, and the disclosure is not repeated herein.

In operation S430, the first audio feature is input into the first feature extraction model, and a second audio feature is obtained.

In an embodiment of the disclosure, the graph convolution submodel may include a second graph convolution network, which may include K second graph convolution layers.

In the embodiment of the present disclosure, the kth second intermediate feature may be input into the (K + 1) th second graph convolution layer, so as to obtain a (K + 1) th second intermediate feature, where K is 2, … … K-1.

For the embodiment of operation S430, reference may be made to the above-mentioned embodiment of operation S130, and the disclosure is not repeated herein.

In operation S440, an emotion of the sample object corresponding to the sample data is identified according to the second content feature and the second audio feature.

In the disclosed embodiment, the emotion recognition model may include a fusion model and a classification model.

In an embodiment of the present disclosure, the second content feature and the second audio feature may be input into a fusion model to obtain a fusion feature.

In the disclosed embodiment, the fused features may be input into a classification model to identify the emotion of the sample object.

For the embodiment of operation S440, reference may be made to the above-mentioned embodiment of operation S140, and the disclosure is not repeated herein.

In operation S450, a loss value is obtained according to the emotion of the sample object and the label of the sample data.

For example, a loss value may be derived using a cross entropy loss function based on the mood of the sample object and the label of the sample data.

In operation S460, the emotion recognition model is trained according to the loss value.

In the embodiment of the present disclosure, parameters of the first feature extraction model may be adjusted according to the loss value to train the emotion recognition model.

For example, based on the loss value, the values in equation two, for example, can be adjusted

It is also possible to adjust, for example, the formula five

It is also possible to adjust, for example, the formula seven

It is also possible to adjust, for example, the equation eight

For example, the number of nodes of the graph structure employed by the graph convolution sub-model may be adjusted based on the penalty value. A model can be obtained that accurately identifies the mood of the target object.

FIG. 5 is a schematic diagram of a method of training a mood recognition model according to one embodiment of the present disclosure.

As shown in fig. 5, the emotion recognition model may include, for example, a first feature extraction model, a second feature extraction model 302, a fusion model 307, and a classification model 308. The first feature extraction model may comprise a graph convolution sub-model, which may comprise a first graph convolution network 303 and a second graph convolution network 305. The first feature extraction model may also include a first pooling layer 304 and a second pooling layer 306.

For example, the processing method of the target data 301 described in fig. 3 may be referred to as the processing method of the sample data 501, and details of the present disclosure are not repeated herein.

The emotion recognition model processes the sample data 501 and outputs the emotion of the sample object corresponding to the sample data. From the mood of the sample object and the label of the sample data, a loss value can be derived. Parameters of the first and second

convolutional networks

303, 305 may be adjusted according to the loss values to train the emotion recognition model.

Fig. 6 is a block diagram of an apparatus for recognizing emotion according to one embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 may include a first obtaining module 610, a first obtaining module 620, a second obtaining module 630, and a first identifying module 640.

The first obtaining module 610 is configured to obtain a first content feature and a first audio feature of the target data.

The first obtaining module 620 is configured to input the first content feature into the first feature extraction model to obtain a second content feature.

The second obtaining module 630 is configured to input the first audio feature into the first feature extraction model to obtain a second audio feature.

The first identifying module 640 is configured to identify an emotion of the target object corresponding to the target data according to the second content feature and the second audio feature.

In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph structure adopted by the graph convolution sub-model is a chain graph structure, and the first adjacency matrix corresponding to the chain graph structure is:

wherein A is_CA is a real number greater than 0 for the first adjacency matrix; the first adjacent matrix is a matrix of N x N, N is a positive integer greater than 2, the (i + 1) th row vector of the first adjacent matrix is obtained by one bit of right circulation according to the (i) th row vector, and i is an integer greater than 1 and less than or equal to N-2.

In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph convolution sub-model uses a graph structure of a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

wherein A is_LB is a real number greater than 0 for the second adjacency matrix; the second adjacent matrix is a matrix of M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacent matrix is obtained by one bit right circulation according to the j row vector, and j is an integer greater than 1 and less than or equal to M-2.

In some embodiments, the first obtaining module includes: a first obtaining unit, configured to input the target data into a second feature extraction model to obtain text information and time information of the target data; a second obtaining unit, configured to obtain the first content feature according to the text information; a third obtaining unit, configured to obtain the first audio feature according to the text information and the time information.

In some embodiments, the graph convolution submodel includes a first graph convolution network including H first graph convolution layers, the first obtaining module includes: a fourth obtaining unit configured to input the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature; a fifth obtaining unit, configured to input the H-th first intermediate feature into the H + 1-th first map convolution layer, so as to obtain an H + 1-th first intermediate feature, where H is 1, … … H-1; a sixth obtaining unit, configured to obtain the second content feature according to the H first intermediate features.

In some embodiments, the graph convolution submodel includes a second graph convolution network, the second graph convolution network includes K second graph convolution layers, and the second obtaining module includes: a seventh obtaining unit, configured to input the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature; an eighth obtaining unit, configured to input the kth second intermediate feature into the (K + 1) th second map convolution layer, so as to obtain a (K + 1) th second intermediate feature, where K is 1, … … K-1; a ninth obtaining unit, configured to obtain the second audio feature according to the K second intermediate features.

In some embodiments, the first identification module comprises: the first fusion unit is used for executing fusion operation on the second content characteristic and the second audio characteristic to obtain a fusion characteristic; and the first identification unit is used for identifying the emotion of the target object according to the fusion characteristics.

Fig. 7 is a block diagram of an apparatus for training a emotion recognition model according to one embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 includes a second obtaining module 710, a third obtaining module 720, a fourth obtaining module 730, a second identifying module 740, a fifth obtaining module 750, and a training module 760. The emotion recognition model includes a first feature extraction model.

The second obtaining module 710 is configured to obtain a first content feature and a first audio feature of the sample data.

The third obtaining module 720 is configured to input the first content feature into the first feature extraction model to obtain a second content feature.

The fourth obtaining module 730 is configured to input the first audio feature into the first feature extraction model to obtain a second audio feature.

The second identifying module 740 is configured to identify an emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature.

A fifth obtaining module 750, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data.

And the training module 760 is configured to train the emotion recognition model according to the loss value.

wherein A is_CA is a real number greater than 0 for the first adjacency matrix; wherein the first adjacent matrix is a matrix of N × N, N being a positive integer greater than 2; the i +1 th row vector of the first adjacency matrix is obtained by one bit of right circulation of the i-th row vector, and i is an integer which is greater than 1 and less than or equal to N-2.

In some embodiments, the emotion recognition model includes a second feature extraction model, and the second obtaining module includes: a tenth obtaining unit, configured to input the sample data into a second feature extraction model, so as to obtain text information and time information of the sample data; an eleventh obtaining unit, configured to obtain the first content feature according to the text information; a twelfth obtaining unit, configured to obtain the first audio feature according to the text information and the time information.

In some embodiments, the graph convolution submodel includes a first graph convolution network including H first graph convolution layers, and the third obtaining module includes: a thirteenth obtaining unit configured to input the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature; a fourteenth obtaining unit, configured to input the H-th first intermediate feature into the H + 1-th first map convolution layer, so as to obtain an H + 1-th first intermediate feature, where H is 1, … … H-1; a fifteenth obtaining unit, configured to obtain the second content feature according to the H first intermediate features.

In some embodiments, the graph convolution submodel includes a second graph convolution network, the second graph convolution network includes K second graph convolution layers, and the fourth obtaining module includes: a sixteenth obtaining unit, configured to input the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature; a seventeenth obtaining unit, configured to input the kth second intermediate feature into the (K + 1) th second map convolution layer, so as to obtain a (K + 1) th second intermediate feature, where K is 1, … … K-1; and an eighteenth obtaining unit, configured to obtain the second audio feature according to the K second intermediate features.

In some embodiments, the emotion recognition model includes a fusion model and a classification model, and the second recognition module includes: the second fusion unit is used for inputting the second content characteristic and the second audio characteristic into the fusion model to obtain a fusion characteristic; and a second recognition unit for inputting the fusion feature into the classification model to recognize the emotion of the sample object.

In some embodiments, the training module is further configured to adjust parameters of the first feature extraction model according to the loss value to train the emotion recognition model.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the various methods and processes described above, such as a method of recognizing emotion and/or a method of training an emotion recognition model. For example, in some embodiments, the method of recognizing emotion and/or the method of training an emotion recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM802 and/or communications unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the above described method of recognizing emotion and/or method of training an emotion recognition model may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of recognizing emotions and/or the method of training an emotion recognition model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of recognizing emotion, comprising:

acquiring a first content characteristic and a first audio characteristic of target data;

inputting the first content features into a first feature extraction model to obtain second content features;

inputting the first audio features into a first feature extraction model to obtain second audio features; and

and identifying the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.

2. The method of claim 1, wherein the first feature extraction model comprises a graph convolution submodel, the graph structure adopted by the graph convolution submodel is a chain graph structure, and a first adjacency matrix corresponding to the chain graph structure is:

wherein A is_CA is a real number greater than 0 for the first adjacency matrix;

the first adjacent matrix is a matrix of N x N, N is a positive integer greater than 2, the (i + 1) th row vector of the first adjacent matrix is obtained by one bit of right circulation according to the (i) th row vector, and i is an integer greater than 1 and less than or equal to N-2.

3. The method of claim 1, wherein the first feature extraction model comprises a graph convolution sub-model, the graph convolution sub-model using a graph structure that is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

wherein A is_LB is a real number greater than 0 for the second adjacency matrix;

the second adjacent matrix is a matrix of M x M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacent matrix is obtained by cycling one bit to the right according to the j row vector, and j is an integer greater than 1 and less than or equal to M-2.

4. The method of claim 1, wherein the obtaining the first content characteristic and the first audio characteristic of the target data comprises:

inputting the target data into a second feature extraction model to obtain text information and time information of the target data;

obtaining the first content characteristic according to the text information;

and obtaining the first audio characteristic according to the text information and the time information.

5. The method of claim 2 or 3, wherein the graph convolution submodel comprises a first graph convolution network comprising H first graph convolution layers,

the inputting the first content feature into a first feature extraction model to obtain a second content feature comprises:

inputting the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature;

inputting the H-th first intermediate feature into the H + 1-th first graph convolution layer to obtain the H + 1-th first intermediate feature, wherein H is 1, … … H-1;

and obtaining the second content feature according to the H first intermediate features.

6. The method of claim 2 or 3, wherein the graph convolution submodel comprises a second graph convolution network comprising K second graph convolution layers,

the inputting the first audio feature into a first feature extraction model to obtain a second audio feature comprises:

inputting the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature;

inputting the kth second intermediate feature into the (K + 1) th second graph convolution layer to obtain a (K + 1) th second intermediate feature, wherein K is 1, … … K-1;

and obtaining the second audio features according to the K second intermediate features.

7. The method of any of claims 1 to 6, wherein the identifying, from the second content features and the second audio features, an emotion of a target object corresponding to target data comprises:

performing fusion operation on the second content characteristic and the second audio characteristic to obtain a fusion characteristic;

and identifying the emotion of the target object according to the fusion characteristics.

8. A method of training an emotion recognition model, the emotion recognition model comprising a first feature extraction model, comprising:

acquiring a first content characteristic and a first audio characteristic of sample data;

inputting the first audio features into a first feature extraction model to obtain second audio features;

according to the second content characteristic and the second audio characteristic, recognizing the emotion of the sample object corresponding to the sample data;

obtaining a loss value according to the emotion of the sample object and the label of the sample data; and

and training the emotion recognition model according to the loss value.

9. The method of claim 8, wherein the first feature extraction model comprises a graph convolution submodel, the graph structure adopted by the graph convolution submodel is a chain graph structure, and a first adjacency matrix corresponding to the chain graph structure is:

wherein A is_CA is a real number greater than 0 for the first adjacency matrix;

wherein the first adjacency matrix is a matrix of N x N, and N is a positive integer greater than 2; the i +1 th row vector of the first adjacency matrix is obtained by cycling the ith row vector to the right by one bit, and i is an integer which is greater than 1 and less than or equal to N-2.

10. The method of claim 8, wherein the first feature extraction model comprises a graph convolution sub-model, the graph convolution sub-model using a graph structure that is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:

11. The method of claim 8, wherein the emotion recognition model comprises a second feature extraction model,

the obtaining of the first content feature and the first audio feature of the sample data comprises:

inputting the sample data into a second feature extraction model to obtain text information and time information of the sample data;

obtaining the first content characteristic according to the text information;

12. The method of claim 9 or 10, wherein the graph convolution submodel comprises a first graph convolution network comprising H first graph convolution layers,

inputting the H-th first intermediate feature into the H + 1-th first graph convolution layer to obtain the H + 1-th first intermediate feature, wherein H is 2, … … H-1;

13. The method of claim 9 or 10, wherein the graph convolution submodel comprises a second graph convolution network comprising K second graph convolution layers,

inputting the kth second intermediate feature into the (K + 1) th second graph convolution layer to obtain a (K + 1) th second intermediate feature, wherein K is 2, … … K-1;

14. The method according to any one of claims 8 to 13, wherein the emotion recognition model comprises a fusion model and a classification model,

said identifying, from the second content features and the second audio features, an emotion of a sample object corresponding to the sample data comprises:

inputting the second content characteristic and the second audio characteristic into the fusion model to obtain a fusion characteristic;

and inputting the fusion features into the classification model, and identifying the emotion of the sample object.

15. The method of any of claims 8 to 14, wherein said training said emotion recognition model according to said loss value comprises:

and adjusting parameters of the first feature extraction model according to the loss value so as to train the emotion recognition model.

16. An apparatus for recognizing emotion, comprising:

the first acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the target data;

the first obtaining module is used for inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics;

the second obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features; and

and the first identification module is used for identifying the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.

17. An apparatus for training an emotion recognition model, the emotion recognition model comprising a first feature extraction model, comprising:

the second acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the sample data;

a third obtaining module, configured to input the first content feature into a first feature extraction model to obtain a second content feature;

the fourth obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features;

a second identification module for identifying an emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature;

a fifth obtaining module, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data; and

and the training module is used for training the emotion recognition model according to the loss value.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 15.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 15.

20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 15.