CN113990353A - Method for recognizing emotion, method, device and equipment for training emotion recognition model - Google Patents

Method for recognizing emotion, method, device and equipment for training emotion recognition model Download PDF

Info

Publication number
CN113990353A
CN113990353A CN202111259447.7A CN202111259447A CN113990353A CN 113990353 A CN113990353 A CN 113990353A CN 202111259447 A CN202111259447 A CN 202111259447A CN 113990353 A CN113990353 A CN 113990353A
Authority
CN
China
Prior art keywords
feature
content
audio
graph convolution
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111259447.7A
Other languages
Chinese (zh)
Inventor
赵情恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111259447.7A priority Critical patent/CN113990353A/en
Publication of CN113990353A publication Critical patent/CN113990353A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The disclosure provides a method for recognizing emotion, relates to the field of artificial intelligence, and particularly relates to the field of deep learning. The specific implementation scheme is as follows: acquiring a first content characteristic and a first audio characteristic of target data; inputting the first content features into a first feature extraction model to obtain second content features; inputting the first audio features into a first feature extraction model to obtain second audio features; and identifying an emotion of the target object corresponding to the target data according to the second content feature and the second audio feature. The disclosure also provides a method and a device for training the emotion recognition model, electronic equipment and a storage medium.

Description

Method for recognizing emotion, method, device and equipment for training emotion recognition model
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and more particularly, to deep learning techniques. More particularly, the present disclosure provides a method of recognizing emotion, a method of training an emotion recognition model, an apparatus, an electronic device, and a storage medium.
Background
Speech is an important carrier of emotion in human communication. People have different language expressions in different emotional states. For example, sentences having the same content carry different emotions, and completely different meanings can be expressed.
Disclosure of Invention
The present disclosure provides a method of recognizing emotion, a method of training an emotion recognition model, an apparatus, a device, and a storage medium.
According to a first aspect, there is provided a method of identifying an emotion, the method comprising: acquiring a first content characteristic and a first audio characteristic of target data; inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics; inputting the first audio features into a first feature extraction model to obtain second audio features; and identifying the emotion of the target object corresponding to the target data according to the second content feature and the second audio feature.
According to a second aspect, there is provided a method of training an emotion recognition model, the emotion recognition model including a first feature extraction model, the method comprising: acquiring a first content characteristic and a first audio characteristic of sample data; inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics; inputting the first audio features into a first feature extraction model to obtain second audio features; recognizing the emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature; obtaining a loss value according to the emotion of the sample object and the label of the sample data; and training the emotion recognition model according to the loss value.
According to a third aspect, there is provided an apparatus for recognizing emotion, the apparatus comprising: the first acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the target data; the first obtaining module is used for inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics; the second obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features; and the first identification module is used for identifying the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.
According to a fourth aspect, there is provided an apparatus for training a emotion recognition model, the emotion recognition model including a first feature extraction model, the apparatus comprising: the second acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the sample data; a third obtaining module, configured to input the first content feature into a first feature extraction model to obtain a second content feature; a fourth obtaining module, configured to input the first audio feature into the first feature extraction model to obtain a second audio feature; a second identification module, configured to identify an emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature; a fifth obtaining module, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data; and the training module is used for training the emotion recognition model according to the loss value.
According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.
According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided in accordance with the present disclosure.
According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow diagram of a method of identifying emotions according to one embodiment of the present disclosure;
FIG. 2A is a schematic diagram of a chain graph structure according to one embodiment of the present disclosure;
FIG. 2B is a schematic diagram of a line graph structure according to one embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a method of recognizing emotion according to one embodiment of the present disclosure;
FIG. 4 is a flow diagram of a method of training an emotion recognition model according to one embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a method of training a mood recognition model according to one embodiment of the present disclosure;
FIG. 6 is a block diagram of an apparatus to recognize emotions according to one embodiment of the present disclosure;
FIG. 7 is a block diagram of an apparatus for training emotion recognition models, according to one embodiment of the present disclosure; and
fig. 8 is a block diagram of an electronic device to which a method of recognizing emotion and/or a method of training an emotion recognition model may be applied, according to one embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The language expression pattern differs for different emotional states. For example, when the emotion is happy, the tone of speech is cheerful. For another example, when the emotion is dysphoria or impairment of the heart, the tone of speech is rather dull.
Deep learning techniques have accelerated the development of emotion recognition from speech. But the research in this respect still remains insufficient. For example, different subjects express different emotions for the same speech, but the related art cannot easily recognize different emotions.
At present, in order to improve the effect of the emotion recognition model, front-end feature extraction, for example, MFCC (Mel Frequency Cepstrum Coefficient) of speech, may be optimized to improve the accuracy of emotion recognition. For another example, the feature dimension may be increased, such as from 40 dimensions to 80 dimensions, to increase the accuracy of emotion recognition. However, the accuracy of emotion recognition cannot be obviously improved by optimizing the technical means of front-end feature extraction.
Fig. 1 is a flow diagram of a method of recognizing emotion according to one embodiment of the present disclosure.
As shown in fig. 1, the method 100 may include operations S110 to S140.
In operation S110, a first content feature and a first audio feature of target data are acquired.
In the disclosed embodiment, the target data may be voice data.
For example, the target data may be a segment of speech originating from the target object.
In the disclosed embodiment, the target data may be voice data in video data.
For example, video data of the target object may be acquired. The target data may be voice data extracted from the video data. In one example, video data for a target object may be captured, with an audio stream in the video data as the target data.
In the embodiment of the present disclosure, the target data may be input into the second feature extraction model, so as to obtain text information and time information of the target data.
For example, the second feature extraction model may include a forced alignment submodel. In some examples, the forced alignment submodel may be a GMM-HMM (Gaussian Mixture Model-Hidden Markov Model), a LSTM-CTC (Long-Short Term Memory-connection Temporal Classifier), or a Chain Model.
For example, the second feature extraction model may be a pre-trained model. In some examples, the second feature extraction model may be pre-trained using open source data Aishell or LibriSpeech, etc. as training samples.
For example, the text information may include information such as phonemes, words, and the like in the target data.
For example, the time information may include time stamps of occurrences of phonemes, words. In one example, the time information includes a start time of occurrence of a phoneme and a duration of the phoneme.
In the embodiment of the present disclosure, the first content feature may be obtained according to the text information.
For example, the second feature extraction model further includes a content feature generation submodel. The text information may be input into a content feature generation submodel to obtain a first content feature. In one example, one or several of phonemes, words of the target data may be input into a content feature generation sub-model to obtain the first content feature. The content feature generation submodel may be a convolutional neural network model.
In the embodiment of the present disclosure, the first audio feature may be obtained according to the text information and the time information.
For example, the forced alignment submodel may output the first audio feature according to the text information and the time information.
In operation S120, the first content feature is input into the first feature extraction model, and a second content feature is obtained.
In an embodiment of the present disclosure, the first feature extraction model may include a graph convolution sub-model.
For example, the Graph convolution submodel may be a GCN (Graph Convolutional neural Network) model. The Graph structure adopted by the GCN model can be an Undirected Graph structure (Undirected Graph Structures).
For example, the graph structure adopted by the graph convolution sub-model is a chain graph structure, and the first adjacency matrix corresponding to the chain graph structure is:
Figure BDA0003323632180000051
for example, ACA is a real number greater than 0, which is the first adjacency matrix.
For example, the first adjacency matrix is a matrix of N × N, N is a positive integer greater than 2, the i +1 th row vector of the first adjacency matrix is obtained by rotating one bit to the right according to the i-th row vector, and i is an integer greater than 1 and less than or equal to N-2.
For example, each row vector of the first adjacency matrix includes two non-zero data (such as a). In one example, a ═ 1.
For example, the graph structure adopted by the graph convolution sub-model is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:
Figure BDA0003323632180000052
for example, ALB is a real number greater than 0, which is the second adjacency matrix.
For example, the second adjacency matrix is a matrix of M × M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacency matrix is obtained by rotating the j th row vector one bit to the right, and j is an integer greater than 1 and less than or equal to M-2.
For example, the first row vector of the second adjacency matrix includes a non-zero data (e.g., b), and the last row vector of the second adjacency matrix includes a non-zero data (e.g., b). The second through M-1 th row vectors of the second adjacency matrix include two non-zero data. In one example, b is 1.
The graph structure of the graph convolution submodel includes a plurality of nodes. The importance of the relationship between adjacent nodes is much greater than the relationship between non-adjacent nodes. By adopting the graph convolution submodel with a chain graph structure or a line graph structure, the relationship between adjacent nodes can be learned. The operation amount of the graph convolution submodel can be reduced, and meanwhile, the emotion recognition accuracy can be guaranteed.
In an embodiment of the present disclosure, the graph convolution submodel may include a first graph convolution network, which may include H first graph convolution layers.
In an embodiment of the present disclosure, the first content feature may be input into the 1 st first graph convolution layer, resulting in the 1 st first intermediate feature.
For example, the 1 st first intermediate feature may be obtained using the following formula:
Figure BDA0003323632180000061
Figure BDA0003323632180000062
for the 1 st first intermediate feature, U is the normalized graph Laplace matrix
Figure BDA00033236321800000616
The feature vector matrix of (2). U shapeTIs the transpose of the U and,
Figure BDA0003323632180000063
in order to be a first one of the content features,
Figure BDA0003323632180000064
the parameters of the layer are convolved for the 1 st first graph.
Graph Laplace matrix
Figure BDA0003323632180000065
This can be obtained by the following formula:
Figure BDA0003323632180000066
wherein A is an adjacency matrix, e.g. A as described aboveCOr AL. D is a degree matrix.
The graph laplacian matrix can be aligned by the following formula
Figure BDA0003323632180000067
And (3) carrying out characteristic value decomposition:
Figure BDA0003323632180000068
λgis the g-th bitEigenvalues, corresponding to eigenvectors ug,U=[u1,u2,......uG],Λ=diag(λg)。
In some examples, A in formula three is ACGraph laplacian matrix
Figure BDA0003323632180000069
The corresponding graph fourier transform is a discrete fourier transform, which is a circulant matrix. Accordingly, the method can be used for solving the problems that,
Figure BDA00033236321800000610
is a matrix of N x N.
In some examples, a in formula three is aLThe corresponding graph is fourier transformed to a discrete cosine transform. Accordingly, the method can be used for solving the problems that,
Figure BDA00033236321800000611
matrix of M
In the disclosed embodiment, the h +1 th first intermediate feature may be input into the h +1 th first graph convolution layer to obtain the h +1 th first intermediate feature.
For example, H1, … … H-1.
For example, the h +1 th first intermediate feature can be obtained by the following formula:
Figure BDA00033236321800000612
Figure BDA00033236321800000613
for the h +1 th first intermediate feature,
Figure BDA00033236321800000614
for the h-th first intermediate feature,
Figure BDA00033236321800000615
is the parameter of the h +1 th first map convolutional layer.
In the embodiment of the present disclosure, the second content feature may be obtained according to the H first intermediate features.
For example, H first intermediate features may be input into the first pooling layer, and pooled by the following formula to obtain the second content feature:
Figure BDA0003323632180000071
c is a second content characteristic of the content,
Figure BDA0003323632180000072
is the h-th first intermediate feature.
In one example, the first graph convolution network employs a graph structure that includes 16 nodes.
In operation S130, the first audio feature is input into the first feature extraction model to obtain a second audio feature.
In an embodiment of the present disclosure, the first feature extraction model may include a graph convolution sub-model.
For a detailed description of the first feature extraction model, reference may be made to the description of the first feature extraction model, for example, described in operation S120, and the detailed description of the present disclosure is omitted here.
In an embodiment of the disclosure, the graph convolution submodel includes a second graph convolution network, the first graph convolution network including K second graph convolution layers.
In an embodiment of the present disclosure, the first audio feature may be input into the 1 st second graph convolution layer, resulting in the 1 st second intermediate feature.
For example, the 1 st second intermediate feature can be obtained using the following formula:
Figure BDA0003323632180000073
Figure BDA0003323632180000074
for the 1 st second intermediate feature, U is the normalized TuraThe placian matrix
Figure BDA0003323632180000075
The feature vector matrix of (2). U shapeTIs the transpose of the U and,
Figure BDA0003323632180000076
in order to be the first audio feature,
Figure BDA0003323632180000077
the parameters of the layer are scrolled for the 1 st second graph.
Graph Laplace matrix
Figure BDA0003323632180000078
The eigenvector matrix of the normalized graph laplacian matrix can be obtained by referring to, for example, the formula three and the formula four described above, and the details of the disclosure are not repeated herein.
In the disclosed embodiment, the kth second intermediate feature may be input into the (k + 1) th second graph convolution layer to obtain a (k + 1) th second intermediate feature.
For example, K is 1, … … K-1.
For example, the (k + 1) th first intermediate feature can be obtained by the following formula:
Figure BDA0003323632180000079
Figure BDA00033236321800000710
for the (k + 1) th second intermediate feature,
Figure BDA00033236321800000711
for the k-th second intermediate feature,
Figure BDA00033236321800000712
is the parameter of the (k + 1) th second map convolutional layer.
In the embodiment of the present disclosure, the second audio feature may be obtained according to the K second intermediate features.
For example, K second intermediate features may be input into the second pooling layer, and pooled by the following formula to obtain a second audio feature:
Figure BDA00033236321800000713
audio is the second Audio characteristic and,
Figure BDA0003323632180000081
is the kth second intermediate feature.
In one example, the graph structure employed by the second graph convolution network contains 120 nodes.
In some examples, H may be equal to K. I.e. the first and second graph convolution networks may have the same number of graph convolution layers.
It should be noted that the graph structure adopted by the graph convolution sub-model may be a graph structure adopted by the first graph convolution network and/or the second graph convolution network.
It should be noted that, if the first graph convolution network adopts a chain graph structure, the graph structure adopted by one or more first graph convolution layers in the H first graph convolution layers may be a chain graph structure.
It should be noted that, if the second graph convolution network adopts a chain graph structure, the graph structure adopted by one or more second graph convolution layers in the K second graph convolution layers may be a chain graph structure.
It should be noted that the first graph convolution network and the second graph convolution network may have the same parameters except for the nodes of the graph structure.
It should be noted that the first graph convolution network and the second graph convolution network may both adopt a chain graph structure. Alternatively, the first and second graph convolution networks may both employ a line graph structure. Alternatively, the first graph convolution network may adopt a chain graph structure, and the second graph convolution network adopts a line graph structure. Alternatively, the first graph convolution network may employ a line graph structure, and the second graph convolution network may employ a chain graph structure.
In operation S140, an emotion of the target object corresponding to the target data is identified according to the second content feature and the second audio feature.
In the embodiment of the present disclosure, a fusion operation may be performed on the second content feature and the second audio feature, resulting in a fusion feature.
For example, the second content feature and the second audio feature may be spliced to obtain a fused feature.
In the embodiment of the present disclosure, the emotion of the target object may be recognized according to the fusion feature.
For example, the fused features may be input into the fully connected layer to identify the mood of the target object. The mood may be happy or sad, etc.
Through the embodiment of the disclosure, the emotion corresponding to the audio frame is determined by considering the front-back sequence relation between the audios and the association between the audio and the content of the audio frame, and the emotion recognition accuracy is improved. By adopting a chain graph structure or a line graph structure, the calculation amount can be reduced, and meanwhile, the emotion recognition accuracy is further improved.
Fig. 2A is a schematic diagram of a chain graph structure according to one embodiment of the present disclosure.
As shown in FIG. 2A, the chain graph structure 201 comprises N nodes, wherein the Nth base point VNConnected with the 1 st node. In one example, N-120. In one example, N ═ 16.
Fig. 2B is a schematic diagram of a line graph structure according to one embodiment of the present disclosure.
As shown in FIG. 2B, the wire graph structure 202 comprises M nodes, wherein the Mth base point V'MUnconnected to the 1 st node. In one example, M-120. In one example, M ═ 16.
Fig. 3 is a schematic diagram of a method of recognizing emotion according to one embodiment of the present disclosure.
As shown in fig. 3, the input of the second feature extraction model 302 is target data 301, and the first content feature and the first audio feature are output.
The first feature extraction model comprises a graph convolution sub-model, which may comprise a first graph convolution network 303 and a second graph convolution network 305. The first feature extraction model may also include a first pooling layer 304 and a second pooling layer 306.
The input to the first graph convolution network 303 is a first content characteristic and the output is a second content characteristic. The first graph convolution network 303 includes H first graph convolution layers. The input to the 1 st first graph convolution layer 3031 is the first content feature, outputting the 1 st first intermediate feature. The 1 st first intermediate feature serves as an input to the 2 nd first map convolutional layer. The input to the h-th first graph convolution layer 3032 is the h-1 st first intermediate feature, outputting the h-th first intermediate feature. The input to the H first graph convolutional layer 3033 is the H-1 st first intermediate feature, and the H first intermediate feature is output. The input to the first pooling layer 304 is the H first intermediate features and the second content feature is output. In one example, the first graph convolution network 303 employs a graph structure that includes 16 nodes.
The input to the second graph convolution network 305 is a first audio feature and the output is a second audio feature. The second graph convolution network 305 includes K first graph convolution layers. The input of the 1 st second graph convolutional layer 3051 is the first audio feature, and the 1 st second intermediate feature is output. The 1 st second intermediate feature serves as an input to the 2 nd second map convolutional layer. The input to the kth second graph convolutional layer 3052 is the kth-1 second intermediate feature, and the kth second intermediate feature is output. The input to the Kth second graph convolutional layer 3053 is the K-1 th second intermediate feature, and the Kth second intermediate feature is output. The inputs to the second pooling layer 306 are K second intermediate features, and the second audio features are output. In one example, the second graph convolution network 305 employs a graph structure that includes 120 nodes.
The input of the fusion model 307 is the second content feature and the second audio feature, and the fusion feature is output. The fusion model 307 may stitch the second content feature and the second audio feature.
The classification model 308 may include one or several fully connected layers, with the input to the classification model 308 being fusion features that output the emotion of the target object corresponding to the target data 301.
FIG. 4 is a flow diagram of a method of training an emotion recognition model according to one embodiment of the present disclosure.
As shown in fig. 4, the method 400 may include operations S410 to S460.
In operation S410, a first content feature and a first audio feature of sample data are acquired.
In an embodiment of the present disclosure, the emotion recognition model may include a second feature extraction model.
In the embodiment of the present disclosure, the sample data may be input into the second feature extraction model, and text information and time information of the sample data may be obtained.
In the embodiment of the present disclosure, the first content feature may be obtained according to the text information.
In the embodiment of the present disclosure, the first audio feature may be obtained according to the text information and the time information.
For the embodiment of operation S410, reference may be made to the above-mentioned embodiment of operation S110, and the disclosure is not repeated herein.
In operation S420, the first content feature is input into the first feature extraction model, and a second content feature is obtained.
In this disclosure, the first feature extraction model may include a graph convolution sub-model, and a graph structure adopted by the graph convolution sub-model may be a chain graph structure.
For example, the first adjacency matrix corresponding to the chain graph structure is:
Figure BDA0003323632180000101
for example, ACA is a real number greater than 0, which is the first adjacency matrix.
For example, the first adjacency matrix is a matrix of N × N, N being a positive integer greater than 2; the i +1 th row vector of the first adjacency matrix is obtained by one bit of right circulation of the ith row vector, and i is an integer which is greater than 1 and less than or equal to N-2.
In the embodiment of the present disclosure, the first feature extraction model includes a graph convolution sub-model, and a graph structure adopted by the graph convolution sub-model is a line graph structure.
For example, the second adjacency matrix corresponding to the line graph structure is:
Figure BDA0003323632180000111
for example, ALB is a real number greater than 0, which is the second adjacency matrix.
For example, the second adjacency matrix is a matrix of M × M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacency matrix is obtained by rotating the j th row vector one bit to the right, and j is an integer greater than 1 and less than or equal to M-2.
In an embodiment of the present disclosure, the graph convolution submodel may include a first graph convolution network, which may include H first graph convolution layers.
In an embodiment of the present disclosure, the first content feature may be input into the 1 st first graph convolution layer, resulting in the 1 st first intermediate feature.
In the embodiment of the present disclosure, the H-th first intermediate feature may be input into the H + 1-th first graph convolution layer to obtain the H + 1-th first intermediate feature, H is 2, … … H-1.
In the embodiment of the present disclosure, the second content feature may be obtained according to the H first intermediate features.
For the embodiment of operation S420, reference may be made to the above-mentioned embodiment of operation S120, and the disclosure is not repeated herein.
In operation S430, the first audio feature is input into the first feature extraction model, and a second audio feature is obtained.
In an embodiment of the disclosure, the graph convolution submodel may include a second graph convolution network, which may include K second graph convolution layers.
In an embodiment of the present disclosure, the first audio feature may be input into the 1 st second graph convolution layer, resulting in the 1 st second intermediate feature.
In the embodiment of the present disclosure, the kth second intermediate feature may be input into the (K + 1) th second graph convolution layer, so as to obtain a (K + 1) th second intermediate feature, where K is 2, … … K-1.
In the embodiment of the present disclosure, the second audio feature may be obtained according to the K second intermediate features.
For the embodiment of operation S430, reference may be made to the above-mentioned embodiment of operation S130, and the disclosure is not repeated herein.
In operation S440, an emotion of the sample object corresponding to the sample data is identified according to the second content feature and the second audio feature.
In the disclosed embodiment, the emotion recognition model may include a fusion model and a classification model.
In an embodiment of the present disclosure, the second content feature and the second audio feature may be input into a fusion model to obtain a fusion feature.
In the disclosed embodiment, the fused features may be input into a classification model to identify the emotion of the sample object.
For the embodiment of operation S440, reference may be made to the above-mentioned embodiment of operation S140, and the disclosure is not repeated herein.
In operation S450, a loss value is obtained according to the emotion of the sample object and the label of the sample data.
For example, a loss value may be derived using a cross entropy loss function based on the mood of the sample object and the label of the sample data.
In operation S460, the emotion recognition model is trained according to the loss value.
In the embodiment of the present disclosure, parameters of the first feature extraction model may be adjusted according to the loss value to train the emotion recognition model.
For example, based on the loss value, the values in equation two, for example, can be adjusted
Figure BDA0003323632180000121
It is also possible to adjust, for example, the formula five
Figure BDA0003323632180000122
It is also possible to adjust, for example, the formula seven
Figure BDA0003323632180000123
It is also possible to adjust, for example, the equation eight
Figure BDA0003323632180000124
For example, the number of nodes of the graph structure employed by the graph convolution sub-model may be adjusted based on the penalty value. A model can be obtained that accurately identifies the mood of the target object.
FIG. 5 is a schematic diagram of a method of training a mood recognition model according to one embodiment of the present disclosure.
As shown in fig. 5, the emotion recognition model may include, for example, a first feature extraction model, a second feature extraction model 302, a fusion model 307, and a classification model 308. The first feature extraction model may comprise a graph convolution sub-model, which may comprise a first graph convolution network 303 and a second graph convolution network 305. The first feature extraction model may also include a first pooling layer 304 and a second pooling layer 306.
For example, the processing method of the target data 301 described in fig. 3 may be referred to as the processing method of the sample data 501, and details of the present disclosure are not repeated herein.
The emotion recognition model processes the sample data 501 and outputs the emotion of the sample object corresponding to the sample data. From the mood of the sample object and the label of the sample data, a loss value can be derived. Parameters of the first and second convolutional networks 303, 305 may be adjusted according to the loss values to train the emotion recognition model.
Fig. 6 is a block diagram of an apparatus for recognizing emotion according to one embodiment of the present disclosure.
As shown in fig. 6, the apparatus 600 may include a first obtaining module 610, a first obtaining module 620, a second obtaining module 630, and a first identifying module 640.
The first obtaining module 610 is configured to obtain a first content feature and a first audio feature of the target data.
The first obtaining module 620 is configured to input the first content feature into the first feature extraction model to obtain a second content feature.
The second obtaining module 630 is configured to input the first audio feature into the first feature extraction model to obtain a second audio feature.
The first identifying module 640 is configured to identify an emotion of the target object corresponding to the target data according to the second content feature and the second audio feature.
In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph structure adopted by the graph convolution sub-model is a chain graph structure, and the first adjacency matrix corresponding to the chain graph structure is:
Figure BDA0003323632180000131
wherein A isCA is a real number greater than 0 for the first adjacency matrix; the first adjacent matrix is a matrix of N x N, N is a positive integer greater than 2, the (i + 1) th row vector of the first adjacent matrix is obtained by one bit of right circulation according to the (i) th row vector, and i is an integer greater than 1 and less than or equal to N-2.
In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph convolution sub-model uses a graph structure of a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:
Figure BDA0003323632180000141
wherein A isLB is a real number greater than 0 for the second adjacency matrix; the second adjacent matrix is a matrix of M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacent matrix is obtained by one bit right circulation according to the j row vector, and j is an integer greater than 1 and less than or equal to M-2.
In some embodiments, the first obtaining module includes: a first obtaining unit, configured to input the target data into a second feature extraction model to obtain text information and time information of the target data; a second obtaining unit, configured to obtain the first content feature according to the text information; a third obtaining unit, configured to obtain the first audio feature according to the text information and the time information.
In some embodiments, the graph convolution submodel includes a first graph convolution network including H first graph convolution layers, the first obtaining module includes: a fourth obtaining unit configured to input the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature; a fifth obtaining unit, configured to input the H-th first intermediate feature into the H + 1-th first map convolution layer, so as to obtain an H + 1-th first intermediate feature, where H is 1, … … H-1; a sixth obtaining unit, configured to obtain the second content feature according to the H first intermediate features.
In some embodiments, the graph convolution submodel includes a second graph convolution network, the second graph convolution network includes K second graph convolution layers, and the second obtaining module includes: a seventh obtaining unit, configured to input the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature; an eighth obtaining unit, configured to input the kth second intermediate feature into the (K + 1) th second map convolution layer, so as to obtain a (K + 1) th second intermediate feature, where K is 1, … … K-1; a ninth obtaining unit, configured to obtain the second audio feature according to the K second intermediate features.
In some embodiments, the first identification module comprises: the first fusion unit is used for executing fusion operation on the second content characteristic and the second audio characteristic to obtain a fusion characteristic; and the first identification unit is used for identifying the emotion of the target object according to the fusion characteristics.
Fig. 7 is a block diagram of an apparatus for training a emotion recognition model according to one embodiment of the present disclosure.
As shown in fig. 7, the apparatus 700 includes a second obtaining module 710, a third obtaining module 720, a fourth obtaining module 730, a second identifying module 740, a fifth obtaining module 750, and a training module 760. The emotion recognition model includes a first feature extraction model.
The second obtaining module 710 is configured to obtain a first content feature and a first audio feature of the sample data.
The third obtaining module 720 is configured to input the first content feature into the first feature extraction model to obtain a second content feature.
The fourth obtaining module 730 is configured to input the first audio feature into the first feature extraction model to obtain a second audio feature.
The second identifying module 740 is configured to identify an emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature.
A fifth obtaining module 750, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data.
And the training module 760 is configured to train the emotion recognition model according to the loss value.
In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph structure adopted by the graph convolution sub-model is a chain graph structure, and the first adjacency matrix corresponding to the chain graph structure is:
Figure BDA0003323632180000151
wherein A isCA is a real number greater than 0 for the first adjacency matrix; wherein the first adjacent matrix is a matrix of N × N, N being a positive integer greater than 2; the i +1 th row vector of the first adjacency matrix is obtained by one bit of right circulation of the i-th row vector, and i is an integer which is greater than 1 and less than or equal to N-2.
In some embodiments, the first feature extraction model includes a graph convolution sub-model, the graph convolution sub-model uses a graph structure of a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:
Figure BDA0003323632180000152
wherein A isLB is a real number greater than 0 for the second adjacency matrix; the second adjacent matrix is a matrix of M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacent matrix is obtained by one bit right circulation according to the j row vector, and j is an integer greater than 1 and less than or equal to M-2.
In some embodiments, the emotion recognition model includes a second feature extraction model, and the second obtaining module includes: a tenth obtaining unit, configured to input the sample data into a second feature extraction model, so as to obtain text information and time information of the sample data; an eleventh obtaining unit, configured to obtain the first content feature according to the text information; a twelfth obtaining unit, configured to obtain the first audio feature according to the text information and the time information.
In some embodiments, the graph convolution submodel includes a first graph convolution network including H first graph convolution layers, and the third obtaining module includes: a thirteenth obtaining unit configured to input the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature; a fourteenth obtaining unit, configured to input the H-th first intermediate feature into the H + 1-th first map convolution layer, so as to obtain an H + 1-th first intermediate feature, where H is 1, … … H-1; a fifteenth obtaining unit, configured to obtain the second content feature according to the H first intermediate features.
In some embodiments, the graph convolution submodel includes a second graph convolution network, the second graph convolution network includes K second graph convolution layers, and the fourth obtaining module includes: a sixteenth obtaining unit, configured to input the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature; a seventeenth obtaining unit, configured to input the kth second intermediate feature into the (K + 1) th second map convolution layer, so as to obtain a (K + 1) th second intermediate feature, where K is 1, … … K-1; and an eighteenth obtaining unit, configured to obtain the second audio feature according to the K second intermediate features.
In some embodiments, the emotion recognition model includes a fusion model and a classification model, and the second recognition module includes: the second fusion unit is used for inputting the second content characteristic and the second audio characteristic into the fusion model to obtain a fusion characteristic; and a second recognition unit for inputting the fusion feature into the classification model to recognize the emotion of the sample object.
In some embodiments, the training module is further configured to adjust parameters of the first feature extraction model according to the loss value to train the emotion recognition model.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the various methods and processes described above, such as a method of recognizing emotion and/or a method of training an emotion recognition model. For example, in some embodiments, the method of recognizing emotion and/or the method of training an emotion recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM802 and/or communications unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the above described method of recognizing emotion and/or method of training an emotion recognition model may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of recognizing emotions and/or the method of training an emotion recognition model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (20)

1. A method of recognizing emotion, comprising:
acquiring a first content characteristic and a first audio characteristic of target data;
inputting the first content features into a first feature extraction model to obtain second content features;
inputting the first audio features into a first feature extraction model to obtain second audio features; and
and identifying the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.
2. The method of claim 1, wherein the first feature extraction model comprises a graph convolution submodel, the graph structure adopted by the graph convolution submodel is a chain graph structure, and a first adjacency matrix corresponding to the chain graph structure is:
Figure FDA0003323632170000011
wherein A isCA is a real number greater than 0 for the first adjacency matrix;
the first adjacent matrix is a matrix of N x N, N is a positive integer greater than 2, the (i + 1) th row vector of the first adjacent matrix is obtained by one bit of right circulation according to the (i) th row vector, and i is an integer greater than 1 and less than or equal to N-2.
3. The method of claim 1, wherein the first feature extraction model comprises a graph convolution sub-model, the graph convolution sub-model using a graph structure that is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:
Figure FDA0003323632170000012
wherein A isLB is a real number greater than 0 for the second adjacency matrix;
the second adjacent matrix is a matrix of M x M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacent matrix is obtained by cycling one bit to the right according to the j row vector, and j is an integer greater than 1 and less than or equal to M-2.
4. The method of claim 1, wherein the obtaining the first content characteristic and the first audio characteristic of the target data comprises:
inputting the target data into a second feature extraction model to obtain text information and time information of the target data;
obtaining the first content characteristic according to the text information;
and obtaining the first audio characteristic according to the text information and the time information.
5. The method of claim 2 or 3, wherein the graph convolution submodel comprises a first graph convolution network comprising H first graph convolution layers,
the inputting the first content feature into a first feature extraction model to obtain a second content feature comprises:
inputting the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature;
inputting the H-th first intermediate feature into the H + 1-th first graph convolution layer to obtain the H + 1-th first intermediate feature, wherein H is 1, … … H-1;
and obtaining the second content feature according to the H first intermediate features.
6. The method of claim 2 or 3, wherein the graph convolution submodel comprises a second graph convolution network comprising K second graph convolution layers,
the inputting the first audio feature into a first feature extraction model to obtain a second audio feature comprises:
inputting the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature;
inputting the kth second intermediate feature into the (K + 1) th second graph convolution layer to obtain a (K + 1) th second intermediate feature, wherein K is 1, … … K-1;
and obtaining the second audio features according to the K second intermediate features.
7. The method of any of claims 1 to 6, wherein the identifying, from the second content features and the second audio features, an emotion of a target object corresponding to target data comprises:
performing fusion operation on the second content characteristic and the second audio characteristic to obtain a fusion characteristic;
and identifying the emotion of the target object according to the fusion characteristics.
8. A method of training an emotion recognition model, the emotion recognition model comprising a first feature extraction model, comprising:
acquiring a first content characteristic and a first audio characteristic of sample data;
inputting the first content features into a first feature extraction model to obtain second content features;
inputting the first audio features into a first feature extraction model to obtain second audio features;
according to the second content characteristic and the second audio characteristic, recognizing the emotion of the sample object corresponding to the sample data;
obtaining a loss value according to the emotion of the sample object and the label of the sample data; and
and training the emotion recognition model according to the loss value.
9. The method of claim 8, wherein the first feature extraction model comprises a graph convolution submodel, the graph structure adopted by the graph convolution submodel is a chain graph structure, and a first adjacency matrix corresponding to the chain graph structure is:
Figure FDA0003323632170000031
wherein A isCA is a real number greater than 0 for the first adjacency matrix;
wherein the first adjacency matrix is a matrix of N x N, and N is a positive integer greater than 2; the i +1 th row vector of the first adjacency matrix is obtained by cycling the ith row vector to the right by one bit, and i is an integer which is greater than 1 and less than or equal to N-2.
10. The method of claim 8, wherein the first feature extraction model comprises a graph convolution sub-model, the graph convolution sub-model using a graph structure that is a line graph structure, and the second adjacency matrix corresponding to the line graph structure is:
Figure FDA0003323632170000032
wherein A isLB is a real number greater than 0 for the second adjacency matrix;
the second adjacent matrix is a matrix of M x M, M is a positive integer greater than 2, the j +1 th row vector of the second adjacent matrix is obtained by cycling one bit to the right according to the j row vector, and j is an integer greater than 1 and less than or equal to M-2.
11. The method of claim 8, wherein the emotion recognition model comprises a second feature extraction model,
the obtaining of the first content feature and the first audio feature of the sample data comprises:
inputting the sample data into a second feature extraction model to obtain text information and time information of the sample data;
obtaining the first content characteristic according to the text information;
and obtaining the first audio characteristic according to the text information and the time information.
12. The method of claim 9 or 10, wherein the graph convolution submodel comprises a first graph convolution network comprising H first graph convolution layers,
the inputting the first content feature into a first feature extraction model to obtain a second content feature comprises:
inputting the first content feature into a 1 st first graph convolution layer to obtain a 1 st first intermediate feature;
inputting the H-th first intermediate feature into the H + 1-th first graph convolution layer to obtain the H + 1-th first intermediate feature, wherein H is 2, … … H-1;
and obtaining the second content feature according to the H first intermediate features.
13. The method of claim 9 or 10, wherein the graph convolution submodel comprises a second graph convolution network comprising K second graph convolution layers,
the inputting the first audio feature into a first feature extraction model to obtain a second audio feature comprises:
inputting the first audio feature into a 1 st second graph convolution layer to obtain a 1 st second intermediate feature;
inputting the kth second intermediate feature into the (K + 1) th second graph convolution layer to obtain a (K + 1) th second intermediate feature, wherein K is 2, … … K-1;
and obtaining the second audio features according to the K second intermediate features.
14. The method according to any one of claims 8 to 13, wherein the emotion recognition model comprises a fusion model and a classification model,
said identifying, from the second content features and the second audio features, an emotion of a sample object corresponding to the sample data comprises:
inputting the second content characteristic and the second audio characteristic into the fusion model to obtain a fusion characteristic;
and inputting the fusion features into the classification model, and identifying the emotion of the sample object.
15. The method of any of claims 8 to 14, wherein said training said emotion recognition model according to said loss value comprises:
and adjusting parameters of the first feature extraction model according to the loss value so as to train the emotion recognition model.
16. An apparatus for recognizing emotion, comprising:
the first acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the target data;
the first obtaining module is used for inputting the first content characteristics into a first characteristic extraction model to obtain second content characteristics;
the second obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features; and
and the first identification module is used for identifying the emotion of the target object corresponding to the target data according to the second content characteristic and the second audio characteristic.
17. An apparatus for training an emotion recognition model, the emotion recognition model comprising a first feature extraction model, comprising:
the second acquisition module is used for acquiring a first content characteristic and a first audio characteristic of the sample data;
a third obtaining module, configured to input the first content feature into a first feature extraction model to obtain a second content feature;
the fourth obtaining module is used for inputting the first audio features into the first feature extraction model to obtain second audio features;
a second identification module for identifying an emotion of the sample object corresponding to the sample data according to the second content feature and the second audio feature;
a fifth obtaining module, configured to obtain a loss value according to the emotion of the sample object and the label of the sample data; and
and the training module is used for training the emotion recognition model according to the loss value.
18. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 15.
19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 15.
20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 15.
CN202111259447.7A 2021-10-27 2021-10-27 Method for recognizing emotion, method, device and equipment for training emotion recognition model Pending CN113990353A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111259447.7A CN113990353A (en) 2021-10-27 2021-10-27 Method for recognizing emotion, method, device and equipment for training emotion recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111259447.7A CN113990353A (en) 2021-10-27 2021-10-27 Method for recognizing emotion, method, device and equipment for training emotion recognition model

Publications (1)

Publication Number Publication Date
CN113990353A true CN113990353A (en) 2022-01-28

Family

ID=79743024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111259447.7A Pending CN113990353A (en) 2021-10-27 2021-10-27 Method for recognizing emotion, method, device and equipment for training emotion recognition model

Country Status (1)

Country Link
CN (1) CN113990353A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017218243A2 (en) * 2016-06-13 2017-12-21 Microsoft Technology Licensing, Llc Intent recognition and emotional text-to-speech learning system
CN108520275A (en) * 2017-06-28 2018-09-11 浙江大学 A kind of regular system of link information based on adjacency matrix, figure Feature Extraction System, figure categorizing system and method
CN112015872A (en) * 2019-05-29 2020-12-01 华为技术有限公司 Question recognition method and device
CN112489635A (en) * 2020-12-03 2021-03-12 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism
CN112735404A (en) * 2020-12-18 2021-04-30 平安科技(深圳)有限公司 Ironic detection method, system, terminal device and storage medium
CN112818861A (en) * 2021-02-02 2021-05-18 南京邮电大学 Emotion classification method and system based on multi-mode context semantic features
CN112948541A (en) * 2021-02-01 2021-06-11 华南理工大学 Financial news text emotional tendency analysis method based on graph convolution network
CN113112994A (en) * 2021-04-21 2021-07-13 江苏师范大学 Cross-corpus emotion recognition method based on graph convolution neural network
CN113223560A (en) * 2021-04-23 2021-08-06 平安科技(深圳)有限公司 Emotion recognition method, device, equipment and storage medium
CN113378160A (en) * 2021-06-11 2021-09-10 浙江工业大学 Graph neural network model defense method and device based on generative confrontation network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017218243A2 (en) * 2016-06-13 2017-12-21 Microsoft Technology Licensing, Llc Intent recognition and emotional text-to-speech learning system
CN108520275A (en) * 2017-06-28 2018-09-11 浙江大学 A kind of regular system of link information based on adjacency matrix, figure Feature Extraction System, figure categorizing system and method
CN112015872A (en) * 2019-05-29 2020-12-01 华为技术有限公司 Question recognition method and device
CN112489635A (en) * 2020-12-03 2021-03-12 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism
CN112735404A (en) * 2020-12-18 2021-04-30 平安科技(深圳)有限公司 Ironic detection method, system, terminal device and storage medium
CN112948541A (en) * 2021-02-01 2021-06-11 华南理工大学 Financial news text emotional tendency analysis method based on graph convolution network
CN112818861A (en) * 2021-02-02 2021-05-18 南京邮电大学 Emotion classification method and system based on multi-mode context semantic features
CN113112994A (en) * 2021-04-21 2021-07-13 江苏师范大学 Cross-corpus emotion recognition method based on graph convolution neural network
CN113223560A (en) * 2021-04-23 2021-08-06 平安科技(深圳)有限公司 Emotion recognition method, device, equipment and storage medium
CN113378160A (en) * 2021-06-11 2021-09-10 浙江工业大学 Graph neural network model defense method and device based on generative confrontation network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FEI TAO, ETC.: "An ensemble framework of voice-based emotion recognition system for films and TV programs", 2018 ICASSP *

Similar Documents

Publication Publication Date Title
CN112466288B (en) Voice recognition method and device, electronic equipment and storage medium
CN109003625B (en) Speech emotion recognition method and system based on ternary loss
CN108428446A (en) Audio recognition method and device
CN114895817B (en) Interactive information processing method, network model training method and device
CN115358243A (en) Training method, device, equipment and storage medium for multi-round dialogue recognition model
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN113947189A (en) Training method and device for image generation model, electronic equipment and storage medium
CN114020950A (en) Training method, device and equipment of image retrieval model and storage medium
CN111554270B (en) Training sample screening method and electronic equipment
EP4024393A2 (en) Training a speech recognition model
CN113990353A (en) Method for recognizing emotion, method, device and equipment for training emotion recognition model
CN112530415B (en) Negative reply recognition model acquisition and negative reply recognition method and device
CN115223573A (en) Voice wake-up method and device, electronic equipment and storage medium
CN114882334A (en) Method for generating pre-training model, model training method and device
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN113889085A (en) Speech recognition method, apparatus, device, storage medium and program product
CN113889089A (en) Method and device for acquiring voice recognition model, electronic equipment and storage medium
CN113468857A (en) Method and device for training style conversion model, electronic equipment and storage medium
CN113792540A (en) Intention recognition model updating method and related equipment
CN113641724A (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN112735432A (en) Audio recognition method and device, electronic equipment and storage medium
CN112632999A (en) Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
CN113380233B (en) Audio recognition method, device, training method, training device, equipment and storage medium
CN115169549B (en) Artificial intelligent model updating method and device, electronic equipment and storage medium
CN116631379B (en) Speech recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination