CN109003625B - Speech emotion recognition method and system based on ternary loss - Google Patents

Speech emotion recognition method and system based on ternary loss Download PDF

Info

Publication number
CN109003625B
CN109003625B CN201810839374.0A CN201810839374A CN109003625B CN 109003625 B CN109003625 B CN 109003625B CN 201810839374 A CN201810839374 A CN 201810839374A CN 109003625 B CN109003625 B CN 109003625B
Authority
CN
China
Prior art keywords
voice
emotion
speech
preset
ternary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810839374.0A
Other languages
Chinese (zh)
Other versions
CN109003625A (en
Inventor
陶建华
黄健
李雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201810839374.0A priority Critical patent/CN109003625B/en
Publication of CN109003625A publication Critical patent/CN109003625A/en
Application granted granted Critical
Publication of CN109003625B publication Critical patent/CN109003625B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention belongs to the technical field of emotion recognition, and particularly relates to a voice emotion recognition method and system based on ternary loss, aiming at solving the technical problem of accurately recognizing confusable emotion categories. To this end, the speech emotion recognition method of the present invention includes: performing framing processing on voice data to be detected to obtain a voice sequence with a specific length; carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence; and predicting the emotion classification corresponding to the emotion feature vector based on the preset speech emotion classifier and according to the plurality of preset real emotion classifications. The speech emotion recognition method can better recognize the confusable speech emotion types, and meanwhile, the speech emotion recognition system can execute and realize the method.

Description

Speech emotion recognition method and system based on ternary loss
Technical Field
The invention belongs to the technical field of emotion recognition, and particularly relates to a voice emotion recognition method and system based on ternary loss.
Background
The speech emotion recognition has wide application in human-computer interaction and artificial intelligence, and is a key research direction in the fields of human-computer interaction and artificial intelligence. The speech emotion recognition mainly comprises two parts, namely speech emotion feature extraction and speech emotion recognition model training. Most speech emotion recognition methods focus on extracting robust and effective speech emotion features and finding effective emotion recognition models. However, emotions are characterized by ambiguity, and some emotions are particularly confusing with each other, such as the two categories "angry" and "nausea," the two categories "surprise" and "sick heart.
In addition, the problem of input lengthening exists in speech emotion recognition, the traditional machine learning method needs input information with a fixed length, the general method is to cut off a long sample and complement a short sample with 0, and the experimental effects of the methods are not ideal.
Accordingly, there is a need in the art for a new speech emotion recognition method and system to solve the above problems.
Disclosure of Invention
The method aims to solve the technical problem in the prior art, namely, the technical problem of how to accurately identify the confusable emotion types is solved. To this end, in one aspect of the present invention, a speech emotion recognition method based on ternary loss is provided, which includes:
performing framing processing on voice data to be detected to obtain a voice sequence with a specific length;
carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence;
predicting emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on the speech data samples and using a machine learning algorithm.
Further, a preferred technical solution provided by the present invention is:
before the step of "obtaining an emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network", the method further includes:
obtaining a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
and carrying out network training on the emotion time sequence coding network according to the ternary voice sample group and a loss function shown as the following formula:
L=L1+L2
wherein, L is1Representing a preset triplet loss function, said L2Representing a preset cross entropy loss function;
said L1As shown in the following formula:
Figure BDA0001745219320000021
wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentioned
Figure BDA0001745219320000022
The first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,
Figure BDA0001745219320000031
the alpha represents a preset distance parameter;
said L2As follows:
Figure BDA0001745219320000032
wherein, said yiRepresents a preset i-th real emotion category label, the
Figure BDA0001745219320000033
Represents said yiLinear regression of (2) processed values.
Further, a preferred technical solution provided by the present invention is:
the step of obtaining a plurality of ternary sets of speech samples from the speech data samples comprises:
obtaining a ternary voice sample group according to the voice data sample and according to the method shown in the following formula:
Figure BDA0001745219320000034
wherein, the
Figure BDA0001745219320000035
To represent
Figure BDA0001745219320000036
Of 2 norm, said
Figure BDA0001745219320000037
To represent
Figure BDA0001745219320000038
Is the square of the 2 norm.
Further, a preferred technical solution provided by the present invention is:
the step of framing the voice data to be tested to obtain the voice sequence with specific length comprises the following steps:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
Further, a preferred technical solution provided by the present invention is:
the step of obtaining the emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network comprises the following steps:
obtaining the emotion feature vector corresponding to the voice sequence according to the method shown in the following formula:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfRepresenting transfer of a hidden layer matrix to a forgetting gateA matrix; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
In another aspect of the present invention, there is also provided a speech emotion recognition system based on ternary loss, including:
the variable-length input processing module is configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;
the emotion time sequence coding module is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;
the voice emotion recognition module is configured to predict emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on the speech data samples and using a machine learning algorithm.
Further, a preferred technical solution provided by the present invention is:
the voice emotion recognition system also comprises a ternary voice sample group acquisition module and a voice emotion loss function module; the voice emotion loss function module comprises a ternary loss function submodule and a cross entropy loss function submodule;
the ternary voice sample group acquisition module is configured to acquire a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
the ternary loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
Figure BDA0001745219320000061
wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentioned
Figure BDA0001745219320000062
The first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,
Figure BDA0001745219320000063
the alpha represents a preset distance parameter;
the cross entropy loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
Figure BDA0001745219320000064
wherein, said yiRepresents a preset i-th real emotion category label, the
Figure BDA0001745219320000065
Represents said yiLinear regression of (2) processed values.
Further, a preferred technical solution provided by the present invention is:
the ternary voice sample group obtaining module is further configured to obtain a ternary voice sample group according to the voice data sample and according to the following method:
Figure BDA0001745219320000066
wherein, the
Figure BDA0001745219320000067
To represent
Figure BDA0001745219320000068
Of 2 norm, said
Figure BDA0001745219320000069
To represent
Figure BDA00017452193200000610
Is the square of the 2 norm.
Further, a preferred technical solution provided by the present invention is:
the variable length input processing module is further configured to perform the following operations:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
Further, a preferred technical solution provided by the present invention is:
the emotion time sequence coding module is further configured to obtain an emotion feature vector according to a method shown in the following formula:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfA transition matrix representing the hidden layer matrix to the forgetting gate; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
Compared with the closest prior art, the technical scheme at least has the following beneficial effects:
1. the speech emotion recognition method based on ternary loss mainly comprises the following steps: framing the voice data to be detected to obtain a voice sequence with a specific length; carrying out time sequence coding on the voice sequence by utilizing an emotion time sequence coding network to obtain a robust emotion characteristic vector; and predicting the emotion type of the voice data to be detected based on the voice emotion classifier and by using the emotion feature vector. The method can effectively improve the speech emotion recognition precision.
2. The method is based on the triple loss function and trains the emotion time sequence coding network according to the triple voice sample group, and the triple loss function can reduce the distance between the positive sample pairs and increase the distance between the negative sample pairs, namely reduce the intra-class distance and increase the inter-class distance, so that the voice emotions among all classes are easier to distinguish.
3. In the invention, the selection standard of the ternary voice sample group for training the emotion time sequence coding network is that the distance of the central voice emotion feature of the positive sample is greater than that of the voice emotion feature of the negative sample, namely, a 'difficult' sample is selected, so that the emotion time sequence coding network after training can obtain a more robust emotion feature vector, and meanwhile, useless training of the 'easy' sample can be avoided, and network convergence is accelerated.
4. According to the method, the voice data to be detected is subjected to framing processing according to a time threshold value to obtain a plurality of voice frames; the number of voice frames and the frame number threshold F are compared and a voice sequence is acquired based on the comparison result and the plurality of voice frames. Based on the method, the problem of input side length can be solved well, and the accuracy of speech emotion recognition is improved.
5. The invention provides a speech emotion recognition system based on ternary loss, which can realize the speech emotion recognition method based on ternary loss.
Drawings
FIG. 1 is a schematic diagram of the main steps of a speech emotion recognition method based on ternary loss in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a main structure of a speech emotion recognition system based on ternary loss according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a main structure of a variable-length input processing module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a main structure of a speech emotion recognition system based on ternary loss according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The emotion recognition method based on ternary loss provided by the invention is explained below with reference to the accompanying drawings.
Fig. 1 exemplarily shows main steps of an emotion recognition method based on ternary loss in this embodiment, and as shown in fig. 1, an emotion recognition method based on ternary loss in this embodiment may include the following steps:
step S101: and performing framing processing on the voice data to be detected to obtain a voice sequence with a specific length.
Specifically, framing processing is carried out on voice data to be detected according to a preset time threshold value, and a plurality of voice frames are obtained;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and a plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking a plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences, namely deleting the voice frames at the head end and the tail end in the voice data to be detected so that the number of the voice frames is equal to the frame number threshold value F;
if the number of the voice frames is less than the frame number threshold value F, the present invention can be divided into three modes for processing, including a loop mode, a copy mode and a fill mode. A cyclic mode, namely, taking the plurality of voice frames as a data whole, copying and splicing the data whole for a plurality of times until the total frame number is greater than a frame number threshold value F, and randomly selecting continuous F voice frames from the data whole as a voice sequence; a copy mode, namely, each voice frame is copied and spliced for multiple times until the total frame number is greater than a frame number threshold value F, and continuous F voice frames are randomly selected from the voice frames as a voice sequence; and filling the mode, namely copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
In this embodiment, the speech data in the speech emotion database IEMOCAP is used. And under the condition that the number of the voice frames is less than the frame number threshold value F, comparing the accuracy of the voice emotion recognition of the three modes. Referring to table 1, table 1 shows a comparison table of the accuracy of speech emotion recognition in the above three modes:
TABLE 1
Mode(s) Accuracy rate
Circulation mode 59.6%
Repetitive pattern 58.1%
Filling mode 56.3%
It can be seen from table 1 that the cyclic mode works best, the copy mode works second, and the fill mode works worst. The cyclic model enables the process of emotion dynamic change to be cyclic and longer, the long-term memory model is favorable for modeling the emotion dynamic change better, the single frame is repeatedly copied by the copy mode, the emotion dynamic change process is slowed down equivalently, the modeling of the emotion dynamic process is not facilitated, the effect of the filling mode does not contribute to the emotion dynamic change process, and therefore the performance of the two modes is not good. Therefore, in the present embodiment, when the number of speech frames is smaller than the frame number threshold F, the loop mode is selected as a method of acquiring a speech sequence.
Step S102: and carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence.
Specifically, the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on preset voice data samples and is constructed by utilizing a machine learning algorithm. The emotion time sequence coding network acquires emotion feature vectors corresponding to the voice sequence according to methods shown in formula (1) to formula (5):
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi) (1)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf) (2)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc) (3)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo) (4)
Ht=ot·tanh(ct) (5)
wherein it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; x is the number oft,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c. Ct-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; wxiA transition matrix representing the input matrix to the input gate; whiA transition matrix representing the hidden layer matrix to the input gate; wciA transition matrix representing the state of the cell to the input gate; wxfA transition matrix representing the input matrix to the forgetting gate; whfA transition matrix representing the hidden layer matrix to the forgetting gate; wcfA transition matrix representing a state of a cell to a forgetting gate; wxcA transition matrix representing the input matrix to the cell state; whcA transition matrix representing the hidden layer matrix to the cell state; wxoA transition matrix representing the input matrix to the output gate; whoRepresenting the transition of the hidden layer matrix to the output gate, WcoTo representTransition matrix of cell states to output gates, bi,bf,bo,bcRespectively representing the bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; "·" denotes a hadamard product, σ denotes a preset activation function, and tanh denotes an all-curve tangent function. In addition, h ist,ctAll are intermediate data in the emotion time sequence coding network and are expressed in a matrix form.
The embodiment further comprises a step of network training the emotion time sequence coding network, which specifically comprises the following steps:
acquiring a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, the emotion classes of the first voice data sample and the second voice data sample are the same, and the emotion classes of the first voice data sample and the third voice data sample are different;
and (3) carrying out network training on the emotion time sequence coding network according to the obtained ternary voice sample group and the loss function shown in the formula (6):
L=L1+L2 (6)
wherein L is1Representing a predetermined triplet loss function, L2Representing a preset cross entropy loss function; the triplet loss function can decrease the distance between pairs of positive samples and increase the distance between pairs of negative samples. The cross entropy loss function is a monitoring cross entropy loss function and is used for monitoring the learning of the network and guiding the clustering of the same emotion classes by utilizing preset sample class information.
L1As shown in equation (7):
Figure BDA0001745219320000131
wherein "+" represents when "," is present]When the value in "[ means ] is greater than zero, the value is taken as a loss value]"a value less than zero is zero;
Figure BDA0001745219320000132
the first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; n represents the number of ternary speech sample groups; f (x) represents the emotion feature vector corresponding to the voice data sample x,
Figure BDA0001745219320000133
and is
Figure BDA0001745219320000134
Representing a set of real numbers; alpha represents a preset distance parameter;
L2as shown in equation (8):
Figure BDA0001745219320000135
wherein, yiRepresents a preset ith real emotion category label,
Figure BDA0001745219320000136
denotes yiLinear regression of (2) processed values.
Further, it is important to select the ternary speech sample set for training the emotion time sequence coding network, because there are many ternary speech sample sets that can be formed in the whole training set, but most of the ternary speech sample sets are not beneficial to the training of the emotion time sequence coding network, and the convergence rate of the emotion time sequence coding network is also reduced. Therefore, the 'difficult' training samples are selected as much as possible, namely the ternary speech sample set in the training set is selected to train the emotion time sequence coding net according to the formula (9):
Figure BDA0001745219320000137
wherein,
Figure BDA0001745219320000138
to represent
Figure BDA0001745219320000139
The square of the 2-norm of (c),
Figure BDA00017452193200001310
to represent
Figure BDA00017452193200001311
Is the square of the 2 norm. In this embodiment, when selecting a ternary speech sample group, a sample is given
Figure BDA00017452193200001312
Selecting a sample
Figure BDA00017452193200001313
Make it
Figure BDA00017452193200001314
Figure BDA00017452193200001315
And selecting
Figure BDA00017452193200001316
Make it
Figure BDA00017452193200001317
The emotion time sequence coding network is trained, and the operations are only carried out on batch samples input each time when the emotion time sequence coding network is trained.
The embodiment of the present invention further provides a speech emotion recognition system based on ternary loss, referring to fig. 2, fig. 2 exemplarily shows a main structure of a speech emotion recognition system based on ternary loss in this embodiment, and the speech emotion recognition system shown in fig. 2 may include:
a variable-length input processing module 21 configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;
the emotion time sequence coding module 22 is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;
the speech emotion recognition module 23 is configured to predict emotion classes corresponding to the emotion feature vectors based on a preset speech emotion classifier and according to a plurality of preset real emotion classes;
the emotion time sequence coding network is a long-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on speech data samples and using a machine learning algorithm.
Further, the variable-length input processing module is further configured to perform the following operations:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and a plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking a plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences, namely deleting the voice frames at the head end and the tail end in the voice data to be detected so that the number of the voice frames is equal to the frame number threshold value F;
and if the number of the voice frames is less than the frame number threshold value F, taking a plurality of voice frames as a data whole, copying and splicing the data whole for multiple times until the total frame number is greater than the frame number threshold value F, randomly selecting continuous F voice frames from the data whole as a voice sequence, or copying and splicing each voice frame for multiple times until the total frame number is greater than the frame number threshold value F, randomly selecting continuous F voice frames from the data whole as the voice sequence, or copying and splicing the last voice frame of the voice data to be tested for multiple times until the total frame number is equal to the frame number threshold value F.
Referring to fig. 3, fig. 3 illustrates the main structure of a variable-length input processing module according to an embodiment, and the variable-length input processing module shown in fig. 3 may further include a speech framing sub-module 211 and a variable-length input processing sub-module 212.
The voice framing submodule 211 is configured to perform framing processing on the voice data to be detected according to a preset time threshold, and obtain a plurality of voice frames.
And the variable length input processing sub-module 212 is configured to compare the number of the voice frames with a preset frame number threshold value F and acquire a voice sequence according to the comparison result and the plurality of voice frames.
Further, the emotion time sequence coding module is configured to obtain the emotion feature vector according to the method shown in formula (1) -formula (5).
Further, referring to fig. 4, fig. 4 illustrates the main structure of a speech emotion recognition system based on ternary loss, and as shown in fig. 4, the speech emotion recognition system further includes a ternary speech sample group obtaining module 41 and a speech emotion loss function module 42; the speech emotion loss function module includes a ternary loss function sub-module 421 and a cross entropy loss function sub-module 422.
The ternary speech sample set obtaining module 41 is configured to obtain a plurality of ternary speech sample sets from the speech data samples. The ternary set of speech samples includes a first speech data sample, a second speech data sample, and a third speech data sample, and the emotion classifications of the first speech data sample and the second speech data sample are the same and the emotion classifications of the first speech data sample and the third speech data sample are different.
Ternary loss function submodule 421 is configured to perform network training on the emotion time sequence coding network according to the loss function shown in equation (7).
The cross entropy loss function sub-module 422 is configured to perform network training on the emotion time series coding network according to the loss function shown in formula (8).
Further, the ternary speech sample set obtaining module 41 may obtain the ternary speech sample set according to the method shown in formula (9), that is, selecting a "difficult" training sample to train the emotion time sequence coding network.
It will be understood by those skilled in the art that the physical forms of the variable-length input processing module, the emotion time sequence coding module and the speech emotion recognition module may be independent from each other, and may of course be functional units integrated into one physical module, and the emotion time sequence coding module described above may include a memory and a processor, and a computing program stored in the memory and executable on the processor, and the computing program may perform the functions of the variable-length input processing module, the emotion time sequence coding module and the speech emotion recognition module.
Those of skill in the art will appreciate that the various illustrative method steps and systems described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (8)

1. A speech emotion recognition method based on ternary loss is characterized by comprising the following steps:
performing framing processing on voice data to be detected to obtain a voice sequence with a specific length;
carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence;
predicting emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model which is constructed based on the speech data samples and by utilizing a machine learning algorithm;
before the step of obtaining the emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network, the method further includes: obtaining a plurality of ternary voice sample groups according to the voice data samples;
and carrying out network training on the emotion time sequence coding network according to the ternary voice sample group and a loss function shown as the following formula:
L=L1+L2
wherein, L is1Representing a preset triplet loss function, said L2Representing a preset cross entropy loss function;
the step of "framing the voice data to be tested and acquiring the voice sequence with a specific length" includes:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
2. The method for speech emotion recognition based on ternary loss according to claim 1, wherein before the step of obtaining the emotion feature vector corresponding to the speech sequence based on the preset emotion time sequence coding network and performing time sequence coding according to the speech sequence, the method further comprises:
the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
said L1As shown in the following formula:
Figure FDA0002591799860000021
wherein the "+" representsWhen the term is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentioned
Figure FDA0002591799860000022
The first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,
Figure FDA0002591799860000023
the alpha represents a preset distance parameter;
said L2As follows:
Figure FDA0002591799860000024
wherein, said yiRepresents a preset i-th real emotion category label, the
Figure FDA0002591799860000036
Represents said yiLinear regression of (2) processed values.
3. The method of claim 2, wherein the step of obtaining a plurality of ternary speech sample groups from the speech data samples comprises:
obtaining a ternary voice sample group according to the voice data sample and according to the method shown in the following formula:
Figure FDA0002591799860000031
wherein, the
Figure FDA0002591799860000032
To represent
Figure FDA0002591799860000033
Of 2 norm, said
Figure FDA0002591799860000034
To represent
Figure FDA0002591799860000035
Is the square of the 2 norm.
4. The method for speech emotion recognition based on ternary loss according to any one of claims 1-3, wherein the step of obtaining the emotion feature vector corresponding to the speech sequence based on a preset emotion time sequence coding network and performing time sequence coding according to the speech sequence comprises:
obtaining the emotion feature vector corresponding to the voice sequence according to the method shown in the following formula:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing the input matrix, the hidden layer matrix and the motion vector of the emotion time sequence coding network at the current moment tA cell state; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfA transition matrix representing the hidden layer matrix to the forgetting gate; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
5. A speech emotion recognition system based on ternary loss is characterized by comprising:
the variable-length input processing module is configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;
the emotion time sequence coding module is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;
the voice emotion recognition module is configured to predict emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model which is constructed based on the speech data samples and by utilizing a machine learning algorithm;
the voice emotion recognition system comprises a voice emotion loss function module; the voice emotion loss function module comprises a ternary loss function submodule and a cross entropy loss function submodule;
wherein the variable length input processing module is further configured to perform the following operations:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
6. The ternary loss based speech emotion recognition system of claim 5, further comprising a ternary speech sample set acquisition module;
the ternary voice sample group acquisition module is configured to acquire a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
the ternary loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
Figure FDA0002591799860000051
wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentioned
Figure FDA0002591799860000052
The first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,
Figure FDA0002591799860000053
the alpha represents a preset distance parameter;
the cross entropy loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
Figure FDA0002591799860000061
wherein, said yiRepresents a preset i-th real emotion category label, the
Figure FDA0002591799860000062
Represents said yiLinear regression of (2) processed values.
7. The system according to claim 6, wherein the ternary loss based speech emotion recognition module is further configured to obtain a ternary speech sample set according to the speech data sample and according to the following method:
Figure FDA0002591799860000063
wherein, the
Figure FDA0002591799860000064
To represent
Figure FDA0002591799860000065
Of 2 norm, said
Figure FDA0002591799860000066
To represent
Figure FDA0002591799860000067
Is the square of the 2 norm.
8. The ternary loss based speech emotion recognition system of any of claims 5-7, wherein the emotion time sequence coding module is further configured to obtain the emotion feature vector according to the following method:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfA transition matrix representing the hidden layer matrix to the forgetting gate; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
CN201810839374.0A 2018-07-27 2018-07-27 Speech emotion recognition method and system based on ternary loss Active CN109003625B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810839374.0A CN109003625B (en) 2018-07-27 2018-07-27 Speech emotion recognition method and system based on ternary loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810839374.0A CN109003625B (en) 2018-07-27 2018-07-27 Speech emotion recognition method and system based on ternary loss

Publications (2)

Publication Number Publication Date
CN109003625A CN109003625A (en) 2018-12-14
CN109003625B true CN109003625B (en) 2021-01-12

Family

ID=64597222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810839374.0A Active CN109003625B (en) 2018-07-27 2018-07-27 Speech emotion recognition method and system based on ternary loss

Country Status (1)

Country Link
CN (1) CN109003625B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11947593B2 (en) * 2018-09-28 2024-04-02 Sony Interactive Entertainment Inc. Sound categorization system
CN109599128B (en) * 2018-12-24 2022-03-01 北京达佳互联信息技术有限公司 Speech emotion recognition method and device, electronic equipment and readable medium
CN110059616A (en) * 2019-04-17 2019-07-26 南京邮电大学 Pedestrian's weight identification model optimization method based on fusion loss function
CN110223714B (en) * 2019-06-03 2021-08-03 杭州哲信信息技术有限公司 Emotion recognition method based on voice
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
CN111445899B (en) * 2020-03-09 2023-08-01 咪咕文化科技有限公司 Speech emotion recognition method, device and storage medium
CN111768764B (en) * 2020-06-23 2024-01-19 北京猎户星空科技有限公司 Voice data processing method and device, electronic equipment and medium
CN112084338B (en) * 2020-09-18 2024-02-06 达而观数据(成都)有限公司 Automatic document classification method, system, computer equipment and storage medium
CN114757310B (en) * 2022-06-16 2022-11-11 山东海量信息技术研究院 Emotion recognition model and training method, device, equipment and readable storage medium thereof
CN116528438B (en) * 2023-04-28 2023-10-10 广州力铭光电科技有限公司 Intelligent dimming method and device for lamp

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436806A (en) * 2011-09-29 2012-05-02 复旦大学 Audio frequency copy detection method based on similarity
CN107221320A (en) * 2017-05-19 2017-09-29 百度在线网络技术(北京)有限公司 Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120137367A1 (en) * 2009-11-06 2012-05-31 Cataphora, Inc. Continuous anomaly detection based on behavior modeling and heterogeneous information analysis
CN105469065B (en) * 2015-12-07 2019-04-23 中国科学院自动化研究所 A kind of discrete emotion identification method based on recurrent neural network
CN106782602B (en) * 2016-12-01 2020-03-17 南京邮电大学 Speech emotion recognition method based on deep neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436806A (en) * 2011-09-29 2012-05-02 复旦大学 Audio frequency copy detection method based on similarity
CN107221320A (en) * 2017-05-19 2017-09-29 百度在线网络技术(北京)有限公司 Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model

Also Published As

Publication number Publication date
CN109003625A (en) 2018-12-14

Similar Documents

Publication Publication Date Title
CN109003625B (en) Speech emotion recognition method and system based on ternary loss
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
Dong et al. Automatic age estimation based on deep learning algorithm
CN109063666A (en) The lightweight face identification method and system of convolution are separated based on depth
CN110556130A (en) Voice emotion recognition method and device and storage medium
CN113095357A (en) Multi-mode emotion recognition method and system based on attention mechanism and GMN
CN111598979B (en) Method, device and equipment for generating facial animation of virtual character and storage medium
CN108804453A (en) A kind of video and audio recognition methods and device
CN110047517A (en) Speech-emotion recognition method, answering method and computer equipment
CN112560495A (en) Microblog rumor detection method based on emotion analysis
Noroozi et al. Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost
Bahari Speaker age estimation using Hidden Markov Model weight supervectors
KR20220098991A (en) Method and apparatus for recognizing emtions based on speech signal
CN111460923A (en) Micro-expression recognition method, device, equipment and storage medium
Shivakumar et al. Simplified and supervised i-vector modeling for speaker age regression
CN110120231B (en) Cross-corpus emotion recognition method based on self-adaptive semi-supervised non-negative matrix factorization
Shih et al. Speech emotion recognition with ensemble learning methods
Pham et al. Vietnamese scene text detection and recognition using deep learning: An empirical study
CN115712739A (en) Dance action generation method, computer device and storage medium
CN106971731B (en) Correction method for voiceprint recognition
CN112738724B (en) Method, device, equipment and medium for accurately identifying regional target crowd
Chandrakala et al. Combination of generative models and SVM based classifier for speech emotion recognition
Elbarougy et al. Continuous audiovisual emotion recognition using feature selection and lstm
Tripathi et al. Facial expression recognition using data mining algorithm
Saeed et al. Robust Visual Lips Feature Extraction Method for Improved Visual Speech Recognition System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant