CN109003625B - Speech emotion recognition method and system based on ternary loss - Google Patents
Speech emotion recognition method and system based on ternary loss Download PDFInfo
- Publication number
- CN109003625B CN109003625B CN201810839374.0A CN201810839374A CN109003625B CN 109003625 B CN109003625 B CN 109003625B CN 201810839374 A CN201810839374 A CN 201810839374A CN 109003625 B CN109003625 B CN 109003625B
- Authority
- CN
- China
- Prior art keywords
- voice
- emotion
- speech
- preset
- ternary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000008451 emotion Effects 0.000 claims abstract description 167
- 239000013598 vector Substances 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000009432 framing Methods 0.000 claims abstract description 20
- 239000011159 matrix material Substances 0.000 claims description 110
- 230000007704 transition Effects 0.000 claims description 53
- 230000006870 function Effects 0.000 claims description 49
- 238000012549 training Methods 0.000 claims description 20
- 238000010801 machine learning Methods 0.000 claims description 12
- 230000015654 memory Effects 0.000 claims description 8
- 238000003062 neural network model Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- RXKJFZQQPQGTFL-UHFFFAOYSA-N dihydroxyacetone Chemical compound OCC(=O)CO RXKJFZQQPQGTFL-UHFFFAOYSA-N 0.000 claims description 5
- 238000012417 linear regression Methods 0.000 claims description 5
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 125000004122 cyclic group Chemical group 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 206010028813 Nausea Diseases 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000008693 nausea Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention belongs to the technical field of emotion recognition, and particularly relates to a voice emotion recognition method and system based on ternary loss, aiming at solving the technical problem of accurately recognizing confusable emotion categories. To this end, the speech emotion recognition method of the present invention includes: performing framing processing on voice data to be detected to obtain a voice sequence with a specific length; carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence; and predicting the emotion classification corresponding to the emotion feature vector based on the preset speech emotion classifier and according to the plurality of preset real emotion classifications. The speech emotion recognition method can better recognize the confusable speech emotion types, and meanwhile, the speech emotion recognition system can execute and realize the method.
Description
Technical Field
The invention belongs to the technical field of emotion recognition, and particularly relates to a voice emotion recognition method and system based on ternary loss.
Background
The speech emotion recognition has wide application in human-computer interaction and artificial intelligence, and is a key research direction in the fields of human-computer interaction and artificial intelligence. The speech emotion recognition mainly comprises two parts, namely speech emotion feature extraction and speech emotion recognition model training. Most speech emotion recognition methods focus on extracting robust and effective speech emotion features and finding effective emotion recognition models. However, emotions are characterized by ambiguity, and some emotions are particularly confusing with each other, such as the two categories "angry" and "nausea," the two categories "surprise" and "sick heart.
In addition, the problem of input lengthening exists in speech emotion recognition, the traditional machine learning method needs input information with a fixed length, the general method is to cut off a long sample and complement a short sample with 0, and the experimental effects of the methods are not ideal.
Accordingly, there is a need in the art for a new speech emotion recognition method and system to solve the above problems.
Disclosure of Invention
The method aims to solve the technical problem in the prior art, namely, the technical problem of how to accurately identify the confusable emotion types is solved. To this end, in one aspect of the present invention, a speech emotion recognition method based on ternary loss is provided, which includes:
performing framing processing on voice data to be detected to obtain a voice sequence with a specific length;
carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence;
predicting emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on the speech data samples and using a machine learning algorithm.
Further, a preferred technical solution provided by the present invention is:
before the step of "obtaining an emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network", the method further includes:
obtaining a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
and carrying out network training on the emotion time sequence coding network according to the ternary voice sample group and a loss function shown as the following formula:
L=L1+L2
wherein, L is1Representing a preset triplet loss function, said L2Representing a preset cross entropy loss function;
said L1As shown in the following formula:
wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentionedThe first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,the alpha represents a preset distance parameter;
said L2As follows:
wherein, said yiRepresents a preset i-th real emotion category label, theRepresents said yiLinear regression of (2) processed values.
Further, a preferred technical solution provided by the present invention is:
the step of obtaining a plurality of ternary sets of speech samples from the speech data samples comprises:
obtaining a ternary voice sample group according to the voice data sample and according to the method shown in the following formula:
Further, a preferred technical solution provided by the present invention is:
the step of framing the voice data to be tested to obtain the voice sequence with specific length comprises the following steps:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
Further, a preferred technical solution provided by the present invention is:
the step of obtaining the emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network comprises the following steps:
obtaining the emotion feature vector corresponding to the voice sequence according to the method shown in the following formula:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfRepresenting transfer of a hidden layer matrix to a forgetting gateA matrix; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
In another aspect of the present invention, there is also provided a speech emotion recognition system based on ternary loss, including:
the variable-length input processing module is configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;
the emotion time sequence coding module is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;
the voice emotion recognition module is configured to predict emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on the speech data samples and using a machine learning algorithm.
Further, a preferred technical solution provided by the present invention is:
the voice emotion recognition system also comprises a ternary voice sample group acquisition module and a voice emotion loss function module; the voice emotion loss function module comprises a ternary loss function submodule and a cross entropy loss function submodule;
the ternary voice sample group acquisition module is configured to acquire a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
the ternary loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentionedThe first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,the alpha represents a preset distance parameter;
the cross entropy loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
wherein, said yiRepresents a preset i-th real emotion category label, theRepresents said yiLinear regression of (2) processed values.
Further, a preferred technical solution provided by the present invention is:
the ternary voice sample group obtaining module is further configured to obtain a ternary voice sample group according to the voice data sample and according to the following method:
Further, a preferred technical solution provided by the present invention is:
the variable length input processing module is further configured to perform the following operations:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
Further, a preferred technical solution provided by the present invention is:
the emotion time sequence coding module is further configured to obtain an emotion feature vector according to a method shown in the following formula:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfA transition matrix representing the hidden layer matrix to the forgetting gate; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
Compared with the closest prior art, the technical scheme at least has the following beneficial effects:
1. the speech emotion recognition method based on ternary loss mainly comprises the following steps: framing the voice data to be detected to obtain a voice sequence with a specific length; carrying out time sequence coding on the voice sequence by utilizing an emotion time sequence coding network to obtain a robust emotion characteristic vector; and predicting the emotion type of the voice data to be detected based on the voice emotion classifier and by using the emotion feature vector. The method can effectively improve the speech emotion recognition precision.
2. The method is based on the triple loss function and trains the emotion time sequence coding network according to the triple voice sample group, and the triple loss function can reduce the distance between the positive sample pairs and increase the distance between the negative sample pairs, namely reduce the intra-class distance and increase the inter-class distance, so that the voice emotions among all classes are easier to distinguish.
3. In the invention, the selection standard of the ternary voice sample group for training the emotion time sequence coding network is that the distance of the central voice emotion feature of the positive sample is greater than that of the voice emotion feature of the negative sample, namely, a 'difficult' sample is selected, so that the emotion time sequence coding network after training can obtain a more robust emotion feature vector, and meanwhile, useless training of the 'easy' sample can be avoided, and network convergence is accelerated.
4. According to the method, the voice data to be detected is subjected to framing processing according to a time threshold value to obtain a plurality of voice frames; the number of voice frames and the frame number threshold F are compared and a voice sequence is acquired based on the comparison result and the plurality of voice frames. Based on the method, the problem of input side length can be solved well, and the accuracy of speech emotion recognition is improved.
5. The invention provides a speech emotion recognition system based on ternary loss, which can realize the speech emotion recognition method based on ternary loss.
Drawings
FIG. 1 is a schematic diagram of the main steps of a speech emotion recognition method based on ternary loss in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a main structure of a speech emotion recognition system based on ternary loss according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a main structure of a variable-length input processing module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a main structure of a speech emotion recognition system based on ternary loss according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The emotion recognition method based on ternary loss provided by the invention is explained below with reference to the accompanying drawings.
Fig. 1 exemplarily shows main steps of an emotion recognition method based on ternary loss in this embodiment, and as shown in fig. 1, an emotion recognition method based on ternary loss in this embodiment may include the following steps:
step S101: and performing framing processing on the voice data to be detected to obtain a voice sequence with a specific length.
Specifically, framing processing is carried out on voice data to be detected according to a preset time threshold value, and a plurality of voice frames are obtained;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and a plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking a plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences, namely deleting the voice frames at the head end and the tail end in the voice data to be detected so that the number of the voice frames is equal to the frame number threshold value F;
if the number of the voice frames is less than the frame number threshold value F, the present invention can be divided into three modes for processing, including a loop mode, a copy mode and a fill mode. A cyclic mode, namely, taking the plurality of voice frames as a data whole, copying and splicing the data whole for a plurality of times until the total frame number is greater than a frame number threshold value F, and randomly selecting continuous F voice frames from the data whole as a voice sequence; a copy mode, namely, each voice frame is copied and spliced for multiple times until the total frame number is greater than a frame number threshold value F, and continuous F voice frames are randomly selected from the voice frames as a voice sequence; and filling the mode, namely copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
In this embodiment, the speech data in the speech emotion database IEMOCAP is used. And under the condition that the number of the voice frames is less than the frame number threshold value F, comparing the accuracy of the voice emotion recognition of the three modes. Referring to table 1, table 1 shows a comparison table of the accuracy of speech emotion recognition in the above three modes:
TABLE 1
Mode(s) | Accuracy rate |
Circulation mode | 59.6% |
Repetitive pattern | 58.1% |
Filling mode | 56.3% |
It can be seen from table 1 that the cyclic mode works best, the copy mode works second, and the fill mode works worst. The cyclic model enables the process of emotion dynamic change to be cyclic and longer, the long-term memory model is favorable for modeling the emotion dynamic change better, the single frame is repeatedly copied by the copy mode, the emotion dynamic change process is slowed down equivalently, the modeling of the emotion dynamic process is not facilitated, the effect of the filling mode does not contribute to the emotion dynamic change process, and therefore the performance of the two modes is not good. Therefore, in the present embodiment, when the number of speech frames is smaller than the frame number threshold F, the loop mode is selected as a method of acquiring a speech sequence.
Step S102: and carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence.
Specifically, the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on preset voice data samples and is constructed by utilizing a machine learning algorithm. The emotion time sequence coding network acquires emotion feature vectors corresponding to the voice sequence according to methods shown in formula (1) to formula (5):
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi) (1)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf) (2)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc) (3)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo) (4)
Ht=ot·tanh(ct) (5)
wherein it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; x is the number oft,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c. Ct-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; wxiA transition matrix representing the input matrix to the input gate; whiA transition matrix representing the hidden layer matrix to the input gate; wciA transition matrix representing the state of the cell to the input gate; wxfA transition matrix representing the input matrix to the forgetting gate; whfA transition matrix representing the hidden layer matrix to the forgetting gate; wcfA transition matrix representing a state of a cell to a forgetting gate; wxcA transition matrix representing the input matrix to the cell state; whcA transition matrix representing the hidden layer matrix to the cell state; wxoA transition matrix representing the input matrix to the output gate; whoRepresenting the transition of the hidden layer matrix to the output gate, WcoTo representTransition matrix of cell states to output gates, bi,bf,bo,bcRespectively representing the bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; "·" denotes a hadamard product, σ denotes a preset activation function, and tanh denotes an all-curve tangent function. In addition, h ist,ctAll are intermediate data in the emotion time sequence coding network and are expressed in a matrix form.
The embodiment further comprises a step of network training the emotion time sequence coding network, which specifically comprises the following steps:
acquiring a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, the emotion classes of the first voice data sample and the second voice data sample are the same, and the emotion classes of the first voice data sample and the third voice data sample are different;
and (3) carrying out network training on the emotion time sequence coding network according to the obtained ternary voice sample group and the loss function shown in the formula (6):
L=L1+L2 (6)
wherein L is1Representing a predetermined triplet loss function, L2Representing a preset cross entropy loss function; the triplet loss function can decrease the distance between pairs of positive samples and increase the distance between pairs of negative samples. The cross entropy loss function is a monitoring cross entropy loss function and is used for monitoring the learning of the network and guiding the clustering of the same emotion classes by utilizing preset sample class information.
L1As shown in equation (7):
wherein "+" represents when "," is present]When the value in "[ means ] is greater than zero, the value is taken as a loss value]"a value less than zero is zero;the first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; n represents the number of ternary speech sample groups; f (x) represents the emotion feature vector corresponding to the voice data sample x,and isRepresenting a set of real numbers; alpha represents a preset distance parameter;
L2as shown in equation (8):
wherein, yiRepresents a preset ith real emotion category label,denotes yiLinear regression of (2) processed values.
Further, it is important to select the ternary speech sample set for training the emotion time sequence coding network, because there are many ternary speech sample sets that can be formed in the whole training set, but most of the ternary speech sample sets are not beneficial to the training of the emotion time sequence coding network, and the convergence rate of the emotion time sequence coding network is also reduced. Therefore, the 'difficult' training samples are selected as much as possible, namely the ternary speech sample set in the training set is selected to train the emotion time sequence coding net according to the formula (9):
wherein,to representThe square of the 2-norm of (c),to representIs the square of the 2 norm. In this embodiment, when selecting a ternary speech sample group, a sample is givenSelecting a sampleMake it And selectingMake itThe emotion time sequence coding network is trained, and the operations are only carried out on batch samples input each time when the emotion time sequence coding network is trained.
The embodiment of the present invention further provides a speech emotion recognition system based on ternary loss, referring to fig. 2, fig. 2 exemplarily shows a main structure of a speech emotion recognition system based on ternary loss in this embodiment, and the speech emotion recognition system shown in fig. 2 may include:
a variable-length input processing module 21 configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;
the emotion time sequence coding module 22 is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;
the speech emotion recognition module 23 is configured to predict emotion classes corresponding to the emotion feature vectors based on a preset speech emotion classifier and according to a plurality of preset real emotion classes;
the emotion time sequence coding network is a long-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on speech data samples and using a machine learning algorithm.
Further, the variable-length input processing module is further configured to perform the following operations:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and a plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking a plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences, namely deleting the voice frames at the head end and the tail end in the voice data to be detected so that the number of the voice frames is equal to the frame number threshold value F;
and if the number of the voice frames is less than the frame number threshold value F, taking a plurality of voice frames as a data whole, copying and splicing the data whole for multiple times until the total frame number is greater than the frame number threshold value F, randomly selecting continuous F voice frames from the data whole as a voice sequence, or copying and splicing each voice frame for multiple times until the total frame number is greater than the frame number threshold value F, randomly selecting continuous F voice frames from the data whole as the voice sequence, or copying and splicing the last voice frame of the voice data to be tested for multiple times until the total frame number is equal to the frame number threshold value F.
Referring to fig. 3, fig. 3 illustrates the main structure of a variable-length input processing module according to an embodiment, and the variable-length input processing module shown in fig. 3 may further include a speech framing sub-module 211 and a variable-length input processing sub-module 212.
The voice framing submodule 211 is configured to perform framing processing on the voice data to be detected according to a preset time threshold, and obtain a plurality of voice frames.
And the variable length input processing sub-module 212 is configured to compare the number of the voice frames with a preset frame number threshold value F and acquire a voice sequence according to the comparison result and the plurality of voice frames.
Further, the emotion time sequence coding module is configured to obtain the emotion feature vector according to the method shown in formula (1) -formula (5).
Further, referring to fig. 4, fig. 4 illustrates the main structure of a speech emotion recognition system based on ternary loss, and as shown in fig. 4, the speech emotion recognition system further includes a ternary speech sample group obtaining module 41 and a speech emotion loss function module 42; the speech emotion loss function module includes a ternary loss function sub-module 421 and a cross entropy loss function sub-module 422.
The ternary speech sample set obtaining module 41 is configured to obtain a plurality of ternary speech sample sets from the speech data samples. The ternary set of speech samples includes a first speech data sample, a second speech data sample, and a third speech data sample, and the emotion classifications of the first speech data sample and the second speech data sample are the same and the emotion classifications of the first speech data sample and the third speech data sample are different.
Ternary loss function submodule 421 is configured to perform network training on the emotion time sequence coding network according to the loss function shown in equation (7).
The cross entropy loss function sub-module 422 is configured to perform network training on the emotion time series coding network according to the loss function shown in formula (8).
Further, the ternary speech sample set obtaining module 41 may obtain the ternary speech sample set according to the method shown in formula (9), that is, selecting a "difficult" training sample to train the emotion time sequence coding network.
It will be understood by those skilled in the art that the physical forms of the variable-length input processing module, the emotion time sequence coding module and the speech emotion recognition module may be independent from each other, and may of course be functional units integrated into one physical module, and the emotion time sequence coding module described above may include a memory and a processor, and a computing program stored in the memory and executable on the processor, and the computing program may perform the functions of the variable-length input processing module, the emotion time sequence coding module and the speech emotion recognition module.
Those of skill in the art will appreciate that the various illustrative method steps and systems described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (8)
1. A speech emotion recognition method based on ternary loss is characterized by comprising the following steps:
performing framing processing on voice data to be detected to obtain a voice sequence with a specific length;
carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence;
predicting emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model which is constructed based on the speech data samples and by utilizing a machine learning algorithm;
before the step of obtaining the emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network, the method further includes: obtaining a plurality of ternary voice sample groups according to the voice data samples;
and carrying out network training on the emotion time sequence coding network according to the ternary voice sample group and a loss function shown as the following formula:
L=L1+L2
wherein, L is1Representing a preset triplet loss function, said L2Representing a preset cross entropy loss function;
the step of "framing the voice data to be tested and acquiring the voice sequence with a specific length" includes:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
2. The method for speech emotion recognition based on ternary loss according to claim 1, wherein before the step of obtaining the emotion feature vector corresponding to the speech sequence based on the preset emotion time sequence coding network and performing time sequence coding according to the speech sequence, the method further comprises:
the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
said L1As shown in the following formula:
wherein the "+" representsWhen the term is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentionedThe first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,the alpha represents a preset distance parameter;
said L2As follows:
3. The method of claim 2, wherein the step of obtaining a plurality of ternary speech sample groups from the speech data samples comprises:
obtaining a ternary voice sample group according to the voice data sample and according to the method shown in the following formula:
4. The method for speech emotion recognition based on ternary loss according to any one of claims 1-3, wherein the step of obtaining the emotion feature vector corresponding to the speech sequence based on a preset emotion time sequence coding network and performing time sequence coding according to the speech sequence comprises:
obtaining the emotion feature vector corresponding to the voice sequence according to the method shown in the following formula:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing the input matrix, the hidden layer matrix and the motion vector of the emotion time sequence coding network at the current moment tA cell state; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfA transition matrix representing the hidden layer matrix to the forgetting gate; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
5. A speech emotion recognition system based on ternary loss is characterized by comprising:
the variable-length input processing module is configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;
the emotion time sequence coding module is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;
the voice emotion recognition module is configured to predict emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;
the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model which is constructed based on the speech data samples and by utilizing a machine learning algorithm;
the voice emotion recognition system comprises a voice emotion loss function module; the voice emotion loss function module comprises a ternary loss function submodule and a cross entropy loss function submodule;
wherein the variable length input processing module is further configured to perform the following operations:
framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;
comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:
if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;
if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;
if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or
Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or
And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.
6. The ternary loss based speech emotion recognition system of claim 5, further comprising a ternary speech sample set acquisition module;
the ternary voice sample group acquisition module is configured to acquire a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;
the ternary loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentionedThe first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,the alpha represents a preset distance parameter;
the cross entropy loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:
7. The system according to claim 6, wherein the ternary loss based speech emotion recognition module is further configured to obtain a ternary speech sample set according to the speech data sample and according to the following method:
8. The ternary loss based speech emotion recognition system of any of claims 5-7, wherein the emotion time sequence coding module is further configured to obtain the emotion feature vector according to the following method:
it=σ(Wxixt+Whiht-1+Wci·ct-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcf·ct-1+bf)
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wco·ct+bo)
Ht=ot·tanh(ct)
wherein, the it,ft,otAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said xt,ht,ctRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentionedt-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h ist-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W isxiA transition matrix representing the input matrix to the input gate; the W ishiA transition matrix representing the hidden layer matrix to the input gate; the W isciA transition matrix representing the state of the cell to the input gate; the W isxfA transition matrix representing the input matrix to the forgetting gate; the W ishfA transition matrix representing the hidden layer matrix to the forgetting gate; the W iscfA transition matrix representing a state of a cell to a forgetting gate; the W isxcA transition matrix representing the input matrix to the cell state; the W ishcA transition matrix representing the hidden layer matrix to the cell state; the W isxoA transition matrix representing the input matrix to the output gate; the W ishoA transition matrix representing a hidden layer matrix to an output gate, said WcoA transition matrix representing the state of the cell to the output gate, said bi,bf,bo,bcRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810839374.0A CN109003625B (en) | 2018-07-27 | 2018-07-27 | Speech emotion recognition method and system based on ternary loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810839374.0A CN109003625B (en) | 2018-07-27 | 2018-07-27 | Speech emotion recognition method and system based on ternary loss |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109003625A CN109003625A (en) | 2018-12-14 |
CN109003625B true CN109003625B (en) | 2021-01-12 |
Family
ID=64597222
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810839374.0A Active CN109003625B (en) | 2018-07-27 | 2018-07-27 | Speech emotion recognition method and system based on ternary loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109003625B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11947593B2 (en) * | 2018-09-28 | 2024-04-02 | Sony Interactive Entertainment Inc. | Sound categorization system |
CN109599128B (en) * | 2018-12-24 | 2022-03-01 | 北京达佳互联信息技术有限公司 | Speech emotion recognition method and device, electronic equipment and readable medium |
CN110059616A (en) * | 2019-04-17 | 2019-07-26 | 南京邮电大学 | Pedestrian's weight identification model optimization method based on fusion loss function |
CN110223714B (en) * | 2019-06-03 | 2021-08-03 | 杭州哲信信息技术有限公司 | Emotion recognition method based on voice |
CN110556130A (en) * | 2019-09-17 | 2019-12-10 | 平安科技(深圳)有限公司 | Voice emotion recognition method and device and storage medium |
CN111445899B (en) * | 2020-03-09 | 2023-08-01 | 咪咕文化科技有限公司 | Speech emotion recognition method, device and storage medium |
CN111768764B (en) * | 2020-06-23 | 2024-01-19 | 北京猎户星空科技有限公司 | Voice data processing method and device, electronic equipment and medium |
CN112084338B (en) * | 2020-09-18 | 2024-02-06 | 达而观数据(成都)有限公司 | Automatic document classification method, system, computer equipment and storage medium |
CN114757310B (en) * | 2022-06-16 | 2022-11-11 | 山东海量信息技术研究院 | Emotion recognition model and training method, device, equipment and readable storage medium thereof |
CN116528438B (en) * | 2023-04-28 | 2023-10-10 | 广州力铭光电科技有限公司 | Intelligent dimming method and device for lamp |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436806A (en) * | 2011-09-29 | 2012-05-02 | 复旦大学 | Audio frequency copy detection method based on similarity |
CN107221320A (en) * | 2017-05-19 | 2017-09-29 | 百度在线网络技术(北京)有限公司 | Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120137367A1 (en) * | 2009-11-06 | 2012-05-31 | Cataphora, Inc. | Continuous anomaly detection based on behavior modeling and heterogeneous information analysis |
CN105469065B (en) * | 2015-12-07 | 2019-04-23 | 中国科学院自动化研究所 | A kind of discrete emotion identification method based on recurrent neural network |
CN106782602B (en) * | 2016-12-01 | 2020-03-17 | 南京邮电大学 | Speech emotion recognition method based on deep neural network |
-
2018
- 2018-07-27 CN CN201810839374.0A patent/CN109003625B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436806A (en) * | 2011-09-29 | 2012-05-02 | 复旦大学 | Audio frequency copy detection method based on similarity |
CN107221320A (en) * | 2017-05-19 | 2017-09-29 | 百度在线网络技术(北京)有限公司 | Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model |
Also Published As
Publication number | Publication date |
---|---|
CN109003625A (en) | 2018-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109003625B (en) | Speech emotion recognition method and system based on ternary loss | |
CN110188343B (en) | Multi-mode emotion recognition method based on fusion attention network | |
Dong et al. | Automatic age estimation based on deep learning algorithm | |
CN109063666A (en) | The lightweight face identification method and system of convolution are separated based on depth | |
CN110556130A (en) | Voice emotion recognition method and device and storage medium | |
CN113095357A (en) | Multi-mode emotion recognition method and system based on attention mechanism and GMN | |
CN111598979B (en) | Method, device and equipment for generating facial animation of virtual character and storage medium | |
CN108804453A (en) | A kind of video and audio recognition methods and device | |
CN110047517A (en) | Speech-emotion recognition method, answering method and computer equipment | |
CN112560495A (en) | Microblog rumor detection method based on emotion analysis | |
Noroozi et al. | Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost | |
Bahari | Speaker age estimation using Hidden Markov Model weight supervectors | |
KR20220098991A (en) | Method and apparatus for recognizing emtions based on speech signal | |
CN111460923A (en) | Micro-expression recognition method, device, equipment and storage medium | |
Shivakumar et al. | Simplified and supervised i-vector modeling for speaker age regression | |
CN110120231B (en) | Cross-corpus emotion recognition method based on self-adaptive semi-supervised non-negative matrix factorization | |
Shih et al. | Speech emotion recognition with ensemble learning methods | |
Pham et al. | Vietnamese scene text detection and recognition using deep learning: An empirical study | |
CN115712739A (en) | Dance action generation method, computer device and storage medium | |
CN106971731B (en) | Correction method for voiceprint recognition | |
CN112738724B (en) | Method, device, equipment and medium for accurately identifying regional target crowd | |
Chandrakala et al. | Combination of generative models and SVM based classifier for speech emotion recognition | |
Elbarougy et al. | Continuous audiovisual emotion recognition using feature selection and lstm | |
Tripathi et al. | Facial expression recognition using data mining algorithm | |
Saeed et al. | Robust Visual Lips Feature Extraction Method for Improved Visual Speech Recognition System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |