CN113989893A - Expression and voice bimodal-based children emotion recognition algorithm - Google Patents
Expression and voice bimodal-based children emotion recognition algorithm Download PDFInfo
- Publication number
- CN113989893A CN113989893A CN202111290611.0A CN202111290611A CN113989893A CN 113989893 A CN113989893 A CN 113989893A CN 202111290611 A CN202111290611 A CN 202111290611A CN 113989893 A CN113989893 A CN 113989893A
- Authority
- CN
- China
- Prior art keywords
- features
- emotion
- audio
- video
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Abstract
The invention relates to emotion recognition, in particular to a child emotion recognition algorithm based on expression and voice dual modes, which comprises the steps of constructing a semantic feature space by utilizing emotion label information of voice features and expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features of the audio and the video to the semantic feature space, and selecting important features contributing to emotion classification from the semantic feature space to judge and recognize emotion; the technical scheme provided by the invention can effectively overcome the defect that the emotion cannot be accurately judged and identified in the prior art.
Description
Technical Field
The invention relates to emotion recognition, in particular to a child emotion recognition algorithm based on expressions and voice bimodality.
Background
The emotion of the children has various expression modes, such as voice, expression, posture, action and the like, and effective information can be extracted from the emotion of the children for correct analysis. The most obvious and easily analyzed characteristics of the voice and expression information are widely researched and applied. The psychologist Mehrabian gives the formula: the emotion expression is 7% of speech, 38% of voice and 55% of facial expression, and the visible voice and expression information covers 93% of emotion information, which is the core of the communication information. In the process of emotion expression, the mood can be effectively and intuitively expressed through the change of facial expressions, the mood is one of the most important characteristic information for emotion recognition, and the voice characteristics can also express rich emotion.
Traditional single-mode recognition may have the problem that a single emotional feature may not well characterize the emotional state, for example, when expressing sad emotions, the facial expression may not change greatly, but when the sad emotion can be distinguished from the deep and slow speech feature. The multi-mode recognition enables information among different modes to be complementary, more information is provided for emotion recognition, and the accuracy of emotion recognition is improved.
However, currently, single-mode emotion recognition research is mature, and an emotion recognition method for multiple modes still needs to be developed and improved. Therefore, the multi-mode emotion recognition has very important practical application significance, and as the most dominant voice feature and expression feature, the dual-mode emotion recognition based on the voice feature and the expression feature has important research significance and application value. In the traditional bimodal emotion recognition, the contribution degree of each feature to emotion recognition is ignored by adopting a weighting method, and the emotion is not beneficial to accurately judging and recognizing.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects in the prior art, the invention provides a child emotion recognition algorithm based on expression and voice dual modes, which can effectively overcome the defect that the emotion cannot be accurately judged and recognized in the prior art.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
a child emotion recognition algorithm based on expressions and voice dual-modes is characterized in that a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted through a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.
Preferably, the extracting the local features of the audio by the multi-scale feature extraction method includes:
sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;
and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.
Preferably, the output gate convolutional neural network includes a plurality of convolutional layers, each convolutional layer is connected with a corresponding pooling layer, the pooling layers are used for performing down-sampling in the time domain and/or the frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than the total down-sampling rate in the frequency domain.
Preferably, the abscissa of the spectrogram is time corresponding to an audio frame, and the ordinate of the spectrogram is a spectral value corresponding to the audio frame.
Preferably, the extracting local features of the video by the multi-scale feature extraction method includes:
extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN network to obtain a candidate region, carrying out maximum Pooling on the candidate region through an ROI Pooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.
Preferably, the performing target detection and accurate positioning on the feature map through the RPN network to obtain a candidate region includes:
carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;
classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;
calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;
and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.
Preferably, said maximal Pooling of candidate regions by ROI Pooling layers in Fast R-CNN networks comprises:
mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;
and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.
Preferably, the extracting global features of audio and video by the multi-scale feature extraction method includes:
and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.
Preferably, the semantic feature space is constructed on the basis of emotion tag information using voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the audio and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the audio and the video to the semantic feature space.
Preferably, the selecting, by the semantic feature space, important features contributing to emotion classification to perform emotion judgment and identification includes:
and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.
(III) advantageous effects
Compared with the prior art, the children emotion recognition algorithm based on expressions and voice dual modes provided by the invention has the advantages that through double-sparse linear discriminant analysis, constructing a semantic feature space on the basis of emotion label information by utilizing the voice features and the expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features to a semantic feature space, learning a weight for measuring the contribution of the local features and the global features to emotion recognition by utilizing a combined sparse reduced rank regression model according to the contribution of the local features and the global features to the emotion recognition, and the weighted local features and global features are secondarily learned through a combined sparse reduced rank regression model to select the features with the capability of distinguishing different emotional states, therefore, the corresponding weight can be determined according to the contribution degree of the features to emotion recognition, and the emotion can be accurately judged and recognized based on the features with the capability of distinguishing different emotion states.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A child emotion recognition algorithm based on expressions and voice dual modes is disclosed, as shown in figure 1, a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted through a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.
The semantic feature space is constructed on the basis of emotion label information of voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the voice and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the voice and the video to the semantic feature space.
Selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification, wherein the important features comprise:
and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.
According to the technical scheme, the corresponding weight can be determined according to the contribution degree of the features to emotion recognition by using the combined sparse reduced rank regression model, the weighted local features and the weighted global features are subjected to secondary learning, the features with the capability of distinguishing different emotion states are selected, and therefore the emotion can be accurately judged and recognized based on the features with the capability of distinguishing different emotion states.
Extracting local features of audio (namely voice features of children) by a multi-scale feature extraction method, wherein the method comprises the following steps:
sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;
and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.
The abscissa of the spectrogram is the time corresponding to the audio frame, and the ordinate of the spectrogram is the spectral value corresponding to the audio frame.
The output gate convolutional neural network comprises a plurality of convolutional layers, a corresponding pooling layer is connected behind each convolutional layer, the pooling layers are used for performing down-sampling in a time domain and/or a frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than the total down-sampling rate in the frequency domain.
Each convolution layer comprises at least two layers, the output of the front layer is used as the input of the rear layer, each layer comprises a first channel and a second channel, the first channel and the second channel respectively adopt different nonlinear activation functions, the nonlinear activation function of the first channel is a hyperbolic function tanh, and the nonlinear activation function of the second channel is an S-shaped function sigmoid.
The method for extracting the local features (namely the expression features of the children) of the video by the multi-scale feature extraction method comprises the following steps:
extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN network to obtain a candidate region, carrying out maximum Pooling on the candidate region through an ROI Pooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.
The target detection and accurate positioning are carried out on the characteristic diagram through the RPN to obtain a candidate area, and the method comprises the following steps:
carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;
classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;
calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;
and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.
Wherein the maximal Pooling of candidate regions by ROI Pooling layers in Fast R-CNN networks comprises:
mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;
and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.
In the technical scheme of the application, after the local features of the audio and the video are extracted by the multi-scale feature extraction method, the global features of the audio and the video are required to be acquired.
Extracting global features of audio and video by a multi-scale feature extraction method, comprising the following steps:
and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.
Claims (10)
1. A children emotion recognition algorithm based on expressions and speech bimodal is characterized in that: the method comprises the steps of constructing a semantic feature space by utilizing emotion label information of voice features and expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features of the audio and the video to the semantic feature space, selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification.
2. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the method for extracting the local features of the audio by the multi-scale feature extraction method comprises the following steps:
sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;
and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.
3. The dual expression and speech-based children emotion recognition algorithm of claim 2, wherein: the output gate convolutional neural network comprises a plurality of convolutional layers, a corresponding pooling layer is connected behind each convolutional layer, the pooling layers are used for performing down-sampling in a time domain and/or a frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than that in the frequency domain.
4. The dual expression and speech-based children emotion recognition algorithm of claim 2, wherein: the abscissa of the spectrogram is the time corresponding to the audio frame, and the ordinate of the spectrogram is the frequency spectrum value corresponding to the audio frame.
5. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the method for extracting the local features of the video through the multi-scale feature extraction method comprises the following steps:
extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN (resilient packet network) to obtain a candidate region, carrying out maximum pooling on the candidate region through a ROIPooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.
6. The dual expression and speech-based children's emotion recognition algorithm of claim 5, wherein: the target detection and accurate positioning of the feature map through the RPN to obtain the candidate area includes:
carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;
classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;
calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;
and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.
7. The dual expression and speech-based children's emotion recognition algorithm of claim 6, wherein: the maximal pooling of candidate regions by the ROIPooling layer in the Fast R-CNN network includes:
mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;
and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.
8. The dual expression and speech based child emotion recognition algorithm of claim 2 or 7, wherein: the method for extracting the global features of the audio and the video by the multi-scale feature extraction method comprises the following steps:
and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.
9. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the semantic feature space is constructed on the basis of emotion label information of voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the voice and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the voice and the video to the semantic feature space.
10. The dual expression and speech-based children's emotion recognition algorithm of claim 9, wherein: selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification, wherein the important features comprise:
and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111290611.0A CN113989893A (en) | 2021-11-02 | 2021-11-02 | Expression and voice bimodal-based children emotion recognition algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111290611.0A CN113989893A (en) | 2021-11-02 | 2021-11-02 | Expression and voice bimodal-based children emotion recognition algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113989893A true CN113989893A (en) | 2022-01-28 |
Family
ID=79745903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111290611.0A Pending CN113989893A (en) | 2021-11-02 | 2021-11-02 | Expression and voice bimodal-based children emotion recognition algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113989893A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114710555A (en) * | 2022-06-06 | 2022-07-05 | 深圳市景创科技电子股份有限公司 | Infant monitoring method and device |
CN114898775A (en) * | 2022-04-24 | 2022-08-12 | 中国科学院声学研究所南海研究站 | Voice emotion recognition method and system based on cross-layer cross fusion |
-
2021
- 2021-11-02 CN CN202111290611.0A patent/CN113989893A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898775A (en) * | 2022-04-24 | 2022-08-12 | 中国科学院声学研究所南海研究站 | Voice emotion recognition method and system based on cross-layer cross fusion |
CN114710555A (en) * | 2022-06-06 | 2022-07-05 | 深圳市景创科技电子股份有限公司 | Infant monitoring method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108717856B (en) | Speech emotion recognition method based on multi-scale deep convolution cyclic neural network | |
CN110188343B (en) | Multi-mode emotion recognition method based on fusion attention network | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN108597539B (en) | Speech emotion recognition method based on parameter migration and spectrogram | |
CN112784798B (en) | Multi-modal emotion recognition method based on feature-time attention mechanism | |
CN108805087B (en) | Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system | |
CN108899050B (en) | Voice signal analysis subsystem based on multi-modal emotion recognition system | |
CN109409296B (en) | Video emotion recognition method integrating facial expression recognition and voice emotion recognition | |
CN108805088B (en) | Physiological signal analysis subsystem based on multi-modal emotion recognition system | |
CN110853680B (en) | double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
CN113643723B (en) | Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information | |
CN113066499B (en) | Method and device for identifying identity of land-air conversation speaker | |
CN113989893A (en) | Expression and voice bimodal-based children emotion recognition algorithm | |
Huang et al. | Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition | |
CN110674483B (en) | Identity recognition method based on multi-mode information | |
CN110853656B (en) | Audio tampering identification method based on improved neural network | |
Bhosale et al. | End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios. | |
CN112329438A (en) | Automatic lie detection method and system based on domain confrontation training | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
Fritsch et al. | Estimating the degree of sleepiness by integrating articulatory feature knowledge in raw waveform Based CNNS | |
CN111048068B (en) | Voice wake-up method, device and system and electronic equipment | |
Zhou et al. | Speech Emotion Recognition with Discriminative Feature Learning. | |
CN116434786A (en) | Text-semantic-assisted teacher voice emotion recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |