CN113989893A - Expression and voice bimodal-based children emotion recognition algorithm - Google Patents

Expression and voice bimodal-based children emotion recognition algorithm Download PDF

Info

Publication number
CN113989893A
CN113989893A CN202111290611.0A CN202111290611A CN113989893A CN 113989893 A CN113989893 A CN 113989893A CN 202111290611 A CN202111290611 A CN 202111290611A CN 113989893 A CN113989893 A CN 113989893A
Authority
CN
China
Prior art keywords
features
emotion
audio
video
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111290611.0A
Other languages
Chinese (zh)
Inventor
张云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Lanchen Information Technology Co ltd
Original Assignee
Anhui Lanchen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Lanchen Information Technology Co ltd filed Critical Anhui Lanchen Information Technology Co ltd
Priority to CN202111290611.0A priority Critical patent/CN113989893A/en
Publication of CN113989893A publication Critical patent/CN113989893A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention relates to emotion recognition, in particular to a child emotion recognition algorithm based on expression and voice dual modes, which comprises the steps of constructing a semantic feature space by utilizing emotion label information of voice features and expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features of the audio and the video to the semantic feature space, and selecting important features contributing to emotion classification from the semantic feature space to judge and recognize emotion; the technical scheme provided by the invention can effectively overcome the defect that the emotion cannot be accurately judged and identified in the prior art.

Description

Expression and voice bimodal-based children emotion recognition algorithm
Technical Field
The invention relates to emotion recognition, in particular to a child emotion recognition algorithm based on expressions and voice bimodality.
Background
The emotion of the children has various expression modes, such as voice, expression, posture, action and the like, and effective information can be extracted from the emotion of the children for correct analysis. The most obvious and easily analyzed characteristics of the voice and expression information are widely researched and applied. The psychologist Mehrabian gives the formula: the emotion expression is 7% of speech, 38% of voice and 55% of facial expression, and the visible voice and expression information covers 93% of emotion information, which is the core of the communication information. In the process of emotion expression, the mood can be effectively and intuitively expressed through the change of facial expressions, the mood is one of the most important characteristic information for emotion recognition, and the voice characteristics can also express rich emotion.
Traditional single-mode recognition may have the problem that a single emotional feature may not well characterize the emotional state, for example, when expressing sad emotions, the facial expression may not change greatly, but when the sad emotion can be distinguished from the deep and slow speech feature. The multi-mode recognition enables information among different modes to be complementary, more information is provided for emotion recognition, and the accuracy of emotion recognition is improved.
However, currently, single-mode emotion recognition research is mature, and an emotion recognition method for multiple modes still needs to be developed and improved. Therefore, the multi-mode emotion recognition has very important practical application significance, and as the most dominant voice feature and expression feature, the dual-mode emotion recognition based on the voice feature and the expression feature has important research significance and application value. In the traditional bimodal emotion recognition, the contribution degree of each feature to emotion recognition is ignored by adopting a weighting method, and the emotion is not beneficial to accurately judging and recognizing.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects in the prior art, the invention provides a child emotion recognition algorithm based on expression and voice dual modes, which can effectively overcome the defect that the emotion cannot be accurately judged and recognized in the prior art.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
a child emotion recognition algorithm based on expressions and voice dual-modes is characterized in that a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted through a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.
Preferably, the extracting the local features of the audio by the multi-scale feature extraction method includes:
sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;
and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.
Preferably, the output gate convolutional neural network includes a plurality of convolutional layers, each convolutional layer is connected with a corresponding pooling layer, the pooling layers are used for performing down-sampling in the time domain and/or the frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than the total down-sampling rate in the frequency domain.
Preferably, the abscissa of the spectrogram is time corresponding to an audio frame, and the ordinate of the spectrogram is a spectral value corresponding to the audio frame.
Preferably, the extracting local features of the video by the multi-scale feature extraction method includes:
extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN network to obtain a candidate region, carrying out maximum Pooling on the candidate region through an ROI Pooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.
Preferably, the performing target detection and accurate positioning on the feature map through the RPN network to obtain a candidate region includes:
carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;
classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;
calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;
and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.
Preferably, said maximal Pooling of candidate regions by ROI Pooling layers in Fast R-CNN networks comprises:
mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;
and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.
Preferably, the extracting global features of audio and video by the multi-scale feature extraction method includes:
and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.
Preferably, the semantic feature space is constructed on the basis of emotion tag information using voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the audio and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the audio and the video to the semantic feature space.
Preferably, the selecting, by the semantic feature space, important features contributing to emotion classification to perform emotion judgment and identification includes:
and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.
(III) advantageous effects
Compared with the prior art, the children emotion recognition algorithm based on expressions and voice dual modes provided by the invention has the advantages that through double-sparse linear discriminant analysis, constructing a semantic feature space on the basis of emotion label information by utilizing the voice features and the expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features to a semantic feature space, learning a weight for measuring the contribution of the local features and the global features to emotion recognition by utilizing a combined sparse reduced rank regression model according to the contribution of the local features and the global features to the emotion recognition, and the weighted local features and global features are secondarily learned through a combined sparse reduced rank regression model to select the features with the capability of distinguishing different emotional states, therefore, the corresponding weight can be determined according to the contribution degree of the features to emotion recognition, and the emotion can be accurately judged and recognized based on the features with the capability of distinguishing different emotion states.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A child emotion recognition algorithm based on expressions and voice dual modes is disclosed, as shown in figure 1, a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted through a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.
The semantic feature space is constructed on the basis of emotion label information of voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the voice and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the voice and the video to the semantic feature space.
Selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification, wherein the important features comprise:
and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.
According to the technical scheme, the corresponding weight can be determined according to the contribution degree of the features to emotion recognition by using the combined sparse reduced rank regression model, the weighted local features and the weighted global features are subjected to secondary learning, the features with the capability of distinguishing different emotion states are selected, and therefore the emotion can be accurately judged and recognized based on the features with the capability of distinguishing different emotion states.
Extracting local features of audio (namely voice features of children) by a multi-scale feature extraction method, wherein the method comprises the following steps:
sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;
and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.
The abscissa of the spectrogram is the time corresponding to the audio frame, and the ordinate of the spectrogram is the spectral value corresponding to the audio frame.
The output gate convolutional neural network comprises a plurality of convolutional layers, a corresponding pooling layer is connected behind each convolutional layer, the pooling layers are used for performing down-sampling in a time domain and/or a frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than the total down-sampling rate in the frequency domain.
Each convolution layer comprises at least two layers, the output of the front layer is used as the input of the rear layer, each layer comprises a first channel and a second channel, the first channel and the second channel respectively adopt different nonlinear activation functions, the nonlinear activation function of the first channel is a hyperbolic function tanh, and the nonlinear activation function of the second channel is an S-shaped function sigmoid.
The method for extracting the local features (namely the expression features of the children) of the video by the multi-scale feature extraction method comprises the following steps:
extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN network to obtain a candidate region, carrying out maximum Pooling on the candidate region through an ROI Pooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.
The target detection and accurate positioning are carried out on the characteristic diagram through the RPN to obtain a candidate area, and the method comprises the following steps:
carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;
classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;
calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;
and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.
Wherein the maximal Pooling of candidate regions by ROI Pooling layers in Fast R-CNN networks comprises:
mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;
and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.
In the technical scheme of the application, after the local features of the audio and the video are extracted by the multi-scale feature extraction method, the global features of the audio and the video are required to be acquired.
Extracting global features of audio and video by a multi-scale feature extraction method, comprising the following steps:
and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A children emotion recognition algorithm based on expressions and speech bimodal is characterized in that: the method comprises the steps of constructing a semantic feature space by utilizing emotion label information of voice features and expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features of the audio and the video to the semantic feature space, selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification.
2. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the method for extracting the local features of the audio by the multi-scale feature extraction method comprises the following steps:
sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;
and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.
3. The dual expression and speech-based children emotion recognition algorithm of claim 2, wherein: the output gate convolutional neural network comprises a plurality of convolutional layers, a corresponding pooling layer is connected behind each convolutional layer, the pooling layers are used for performing down-sampling in a time domain and/or a frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than that in the frequency domain.
4. The dual expression and speech-based children emotion recognition algorithm of claim 2, wherein: the abscissa of the spectrogram is the time corresponding to the audio frame, and the ordinate of the spectrogram is the frequency spectrum value corresponding to the audio frame.
5. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the method for extracting the local features of the video through the multi-scale feature extraction method comprises the following steps:
extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN (resilient packet network) to obtain a candidate region, carrying out maximum pooling on the candidate region through a ROIPooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.
6. The dual expression and speech-based children's emotion recognition algorithm of claim 5, wherein: the target detection and accurate positioning of the feature map through the RPN to obtain the candidate area includes:
carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;
classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;
calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;
and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.
7. The dual expression and speech-based children's emotion recognition algorithm of claim 6, wherein: the maximal pooling of candidate regions by the ROIPooling layer in the Fast R-CNN network includes:
mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;
and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.
8. The dual expression and speech based child emotion recognition algorithm of claim 2 or 7, wherein: the method for extracting the global features of the audio and the video by the multi-scale feature extraction method comprises the following steps:
and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.
9. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the semantic feature space is constructed on the basis of emotion label information of voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the voice and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the voice and the video to the semantic feature space.
10. The dual expression and speech-based children's emotion recognition algorithm of claim 9, wherein: selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification, wherein the important features comprise:
and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.
CN202111290611.0A 2021-11-02 2021-11-02 Expression and voice bimodal-based children emotion recognition algorithm Pending CN113989893A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111290611.0A CN113989893A (en) 2021-11-02 2021-11-02 Expression and voice bimodal-based children emotion recognition algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111290611.0A CN113989893A (en) 2021-11-02 2021-11-02 Expression and voice bimodal-based children emotion recognition algorithm

Publications (1)

Publication Number Publication Date
CN113989893A true CN113989893A (en) 2022-01-28

Family

ID=79745903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111290611.0A Pending CN113989893A (en) 2021-11-02 2021-11-02 Expression and voice bimodal-based children emotion recognition algorithm

Country Status (1)

Country Link
CN (1) CN113989893A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710555A (en) * 2022-06-06 2022-07-05 深圳市景创科技电子股份有限公司 Infant monitoring method and device
CN114898775A (en) * 2022-04-24 2022-08-12 中国科学院声学研究所南海研究站 Voice emotion recognition method and system based on cross-layer cross fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898775A (en) * 2022-04-24 2022-08-12 中国科学院声学研究所南海研究站 Voice emotion recognition method and system based on cross-layer cross fusion
CN114710555A (en) * 2022-06-06 2022-07-05 深圳市景创科技电子股份有限公司 Infant monitoring method and device

Similar Documents

Publication Publication Date Title
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN108597539B (en) Speech emotion recognition method based on parameter migration and spectrogram
CN112784798B (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108899050B (en) Voice signal analysis subsystem based on multi-modal emotion recognition system
CN109409296B (en) Video emotion recognition method integrating facial expression recognition and voice emotion recognition
CN108805088B (en) Physiological signal analysis subsystem based on multi-modal emotion recognition system
CN110853680B (en) double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN113643723B (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN113989893A (en) Expression and voice bimodal-based children emotion recognition algorithm
Huang et al. Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition
CN110674483B (en) Identity recognition method based on multi-mode information
CN110853656B (en) Audio tampering identification method based on improved neural network
Bhosale et al. End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios.
CN112329438A (en) Automatic lie detection method and system based on domain confrontation training
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Fritsch et al. Estimating the degree of sleepiness by integrating articulatory feature knowledge in raw waveform Based CNNS
CN111048068B (en) Voice wake-up method, device and system and electronic equipment
Zhou et al. Speech Emotion Recognition with Discriminative Feature Learning.
CN116434786A (en) Text-semantic-assisted teacher voice emotion recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination