CN113989893A

CN113989893A - Expression and voice bimodal-based children emotion recognition algorithm

Info

Publication number: CN113989893A
Application number: CN202111290611.0A
Authority: CN
Inventors: 张云龙
Original assignee: Anhui Lanchen Information Technology Co ltd
Current assignee: Anhui Lanchen Information Technology Co ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-01-28

Abstract

The invention relates to emotion recognition, in particular to a child emotion recognition algorithm based on expression and voice dual modes, which comprises the steps of constructing a semantic feature space by utilizing emotion label information of voice features and expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features of the audio and the video to the semantic feature space, and selecting important features contributing to emotion classification from the semantic feature space to judge and recognize emotion; the technical scheme provided by the invention can effectively overcome the defect that the emotion cannot be accurately judged and identified in the prior art.

Description

Expression and voice bimodal-based children emotion recognition algorithm

Technical Field

The invention relates to emotion recognition, in particular to a child emotion recognition algorithm based on expressions and voice bimodality.

Background

The emotion of the children has various expression modes, such as voice, expression, posture, action and the like, and effective information can be extracted from the emotion of the children for correct analysis. The most obvious and easily analyzed characteristics of the voice and expression information are widely researched and applied. The psychologist Mehrabian gives the formula: the emotion expression is 7% of speech, 38% of voice and 55% of facial expression, and the visible voice and expression information covers 93% of emotion information, which is the core of the communication information. In the process of emotion expression, the mood can be effectively and intuitively expressed through the change of facial expressions, the mood is one of the most important characteristic information for emotion recognition, and the voice characteristics can also express rich emotion.

Traditional single-mode recognition may have the problem that a single emotional feature may not well characterize the emotional state, for example, when expressing sad emotions, the facial expression may not change greatly, but when the sad emotion can be distinguished from the deep and slow speech feature. The multi-mode recognition enables information among different modes to be complementary, more information is provided for emotion recognition, and the accuracy of emotion recognition is improved.

However, currently, single-mode emotion recognition research is mature, and an emotion recognition method for multiple modes still needs to be developed and improved. Therefore, the multi-mode emotion recognition has very important practical application significance, and as the most dominant voice feature and expression feature, the dual-mode emotion recognition based on the voice feature and the expression feature has important research significance and application value. In the traditional bimodal emotion recognition, the contribution degree of each feature to emotion recognition is ignored by adopting a weighting method, and the emotion is not beneficial to accurately judging and recognizing.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects in the prior art, the invention provides a child emotion recognition algorithm based on expression and voice dual modes, which can effectively overcome the defect that the emotion cannot be accurately judged and recognized in the prior art.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

a child emotion recognition algorithm based on expressions and voice dual-modes is characterized in that a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted through a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.

Preferably, the extracting the local features of the audio by the multi-scale feature extraction method includes:

sampling the audio frequency in a preset sampling period to obtain each audio frequency frame, and carrying out Fourier change on each audio frequency frame to obtain a spectrogram;

and performing model training on the output gate convolutional neural network, and performing feature extraction on the spectrogram by using the trained output gate convolutional neural network to obtain the local features of the audio.

Preferably, the output gate convolutional neural network includes a plurality of convolutional layers, each convolutional layer is connected with a corresponding pooling layer, the pooling layers are used for performing down-sampling in the time domain and/or the frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than the total down-sampling rate in the frequency domain.

Preferably, the abscissa of the spectrogram is time corresponding to an audio frame, and the ordinate of the spectrogram is a spectral value corresponding to the audio frame.

Preferably, the extracting local features of the video by the multi-scale feature extraction method includes:

extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN network to obtain a candidate region, carrying out maximum Pooling on the candidate region through an ROI Pooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.

Preferably, the performing target detection and accurate positioning on the feature map through the RPN network to obtain a candidate region includes:

carrying out convolution calculation on the feature map through an RPN (resilient packet network) to obtain a feature map after scale transformation;

classifying the anchor frames in the feature map after the scale transformation by using a softmax function to obtain a foreground candidate area containing a target object;

calculating frame regression offset of an anchor frame in the feature map after the scale transformation to obtain an accurate candidate region;

and obtaining a pre-candidate area based on the foreground candidate area and the accurate candidate area, and removing the pre-candidate area with smaller size and exceeding the boundary based on NMS to obtain the candidate area.

Preferably, said maximal Pooling of candidate regions by ROI Pooling layers in Fast R-CNN networks comprises:

mapping the accurate candidate area to a corresponding position of the characteristic diagram to obtain the mapped accurate candidate area on the characteristic diagram;

and dividing the mapped accurate candidate area into a plurality of sub-windows with the same size, and performing maximum pooling on each sub-window to obtain a group of video local features with multiple same dimensions.

Preferably, the extracting global features of audio and video by the multi-scale feature extraction method includes:

and respectively carrying out DAC (digital-to-analog converter) feature fusion on the multiple groups of audio local features and video local features to maximize the intra-class correlation and minimize the inter-class correlation so as to respectively obtain the audio global features and the video global features.

Preferably, the semantic feature space is constructed on the basis of emotion tag information using voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the audio and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the audio and the video to the semantic feature space.

Preferably, the selecting, by the semantic feature space, important features contributing to emotion classification to perform emotion judgment and identification includes:

and learning a weight value for measuring the contribution degree of the local characteristic and the global characteristic according to the contribution of the local characteristic and the global characteristic to emotion recognition by using a combined sparse reduced rank regression model, and performing secondary learning on the weighted local characteristic and the weighted global characteristic by using the combined sparse reduced rank regression model to select the characteristic with the capability of distinguishing different emotion states.

(III) advantageous effects

Compared with the prior art, the children emotion recognition algorithm based on expressions and voice dual modes provided by the invention has the advantages that through double-sparse linear discriminant analysis, constructing a semantic feature space on the basis of emotion label information by utilizing the voice features and the expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features to a semantic feature space, learning a weight for measuring the contribution of the local features and the global features to emotion recognition by utilizing a combined sparse reduced rank regression model according to the contribution of the local features and the global features to the emotion recognition, and the weighted local features and global features are secondarily learned through a combined sparse reduced rank regression model to select the features with the capability of distinguishing different emotional states, therefore, the corresponding weight can be determined according to the contribution degree of the features to emotion recognition, and the emotion can be accurately judged and recognized based on the features with the capability of distinguishing different emotion states.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A child emotion recognition algorithm based on expressions and voice dual modes is disclosed, as shown in figure 1, a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted through a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.

The semantic feature space is constructed on the basis of emotion label information of voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the voice and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the voice and the video to the semantic feature space.

Selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification, wherein the important features comprise:

According to the technical scheme, the corresponding weight can be determined according to the contribution degree of the features to emotion recognition by using the combined sparse reduced rank regression model, the weighted local features and the weighted global features are subjected to secondary learning, the features with the capability of distinguishing different emotion states are selected, and therefore the emotion can be accurately judged and recognized based on the features with the capability of distinguishing different emotion states.

Extracting local features of audio (namely voice features of children) by a multi-scale feature extraction method, wherein the method comprises the following steps:

The abscissa of the spectrogram is the time corresponding to the audio frame, and the ordinate of the spectrogram is the spectral value corresponding to the audio frame.

The output gate convolutional neural network comprises a plurality of convolutional layers, a corresponding pooling layer is connected behind each convolutional layer, the pooling layers are used for performing down-sampling in a time domain and/or a frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than the total down-sampling rate in the frequency domain.

Each convolution layer comprises at least two layers, the output of the front layer is used as the input of the rear layer, each layer comprises a first channel and a second channel, the first channel and the second channel respectively adopt different nonlinear activation functions, the nonlinear activation function of the first channel is a hyperbolic function tanh, and the nonlinear activation function of the second channel is an S-shaped function sigmoid.

The method for extracting the local features (namely the expression features of the children) of the video by the multi-scale feature extraction method comprises the following steps:

The target detection and accurate positioning are carried out on the characteristic diagram through the RPN to obtain a candidate area, and the method comprises the following steps:

Wherein the maximal Pooling of candidate regions by ROI Pooling layers in Fast R-CNN networks comprises:

In the technical scheme of the application, after the local features of the audio and the video are extracted by the multi-scale feature extraction method, the global features of the audio and the video are required to be acquired.

Extracting global features of audio and video by a multi-scale feature extraction method, comprising the following steps:

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A children emotion recognition algorithm based on expressions and speech bimodal is characterized in that: the method comprises the steps of constructing a semantic feature space by utilizing emotion label information of voice features and expression features, extracting local features and global features of audio and video by a multi-scale feature extraction method, projecting the local features and the global features of the audio and the video to the semantic feature space, selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification.

2. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the method for extracting the local features of the audio by the multi-scale feature extraction method comprises the following steps:

3. The dual expression and speech-based children emotion recognition algorithm of claim 2, wherein: the output gate convolutional neural network comprises a plurality of convolutional layers, a corresponding pooling layer is connected behind each convolutional layer, the pooling layers are used for performing down-sampling in a time domain and/or a frequency domain, and the total down-sampling rate of each pooling layer in the time domain is smaller than that in the frequency domain.

4. The dual expression and speech-based children emotion recognition algorithm of claim 2, wherein: the abscissa of the spectrogram is the time corresponding to the audio frame, and the ordinate of the spectrogram is the frequency spectrum value corresponding to the audio frame.

5. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the method for extracting the local features of the video through the multi-scale feature extraction method comprises the following steps:

extracting the image by using the convolutional layer to obtain a characteristic diagram, carrying out target detection and accurate positioning on the characteristic diagram through an RPN (resilient packet network) to obtain a candidate region, carrying out maximum pooling on the candidate region through a ROIPooling layer in the FastR-CNN network, and outputting a group of video local characteristics with multiple same dimensions.

6. The dual expression and speech-based children's emotion recognition algorithm of claim 5, wherein: the target detection and accurate positioning of the feature map through the RPN to obtain the candidate area includes:

7. The dual expression and speech-based children's emotion recognition algorithm of claim 6, wherein: the maximal pooling of candidate regions by the ROIPooling layer in the Fast R-CNN network includes:

8. The dual expression and speech based child emotion recognition algorithm of claim 2 or 7, wherein: the method for extracting the global features of the audio and the video by the multi-scale feature extraction method comprises the following steps:

9. The dual expression and speech-based children emotion recognition algorithm of claim 1, wherein: the semantic feature space is constructed on the basis of emotion label information of voice features and expression features through double-sparse linear discriminant analysis, and important features are selected from the local features and the global features of the voice and the video according to contribution of the voice and the expression features to emotion classification in the process of projecting the local features and the global features of the voice and the video to the semantic feature space.

10. The dual expression and speech-based children's emotion recognition algorithm of claim 9, wherein: selecting important features contributing to emotion classification from the semantic feature space, and carrying out emotion judgment and identification, wherein the important features comprise: