WO2020248376A1

WO2020248376A1 - Emotion detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2020248376A1
Application number: PCT/CN2019/102867
Authority: WO
Inventors: 盛建达
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-14
Filing date: 2019-08-27
Publication date: 2020-12-17
Also published as: CN110414323A

Abstract

An emotion detection method. Said method comprises: acquiring a plurality of audio and video samples; extracting, from the audio and video samples, a plurality of image samples and a plurality of sound samples matching the plurality of image samples; inputting the plurality of image samples into an image-based emotion recognition model to obtain a plurality of image features, and inputting the plurality of sound samples into a sound-based emotion recognition model to obtain a plurality of sound features; fusing the plurality of image features and the plurality of sound features to obtain fused features; inputting the fused features into a model to be trained, for training, so as to obtain an emotion detection hybrid model; and inputting the processed audio and video to be recognized into the emotion detection hybrid model, to obtain an emotion recognition result. The present invention further provides an emotion detection apparatus, an electronic device and a storage medium. The present invention improves the accuracy of the whole emotion prediction, and also increases the robustness of a model.

Description

Emotion detection method, device, electronic equipment and storage medium

This application claims to be submitted to the Chinese Patent Office on June 14, 2019. The application number is 201910518131.1. The priority of the Chinese patent application with the title of "emotion detection method, device, electronic equipment and storage medium", the entire content of which is incorporated by reference In this application.

Technical field

The present invention relates to the field of artificial intelligence technology, and in particular to an emotion detection method, device, electronic equipment and storage medium.

Background technique

Emotion is a state that integrates people’s feelings, thoughts and behaviors. It includes people’s psychological response to external or self-stimulation. Emotions cause significant changes in people’s physiology and behavior. Facial expressions are an important part of emotional manifestations. On the one hand, changes in the eyes, eyebrows, or mouth can best express a person’s emotions. Through the recognition and analysis of facial expressions, people’s emotions can be judged. Under certain emotional states, people will produce specific facial muscle movements and The expression mode can recognize emotions based on the correspondence between expressions and emotions.

Emotion recognition is a key technology in the field of artificial intelligence. The research on emotion recognition has important practical application value for human-computer interaction. At present, traditional emotion recognition methods generally use LBP (Local Binary Pattern) method to extract features, and then SVM (Support Vector Machine) classifier is used to classify emotions. However, because there are more classifications of facial expressions and the rules are more complicated, the existing facial expression recognition methods, the facial expression recognition process The calculation is complicated, and the recognition accuracy and recognition efficiency of facial expressions are not high.

Summary of the invention

In view of the above, it is necessary to provide an emotion detection method that can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.

The first aspect of the present invention provides an emotion detection method, which includes:

Obtain multiple audio and video samples;

Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;

Fusing the multiple image features and multiple voice features to obtain a fusion feature;

Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;

Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.

In a possible implementation manner, before the acquiring multiple audio and video samples, the method further includes:

Obtain face pictures;

Preprocessing the face image to obtain training samples;

The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.

In a possible implementation manner, the fusing the multiple image features and the multiple voice features to obtain the fusion feature includes:

The multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.

Obtain audio file samples;

Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;

The MFCC features are trained to obtain a voice-based emotion recognition model.

In a possible implementation, the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features includes:

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or

A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.

In a possible implementation manner, the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features includes:

Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or

Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.

In a possible implementation manner, the method further includes:

If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;

The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.

A second aspect of the present invention provides an emotion detection device, which includes:

The acquisition module is used to acquire multiple audio and video samples;

A grabbing module for grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

An input module, configured to input a plurality of said image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain multiple voice features;

A fusion module for fusing the multiple image features and multiple voice features to obtain fusion features;

A training module, configured to input the fusion features into the model to be trained for training, and obtain a mixed emotion detection model;

The input module is also used to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.

A third aspect of the present invention provides an electronic device including a processor and a memory, and the processor is configured to implement the emotion detection method when executing computer-readable instructions stored in the memory.

A fourth aspect of the present invention provides a non-volatile readable storage medium having computer readable instructions stored on the non-volatile readable storage medium, and when the computer readable instructions are executed by a processor, the Method of emotion detection.

From the above technical solutions, in the present invention, multiple audio and video samples can be obtained, multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples. Further, Input a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of the sound samples into a voice-based emotion recognition model to obtain a plurality of sound features. Finally, the A plurality of image features and a plurality of the sound features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion detection Hybrid model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.

Description of the drawings

Fig. 1 is a flowchart of a preferred embodiment of an emotion detection method disclosed in the present invention.

Fig. 2 is a schematic diagram of the training process of an image-based emotion recognition model disclosed in the present invention.

Fig. 3 is a schematic diagram of the training process of a voice-based emotion recognition model disclosed in the present invention.

Fig. 4 is a schematic diagram of the training process of a mixed emotion detection model disclosed in the present invention.

Fig. 5 is a functional block diagram of a preferred embodiment of an emotion detection device disclosed in the present invention.

Fig. 6 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the emotion detection method of the present invention.

Detailed ways

The emotion detection method of the embodiment of the present invention is applied to an electronic device, and can also be applied to a hardware environment composed of an electronic device and a server connected to the electronic device through a network, and is executed by the server and the electronic device. Networks include but are not limited to: wide area network, metropolitan area network or local area network.

Please refer to Fig. 1, which is a flowchart of a preferred embodiment of an emotion detection method disclosed in the present invention. Among them, according to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.

S11. The electronic device obtains multiple audio and video samples.

The audio and video samples include audio information and video information samples, and may include, but are not limited to, audio and video samples extracted from multimedia files, other public audio and video data sets, or user-collected and labeled audio and video samples.

S12. The electronic device captures multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples.

Among them, multiple image samples and multiple sound samples used for separate training need to come from the same audio and video sample, or in other words, each image sample has a sound sample that matches the image sample, which is helpful for subsequent combination Images and sound together build a mixed model of emotion detection.

S13. The electronic device inputs a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and inputs a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features.

The image-based emotion recognition model may be a Densnet (Dense Convolutional Network, densely connected convolutional network) model, or the image-based emotion recognition model may be a ResNet (deep residual network) model.

As an optional implementation manner, before step S11, the method further includes:

Obtain face pictures;

Preprocessing the face image to obtain training samples;

In this alternative embodiment, the face image may include not limited to images captured from the video, but may also be other public data collections, or use crawler technology to capture data self-annotated face images in the search engine, through The face pictures are sorted and labeled, and each face picture is assigned a corresponding emotion label, where the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions. The format of the face picture may include, but is not limited to, jpg, png, and jpeg. The emotions represented by face pictures include, but are not limited to, emotions such as happiness, sadness, fear, anger, surprise, disgust, and peace, and each face expression represents only one type of emotion.

Specifically, the face picture can be preprocessed, such as normalization operation and pooling operation, which can speed up the subsequent model training, and at the same time, remove redundant information in the face picture to obtain a picture with a uniform specification As a training sample, it is conducive to the effectiveness of training, and the obtained image-based emotion recognition model is more accurate.

Specifically, an initial parameter can be assigned to the weights and biases of each network layer of the Densnet model, that is, the Densnet model is initialized to obtain the initial model, the training samples are input into the initial model, and the training samples are normalized and calculated The output parameters of each network layer of the initial model are obtained, and the forward output of the initial model is obtained. According to the forward output, the initial parameters of each network in the initial model are adjusted using the backpropagation algorithm to obtain the emotion recognition model. Among them, how to calculate the output parameters of each network layer of the initial model belongs to the prior art, and will not be repeated here.

Obtain face pictures;

Preprocessing the face image to obtain training samples;

The training samples are input into a deep residual network ResNet model for training, and an image-based emotion recognition model is obtained.

Among them, the face picture may include, but is not limited to, an image captured from a video, it may also be a collection of other public data, or a face picture self-annotated using crawler technology to capture data on a search engine. The emotions represented by face pictures include but are not limited to emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm. Each face expression represents only one type of emotion, and the emotion label corresponding to the face picture is assigned. Among them, the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions. Optionally, the preprocessing process of the picture refers to the processing of transforming the size, grayscale, and shape of the picture to form a unified specification, so that subsequent picture processing is more efficient. Optionally, the image preprocessing process can also be to perform normalization and pooling operations on the image, remove redundant information in the face image, and obtain a uniform specification image as a training sample, which is beneficial to the effectiveness of training , The obtained image-based emotion recognition model is more accurate.

Specifically, the training samples can be input to the constructed deep residual network model, and the characteristics of the training samples can be extracted; according to the emotion labels carried by the training samples, the preset weights and preset thresholds in each layer of neurons can be adjusted, so that according to the features The classification result of is in line with the emotion label corresponding to the training sample; the preset model parameters of the deep residual network model are updated to obtain the emotion recognition model. Among them, the preset model parameters include preset weights and preset thresholds. The preset model parameters are suitable for the model parameters that calculate the characteristics of the training samples. The deep residual network model is trained using the training samples for Obtain appropriate network parameters to prevent model fitting.

Please refer to FIG. 2 together, which is a schematic diagram of the training process of an image-based emotion recognition model disclosed in the present invention. Wherein, Input represents an input image sample. The image sample may include, but is not limited to, images extracted from multimedia files, and also include other public image data collections or images collected by users themselves, and the network layer (Layer) identification In the layer of neural network calculation, the output result (Result) can be 7 major emotions, such as happiness, surprise, fear, disgust, anger, sadness, and calm. It can also be the prediction result of face action units (AU). It can be the positive and negative direction and the degree of motivation that stimulate the theory.

Obtain audio file samples;

In this optional implementation manner, the audio file samples may include, but are not limited to, audio files extracted from multimedia files, and also include other public audio data sets or audio collected by the user. Specifically, the audio file samples can be pre-emphasized, framed and windowed first, and then for each short-term analysis window, the corresponding frequency spectrum is obtained through fast Fourier transform FFT, and the frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum. Perform cepstrum analysis on the Mel spectrum (take the logarithm and do the inverse transformation. The actual inverse transformation is generally realized by the DCT discrete cosine transform, taking the 2nd to 13th coefficients after DCT as the MFCC coefficients) to obtain the Mel frequency Cepstral coefficient MFCC, this MFCC is the feature of this frame of speech, that is, the MFCC feature, and finally, the MFCC feature can be trained to obtain a voice-based emotion recognition model. Among them, MFCC (Mel Frequency Cepstral Coefficents) is the English abbreviation of Mel Frequency Cepstral Coefficents.

Please refer to FIG. 3 together, which is a schematic diagram of the training process of a voice-based emotion recognition model disclosed in the present invention. In Figure 3, h0, h1,...hn are the output results obtained through two LSTMs (Long Short Term Memory Network), and a0, a1...an and c0, c1...cn are all coefficients. After feature extraction from multiple audio frames, the h value (h0, h1...hn) is obtained after two LSTMs, and the audio features are obtained after coefficients (a0, a1...an and c0, c1...cn). Finally, The classification prediction is performed by the regression classification layer (Softmax), and the audio-based prediction result (ie, the output result) is obtained from the output layer (Result).

Specifically, the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features includes:

Among them, the face action unit (Action Units, AU) is proposed to analyze the facial muscle movement. Facial expressions can be recognized by AU. AU refers to the basic muscle action units of the human face, such as: raised inner eyebrows, raised mouth corners, and wrinkled nose.

Wherein, inputting a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action units AU detection network features of the entire neural network structure is: a plurality of said image samples are input by an input layer (Input), through A convolutional layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolutional packages with different parameters (conv_x package), and finally pass through the pooling layer, the fully connected layer (Full Connect) and the sigmoid layer to obtain the AU forecast result. Wherein, the AU detection network feature is the value of the middle layer of the entire neural network structure.

Wherein, inputting a plurality of the image samples into an image-based emotion recognition model to obtain the characteristics of a plurality of emotion detection networks is as follows: a plurality of the image samples are input by an input layer (Input) and undergo a convolution Layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolution packages with different parameters (conv_x package), and finally pass through the pooling layer and the fully connected layer (Full connect), which are performed by the regression classification layer (Softmax) Classification prediction, the prediction result of the face picture is obtained from the output layer (Result). Wherein, the emotion detection network characteristic is the value of the middle layer of the entire neural network structure.

Specifically, the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features includes:

Among them, the power spectrum is the abbreviation of the power spectrum density function, and the power spectrum is the signal power in the unit frequency band. It shows the change of signal power with frequency, that is, the distribution of signal power in the frequency domain. The original audio undergoes a series of processing, and after FFT transformation, a complex number sequence with respect to time is obtained. The square of the complex number is the sequence of power with respect to time, that is, the power spectrum.

S14. The electronic device fuses the multiple image features and the multiple voice features to obtain a fusion feature.

Specifically, the fusing the multiple image features and the multiple voice features to obtain the fusion feature includes:

Wherein, multiple image features and multiple voice features obtained through individual recognition are spliced in the concat layer (ie, splicing layer), that is, fusion is performed, and the fusion feature is obtained. As shown in Figure 3 below, after the input audio is divided into frames to obtain the sound features and the input picture to obtain the image features, then the sound features and image features are spliced in the concat layer to obtain the fusion feature.

Among them, the function of the concat layer is to splice two or more feature maps in the channel (channel) or num (number) dimensions. For example, for example, the first feature map and the second feature map are spliced in the channel dimension. First, except for the channel dimension, the other dimensions must be the same (that is, num, H, and W are consistent). Specifically, you can combine The channel k1 of the first feature map plus the channel k2 of the second feature map, the output of the concat layer can be expressed as: N*(k1+k2)*H*W. Among them, H is the height of the picture, W is the width of the picture, k1 is the number of channels in the first feature map, and k2 is the number of channels in the second feature map.

S15. The electronic device inputs the fusion feature into the model to be trained for training, and obtains a mixed emotion detection model.

Specifically, the fusion features are input into the model to be trained, and the network structure of the two-layer LSTM (Long Short Term Memory Network) and the attention mechanism is used to train the emotion detection mixture based on the excitation theory to predict emotions. model.

Please refer to FIG. 4 together, which is a schematic diagram of a training process of a mixed emotion detection model disclosed in the present invention. In Figure 4, the audio sub-frames and pictures are identified first to obtain sound features and image features, and then the acquired sound features and image features are stitched through the concatenation layer (Concate) to obtain the fusion features. A full connect layer (full connect), and finally classification prediction is performed by the regression classification layer (Softmax), and the emotional classification result is obtained from the output layer (Result). Among them, the parameters before the concate layer are not updated, only the parameters after the concate layer need to be updated, and training is continued to obtain a mixed emotion detection model.

S16. The electronic device inputs the processed audio and video to be recognized into the emotion detection mixed model to obtain an emotion recognition result.

Among them, the audio and video to be recognized that require emotion detection can be obtained first, and the audio and video to be recognized can be processed to remove redundant information, and then the processed audio and video to be recognized can be used as the input of the emotion detection hybrid model , According to the updated weights and biases of each network layer, perform calculations on the characteristics of the audio and video to be recognized to obtain the probability value of the preset number of emotions; according to the probability value of the preset number of emotions, obtain the emotion classification corresponding to the highest probability The emotion of is used as the emotion in the audio and video to be recognized, and the recognition result is output.

Wherein, a probability result of a preset number of emotions is obtained, and the preset number of emotions is the same as the total amount of emotion labels of the training sample. Specifically, the preset number can be set to 7, which means that the face picture has 7 emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm, or the preset number can also be set to 8, which means that the face picture has a total of 7 emotions. There are 8 emotions, including contempt, happiness, sadness, fear, anger, surprise, disgust, and calm. The specific emotions can be set according to the needs of the actual application, and there is no restriction here.

Wherein, using the emotion detection hybrid model to perform emotion prediction can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.

As an optional implementation manner, the method further includes:

In this optional embodiment, the negative emotions are generally negative energy emotions, such as sadness, fear, anger, disgust, etc. When the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, it indicates that the emotion of the user is relatively extreme, in order to allow the user to better understand his emotions, and to avoid If the user subsequently makes an aggressive behavior under the same emotion, the user terminal of the user can be obtained, and the prompt information carrying the emotion recognition result can be sent to the user terminal, so that the user can pay attention to adjust the emotion .

In the method flow described in FIG. 1, multiple audio and video samples can be obtained, and multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples. Further, A plurality of the image samples can be input into an image-based emotion recognition model to obtain multiple image features, and a plurality of the sound samples can be input into a voice-based emotion recognition model to obtain multiple voice features, and finally, all The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion Detect the mixed model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.

The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. For those of ordinary skill in the art, without departing from the inventive concept of the present invention, they can also make Improvements, but these all belong to the protection scope of the present invention.

Please refer to Fig. 5, which is a functional block diagram of a preferred embodiment of an emotion detection device disclosed in the present invention.

In some embodiments, the emotion detection device runs in an electronic device. The emotion detection device may include multiple functional modules composed of program code segments. The instruction code of each program segment in the emotion detection device can be stored in a memory and executed by at least one processor to perform part or all of the steps in the emotion detection method described in FIG. 1.

In this embodiment, the emotion detection device can be divided into multiple functional modules according to the functions it performs. The functional modules may include: an acquisition module 501, a capture module 502, an input module 503, a fusion module 504, and a training module 505. The module referred to in the present invention refers to a series of computer-readable instruction segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.

The obtaining module 501 is used to obtain multiple audio and video samples;

The grabbing module 502 is configured to grab multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

The input module 503 is configured to input a plurality of the image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of the sound samples into the voice-based emotion recognition model to obtain multiple voice features;

The fusion module 504 is configured to merge the multiple image features and the multiple voice features to obtain a fusion feature;

Specifically, the fusion module 504 fuses the multiple image features and the multiple voice features to obtain the fusion feature includes:

Among them, the function of the concat layer is to splice two or more feature maps in the channel (channel) or hum (number) dimension. For example, for example, the first feature map and the second feature map are spliced in the channel dimension. First, except for the channel dimension, the other dimensions must be the same (that is, hum, H, and W are consistent). Specifically, you can combine The channel k1 of the first feature map plus the channel k2 of the second feature map, the output of the concat layer can be expressed as: N*(k1+k2)*H*W. Among them, H is the height of the picture, W is the width of the picture, k1 is the number of channels in the first feature map, and k2 is the number of channels in the second feature map.

The training module 505 is configured to input the fusion features into the model to be trained for training to obtain a mixed emotion detection model;

The input module 503 is further configured to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.

Among them, the audio and video to be recognized are used as the input of the emotion detection hybrid model, and the features of the audio and video to be recognized are calculated according to the updated weights and biases of each network layer to obtain the probability value of the preset number of emotions; according to the preset The probability value of a number of emotions, the emotion corresponding to the emotion category with the highest probability is obtained as the emotion in the audio and video to be recognized, and the recognition result is output.

As an optional implementation manner, the acquiring module 501 is also used to acquire a face picture;

The emotion detection device further includes:

A preprocessing module for preprocessing the face picture to obtain training samples;

The input module 503 is also used to input the training samples into the densely connected convolutional network Densnet model for training, to obtain an image-based emotion recognition model.

The preprocessing module is also used to preprocess the face picture to obtain training samples;

The input module 503 is also used to input the training samples into a deep residual network ResNet model for training, to obtain an image-based emotion recognition model.

As an optional implementation manner, the acquiring module 501 is further configured to acquire audio file samples;

The emotion detection device further includes:

The calculation module is used to calculate the Mel frequency cepstrum coefficient of the audio file samples to obtain the Mel frequency cepstrum coefficient MFCC feature;

The training module 505 is also used to train the MFCC features to obtain a voice-based emotion recognition model.

In this optional implementation manner, the audio file samples may include, but are not limited to, audio files extracted from multimedia files, and also include other public audio data sets or audio collected by the user. Specifically, the audio file samples can be pre-emphasized, framed and windowed first, and then for each short-term analysis window, the corresponding frequency spectrum is obtained through fast Fourier transform FFT, and the frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum. Perform cepstrum analysis on the Mel spectrum (take the logarithm and do the inverse transformation, the actual inverse transformation is generally realized by DCT discrete cosine transform, and take the 5th to 13th coefficients after DCT as the MFCC coefficients) to obtain the Mel frequency Cepstral coefficient MFCC, this MFCC is the feature of this frame of speech, that is, the MFCC feature, and finally, the MFCC feature can be trained to obtain a voice-based emotion recognition model. Among them, MFCC (Mel Frequency Cepstral Coefficents) is the English abbreviation of Mel Frequency Cepstral Coefficents.

As an optional implementation manner, the input module 503 inputs a plurality of the image samples into an image-based emotion recognition model, and the specific method for obtaining a plurality of image features is:

As an optional implementation manner, the input module 503 inputs a plurality of the sound samples into a sound-based emotion recognition model, and the specific method for obtaining a plurality of sound features is:

As an optional implementation manner, the obtaining module 501 is further configured to obtain the user terminal of the user if the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion;

The emotion detection device further includes:

The sending module is configured to send prompt information carrying the emotion recognition result to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.

In the emotion detection device described in FIG. 5, multiple audio and video samples can be obtained, and multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples. Further, A plurality of the image samples can be input into an image-based emotion recognition model to obtain multiple image features, and a plurality of the sound samples can be input into a voice-based emotion recognition model to obtain multiple voice features, and finally, all The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion Detect the mixed model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.

As shown in FIG. 6, FIG. 6 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the emotion detection method of the present invention. The electronic device 6 includes a memory 61, at least one processor 62, computer readable instructions 63 stored in the memory 61 and executable on the at least one processor 62, and at least one communication bus 64.

Those skilled in the art can understand that the schematic diagram shown in FIG. 6 is only an example of the electronic device 6 and does not constitute a limitation on the electronic device 6. It may include more or less components than those shown in the figure, or a combination Certain components, or different components, for example, the electronic device 6 may also include input and output devices, network access devices, and so on.

The electronic device 6 also includes, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, Personal digital assistants (Personal Digital Assistant, PDA), game consoles, interactive network television (Internet Protocol Television, IPTV), smart wearable devices, etc. The network where the electronic device 6 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.

The at least one processor 62 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 62 can be a microprocessor or the processor 62 can also be any conventional processor, etc. The processor 62 is the control center of the electronic device 6 and connects the entire electronic device 6 through various interfaces and lines. Parts.

The memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 can run or execute the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6. The memory 61 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data (such as audio data, etc.) created according to the use of the electronic device 6 and the like are stored.

With reference to Fig. 1, the memory 61 in the electronic device 6 stores multiple instructions to implement an emotion detection method, and the processor 62 can execute the multiple instructions to achieve:

Obtain multiple audio and video samples;

Specifically, for the specific implementation method of the above-mentioned instructions by the processor 62, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which will not be repeated here.

In the electronic device 6 described in FIG. 6, multiple audio and video samples can be acquired, multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples, and further , You can input a plurality of the image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of the sound samples into a voice-based emotion recognition model to obtain multiple voice features, and finally The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain a mixed emotion detection model, and the processed audio and video to be recognized are input to the Emotion detection hybrid model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.

If the integrated module/unit of the electronic device 6 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile readable storage medium. Based on this understanding, the present invention implements all or part of the processes in the above-mentioned embodiment methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile readable storage medium. When the computer program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer readable instruction code, and the computer readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM, Read-Only Memory) etc.

In the several embodiments provided by the present invention, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional modules in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional modules.

For those skilled in the art, it is obvious that the present invention is not limited to the details of the foregoing exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or basic characteristics of the present invention. Therefore, from any point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of the present invention is defined by the appended claims rather than the above description, and therefore it is intended to fall within the claims. All changes within the meaning and scope of equivalent elements of are included in the present invention. Any associated diagram marks in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of the present invention.

Claims

An emotion detection method, characterized in that the method includes:

Obtain multiple audio and video samples;

Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;

Fusing the multiple image features and multiple voice features to obtain a fusion feature;

Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;

Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
The method according to claim 1, wherein before said acquiring multiple audio and video samples, the method further comprises:

Obtain face pictures;

Preprocessing the face image to obtain training samples;

The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
The method according to claim 1, wherein the fusing the multiple image features and the multiple voice features to obtain the fused feature comprises:

The multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
The method according to claim 1, wherein before said acquiring multiple audio and video samples, the method further comprises:

Obtain audio file samples;

Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;

The MFCC features are trained to obtain a voice-based emotion recognition model.
The method according to any one of claims 1 to 4, wherein the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features comprises:

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or

A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
The method according to any one of claims 1 to 4, wherein the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features comprises:

Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or

Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
The method according to any one of claims 1 to 4, wherein the method further comprises:

If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;

The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
An emotion detection device, characterized in that the emotion detection device includes:

The acquisition module is used to acquire multiple audio and video samples;

A grabbing module for grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

An input module, configured to input a plurality of said image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain multiple voice features;

A fusion module for fusing the multiple image features and multiple voice features to obtain fusion features;

A training module, configured to input the fusion features into the model to be trained for training, and obtain a mixed emotion detection model;

The input module is also used to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.
An electronic device, characterized in that the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:

Obtain multiple audio and video samples;

Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;

Fusing the multiple image features and multiple voice features to obtain a fusion feature;

Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;

Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
The electronic device according to claim 9, wherein before said acquiring a plurality of audio and video samples, the processor executing at least one computer readable instruction is further used to implement the following steps:

Obtain face pictures;

Preprocessing the face image to obtain training samples;

The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
The electronic device according to claim 9, characterized in that, before said acquiring a plurality of audio and video samples, the processor executing at least one computer readable instruction is further used to implement the following steps:

Obtain audio file samples;

Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;

The MFCC features are trained to obtain a voice-based emotion recognition model.
The electronic device according to any one of claims 9 to 11, wherein the processor executes at least one computer-readable instruction to implement the input of the plurality of image samples into an image-based emotion recognition model, When obtaining multiple image features, including:

Input a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of the AU detection network features as a plurality of image features; or combine a plurality of the image samples Input the image-based emotion recognition model to obtain multiple emotion detection network features; determine the multiple emotion detection network features as multiple image features.
The electronic device according to any one of claims 9 to 11, wherein the processor executes at least one computer-readable instruction to implement the input of the plurality of sound samples into a sound-based emotion recognition model, When obtaining multiple sound characteristics, including:

Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or

Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
The electronic device according to any one of claims 9 to 11, wherein the processor executing at least one computer-readable instruction is further used to implement the following steps:

If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;

The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
A non-volatile readable storage medium, wherein the non-volatile readable storage medium stores at least one computer readable instruction, and the following steps are implemented when the at least one computer readable instruction is executed by a processor :

Obtain multiple audio and video samples;

Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;

Fusing the multiple image features and multiple voice features to obtain a fusion feature;

Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;

Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
15. The storage medium according to claim 15, wherein before said acquiring a plurality of audio and video samples, the at least one computer readable instruction is executed by the processor to further implement the following steps:

Obtain face pictures;

Preprocessing the face image to obtain training samples;

The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
15. The storage medium according to claim 15, wherein before said acquiring a plurality of audio and video samples, the at least one computer readable instruction is executed by the processor to further implement the following steps:

Obtain audio file samples;

Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;

The MFCC features are trained to obtain a voice-based emotion recognition model.
The storage medium according to any one of claims 15 to 17, wherein the at least one computer readable instruction is executed by a processor to implement the input of a plurality of the image samples into an image-based emotion recognition model , When multiple image features are obtained, including:

Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or

A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
The storage medium according to any one of claims 15 to 17, wherein the at least one computer readable instruction is executed by a processor to implement the input of a plurality of the sound samples into a sound-based emotion recognition model , When multiple sound features are obtained, including:

Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or

Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
The storage medium according to any one of claims 15 to 17, wherein the at least one computer readable instruction is executed by the processor to further implement the following steps:

If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;

The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.