WO2020248376A1 - Emotion detection method and apparatus, electronic device, and storage medium - Google Patents

Emotion detection method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2020248376A1
WO2020248376A1 PCT/CN2019/102867 CN2019102867W WO2020248376A1 WO 2020248376 A1 WO2020248376 A1 WO 2020248376A1 CN 2019102867 W CN2019102867 W CN 2019102867W WO 2020248376 A1 WO2020248376 A1 WO 2020248376A1
Authority
WO
WIPO (PCT)
Prior art keywords
samples
image
features
sound
input
Prior art date
Application number
PCT/CN2019/102867
Other languages
French (fr)
Chinese (zh)
Inventor
盛建达
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020248376A1 publication Critical patent/WO2020248376A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to the field of artificial intelligence technology, and in particular to an emotion detection method, device, electronic equipment and storage medium.
  • Emotion is a state that integrates people’s feelings, thoughts and behaviors. It includes people’s psychological response to external or self-stimulation. Emotions cause significant changes in people’s physiology and behavior. Facial expressions are an important part of emotional manifestations. On the one hand, changes in the eyes, eyebrows, or mouth can best express a person’s emotions. Through the recognition and analysis of facial expressions, people’s emotions can be judged. Under certain emotional states, people will produce specific facial muscle movements and The expression mode can recognize emotions based on the correspondence between expressions and emotions.
  • Emotion recognition is a key technology in the field of artificial intelligence.
  • the research on emotion recognition has important practical application value for human-computer interaction.
  • traditional emotion recognition methods generally use LBP (Local Binary Pattern) method to extract features, and then SVM (Support Vector Machine) classifier is used to classify emotions.
  • LBP Large Binary Pattern
  • SVM Small Vector Machine
  • the existing facial expression recognition methods the facial expression recognition process The calculation is complicated, and the recognition accuracy and recognition efficiency of facial expressions are not high.
  • the first aspect of the present invention provides an emotion detection method, which includes:
  • the method before the acquiring multiple audio and video samples, the method further includes:
  • the training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
  • the fusing the multiple image features and the multiple voice features to obtain the fusion feature includes:
  • the multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
  • the method before the acquiring multiple audio and video samples, the method further includes:
  • the MFCC features are trained to obtain a voice-based emotion recognition model.
  • the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features includes:
  • Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
  • a plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
  • the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features includes:
  • Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
  • the method further includes:
  • the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
  • the prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
  • a second aspect of the present invention provides an emotion detection device, which includes:
  • the acquisition module is used to acquire multiple audio and video samples
  • a grabbing module for grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples
  • An input module configured to input a plurality of said image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain multiple voice features;
  • a fusion module for fusing the multiple image features and multiple voice features to obtain fusion features
  • a training module configured to input the fusion features into the model to be trained for training, and obtain a mixed emotion detection model
  • the input module is also used to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.
  • a third aspect of the present invention provides an electronic device including a processor and a memory, and the processor is configured to implement the emotion detection method when executing computer-readable instructions stored in the memory.
  • a fourth aspect of the present invention provides a non-volatile readable storage medium having computer readable instructions stored on the non-volatile readable storage medium, and when the computer readable instructions are executed by a processor, the Method of emotion detection.
  • multiple audio and video samples can be obtained, multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples.
  • the A plurality of image features and a plurality of the sound features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion detection Hybrid model to obtain emotion recognition results.
  • the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions.
  • the accuracy of the prediction also increases the robustness of the model.
  • Fig. 1 is a flowchart of a preferred embodiment of an emotion detection method disclosed in the present invention.
  • Fig. 2 is a schematic diagram of the training process of an image-based emotion recognition model disclosed in the present invention.
  • Fig. 3 is a schematic diagram of the training process of a voice-based emotion recognition model disclosed in the present invention.
  • Fig. 4 is a schematic diagram of the training process of a mixed emotion detection model disclosed in the present invention.
  • Fig. 5 is a functional block diagram of a preferred embodiment of an emotion detection device disclosed in the present invention.
  • Fig. 6 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the emotion detection method of the present invention.
  • the emotion detection method of the embodiment of the present invention is applied to an electronic device, and can also be applied to a hardware environment composed of an electronic device and a server connected to the electronic device through a network, and is executed by the server and the electronic device.
  • Networks include but are not limited to: wide area network, metropolitan area network or local area network.
  • Fig. 1 is a flowchart of a preferred embodiment of an emotion detection method disclosed in the present invention. Among them, according to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the electronic device obtains multiple audio and video samples.
  • the audio and video samples include audio information and video information samples, and may include, but are not limited to, audio and video samples extracted from multimedia files, other public audio and video data sets, or user-collected and labeled audio and video samples.
  • the electronic device captures multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples.
  • each image sample has a sound sample that matches the image sample, which is helpful for subsequent combination
  • Images and sound together build a mixed model of emotion detection.
  • the electronic device inputs a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and inputs a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features.
  • the image-based emotion recognition model may be a Densnet (Dense Convolutional Network, densely connected convolutional network) model, or the image-based emotion recognition model may be a ResNet (deep residual network) model.
  • Densnet Densnet Convolutional Network, densely connected convolutional network
  • ResNet deep residual network
  • the method further includes:
  • the training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
  • the face image may include not limited to images captured from the video, but may also be other public data collections, or use crawler technology to capture data self-annotated face images in the search engine, through
  • the face pictures are sorted and labeled, and each face picture is assigned a corresponding emotion label, where the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions.
  • the format of the face picture may include, but is not limited to, jpg, png, and jpeg.
  • the emotions represented by face pictures include, but are not limited to, emotions such as happiness, sadness, fear, anger, surprise, disgust, and peace, and each face expression represents only one type of emotion.
  • the face picture can be preprocessed, such as normalization operation and pooling operation, which can speed up the subsequent model training, and at the same time, remove redundant information in the face picture to obtain a picture with a uniform specification As a training sample, it is conducive to the effectiveness of training, and the obtained image-based emotion recognition model is more accurate.
  • an initial parameter can be assigned to the weights and biases of each network layer of the Densnet model, that is, the Densnet model is initialized to obtain the initial model, the training samples are input into the initial model, and the training samples are normalized and calculated
  • the output parameters of each network layer of the initial model are obtained, and the forward output of the initial model is obtained.
  • the initial parameters of each network in the initial model are adjusted using the backpropagation algorithm to obtain the emotion recognition model.
  • how to calculate the output parameters of each network layer of the initial model belongs to the prior art, and will not be repeated here.
  • the method further includes:
  • the training samples are input into a deep residual network ResNet model for training, and an image-based emotion recognition model is obtained.
  • the face picture may include, but is not limited to, an image captured from a video, it may also be a collection of other public data, or a face picture self-annotated using crawler technology to capture data on a search engine.
  • the emotions represented by face pictures include but are not limited to emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm.
  • Each face expression represents only one type of emotion, and the emotion label corresponding to the face picture is assigned.
  • the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions.
  • the preprocessing process of the picture refers to the processing of transforming the size, grayscale, and shape of the picture to form a unified specification, so that subsequent picture processing is more efficient.
  • the image preprocessing process can also be to perform normalization and pooling operations on the image, remove redundant information in the face image, and obtain a uniform specification image as a training sample, which is beneficial to the effectiveness of training , The obtained image-based emotion recognition model is more accurate.
  • the training samples can be input to the constructed deep residual network model, and the characteristics of the training samples can be extracted; according to the emotion labels carried by the training samples, the preset weights and preset thresholds in each layer of neurons can be adjusted, so that according to the features
  • the classification result of is in line with the emotion label corresponding to the training sample; the preset model parameters of the deep residual network model are updated to obtain the emotion recognition model.
  • the preset model parameters include preset weights and preset thresholds.
  • the preset model parameters are suitable for the model parameters that calculate the characteristics of the training samples.
  • the deep residual network model is trained using the training samples for Obtain appropriate network parameters to prevent model fitting.
  • Input represents an input image sample.
  • the image sample may include, but is not limited to, images extracted from multimedia files, and also include other public image data collections or images collected by users themselves, and the network layer (Layer) identification
  • the output result (Result) can be 7 major emotions, such as happiness, surprise, fear, disgust, anger, sadness, and calm. It can also be the prediction result of face action units (AU). It can be the positive and negative direction and the degree of motivation that stimulate the theory.
  • the method further includes:
  • the MFCC features are trained to obtain a voice-based emotion recognition model.
  • the audio file samples may include, but are not limited to, audio files extracted from multimedia files, and also include other public audio data sets or audio collected by the user.
  • the audio file samples can be pre-emphasized, framed and windowed first, and then for each short-term analysis window, the corresponding frequency spectrum is obtained through fast Fourier transform FFT, and the frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum. Perform cepstrum analysis on the Mel spectrum (take the logarithm and do the inverse transformation.
  • MFCC Mel Frequency Cepstral Coefficents
  • FIG. 3 is a schematic diagram of the training process of a voice-based emotion recognition model disclosed in the present invention.
  • h0, h1,...hn are the output results obtained through two LSTMs (Long Short Term Memory Network), and a0, a1...an and c0, c1...cn are all coefficients.
  • the h value (h0, h1...hn) is obtained after two LSTMs, and the audio features are obtained after coefficients (a0, a1...an and c0, c1...cn).
  • Softmax regression classification layer
  • the audio-based prediction result ie, the output result
  • the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features includes:
  • Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
  • a plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
  • the face action unit (Action Units, AU) is proposed to analyze the facial muscle movement. Facial expressions can be recognized by AU.
  • AU refers to the basic muscle action units of the human face, such as: raised inner eyebrows, raised mouth corners, and wrinkled nose.
  • AU detection network features of the entire neural network structure is: a plurality of said image samples are input by an input layer (Input), through A convolutional layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolutional packages with different parameters (conv_x package), and finally pass through the pooling layer, the fully connected layer (Full Connect) and the sigmoid layer to obtain the AU forecast result.
  • the AU detection network feature is the value of the middle layer of the entire neural network structure.
  • inputting a plurality of the image samples into an image-based emotion recognition model to obtain the characteristics of a plurality of emotion detection networks is as follows: a plurality of the image samples are input by an input layer (Input) and undergo a convolution Layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolution packages with different parameters (conv_x package), and finally pass through the pooling layer and the fully connected layer (Full connect), which are performed by the regression classification layer (Softmax) Classification prediction, the prediction result of the face picture is obtained from the output layer (Result).
  • the emotion detection network characteristic is the value of the middle layer of the entire neural network structure.
  • the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features includes:
  • Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
  • the power spectrum is the abbreviation of the power spectrum density function
  • the power spectrum is the signal power in the unit frequency band. It shows the change of signal power with frequency, that is, the distribution of signal power in the frequency domain.
  • the original audio undergoes a series of processing, and after FFT transformation, a complex number sequence with respect to time is obtained.
  • the square of the complex number is the sequence of power with respect to time, that is, the power spectrum.
  • the electronic device fuses the multiple image features and the multiple voice features to obtain a fusion feature.
  • the fusing the multiple image features and the multiple voice features to obtain the fusion feature includes:
  • the multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
  • multiple image features and multiple voice features obtained through individual recognition are spliced in the concat layer (ie, splicing layer), that is, fusion is performed, and the fusion feature is obtained.
  • the concat layer ie, splicing layer
  • the sound features and image features are spliced in the concat layer to obtain the fusion feature.
  • the function of the concat layer is to splice two or more feature maps in the channel (channel) or num (number) dimensions.
  • the first feature map and the second feature map are spliced in the channel dimension.
  • the other dimensions must be the same (that is, num, H, and W are consistent).
  • the output of the concat layer can be expressed as: N*(k1+k2)*H*W.
  • H is the height of the picture
  • W is the width of the picture
  • k1 is the number of channels in the first feature map
  • k2 is the number of channels in the second feature map.
  • the electronic device inputs the fusion feature into the model to be trained for training, and obtains a mixed emotion detection model.
  • the fusion features are input into the model to be trained, and the network structure of the two-layer LSTM (Long Short Term Memory Network) and the attention mechanism is used to train the emotion detection mixture based on the excitation theory to predict emotions. model.
  • LSTM Long Short Term Memory Network
  • FIG. 4 is a schematic diagram of a training process of a mixed emotion detection model disclosed in the present invention.
  • the audio sub-frames and pictures are identified first to obtain sound features and image features, and then the acquired sound features and image features are stitched through the concatenation layer (Concate) to obtain the fusion features.
  • a full connect layer full connect
  • classification prediction is performed by the regression classification layer (Softmax), and the emotional classification result is obtained from the output layer (Result).
  • the parameters before the concate layer are not updated, only the parameters after the concate layer need to be updated, and training is continued to obtain a mixed emotion detection model.
  • the electronic device inputs the processed audio and video to be recognized into the emotion detection mixed model to obtain an emotion recognition result.
  • the audio and video to be recognized that require emotion detection can be obtained first, and the audio and video to be recognized can be processed to remove redundant information, and then the processed audio and video to be recognized can be used as the input of the emotion detection hybrid model , According to the updated weights and biases of each network layer, perform calculations on the characteristics of the audio and video to be recognized to obtain the probability value of the preset number of emotions; according to the probability value of the preset number of emotions, obtain the emotion classification corresponding to the highest probability The emotion of is used as the emotion in the audio and video to be recognized, and the recognition result is output.
  • a probability result of a preset number of emotions is obtained, and the preset number of emotions is the same as the total amount of emotion labels of the training sample.
  • the preset number can be set to 7, which means that the face picture has 7 emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm, or the preset number can also be set to 8, which means that the face picture has a total of 7 emotions.
  • the specific emotions can be set according to the needs of the actual application, and there is no restriction here.
  • using the emotion detection hybrid model to perform emotion prediction can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.
  • the method further includes:
  • the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
  • the prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
  • the negative emotions are generally negative energy emotions, such as sadness, fear, anger, disgust, etc.
  • the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, it indicates that the emotion of the user is relatively extreme, in order to allow the user to better understand his emotions, and to avoid If the user subsequently makes an aggressive behavior under the same emotion, the user terminal of the user can be obtained, and the prompt information carrying the emotion recognition result can be sent to the user terminal, so that the user can pay attention to adjust the emotion .
  • multiple audio and video samples can be obtained, and multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples.
  • a plurality of the image samples can be input into an image-based emotion recognition model to obtain multiple image features
  • a plurality of the sound samples can be input into a voice-based emotion recognition model to obtain multiple voice features
  • all The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion Detect the mixed model to obtain emotion recognition results.
  • the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions.
  • the accuracy of the prediction also increases the robustness of the model.
  • Fig. 5 is a functional block diagram of a preferred embodiment of an emotion detection device disclosed in the present invention.
  • the emotion detection device runs in an electronic device.
  • the emotion detection device may include multiple functional modules composed of program code segments.
  • the instruction code of each program segment in the emotion detection device can be stored in a memory and executed by at least one processor to perform part or all of the steps in the emotion detection method described in FIG. 1.
  • the emotion detection device can be divided into multiple functional modules according to the functions it performs.
  • the functional modules may include: an acquisition module 501, a capture module 502, an input module 503, a fusion module 504, and a training module 505.
  • the module referred to in the present invention refers to a series of computer-readable instruction segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.
  • the obtaining module 501 is used to obtain multiple audio and video samples
  • the audio and video samples include audio information and video information samples, and may include, but are not limited to, audio and video samples extracted from multimedia files, other public audio and video data sets, or user-collected and labeled audio and video samples.
  • the grabbing module 502 is configured to grab multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
  • each image sample has a sound sample that matches the image sample, which is helpful for subsequent combination
  • Images and sound together build a mixed model of emotion detection.
  • the input module 503 is configured to input a plurality of the image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of the sound samples into the voice-based emotion recognition model to obtain multiple voice features;
  • the image-based emotion recognition model may be a Densnet (Dense Convolutional Network, densely connected convolutional network) model, or the image-based emotion recognition model may be a ResNet (deep residual network) model.
  • Densnet Densnet Convolutional Network, densely connected convolutional network
  • ResNet deep residual network
  • the fusion module 504 is configured to merge the multiple image features and the multiple voice features to obtain a fusion feature
  • the fusion module 504 fuses the multiple image features and the multiple voice features to obtain the fusion feature includes:
  • the multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
  • multiple image features and multiple voice features obtained through individual recognition are spliced in the concat layer (ie, splicing layer), that is, fusion is performed, and the fusion feature is obtained.
  • the concat layer ie, splicing layer
  • the sound features and image features are spliced in the concat layer to obtain the fusion feature.
  • the function of the concat layer is to splice two or more feature maps in the channel (channel) or hum (number) dimension.
  • the first feature map and the second feature map are spliced in the channel dimension.
  • the other dimensions must be the same (that is, hum, H, and W are consistent).
  • the output of the concat layer can be expressed as: N*(k1+k2)*H*W.
  • H is the height of the picture
  • W is the width of the picture
  • k1 is the number of channels in the first feature map
  • k2 is the number of channels in the second feature map.
  • the training module 505 is configured to input the fusion features into the model to be trained for training to obtain a mixed emotion detection model
  • the fusion features are input into the model to be trained, and the network structure of the two-layer LSTM (Long Short Term Memory Network) and the attention mechanism is used to train the emotion detection mixture based on the excitation theory to predict emotions. model.
  • LSTM Long Short Term Memory Network
  • the input module 503 is further configured to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.
  • the audio and video to be recognized are used as the input of the emotion detection hybrid model, and the features of the audio and video to be recognized are calculated according to the updated weights and biases of each network layer to obtain the probability value of the preset number of emotions; according to the preset The probability value of a number of emotions, the emotion corresponding to the emotion category with the highest probability is obtained as the emotion in the audio and video to be recognized, and the recognition result is output.
  • a probability result of a preset number of emotions is obtained, and the preset number of emotions is the same as the total amount of emotion labels of the training sample.
  • the preset number can be set to 7, which means that the face picture has 7 emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm, or the preset number can also be set to 8, which means that the face picture has a total of 7 emotions.
  • the specific emotions can be set according to the needs of the actual application, and there is no restriction here.
  • using the emotion detection hybrid model to perform emotion prediction can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.
  • the acquiring module 501 is also used to acquire a face picture
  • the emotion detection device further includes:
  • a preprocessing module for preprocessing the face picture to obtain training samples
  • the input module 503 is also used to input the training samples into the densely connected convolutional network Densnet model for training, to obtain an image-based emotion recognition model.
  • the face image may include not limited to images captured from the video, but may also be other public data collections, or use crawler technology to capture data self-annotated face images in the search engine, through
  • the face pictures are sorted and labeled, and each face picture is assigned a corresponding emotion label, where the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions.
  • the format of the face picture may include, but is not limited to, jpg, png, and jpeg.
  • the emotions represented by face pictures include, but are not limited to, emotions such as happiness, sadness, fear, anger, surprise, disgust, and peace, and each face expression represents only one type of emotion.
  • the face picture can be preprocessed, such as normalization operation and pooling operation, which can speed up the subsequent model training, and at the same time, remove redundant information in the face picture to obtain a picture with a uniform specification As a training sample, it is conducive to the effectiveness of training, and the obtained image-based emotion recognition model is more accurate.
  • an initial parameter can be assigned to the weights and biases of each network layer of the Densnet model, that is, the Densnet model is initialized to obtain the initial model, the training samples are input into the initial model, and the training samples are normalized and calculated
  • the output parameters of each network layer of the initial model are obtained, and the forward output of the initial model is obtained.
  • the initial parameters of each network in the initial model are adjusted using the backpropagation algorithm to obtain the emotion recognition model.
  • how to calculate the output parameters of each network layer of the initial model belongs to the prior art, and will not be repeated here.
  • the acquiring module 501 is also used to acquire a face picture
  • the preprocessing module is also used to preprocess the face picture to obtain training samples
  • the input module 503 is also used to input the training samples into a deep residual network ResNet model for training, to obtain an image-based emotion recognition model.
  • the face picture may include, but is not limited to, an image captured from a video, it may also be a collection of other public data, or a face picture self-annotated using crawler technology to capture data on a search engine.
  • the emotions represented by face pictures include but are not limited to emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm.
  • Each face expression represents only one type of emotion, and the emotion label corresponding to the face picture is assigned.
  • the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions.
  • the preprocessing process of the picture refers to the processing of transforming the size, grayscale, and shape of the picture to form a unified specification, so that subsequent picture processing is more efficient.
  • the image preprocessing process can also be to perform normalization and pooling operations on the image, remove redundant information in the face image, and obtain a uniform specification image as a training sample, which is beneficial to the effectiveness of training , The obtained image-based emotion recognition model is more accurate.
  • the training samples can be input to the constructed deep residual network model, and the characteristics of the training samples can be extracted; according to the emotion labels carried by the training samples, the preset weights and preset thresholds in each layer of neurons can be adjusted, so that according to the features
  • the classification result of is in line with the emotion label corresponding to the training sample; the preset model parameters of the deep residual network model are updated to obtain the emotion recognition model.
  • the preset model parameters include preset weights and preset thresholds.
  • the preset model parameters are suitable for the model parameters that calculate the characteristics of the training samples.
  • the deep residual network model is trained using the training samples for Obtain appropriate network parameters to prevent model fitting.
  • the acquiring module 501 is further configured to acquire audio file samples
  • the emotion detection device further includes:
  • the calculation module is used to calculate the Mel frequency cepstrum coefficient of the audio file samples to obtain the Mel frequency cepstrum coefficient MFCC feature;
  • the training module 505 is also used to train the MFCC features to obtain a voice-based emotion recognition model.
  • the audio file samples may include, but are not limited to, audio files extracted from multimedia files, and also include other public audio data sets or audio collected by the user.
  • the audio file samples can be pre-emphasized, framed and windowed first, and then for each short-term analysis window, the corresponding frequency spectrum is obtained through fast Fourier transform FFT, and the frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum.
  • MFCC Mel frequency Cepstral coefficient
  • this MFCC is the feature of this frame of speech, that is, the MFCC feature
  • the MFCC feature can be trained to obtain a voice-based emotion recognition model.
  • MFCC Mel Frequency Cepstral Coefficents
  • the input module 503 inputs a plurality of the image samples into an image-based emotion recognition model, and the specific method for obtaining a plurality of image features is:
  • Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
  • a plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
  • the face action unit (Action Units, AU) is proposed to analyze the facial muscle movement. Facial expressions can be recognized by AU.
  • AU refers to the basic muscle action units of the human face, such as: raised inner eyebrows, raised mouth corners, and wrinkled nose.
  • AU detection network features of the entire neural network structure is: a plurality of said image samples are input by an input layer (Input), through A convolutional layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolutional packages with different parameters (conv_x package), and finally pass through the pooling layer, the fully connected layer (Full Connect) and the sigmoid layer to obtain the AU forecast result.
  • the AU detection network feature is the value of the middle layer of the entire neural network structure.
  • inputting a plurality of the image samples into an image-based emotion recognition model to obtain the characteristics of a plurality of emotion detection networks is as follows: a plurality of the image samples are input by an input layer (Input) and undergo a convolution Layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolution packages with different parameters (conv_x package), and finally pass through the pooling layer and the fully connected layer (Full connect), which are performed by the regression classification layer (Softmax) Classification prediction, the prediction result of the face picture is obtained from the output layer (Result).
  • the emotion detection network characteristic is the value of the middle layer of the entire neural network structure.
  • the input module 503 inputs a plurality of the sound samples into a sound-based emotion recognition model, and the specific method for obtaining a plurality of sound features is:
  • Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
  • the power spectrum is the abbreviation of the power spectrum density function
  • the power spectrum is the signal power in the unit frequency band. It shows the change of signal power with frequency, that is, the distribution of signal power in the frequency domain.
  • the original audio undergoes a series of processing, and after FFT transformation, a complex number sequence with respect to time is obtained.
  • the square of the complex number is the sequence of power with respect to time, that is, the power spectrum.
  • the obtaining module 501 is further configured to obtain the user terminal of the user if the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion;
  • the emotion detection device further includes:
  • the sending module is configured to send prompt information carrying the emotion recognition result to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
  • the negative emotions are generally negative energy emotions, such as sadness, fear, anger, disgust, etc.
  • the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, it indicates that the emotion of the user is relatively extreme, in order to allow the user to better understand his emotions, and to avoid If the user subsequently makes an aggressive behavior under the same emotion, the user terminal of the user can be obtained, and the prompt information carrying the emotion recognition result can be sent to the user terminal, so that the user can pay attention to adjust the emotion .
  • multiple audio and video samples can be obtained, and multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples.
  • a plurality of the image samples can be input into an image-based emotion recognition model to obtain multiple image features
  • a plurality of the sound samples can be input into a voice-based emotion recognition model to obtain multiple voice features
  • all The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion Detect the mixed model to obtain emotion recognition results.
  • the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions.
  • the accuracy of the prediction also increases the robustness of the model.
  • FIG. 6 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the emotion detection method of the present invention.
  • the electronic device 6 includes a memory 61, at least one processor 62, computer readable instructions 63 stored in the memory 61 and executable on the at least one processor 62, and at least one communication bus 64.
  • FIG. 6 is only an example of the electronic device 6 and does not constitute a limitation on the electronic device 6. It may include more or less components than those shown in the figure, or a combination Certain components, or different components, for example, the electronic device 6 may also include input and output devices, network access devices, and so on.
  • the electronic device 6 also includes, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, Personal digital assistants (Personal Digital Assistant, PDA), game consoles, interactive network television (Internet Protocol Television, IPTV), smart wearable devices, etc.
  • the network where the electronic device 6 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • the at least one processor 62 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the processor 62 can be a microprocessor or the processor 62 can also be any conventional processor, etc.
  • the processor 62 is the control center of the electronic device 6 and connects the entire electronic device 6 through various interfaces and lines. Parts.
  • the memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 can run or execute the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6.
  • the memory 61 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data (such as audio data, etc.) created according to the use of the electronic device 6 and the like are stored.
  • the memory 61 in the electronic device 6 stores multiple instructions to implement an emotion detection method, and the processor 62 can execute the multiple instructions to achieve:
  • multiple audio and video samples can be acquired, multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples, and further , You can input a plurality of the image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of the sound samples into a voice-based emotion recognition model to obtain multiple voice features, and finally The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain a mixed emotion detection model, and the processed audio and video to be recognized are input to the Emotion detection hybrid model to obtain emotion recognition results.
  • the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions.
  • the accuracy of the prediction also increases the robustness of the model.
  • the integrated module/unit of the electronic device 6 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile readable storage medium.
  • the present invention implements all or part of the processes in the above-mentioned embodiment methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a non-volatile readable storage medium.
  • the computer program includes computer readable instruction code
  • the computer readable instruction code may be in the form of source code, object code, executable file, or some intermediate form.
  • the non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM, Read-Only Memory) etc.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Image Analysis (AREA)

Abstract

An emotion detection method. Said method comprises: acquiring a plurality of audio and video samples; extracting, from the audio and video samples, a plurality of image samples and a plurality of sound samples matching the plurality of image samples; inputting the plurality of image samples into an image-based emotion recognition model to obtain a plurality of image features, and inputting the plurality of sound samples into a sound-based emotion recognition model to obtain a plurality of sound features; fusing the plurality of image features and the plurality of sound features to obtain fused features; inputting the fused features into a model to be trained, for training, so as to obtain an emotion detection hybrid model; and inputting the processed audio and video to be recognized into the emotion detection hybrid model, to obtain an emotion recognition result. The present invention further provides an emotion detection apparatus, an electronic device and a storage medium. The present invention improves the accuracy of the whole emotion prediction, and also increases the robustness of a model.

Description

情绪检测方法、装置、电子设备及存储介质Emotion detection method, device, electronic equipment and storage medium
本申请要求于2019年06月14日提交中国专利局,申请号为201910518131.1发明名称为“情绪检测方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims to be submitted to the Chinese Patent Office on June 14, 2019. The application number is 201910518131.1. The priority of the Chinese patent application with the title of "emotion detection method, device, electronic equipment and storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本发明涉及人工智能技术领域,尤其涉及一种情绪检测方法、装置、电子设备及存储介质。The present invention relates to the field of artificial intelligence technology, and in particular to an emotion detection method, device, electronic equipment and storage medium.
背景技术Background technique
情绪是一种综合了人的感觉、思想和行为的状态,它包括人对外界或自身刺激的心理反应,情绪使人产生生理和行为的显著变化,面部表情是情绪的外显行为的一个重要方面,眼、眉或者嘴等的变化最能表示一个人的情绪,通过对人脸面部表情进行识别分析能够达到对人的情绪的判断,在特定情绪状态下人们会产生特定的面部肌肉运动和表情模式,根据表情与情绪间的对应关系能够对情绪进行识别。Emotion is a state that integrates people’s feelings, thoughts and behaviors. It includes people’s psychological response to external or self-stimulation. Emotions cause significant changes in people’s physiology and behavior. Facial expressions are an important part of emotional manifestations. On the one hand, changes in the eyes, eyebrows, or mouth can best express a person’s emotions. Through the recognition and analysis of facial expressions, people’s emotions can be judged. Under certain emotional states, people will produce specific facial muscle movements and The expression mode can recognize emotions based on the correspondence between expressions and emotions.
情绪识别是人工智能领域的关键技术,对情绪识别的研究对于人机交互有着重要的现实应用价值,目前传统的情绪识别方法一般采用LBP(Local Binary Pattern,局部二值模式)方法提取特征,再使用SVM(Support Vector Machine,支持向量机)分类器进行情绪分类,但是由于人脸表情的分类较多,以及规律较复杂,因此,现有的人脸表情识别方法中,人脸表情的识别过程运算复杂,对人脸表情的识别准确率以及识别效率不高。Emotion recognition is a key technology in the field of artificial intelligence. The research on emotion recognition has important practical application value for human-computer interaction. At present, traditional emotion recognition methods generally use LBP (Local Binary Pattern) method to extract features, and then SVM (Support Vector Machine) classifier is used to classify emotions. However, because there are more classifications of facial expressions and the rules are more complicated, the existing facial expression recognition methods, the facial expression recognition process The calculation is complicated, and the recognition accuracy and recognition efficiency of facial expressions are not high.
发明内容Summary of the invention
鉴于以上内容,有必要提供一种情绪检测方法,能够提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。In view of the above, it is necessary to provide an emotion detection method that can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.
本发明的第一方面提供一种情绪检测方法,所述方法包括:The first aspect of the present invention provides an emotion detection method, which includes:
获取多个音视频样本;Obtain multiple audio and video samples;
从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;
将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;Fusing the multiple image features and multiple voice features to obtain a fusion feature;
将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;
将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
在一种可能的实现方式中,所述获取多个音视频样本之前,所述方法还包括:In a possible implementation manner, before the acquiring multiple audio and video samples, the method further includes:
获取人脸图片;Obtain face pictures;
对所述人脸图片进行预处理,得到训练样本;Preprocessing the face image to obtain training samples;
将所述训练样本输入稠密连接卷积网络Densnet模型中进行训练,得到基于图像的情绪识别模型。The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
在一种可能的实现方式中,所述将所述多个图像特征和多个所述声音特征进行融合,获得融合特征包括:In a possible implementation manner, the fusing the multiple image features and the multiple voice features to obtain the fusion feature includes:
将所述多个图像特征和多个所述声音特征在拼接层进行特征值拼接,获得融合特征。The multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
在一种可能的实现方式中,所述获取多个音视频样本之前,所述方法还包括:In a possible implementation manner, before the acquiring multiple audio and video samples, the method further includes:
获取音频文件样本;Obtain audio file samples;
对所述音频文件样本进行梅尔频率倒谱系数计算,获得梅尔频率倒谱系数MFCC特征;Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;
对所述MFCC特征进行训练,获得基于声音的情绪识别模型。The MFCC features are trained to obtain a voice-based emotion recognition model.
在一种可能的实现方式中,所述将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征包括:In a possible implementation, the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features includes:
将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征;将多个所述AU检测网络特征确定作为多个图像特征;或Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征;将多个所述情绪检测网络特征确定作为多个图像特征。A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
在一种可能的实现方式中,所述将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征包括:In a possible implementation manner, the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features includes:
将多个所述声音样本输入基于声音的情绪识别模型,获得多个梅尔频率倒谱系数;将多个所述梅尔频率倒谱系数确定作为多个声音特征;或Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
将多个所述声音样本输入基于声音的情绪识别模型,获得多个功率谱;将多个所述功率谱确定作为多个声音特征。Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
在一种可能的实现方式中,所述方法还包括:In a possible implementation manner, the method further includes:
若所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪,获取所述用户的用户终端;If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
将携带有所述情绪识别结果的提示信息发送至所述用户终端,所述提示信息用于提示所述用户注意调节情绪。The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
本发明的第二方面提供一种情绪检测装置,所述装置包括:A second aspect of the present invention provides an emotion detection device, which includes:
获取模块,用于获取多个音视频样本;The acquisition module is used to acquire multiple audio and video samples;
抓取模块,用于从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;A grabbing module for grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
输入模块,用于将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得 多个声音特征;An input module, configured to input a plurality of said image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain multiple voice features;
融合模块,用于将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;A fusion module for fusing the multiple image features and multiple voice features to obtain fusion features;
训练模块,用于将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;A training module, configured to input the fusion features into the model to be trained for training, and obtain a mixed emotion detection model;
所述输入模块,还用于将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。The input module is also used to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.
本发明的第三方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现所述的情绪检测方法。A third aspect of the present invention provides an electronic device including a processor and a memory, and the processor is configured to implement the emotion detection method when executing computer-readable instructions stored in the memory.
本发明的第四方面提供一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述的情绪检测方法。A fourth aspect of the present invention provides a non-volatile readable storage medium having computer readable instructions stored on the non-volatile readable storage medium, and when the computer readable instructions are executed by a processor, the Method of emotion detection.
由以上技术方案,本发明中,可以获取多个音视频样本,从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本,进一步地,可以将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征,最后,即可将所述多个图像特征和多个所述声音特征进行融合,获得融合特征,并将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型,将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。可见,本发明中,基于图像特征和声音特征进行训练获得的情绪检测混合模型,能够弥补单独基于图像训练或者单独基于声音训练获得的情绪检测模型,这能更好地进行情绪预测,提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。From the above technical solutions, in the present invention, multiple audio and video samples can be obtained, multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples. Further, Input a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of the sound samples into a voice-based emotion recognition model to obtain a plurality of sound features. Finally, the A plurality of image features and a plurality of the sound features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion detection Hybrid model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.
附图说明Description of the drawings
图1是本发明公开的一种情绪检测方法的较佳实施例的流程图。Fig. 1 is a flowchart of a preferred embodiment of an emotion detection method disclosed in the present invention.
图2是本发明公开的一种基于图像的情绪识别模型的训练流程示意图。Fig. 2 is a schematic diagram of the training process of an image-based emotion recognition model disclosed in the present invention.
图3是本发明公开的一种基于声音的情绪识别模型的训练流程示意图。Fig. 3 is a schematic diagram of the training process of a voice-based emotion recognition model disclosed in the present invention.
图4是本发明公开的一种情绪检测混合模型的训练流程示意图。Fig. 4 is a schematic diagram of the training process of a mixed emotion detection model disclosed in the present invention.
图5是本发明公开的一种情绪检测装置的较佳实施例的功能模块图。Fig. 5 is a functional block diagram of a preferred embodiment of an emotion detection device disclosed in the present invention.
图6是本发明实现情绪检测方法的较佳实施例的电子设备的结构示意图。Fig. 6 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the emotion detection method of the present invention.
具体实施方式Detailed ways
本发明实施例的情绪检测方法应用在电子设备中,也可以应用在电子设备和通过网络与所述电子设备进行连接的服务器所构成的硬件环境中,由服务器和电子设备共同执行。网络包括但不限于:广域网、城域网或局域网。The emotion detection method of the embodiment of the present invention is applied to an electronic device, and can also be applied to a hardware environment composed of an electronic device and a server connected to the electronic device through a network, and is executed by the server and the electronic device. Networks include but are not limited to: wide area network, metropolitan area network or local area network.
请参见图1,图1是本发明公开的一种情绪检测方法的较佳实施例的流 程图。其中,根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。Please refer to Fig. 1, which is a flowchart of a preferred embodiment of an emotion detection method disclosed in the present invention. Among them, according to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
S11、电子设备获取多个音视频样本。S11. The electronic device obtains multiple audio and video samples.
其中,所述音视频样本即包括音频信息以及视频信息的样本,可以包括但不限于多媒体文件提取的音视频样本、其他公开音视频数据集合或者用户采集标注的音视频。The audio and video samples include audio information and video information samples, and may include, but are not limited to, audio and video samples extracted from multimedia files, other public audio and video data sets, or user-collected and labeled audio and video samples.
S12、电子设备从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本。S12. The electronic device captures multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples.
其中,用于单独训练的多个图像样本以及多个声音样本需要来自于同一个音视频样本,或者说,每个图像样本都有与所述图像样本匹配的声音样本,这有助于后续结合图像和声音一起构建情绪检测混合模型。Among them, multiple image samples and multiple sound samples used for separate training need to come from the same audio and video sample, or in other words, each image sample has a sound sample that matches the image sample, which is helpful for subsequent combination Images and sound together build a mixed model of emotion detection.
S13、电子设备将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征。S13. The electronic device inputs a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and inputs a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features.
其中,所述基于图像的情绪识别模型可以是Densnet(Dense Convolutional Network,稠密连接卷积网络)模型,或者,所述基于图像的情绪识别模型可以是ResNet(深度残差网络)模型。The image-based emotion recognition model may be a Densnet (Dense Convolutional Network, densely connected convolutional network) model, or the image-based emotion recognition model may be a ResNet (deep residual network) model.
作为一种可选的实施方式,步骤S11之前,所述方法还包括:As an optional implementation manner, before step S11, the method further includes:
获取人脸图片;Obtain face pictures;
对所述人脸图片进行预处理,得到训练样本;Preprocessing the face image to obtain training samples;
将所述训练样本输入稠密连接卷积网络Densnet模型中进行训练,得到基于图像的情绪识别模型。The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
在该可选的实施方式中,人脸图片可以包括不限于从视频里面抓取的图像,也可以是其他公开数据集合,或者利用爬虫技术在搜索引擎抓取数据自行标注的人脸图片,通过对人脸图片进行整理和标注,为每张人脸图片分配与其对应的情绪标签,其中,情绪标签代表的是人脸图像所呈现的情绪,不同的情绪标签代表不同的人脸情绪。人脸图片的格式可以包括但不限于jpg、png和jpeg等。人脸图片代表的情绪包括但不限于开心、悲伤、恐惧、生气、惊讶、厌恶和平静等情绪,并且每张人脸表情代表的情绪只有一类情绪。In this alternative embodiment, the face image may include not limited to images captured from the video, but may also be other public data collections, or use crawler technology to capture data self-annotated face images in the search engine, through The face pictures are sorted and labeled, and each face picture is assigned a corresponding emotion label, where the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions. The format of the face picture may include, but is not limited to, jpg, png, and jpeg. The emotions represented by face pictures include, but are not limited to, emotions such as happiness, sadness, fear, anger, surprise, disgust, and peace, and each face expression represents only one type of emotion.
具体的,可以对所述人脸图片进行预处理,比如归一化操作以及池化操作,这可以加快后续模型训练的速度,同时,去除人脸图片中的冗余信息,获得统一规格的图片作为训练样本,有利于训练的有效性,得到的基于图像的情绪识别模型更准确。Specifically, the face picture can be preprocessed, such as normalization operation and pooling operation, which can speed up the subsequent model training, and at the same time, remove redundant information in the face picture to obtain a picture with a uniform specification As a training sample, it is conducive to the effectiveness of training, and the obtained image-based emotion recognition model is more accurate.
具体的,可以为Densnet模型的各个网络层的权值和偏置均赋予一个初始参数,即初始化Densnet模型,得到初始模型,在初始模型中输入训练样本,对训练样本进行归一化处理,计算初始模型各网络层的输出参数,获取初始模型的前向输出,根据前向输出,使用反向传播算法对初始模型中各个网络的初始参数进行调整,获取情绪识别模型。其中,如何计算初始模型各网络层的输出参数属于现有技术,在此不再赘述。Specifically, an initial parameter can be assigned to the weights and biases of each network layer of the Densnet model, that is, the Densnet model is initialized to obtain the initial model, the training samples are input into the initial model, and the training samples are normalized and calculated The output parameters of each network layer of the initial model are obtained, and the forward output of the initial model is obtained. According to the forward output, the initial parameters of each network in the initial model are adjusted using the backpropagation algorithm to obtain the emotion recognition model. Among them, how to calculate the output parameters of each network layer of the initial model belongs to the prior art, and will not be repeated here.
作为一种可选的实施方式,步骤S11之前,所述方法还包括:As an optional implementation manner, before step S11, the method further includes:
获取人脸图片;Obtain face pictures;
对所述人脸图片进行预处理,得到训练样本;Preprocessing the face image to obtain training samples;
将所述训练样本输入深度残差网络ResNet模型进行训练,得到基于图像的情绪识别模型。The training samples are input into a deep residual network ResNet model for training, and an image-based emotion recognition model is obtained.
其中,人脸图片可以包括但不限于从视频里面抓取的图像,也可以是其他公开数据集合,或者利用爬虫技术在搜索引擎抓取数据自行标注的人脸图片。人脸图片代表的情绪包括但不限于开心、悲伤、恐惧、生气、惊讶、厌恶和平静等情绪,每张人脸表情代表的情绪只有一类情绪,并分配与人脸图片对应的情绪标签,其中,情绪标签代表的是人脸图像所呈现的情绪,不同的情绪标签代表不同的人脸情绪。可选地,图片的预处理过程是指对图片进行尺寸、灰度和形状等变换的处理,形成统一的规格,以便后续的图片处理更加高效。可选的,图像的预处理过程还可以是对图像进行归一化操作以及池化操作等,去除人脸图片中的冗余信息,获得统一规格的图片作为训练样本,有利于训练的有效性,得到的基于图像的情绪识别模型更准确。Among them, the face picture may include, but is not limited to, an image captured from a video, it may also be a collection of other public data, or a face picture self-annotated using crawler technology to capture data on a search engine. The emotions represented by face pictures include but are not limited to emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm. Each face expression represents only one type of emotion, and the emotion label corresponding to the face picture is assigned. Among them, the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions. Optionally, the preprocessing process of the picture refers to the processing of transforming the size, grayscale, and shape of the picture to form a unified specification, so that subsequent picture processing is more efficient. Optionally, the image preprocessing process can also be to perform normalization and pooling operations on the image, remove redundant information in the face image, and obtain a uniform specification image as a training sample, which is beneficial to the effectiveness of training , The obtained image-based emotion recognition model is more accurate.
具体的,可以将训练样本输入构建的深度残差网络模型,提取训练样本的特征;根据训练样本携带的情绪标签,调整各层神经元中预设的权值和预设的阈值,使得按照特征的分类结果符合与训练样本对应的情绪标签;更新深度残差网络模型预设的模型参数,得到情绪识别模型。其中,预设的模型参数包括预设的权值和预设的阈值,预设的模型参数适用于对训练样本的特征进行计算的模型参数,使用训练样本对深度残差网络模型进行训练是为了得到合适的网络参数,防止模型的拟合。Specifically, the training samples can be input to the constructed deep residual network model, and the characteristics of the training samples can be extracted; according to the emotion labels carried by the training samples, the preset weights and preset thresholds in each layer of neurons can be adjusted, so that according to the features The classification result of is in line with the emotion label corresponding to the training sample; the preset model parameters of the deep residual network model are updated to obtain the emotion recognition model. Among them, the preset model parameters include preset weights and preset thresholds. The preset model parameters are suitable for the model parameters that calculate the characteristics of the training samples. The deep residual network model is trained using the training samples for Obtain appropriate network parameters to prevent model fitting.
请一并参见图2,图2是本发明公开的一种基于图像的情绪识别模型的训练流程示意图。其中,输入(Input)表示输入的图像样本,所述图像样本可以包括但不限于从多媒体文件中提取的图像,还包括其他公开的图像数据集合或者用户自己采集的图像,网络层(Layer)标识神经网络计算的层,输出结果(Result)可以是7大情绪,比如高兴、惊讶、恐惧、厌恶、愤怒、悲伤、平静,也可以是人脸动作单元(Action Units,AU)的预测结果,还可以是激发理论的情绪正负向和激发程度。Please refer to FIG. 2 together, which is a schematic diagram of the training process of an image-based emotion recognition model disclosed in the present invention. Wherein, Input represents an input image sample. The image sample may include, but is not limited to, images extracted from multimedia files, and also include other public image data collections or images collected by users themselves, and the network layer (Layer) identification In the layer of neural network calculation, the output result (Result) can be 7 major emotions, such as happiness, surprise, fear, disgust, anger, sadness, and calm. It can also be the prediction result of face action units (AU). It can be the positive and negative direction and the degree of motivation that stimulate the theory.
作为一种可选的实施方式,步骤S11之前,所述方法还包括:As an optional implementation manner, before step S11, the method further includes:
获取音频文件样本;Obtain audio file samples;
对所述音频文件样本进行梅尔频率倒谱系数计算,获得梅尔频率倒谱系数MFCC特征;Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;
对所述MFCC特征进行训练,获得基于声音的情绪识别模型。The MFCC features are trained to obtain a voice-based emotion recognition model.
在该可选的实施方式中,所述音频文件样本可以包括但不限于从多媒体文件中提取的音频文件,还包括其他公开的音频数据集合或者用户自己采集的音频。具体的,可以先对音频文件样本进行预加重、分帧和加窗,然后对每一个短时分析窗,通过快速傅里叶变换FFT得到对应的频谱,频谱通过Mel滤波器组得到Mel频谱,在Mel频谱上面进行倒谱分析(取对数,做逆 变换,实际逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为MFCC系数),获得Mel频率倒谱系数MFCC,这个MFCC就是这帧语音的特征,也即MFCC特征,最后,可以对所述MFCC特征进行训练,获得基于声音的情绪识别模型。其中,MFCC(Mel Frequency Cepstral Coefficents)为梅尔频率倒谱系数的英文缩写。In this optional implementation manner, the audio file samples may include, but are not limited to, audio files extracted from multimedia files, and also include other public audio data sets or audio collected by the user. Specifically, the audio file samples can be pre-emphasized, framed and windowed first, and then for each short-term analysis window, the corresponding frequency spectrum is obtained through fast Fourier transform FFT, and the frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum. Perform cepstrum analysis on the Mel spectrum (take the logarithm and do the inverse transformation. The actual inverse transformation is generally realized by the DCT discrete cosine transform, taking the 2nd to 13th coefficients after DCT as the MFCC coefficients) to obtain the Mel frequency Cepstral coefficient MFCC, this MFCC is the feature of this frame of speech, that is, the MFCC feature, and finally, the MFCC feature can be trained to obtain a voice-based emotion recognition model. Among them, MFCC (Mel Frequency Cepstral Coefficents) is the English abbreviation of Mel Frequency Cepstral Coefficents.
请一并参见图3,图3是本发明公开的一种基于声音的情绪识别模型的训练流程示意图。在图3中,h0、h1、…hn为经过两个LSTM(Long Short Term Memory Network,长短时记忆网络)得到的输出结果,a0、a1…an以及c0、c1…cn均为系数。多个音频分帧经特征提取,在经两个LSTM后获得h值(h0、h1…hn),再经系数(a0、a1…an以及c0、c1…cn)后,获得音频特征,最后,由回归分类层(Softmax)进行分类预测,由输出层(Result)得到基于音频的预测结果(即输出结果)。Please refer to FIG. 3 together, which is a schematic diagram of the training process of a voice-based emotion recognition model disclosed in the present invention. In Figure 3, h0, h1,...hn are the output results obtained through two LSTMs (Long Short Term Memory Network), and a0, a1...an and c0, c1...cn are all coefficients. After feature extraction from multiple audio frames, the h value (h0, h1...hn) is obtained after two LSTMs, and the audio features are obtained after coefficients (a0, a1...an and c0, c1...cn). Finally, The classification prediction is performed by the regression classification layer (Softmax), and the audio-based prediction result (ie, the output result) is obtained from the output layer (Result).
具体的,所述将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征包括:Specifically, the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features includes:
将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征;将多个所述AU检测网络特征确定作为多个图像特征;或Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征;将多个所述情绪检测网络特征确定作为多个图像特征。A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
其中,人脸动作单元(Action Units,AU)是为了分析人脸面部肌肉运动而提出的。人脸面部表情可以通过AU进行识别。AU是指人脸部的基本肌肉动作单元,例如:内眉上扬、嘴角上扬、鼻子蹙皱等。Among them, the face action unit (Action Units, AU) is proposed to analyze the facial muscle movement. Facial expressions can be recognized by AU. AU refers to the basic muscle action units of the human face, such as: raised inner eyebrows, raised mouth corners, and wrinkled nose.
其中,将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征的整个神经网络结构为:多个所述图像样本由输入层(Input)进行输入,经过一个卷积层(Conv_)和一个池化层(Pool),然后经过4组参数不同的卷积包(conv_x package),最后经过池化层和全连接层(Full connect)及sigmoid层得到AU的预测结果。其中,所述AU检测网络特征即整个神经网络结构中间层的值。Wherein, inputting a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action units AU detection network features of the entire neural network structure is: a plurality of said image samples are input by an input layer (Input), through A convolutional layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolutional packages with different parameters (conv_x package), and finally pass through the pooling layer, the fully connected layer (Full Connect) and the sigmoid layer to obtain the AU forecast result. Wherein, the AU detection network feature is the value of the middle layer of the entire neural network structure.
其中,将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征的整个神经网络结构为:多个所述图像样本由输入层(Input)进行输入,经过一个卷积层(Conv_)和一个池化层(Pool),然后经过4组参数不同的卷积包(conv_x package),最后经过池化层和全连接层(Full connect),由回归分类层(Softmax)进行分类预测,由输出层(Result)得到人脸图片的预测结果。其中,所述情绪检测网络特征即整个神经网络结构中间层的值。Wherein, inputting a plurality of the image samples into an image-based emotion recognition model to obtain the characteristics of a plurality of emotion detection networks is as follows: a plurality of the image samples are input by an input layer (Input) and undergo a convolution Layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolution packages with different parameters (conv_x package), and finally pass through the pooling layer and the fully connected layer (Full connect), which are performed by the regression classification layer (Softmax) Classification prediction, the prediction result of the face picture is obtained from the output layer (Result). Wherein, the emotion detection network characteristic is the value of the middle layer of the entire neural network structure.
具体的,所述将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征包括:Specifically, the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features includes:
将多个所述声音样本输入基于声音的情绪识别模型,获得多个梅尔频率倒谱系数;将多个所述梅尔频率倒谱系数确定作为多个声音特征;或Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
将多个所述声音样本输入基于声音的情绪识别模型,获得多个功率谱;将多个所述功率谱确定作为多个声音特征。Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
其中,功率谱是功率谱密度函数的简称,功率谱为单位频带内的信号功率。它表示了信号功率随着频率的变化情况,即信号功率在频域的分布状况。原始音频经一系列处理,在经FFT变换之后得到一个复数关于时间的序列,复数的模的平方就是功率关于时间的序列,即功率谱。Among them, the power spectrum is the abbreviation of the power spectrum density function, and the power spectrum is the signal power in the unit frequency band. It shows the change of signal power with frequency, that is, the distribution of signal power in the frequency domain. The original audio undergoes a series of processing, and after FFT transformation, a complex number sequence with respect to time is obtained. The square of the complex number is the sequence of power with respect to time, that is, the power spectrum.
S14、电子设备将所述多个图像特征和多个所述声音特征进行融合,获得融合特征。S14. The electronic device fuses the multiple image features and the multiple voice features to obtain a fusion feature.
具体的,所述将所述多个图像特征和多个所述声音特征进行融合,获得融合特征包括:Specifically, the fusing the multiple image features and the multiple voice features to obtain the fusion feature includes:
将所述多个图像特征和多个所述声音特征在拼接层进行特征值拼接,获得融合特征。The multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
其中,将单独识别获得的多个图像特征和多个所述声音特征在concat层(即拼接层)进行拼接,即进行融合,就获得了融合特征。如下图3所示的,当输入音频分帧获得声音特征以及输入图片获得图像特征之后,再将声音特征以及图像特征在concat层进行拼接,即可获得融合特征。Wherein, multiple image features and multiple voice features obtained through individual recognition are spliced in the concat layer (ie, splicing layer), that is, fusion is performed, and the fusion feature is obtained. As shown in Figure 3 below, after the input audio is divided into frames to obtain the sound features and the input picture to obtain the image features, then the sound features and image features are spliced in the concat layer to obtain the fusion feature.
其中,concat层的作用就是将两个及以上的特征图在channel(通道)或num(数量)维度上进行拼接。举个例子,比如在channel维度上对第一特征图和第二特征图进行拼接,首先除了channel维度可以不一样,其余维度必须一致(也就是num、H、W一致),具体的,可以将第一特征图的channel k1加上第二特征图的channel k2,concat层的输出可表示为:N*(k1+k2)*H*W。其中,H为图片的高度,W为图片的宽度,k1为第一特征图的channel的数量,k2为第二特征图的channel的数量。Among them, the function of the concat layer is to splice two or more feature maps in the channel (channel) or num (number) dimensions. For example, for example, the first feature map and the second feature map are spliced in the channel dimension. First, except for the channel dimension, the other dimensions must be the same (that is, num, H, and W are consistent). Specifically, you can combine The channel k1 of the first feature map plus the channel k2 of the second feature map, the output of the concat layer can be expressed as: N*(k1+k2)*H*W. Among them, H is the height of the picture, W is the width of the picture, k1 is the number of channels in the first feature map, and k2 is the number of channels in the second feature map.
S15、电子设备将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型。S15. The electronic device inputs the fusion feature into the model to be trained for training, and obtains a mixed emotion detection model.
具体的,将所述融合特征输入待训练模型,使用两层LSTM(Long Short Term Memory Network,长短时记忆网络)以及attention机制的网络结构,训练得到基于激发理论的对情绪进行预测的情绪检测混合模型。Specifically, the fusion features are input into the model to be trained, and the network structure of the two-layer LSTM (Long Short Term Memory Network) and the attention mechanism is used to train the emotion detection mixture based on the excitation theory to predict emotions. model.
请一并参见图4,图4是本发明公开的一种情绪检测混合模型的训练流程示意图。在图4中,先分别对音频分帧和图片进行识别,获得声音特征以及图像特征,进而将获得的声音特征以及图像特征经拼接层(Concate)进行特征值拼接,获得融合特征,在经2个全链接层(full connect),最后由回归分类层(Softmax)进行分类预测,由输出层(Result)得到情绪的分类结果。其中,对concate层之前的参数不进行更新,只需要更新concate层之后的参数,不断训练,即可获得情绪检测混合模型。Please refer to FIG. 4 together, which is a schematic diagram of a training process of a mixed emotion detection model disclosed in the present invention. In Figure 4, the audio sub-frames and pictures are identified first to obtain sound features and image features, and then the acquired sound features and image features are stitched through the concatenation layer (Concate) to obtain the fusion features. A full connect layer (full connect), and finally classification prediction is performed by the regression classification layer (Softmax), and the emotional classification result is obtained from the output layer (Result). Among them, the parameters before the concate layer are not updated, only the parameters after the concate layer need to be updated, and training is continued to obtain a mixed emotion detection model.
S16、电子设备将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。S16. The electronic device inputs the processed audio and video to be recognized into the emotion detection mixed model to obtain an emotion recognition result.
其中,可以先获取需要进行情绪检测的待识别音视频,并对所述待识别音视频做处理,去除冗余信息,之后,即可将处理后的待识别音视频作为情 绪检测混合模型的输入,根据更新后的各个网络层的权值和偏置对待识别音视频的特征进行运算,得到预设数量的情绪的概率值;根据预设数量的情绪的概率值,获取概率最大的情绪分类对应的情绪作为待识别音视频中的情绪,并输出识别结果。Among them, the audio and video to be recognized that require emotion detection can be obtained first, and the audio and video to be recognized can be processed to remove redundant information, and then the processed audio and video to be recognized can be used as the input of the emotion detection hybrid model , According to the updated weights and biases of each network layer, perform calculations on the characteristics of the audio and video to be recognized to obtain the probability value of the preset number of emotions; according to the probability value of the preset number of emotions, obtain the emotion classification corresponding to the highest probability The emotion of is used as the emotion in the audio and video to be recognized, and the recognition result is output.
其中,获取预设数量的情绪的概率结果,该预设数量的情绪与训练样本的情绪标签的总量相同。具体地,预先的数量可以设置为7,表示人脸图片一共有开心、悲伤、恐惧、生气、惊讶、厌恶和平静等7种情绪,或者预先的数量也可以设置为8,表示人脸图片一共有蔑视、开心、悲伤、恐惧、生气、惊讶、厌恶和平静等8种情绪,具体可以根据实际应用的需要进行设置,此处不做限制。Wherein, a probability result of a preset number of emotions is obtained, and the preset number of emotions is the same as the total amount of emotion labels of the training sample. Specifically, the preset number can be set to 7, which means that the face picture has 7 emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm, or the preset number can also be set to 8, which means that the face picture has a total of 7 emotions. There are 8 emotions, including contempt, happiness, sadness, fear, anger, surprise, disgust, and calm. The specific emotions can be set according to the needs of the actual application, and there is no restriction here.
其中,使用所述情绪检测混合模型进行情绪预测,能够提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。Wherein, using the emotion detection hybrid model to perform emotion prediction can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.
作为一种可选的实施方式,所述方法还包括:As an optional implementation manner, the method further includes:
若所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪,获取所述用户的用户终端;If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
将携带有所述情绪识别结果的提示信息发送至所述用户终端,所述提示信息用于提示所述用户注意调节情绪。The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
在该可选的实施方式中,所述负向情绪一般为负能量的情绪,比如悲伤、恐惧、生气,厌恶等。当所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪时,表明所述用户的情绪比较偏激,为了让所述用户更好地认识自己的情绪,同时,也为了避免所述用户后续在同样的情绪下做出过激的行为,可以获取所述用户的用户终端,并将携带有所述情绪识别结果的提示信息发送至所述用户终端,以所述用户注意调节情绪。In this optional embodiment, the negative emotions are generally negative energy emotions, such as sadness, fear, anger, disgust, etc. When the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, it indicates that the emotion of the user is relatively extreme, in order to allow the user to better understand his emotions, and to avoid If the user subsequently makes an aggressive behavior under the same emotion, the user terminal of the user can be obtained, and the prompt information carrying the emotion recognition result can be sent to the user terminal, so that the user can pay attention to adjust the emotion .
在图1所描述的方法流程中,可以获取多个音视频样本,从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本,进一步地,可以将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征,最后,即可将所述多个图像特征和多个所述声音特征进行融合,获得融合特征,并将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型,将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。可见,本发明中,基于图像特征和声音特征进行训练获得的情绪检测混合模型,能够弥补单独基于图像训练或者单独基于声音训练获得的情绪检测模型,这能更好地进行情绪预测,提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。In the method flow described in FIG. 1, multiple audio and video samples can be obtained, and multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples. Further, A plurality of the image samples can be input into an image-based emotion recognition model to obtain multiple image features, and a plurality of the sound samples can be input into a voice-based emotion recognition model to obtain multiple voice features, and finally, all The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion Detect the mixed model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.
以上所述,仅是本发明的具体实施方式,但本发明的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本发明创造构思的前提下,还可以做出改进,但这些均属于本发明的保护范围。The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. For those of ordinary skill in the art, without departing from the inventive concept of the present invention, they can also make Improvements, but these all belong to the protection scope of the present invention.
请参见图5,图5是本发明公开的一种情绪检测装置的较佳实施例的功 能模块图。Please refer to Fig. 5, which is a functional block diagram of a preferred embodiment of an emotion detection device disclosed in the present invention.
在一些实施例中,所述情绪检测装置运行于电子设备中。所述情绪检测装置可以包括多个由程序代码段所组成的功能模块。所述情绪检测装置中的各个程序段的指令代码可以存储于存储器中,并由至少一个处理器所执行,以执行图1所描述的情绪检测方法中的部分或全部步骤。In some embodiments, the emotion detection device runs in an electronic device. The emotion detection device may include multiple functional modules composed of program code segments. The instruction code of each program segment in the emotion detection device can be stored in a memory and executed by at least one processor to perform part or all of the steps in the emotion detection method described in FIG. 1.
本实施例中,所述情绪检测装置根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:获取模块501、抓取模块502、输入模块503、融合模块504及训练模块505。本发明所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。在一些实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the emotion detection device can be divided into multiple functional modules according to the functions it performs. The functional modules may include: an acquisition module 501, a capture module 502, an input module 503, a fusion module 504, and a training module 505. The module referred to in the present invention refers to a series of computer-readable instruction segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.
获取模块501,用于获取多个音视频样本;The obtaining module 501 is used to obtain multiple audio and video samples;
其中,所述音视频样本即包括音频信息以及视频信息的样本,可以包括但不限于多媒体文件提取的音视频样本、其他公开音视频数据集合或者用户采集标注的音视频。The audio and video samples include audio information and video information samples, and may include, but are not limited to, audio and video samples extracted from multimedia files, other public audio and video data sets, or user-collected and labeled audio and video samples.
抓取模块502,用于从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;The grabbing module 502 is configured to grab multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
其中,用于单独训练的多个图像样本以及多个声音样本需要来自于同一个音视频样本,或者说,每个图像样本都有与所述图像样本匹配的声音样本,这有助于后续结合图像和声音一起构建情绪检测混合模型。Among them, multiple image samples and multiple sound samples used for separate training need to come from the same audio and video sample, or in other words, each image sample has a sound sample that matches the image sample, which is helpful for subsequent combination Images and sound together build a mixed model of emotion detection.
输入模块503,用于将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;The input module 503 is configured to input a plurality of the image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of the sound samples into the voice-based emotion recognition model to obtain multiple voice features;
其中,所述基于图像的情绪识别模型可以是Densnet(Dense Convolutional Network,稠密连接卷积网络)模型,或者,所述基于图像的情绪识别模型可以是ResNet(深度残差网络)模型。The image-based emotion recognition model may be a Densnet (Dense Convolutional Network, densely connected convolutional network) model, or the image-based emotion recognition model may be a ResNet (deep residual network) model.
融合模块504,用于将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;The fusion module 504 is configured to merge the multiple image features and the multiple voice features to obtain a fusion feature;
具体的,所述融合模块504将所述多个图像特征和多个所述声音特征进行融合,获得融合特征包括:Specifically, the fusion module 504 fuses the multiple image features and the multiple voice features to obtain the fusion feature includes:
将所述多个图像特征和多个所述声音特征在拼接层进行特征值拼接,获得融合特征。The multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
其中,将单独识别获得的多个图像特征和多个所述声音特征在concat层(即拼接层)进行拼接,即进行融合,就获得了融合特征。如下图3所示的,当输入音频分帧获得声音特征以及输入图片获得图像特征之后,再将声音特征以及图像特征在concat层进行拼接,即可获得融合特征。Wherein, multiple image features and multiple voice features obtained through individual recognition are spliced in the concat layer (ie, splicing layer), that is, fusion is performed, and the fusion feature is obtained. As shown in Figure 3 below, after the input audio is divided into frames to obtain the sound features and the input picture to obtain the image features, then the sound features and image features are spliced in the concat layer to obtain the fusion feature.
其中,concat层的作用就是将两个及以上的特征图在channel(通道)或hum(数量)维度上进行拼接。举个例子,比如在channel维度上对第一特征图和第二特征图进行拼接,首先除了channel维度可以不一样,其余维度必须一致(也就是hum、H、W一致),具体的,可以将第一特征图的channel  k1加上第二特征图的channel k2,concat层的输出可表示为:N*(k1+k2)*H*W。其中,H为图片的高度,W为图片的宽度,k1为第一特征图的channel的数量,k2为第二特征图的channel的数量。Among them, the function of the concat layer is to splice two or more feature maps in the channel (channel) or hum (number) dimension. For example, for example, the first feature map and the second feature map are spliced in the channel dimension. First, except for the channel dimension, the other dimensions must be the same (that is, hum, H, and W are consistent). Specifically, you can combine The channel k1 of the first feature map plus the channel k2 of the second feature map, the output of the concat layer can be expressed as: N*(k1+k2)*H*W. Among them, H is the height of the picture, W is the width of the picture, k1 is the number of channels in the first feature map, and k2 is the number of channels in the second feature map.
训练模块505,用于将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;The training module 505 is configured to input the fusion features into the model to be trained for training to obtain a mixed emotion detection model;
具体的,将所述融合特征输入待训练模型,使用两层LSTM(Long Short Term Memory Network,长短时记忆网络)以及attention机制的网络结构,训练得到基于激发理论的对情绪进行预测的情绪检测混合模型。Specifically, the fusion features are input into the model to be trained, and the network structure of the two-layer LSTM (Long Short Term Memory Network) and the attention mechanism is used to train the emotion detection mixture based on the excitation theory to predict emotions. model.
所述输入模块503,还用于将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。The input module 503 is further configured to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.
其中,将待识别音视频作为情绪检测混合模型的输入,根据更新后的各个网络层的权值和偏置对待识别音视频的特征进行运算,得到预设数量的情绪的概率值;根据预设数量的情绪的概率值,获取概率最大的情绪分类对应的情绪作为待识别音视频中的情绪,并输出识别结果。Among them, the audio and video to be recognized are used as the input of the emotion detection hybrid model, and the features of the audio and video to be recognized are calculated according to the updated weights and biases of each network layer to obtain the probability value of the preset number of emotions; according to the preset The probability value of a number of emotions, the emotion corresponding to the emotion category with the highest probability is obtained as the emotion in the audio and video to be recognized, and the recognition result is output.
其中,获取预设数量的情绪的概率结果,该预设数量的情绪与训练样本的情绪标签的总量相同。具体地,预先的数量可以设置为7,表示人脸图片一共有开心、悲伤、恐惧、生气、惊讶、厌恶和平静等7种情绪,或者预先的数量也可以设置为8,表示人脸图片一共有蔑视、开心、悲伤、恐惧、生气、惊讶、厌恶和平静等8种情绪,具体可以根据实际应用的需要进行设置,此处不做限制。Wherein, a probability result of a preset number of emotions is obtained, and the preset number of emotions is the same as the total amount of emotion labels of the training sample. Specifically, the preset number can be set to 7, which means that the face picture has 7 emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm, or the preset number can also be set to 8, which means that the face picture has a total of 7 emotions. There are 8 emotions, including contempt, happiness, sadness, fear, anger, surprise, disgust, and calm. The specific emotions can be set according to the needs of the actual application, and there is no restriction here.
其中,使用所述情绪检测混合模型进行情绪预测,能够提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。Wherein, using the emotion detection hybrid model to perform emotion prediction can improve the accuracy of the entire emotion prediction, and at the same time, it also increases the robustness of the model.
作为一种可选的实施方式,所述获取模块501,还用于获取人脸图片;As an optional implementation manner, the acquiring module 501 is also used to acquire a face picture;
所述情绪检测装置还包括:The emotion detection device further includes:
预处理模块,用于对所述人脸图片进行预处理,得到训练样本;A preprocessing module for preprocessing the face picture to obtain training samples;
所述输入模块503,还用于将所述训练样本输入稠密连接卷积网络Densnet模型中进行训练,得到基于图像的情绪识别模型。The input module 503 is also used to input the training samples into the densely connected convolutional network Densnet model for training, to obtain an image-based emotion recognition model.
在该可选的实施方式中,人脸图片可以包括不限于从视频里面抓取的图像,也可以是其他公开数据集合,或者利用爬虫技术在搜索引擎抓取数据自行标注的人脸图片,通过对人脸图片进行整理和标注,为每张人脸图片分配与其对应的情绪标签,其中,情绪标签代表的是人脸图像所呈现的情绪,不同的情绪标签代表不同的人脸情绪。人脸图片的格式可以包括但不限于jpg、png和jpeg等。人脸图片代表的情绪包括但不限于开心、悲伤、恐惧、生气、惊讶、厌恶和平静等情绪,并且每张人脸表情代表的情绪只有一类情绪。In this alternative embodiment, the face image may include not limited to images captured from the video, but may also be other public data collections, or use crawler technology to capture data self-annotated face images in the search engine, through The face pictures are sorted and labeled, and each face picture is assigned a corresponding emotion label, where the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions. The format of the face picture may include, but is not limited to, jpg, png, and jpeg. The emotions represented by face pictures include, but are not limited to, emotions such as happiness, sadness, fear, anger, surprise, disgust, and peace, and each face expression represents only one type of emotion.
具体的,可以对所述人脸图片进行预处理,比如归一化操作以及池化操作,这可以加快后续模型训练的速度,同时,去除人脸图片中的冗余信息,获得统一规格的图片作为训练样本,有利于训练的有效性,得到的基于图像的情绪识别模型更准确。Specifically, the face picture can be preprocessed, such as normalization operation and pooling operation, which can speed up the subsequent model training, and at the same time, remove redundant information in the face picture to obtain a picture with a uniform specification As a training sample, it is conducive to the effectiveness of training, and the obtained image-based emotion recognition model is more accurate.
具体的,可以为Densnet模型的各个网络层的权值和偏置均赋予一个初 始参数,即初始化Densnet模型,得到初始模型,在初始模型中输入训练样本,对训练样本进行归一化处理,计算初始模型各网络层的输出参数,获取初始模型的前向输出,根据前向输出,使用反向传播算法对初始模型中各个网络的初始参数进行调整,获取情绪识别模型。其中,如何计算初始模型各网络层的输出参数属于现有技术,在此不再赘述。Specifically, an initial parameter can be assigned to the weights and biases of each network layer of the Densnet model, that is, the Densnet model is initialized to obtain the initial model, the training samples are input into the initial model, and the training samples are normalized and calculated The output parameters of each network layer of the initial model are obtained, and the forward output of the initial model is obtained. According to the forward output, the initial parameters of each network in the initial model are adjusted using the backpropagation algorithm to obtain the emotion recognition model. Among them, how to calculate the output parameters of each network layer of the initial model belongs to the prior art, and will not be repeated here.
作为一种可选的实施方式,所述获取模块501,还用于获取人脸图片;As an optional implementation manner, the acquiring module 501 is also used to acquire a face picture;
所述预处理模块,还用于对所述人脸图片进行预处理,得到训练样本;The preprocessing module is also used to preprocess the face picture to obtain training samples;
所述输入模块503,还用于将所述训练样本输入深度残差网络ResNet模型进行训练,得到基于图像的情绪识别模型。The input module 503 is also used to input the training samples into a deep residual network ResNet model for training, to obtain an image-based emotion recognition model.
其中,人脸图片可以包括但不限于从视频里面抓取的图像,也可以是其他公开数据集合,或者利用爬虫技术在搜索引擎抓取数据自行标注的人脸图片。人脸图片代表的情绪包括但不限于开心、悲伤、恐惧、生气、惊讶、厌恶和平静等情绪,每张人脸表情代表的情绪只有一类情绪,并分配与人脸图片对应的情绪标签,其中,情绪标签代表的是人脸图像所呈现的情绪,不同的情绪标签代表不同的人脸情绪。可选地,图片的预处理过程是指对图片进行尺寸、灰度和形状等变换的处理,形成统一的规格,以便后续的图片处理更加高效。可选的,图像的预处理过程还可以是对图像进行归一化操作以及池化操作等,去除人脸图片中的冗余信息,获得统一规格的图片作为训练样本,有利于训练的有效性,得到的基于图像的情绪识别模型更准确。Among them, the face picture may include, but is not limited to, an image captured from a video, it may also be a collection of other public data, or a face picture self-annotated using crawler technology to capture data on a search engine. The emotions represented by face pictures include but are not limited to emotions such as happiness, sadness, fear, anger, surprise, disgust, and calm. Each face expression represents only one type of emotion, and the emotion label corresponding to the face picture is assigned. Among them, the emotion label represents the emotion presented by the face image, and different emotion labels represent different facial emotions. Optionally, the preprocessing process of the picture refers to the processing of transforming the size, grayscale, and shape of the picture to form a unified specification, so that subsequent picture processing is more efficient. Optionally, the image preprocessing process can also be to perform normalization and pooling operations on the image, remove redundant information in the face image, and obtain a uniform specification image as a training sample, which is beneficial to the effectiveness of training , The obtained image-based emotion recognition model is more accurate.
具体的,可以将训练样本输入构建的深度残差网络模型,提取训练样本的特征;根据训练样本携带的情绪标签,调整各层神经元中预设的权值和预设的阈值,使得按照特征的分类结果符合与训练样本对应的情绪标签;更新深度残差网络模型预设的模型参数,得到情绪识别模型。其中,预设的模型参数包括预设的权值和预设的阈值,预设的模型参数适用于对训练样本的特征进行计算的模型参数,使用训练样本对深度残差网络模型进行训练是为了得到合适的网络参数,防止模型的拟合。Specifically, the training samples can be input to the constructed deep residual network model, and the characteristics of the training samples can be extracted; according to the emotion labels carried by the training samples, the preset weights and preset thresholds in each layer of neurons can be adjusted, so that according to the features The classification result of is in line with the emotion label corresponding to the training sample; the preset model parameters of the deep residual network model are updated to obtain the emotion recognition model. Among them, the preset model parameters include preset weights and preset thresholds. The preset model parameters are suitable for the model parameters that calculate the characteristics of the training samples. The deep residual network model is trained using the training samples for Obtain appropriate network parameters to prevent model fitting.
作为一种可选的实施方式,所述获取模块501,还用于获取音频文件样本;As an optional implementation manner, the acquiring module 501 is further configured to acquire audio file samples;
所述情绪检测装置还包括:The emotion detection device further includes:
计算模块,用于对所述音频文件样本进行梅尔频率倒谱系数计算,获得梅尔频率倒谱系数MFCC特征;The calculation module is used to calculate the Mel frequency cepstrum coefficient of the audio file samples to obtain the Mel frequency cepstrum coefficient MFCC feature;
所述训练模块505,还用于对所述MFCC特征进行训练,获得基于声音的情绪识别模型。The training module 505 is also used to train the MFCC features to obtain a voice-based emotion recognition model.
在该可选的实施方式中,所述音频文件样本可以包括但不限于从多媒体文件中提取的音频文件,还包括其他公开的音频数据集合或者用户自己采集的音频。具体的,可以先对音频文件样本进行预加重、分帧和加窗,然后对每一个短时分析窗,通过快速傅里叶变换FFT得到对应的频谱,频谱通过Mel滤波器组得到Mel频谱,在Mel频谱上面进行倒谱分析(取对数,做逆变换,实际逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第5 个到第13个系数作为MFCC系数),获得Mel频率倒谱系数MFCC,这个MFCC就是这帧语音的特征,也即MFCC特征,最后,可以对所述MFCC特征进行训练,获得基于声音的情绪识别模型。其中,MFCC(Mel Frequency Cepstral Coefficents)为梅尔频率倒谱系数的英文缩写。In this optional implementation manner, the audio file samples may include, but are not limited to, audio files extracted from multimedia files, and also include other public audio data sets or audio collected by the user. Specifically, the audio file samples can be pre-emphasized, framed and windowed first, and then for each short-term analysis window, the corresponding frequency spectrum is obtained through fast Fourier transform FFT, and the frequency spectrum is passed through the Mel filter bank to obtain the Mel frequency spectrum. Perform cepstrum analysis on the Mel spectrum (take the logarithm and do the inverse transformation, the actual inverse transformation is generally realized by DCT discrete cosine transform, and take the 5th to 13th coefficients after DCT as the MFCC coefficients) to obtain the Mel frequency Cepstral coefficient MFCC, this MFCC is the feature of this frame of speech, that is, the MFCC feature, and finally, the MFCC feature can be trained to obtain a voice-based emotion recognition model. Among them, MFCC (Mel Frequency Cepstral Coefficents) is the English abbreviation of Mel Frequency Cepstral Coefficents.
作为一种可选的实施方式,所述输入模块503将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征的方式具体为:As an optional implementation manner, the input module 503 inputs a plurality of the image samples into an image-based emotion recognition model, and the specific method for obtaining a plurality of image features is:
将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征;将多个所述AU检测网络特征确定作为多个图像特征;或Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征;将多个所述情绪检测网络特征确定作为多个图像特征。A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
其中,人脸动作单元(Action Units,AU)是为了分析人脸面部肌肉运动而提出的。人脸面部表情可以通过AU进行识别。AU是指人脸部的基本肌肉动作单元,例如:内眉上扬、嘴角上扬、鼻子蹙皱等。Among them, the face action unit (Action Units, AU) is proposed to analyze the facial muscle movement. Facial expressions can be recognized by AU. AU refers to the basic muscle action units of the human face, such as: raised inner eyebrows, raised mouth corners, and wrinkled nose.
其中,将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征的整个神经网络结构为:多个所述图像样本由输入层(Input)进行输入,经过一个卷积层(Conv_)和一个池化层(Pool),然后经过4组参数不同的卷积包(conv_x package),最后经过池化层和全连接层(Full connect)及sigmoid层得到AU的预测结果。其中,所述AU检测网络特征即整个神经网络结构中间层的值。Wherein, inputting a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action units AU detection network features of the entire neural network structure is: a plurality of said image samples are input by an input layer (Input), through A convolutional layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolutional packages with different parameters (conv_x package), and finally pass through the pooling layer, the fully connected layer (Full Connect) and the sigmoid layer to obtain the AU forecast result. Wherein, the AU detection network feature is the value of the middle layer of the entire neural network structure.
其中,将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征的整个神经网络结构为:多个所述图像样本由输入层(Input)进行输入,经过一个卷积层(Conv_)和一个池化层(Pool),然后经过4组参数不同的卷积包(conv_x package),最后经过池化层和全连接层(Full connect),由回归分类层(Softmax)进行分类预测,由输出层(Result)得到人脸图片的预测结果。其中,所述情绪检测网络特征即整个神经网络结构中间层的值。Wherein, inputting a plurality of the image samples into an image-based emotion recognition model to obtain the characteristics of a plurality of emotion detection networks is as follows: a plurality of the image samples are input by an input layer (Input) and undergo a convolution Layer (Conv_) and a pooling layer (Pool), then pass through 4 sets of convolution packages with different parameters (conv_x package), and finally pass through the pooling layer and the fully connected layer (Full connect), which are performed by the regression classification layer (Softmax) Classification prediction, the prediction result of the face picture is obtained from the output layer (Result). Wherein, the emotion detection network characteristic is the value of the middle layer of the entire neural network structure.
作为一种可选的实施方式,所述输入模块503将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征的方式具体为:As an optional implementation manner, the input module 503 inputs a plurality of the sound samples into a sound-based emotion recognition model, and the specific method for obtaining a plurality of sound features is:
将多个所述声音样本输入基于声音的情绪识别模型,获得多个梅尔频率倒谱系数;将多个所述梅尔频率倒谱系数确定作为多个声音特征;或Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
将多个所述声音样本输入基于声音的情绪识别模型,获得多个功率谱;将多个所述功率谱确定作为多个声音特征。Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
其中,功率谱是功率谱密度函数的简称,功率谱为单位频带内的信号功率。它表示了信号功率随着频率的变化情况,即信号功率在频域的分布状况。原始音频经一系列处理,在经FFT变换之后得到一个复数关于时间的序列,复数的模的平方就是功率关于时间的序列,即功率谱。Among them, the power spectrum is the abbreviation of the power spectrum density function, and the power spectrum is the signal power in the unit frequency band. It shows the change of signal power with frequency, that is, the distribution of signal power in the frequency domain. The original audio undergoes a series of processing, and after FFT transformation, a complex number sequence with respect to time is obtained. The square of the complex number is the sequence of power with respect to time, that is, the power spectrum.
作为一种可选的实施方式,所述获取模块501,还用于若所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪,获取所述用户的用 户终端;As an optional implementation manner, the obtaining module 501 is further configured to obtain the user terminal of the user if the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion;
所述情绪检测装置还包括:The emotion detection device further includes:
发送模块,用于将携带有所述情绪识别结果的提示信息发送至所述用户终端,所述提示信息用于提示所述用户注意调节情绪。The sending module is configured to send prompt information carrying the emotion recognition result to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
在该可选的实施方式中,所述负向情绪一般为负能量的情绪,比如悲伤、恐惧、生气,厌恶等。当所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪时,表明所述用户的情绪比较偏激,为了让所述用户更好地认识自己的情绪,同时,也为了避免所述用户后续在同样的情绪下做出过激的行为,可以获取所述用户的用户终端,并将携带有所述情绪识别结果的提示信息发送至所述用户终端,以所述用户注意调节情绪。In this optional embodiment, the negative emotions are generally negative energy emotions, such as sadness, fear, anger, disgust, etc. When the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, it indicates that the emotion of the user is relatively extreme, in order to allow the user to better understand his emotions, and to avoid If the user subsequently makes an aggressive behavior under the same emotion, the user terminal of the user can be obtained, and the prompt information carrying the emotion recognition result can be sent to the user terminal, so that the user can pay attention to adjust the emotion .
在图5所描述情绪检测装置中,可以获取多个音视频样本,从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本,进一步地,可以将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征,最后,即可将所述多个图像特征和多个所述声音特征进行融合,获得融合特征,并将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型,将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。可见,本发明中,基于图像特征和声音特征进行训练获得的情绪检测混合模型,能够弥补单独基于图像训练或者单独基于声音训练获得的情绪检测模型,这能更好地进行情绪预测,提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。In the emotion detection device described in FIG. 5, multiple audio and video samples can be obtained, and multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples. Further, A plurality of the image samples can be input into an image-based emotion recognition model to obtain multiple image features, and a plurality of the sound samples can be input into a voice-based emotion recognition model to obtain multiple voice features, and finally, all The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain an emotion detection mixed model, and the processed audio and video to be recognized are input to the emotion Detect the mixed model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.
如图6所示,图6是本发明实现情绪检测方法的较佳实施例的电子设备的结构示意图。所述电子设备6包括存储器61、至少一个处理器62、存储在所述存储器61中并可在所述至少一个处理器62上运行的计算机可读指令63及至少一条通讯总线64。As shown in FIG. 6, FIG. 6 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the emotion detection method of the present invention. The electronic device 6 includes a memory 61, at least one processor 62, computer readable instructions 63 stored in the memory 61 and executable on the at least one processor 62, and at least one communication bus 64.
本领域技术人员可以理解,图6所示的示意图仅仅是所述电子设备6的示例,并不构成对所述电子设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备6还可以包括输入输出设备、网络接入设备等。Those skilled in the art can understand that the schematic diagram shown in FIG. 6 is only an example of the electronic device 6 and does not constitute a limitation on the electronic device 6. It may include more or less components than those shown in the figure, or a combination Certain components, or different components, for example, the electronic device 6 may also include input and output devices, network access devices, and so on.
所述电子设备6还包括但不限于任何一种可与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。所述电子设备6所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The electronic device 6 also includes, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, Personal digital assistants (Personal Digital Assistant, PDA), game consoles, interactive network television (Internet Protocol Television, IPTV), smart wearable devices, etc. The network where the electronic device 6 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
所述至少一个处理器62可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、 分立门或者晶体管逻辑器件、分立硬件组件等。该处理器62可以是微处理器或者该处理器62也可以是任何常规的处理器等,所述处理器62是所述电子设备6的控制中心,利用各种接口和线路连接整个电子设备6的各个部分。The at least one processor 62 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 62 can be a microprocessor or the processor 62 can also be any conventional processor, etc. The processor 62 is the control center of the electronic device 6 and connects the entire electronic device 6 through various interfaces and lines. Parts.
所述存储器61可用于存储所述计算机可读指令63和/或模块/单元,所述处理器62通过运行或执行存储在所述存储器61内的计算机可读指令和/或模块/单元,以及调用存储在存储器61内的数据,实现所述电子设备6的各种功能。所述存储器61可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备6的使用所创建的数据(比如音频数据等)等。The memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 can run or execute the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6. The memory 61 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data (such as audio data, etc.) created according to the use of the electronic device 6 and the like are stored.
结合图1,所述电子设备6中的所述存储器61存储多个指令以实现一种情绪检测方法,所述处理器62可执行所述多个指令从而实现:With reference to Fig. 1, the memory 61 in the electronic device 6 stores multiple instructions to implement an emotion detection method, and the processor 62 can execute the multiple instructions to achieve:
获取多个音视频样本;Obtain multiple audio and video samples;
从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;
将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;Fusing the multiple image features and multiple voice features to obtain a fusion feature;
将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;
将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
具体地,所述处理器62对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above-mentioned instructions by the processor 62, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which will not be repeated here.
在图6所描述的电子设备6中,可以获取多个音视频样本,从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本,进一步地,可以将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征,最后,即可将所述多个图像特征和多个所述声音特征进行融合,获得融合特征,并将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型,将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。可见,本发明中,基于图像特征和声音特征进行训练获得的情绪检测混合模型,能够弥补单独基于图像训练或者单独基于声音训练获得的情绪检测模型,这能更好地进行情绪预测,提高整个情绪预测的准确度,同时,也增加了模型的鲁棒性。In the electronic device 6 described in FIG. 6, multiple audio and video samples can be acquired, multiple image samples and multiple sound samples matching the multiple image samples can be captured from the audio and video samples, and further , You can input a plurality of the image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of the sound samples into a voice-based emotion recognition model to obtain multiple voice features, and finally The multiple image features and multiple voice features are fused to obtain a fusion feature, and the fusion feature is input to the model to be trained for training to obtain a mixed emotion detection model, and the processed audio and video to be recognized are input to the Emotion detection hybrid model to obtain emotion recognition results. It can be seen that in the present invention, the emotion detection mixed model obtained by training based on image features and sound features can compensate for the emotion detection model obtained based on image training alone or solely based on sound training, which can better predict emotions and improve the overall emotions. The accuracy of the prediction also increases the robustness of the model.
所述电子设备6集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的 步骤。其中,所述计算机程序包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、只读存储器(ROM,Read-Only Memory)等。If the integrated module/unit of the electronic device 6 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile readable storage medium. Based on this understanding, the present invention implements all or part of the processes in the above-mentioned embodiment methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile readable storage medium. When the computer program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer readable instruction code, and the computer readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The non-volatile readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM, Read-Only Memory) etc.
在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本发明各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, the functional modules in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体形式实现本发明。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。For those skilled in the art, it is obvious that the present invention is not limited to the details of the foregoing exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or basic characteristics of the present invention. Therefore, from any point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of the present invention is defined by the appended claims rather than the above description, and therefore it is intended to fall within the claims. All changes within the meaning and scope of equivalent elements of are included in the present invention. Any associated diagram marks in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.
最后应说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或等同替换,而不脱离本发明技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of the present invention.

Claims (20)

  1. 一种情绪检测方法,其特征在于,所述方法包括:An emotion detection method, characterized in that the method includes:
    获取多个音视频样本;Obtain multiple audio and video samples;
    从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;
    将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;Fusing the multiple image features and multiple voice features to obtain a fusion feature;
    将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;
    将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
  2. 根据权利要求1所述的方法,其特征在于,所述获取多个音视频样本之前,所述方法还包括:The method according to claim 1, wherein before said acquiring multiple audio and video samples, the method further comprises:
    获取人脸图片;Obtain face pictures;
    对所述人脸图片进行预处理,得到训练样本;Preprocessing the face image to obtain training samples;
    将所述训练样本输入稠密连接卷积网络Densnet模型中进行训练,得到基于图像的情绪识别模型。The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
  3. 根据权利要求1所述的方法,其特征在于,所述将所述多个图像特征和多个所述声音特征进行融合,获得融合特征包括:The method according to claim 1, wherein the fusing the multiple image features and the multiple voice features to obtain the fused feature comprises:
    将所述多个图像特征和多个所述声音特征在拼接层进行特征值拼接,获得融合特征。The multiple image features and multiple voice features are spliced by feature values in a splicing layer to obtain a fusion feature.
  4. 根据权利要求1所述的方法,其特征在于,所述获取多个音视频样本之前,所述方法还包括:The method according to claim 1, wherein before said acquiring multiple audio and video samples, the method further comprises:
    获取音频文件样本;Obtain audio file samples;
    对所述音频文件样本进行梅尔频率倒谱系数计算,获得梅尔频率倒谱系数MFCC特征;Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;
    对所述MFCC特征进行训练,获得基于声音的情绪识别模型。The MFCC features are trained to obtain a voice-based emotion recognition model.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征包括:The method according to any one of claims 1 to 4, wherein the inputting a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of image features comprises:
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征;将多个所述AU检测网络特征确定作为多个图像特征;或Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征;将多个所述情绪检测网络特征确定作为多个图像特征。A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
  6. 根据权利要求1至4中任一项所述的方法,其特征在于,所述将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征包括:The method according to any one of claims 1 to 4, wherein the inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of sound features comprises:
    将多个所述声音样本输入基于声音的情绪识别模型,获得多个梅尔频率倒 谱系数;将多个所述梅尔频率倒谱系数确定作为多个声音特征;或Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
    将多个所述声音样本输入基于声音的情绪识别模型,获得多个功率谱;将多个所述功率谱确定作为多个声音特征。Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
  7. 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    若所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪,获取所述用户的用户终端;If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
    将携带有所述情绪识别结果的提示信息发送至所述用户终端,所述提示信息用于提示所述用户注意调节情绪。The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
  8. 一种情绪检测装置,其特征在于,所述情绪检测装置包括:An emotion detection device, characterized in that the emotion detection device includes:
    获取模块,用于获取多个音视频样本;The acquisition module is used to acquire multiple audio and video samples;
    抓取模块,用于从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;A grabbing module for grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
    输入模块,用于将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;An input module, configured to input a plurality of said image samples into an image-based emotion recognition model to obtain multiple image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain multiple voice features;
    融合模块,用于将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;A fusion module for fusing the multiple image features and multiple voice features to obtain fusion features;
    训练模块,用于将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;A training module, configured to input the fusion features into the model to be trained for training, and obtain a mixed emotion detection model;
    所述输入模块,还用于将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。The input module is also used to input the processed audio and video to be recognized into the emotion detection mixture model to obtain emotion recognition results.
  9. 一种电子设备,其特征在于,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:An electronic device, characterized in that the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:
    获取多个音视频样本;Obtain multiple audio and video samples;
    从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;
    将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;Fusing the multiple image features and multiple voice features to obtain a fusion feature;
    将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;
    将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
  10. 根据权利要求9所述的电子设备,其特征在于,在所述获取多个音视频样本之前,所述处理器执行至少一个计算机可读指令还用以实现以下步骤:The electronic device according to claim 9, wherein before said acquiring a plurality of audio and video samples, the processor executing at least one computer readable instruction is further used to implement the following steps:
    获取人脸图片;Obtain face pictures;
    对所述人脸图片进行预处理,得到训练样本;Preprocessing the face image to obtain training samples;
    将所述训练样本输入稠密连接卷积网络Densnet模型中进行训练,得到基于图像的情绪识别模型。The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
  11. 根据权利要求9所述的电子设备,其特征在于,在所述获取多个音视频 样本之前,所述处理器执行至少一个计算机可读指令还用以实现以下步骤:The electronic device according to claim 9, characterized in that, before said acquiring a plurality of audio and video samples, the processor executing at least one computer readable instruction is further used to implement the following steps:
    获取音频文件样本;Obtain audio file samples;
    对所述音频文件样本进行梅尔频率倒谱系数计算,获得梅尔频率倒谱系数MFCC特征;Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;
    对所述MFCC特征进行训练,获得基于声音的情绪识别模型。The MFCC features are trained to obtain a voice-based emotion recognition model.
  12. 根据权利要求9至11中任一项所述的电子设备,其特征在于,所述处理器执行至少一个计算机可读指令以实现所述将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征时,包括:The electronic device according to any one of claims 9 to 11, wherein the processor executes at least one computer-readable instruction to implement the input of the plurality of image samples into an image-based emotion recognition model, When obtaining multiple image features, including:
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征;将多个所述AU检测网络特征确定作为多个图像特征;或将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征;将多个所述情绪检测网络特征确定作为多个图像特征。Input a plurality of the image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of the AU detection network features as a plurality of image features; or combine a plurality of the image samples Input the image-based emotion recognition model to obtain multiple emotion detection network features; determine the multiple emotion detection network features as multiple image features.
  13. 根据权利要求9至11中任一项所述的电子设备,其特征在于,所述处理器执行至少一个计算机可读指令以实现所述将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征时,包括:The electronic device according to any one of claims 9 to 11, wherein the processor executes at least one computer-readable instruction to implement the input of the plurality of sound samples into a sound-based emotion recognition model, When obtaining multiple sound characteristics, including:
    将多个所述声音样本输入基于声音的情绪识别模型,获得多个梅尔频率倒谱系数;将多个所述梅尔频率倒谱系数确定作为多个声音特征;或Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
    将多个所述声音样本输入基于声音的情绪识别模型,获得多个功率谱;将多个所述功率谱确定作为多个声音特征。Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
  14. 根据权利要求9至11中任一项所述的电子设备,其特征在于,所述处理器执行至少一个计算机可读指令还用以实现以下步骤:The electronic device according to any one of claims 9 to 11, wherein the processor executing at least one computer-readable instruction is further used to implement the following steps:
    若所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪,获取所述用户的用户终端;If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
    将携带有所述情绪识别结果的提示信息发送至所述用户终端,所述提示信息用于提示所述用户注意调节情绪。The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
  15. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A non-volatile readable storage medium, wherein the non-volatile readable storage medium stores at least one computer readable instruction, and the following steps are implemented when the at least one computer readable instruction is executed by a processor :
    获取多个音视频样本;Obtain multiple audio and video samples;
    从所述音视频样本中抓取出多个图像样本以及与所述多个图像样本匹配的多个声音样本;Grabbing multiple image samples and multiple sound samples matching the multiple image samples from the audio and video samples;
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征,以及将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征;Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of image features, and input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of sound features;
    将所述多个图像特征和多个所述声音特征进行融合,获得融合特征;Fusing the multiple image features and multiple voice features to obtain a fusion feature;
    将所述融合特征输入待训练模型进行训练,获得情绪检测混合模型;Input the fusion feature into the model to be trained for training, and obtain a mixed emotion detection model;
    将处理后的待识别音视频输入所述情绪检测混合模型,获得情绪识别结果。Input the processed audio and video to be recognized into the emotion detection mixed model to obtain emotion recognition results.
  16. 根据权利要求15所述的存储介质,其特征在于,在所述获取多个音视频样本之前,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:15. The storage medium according to claim 15, wherein before said acquiring a plurality of audio and video samples, the at least one computer readable instruction is executed by the processor to further implement the following steps:
    获取人脸图片;Obtain face pictures;
    对所述人脸图片进行预处理,得到训练样本;Preprocessing the face image to obtain training samples;
    将所述训练样本输入稠密连接卷积网络Densnet模型中进行训练,得到基于图像的情绪识别模型。The training samples are input into the densely connected convolutional network Densnet model for training to obtain an image-based emotion recognition model.
  17. 根据权利要求15所述的存储介质,其特征在于,在所述获取多个音视频样本之前,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:15. The storage medium according to claim 15, wherein before said acquiring a plurality of audio and video samples, the at least one computer readable instruction is executed by the processor to further implement the following steps:
    获取音频文件样本;Obtain audio file samples;
    对所述音频文件样本进行梅尔频率倒谱系数计算,获得梅尔频率倒谱系数MFCC特征;Performing Mel frequency cepstral coefficient calculation on the audio file samples to obtain Mel frequency cepstral coefficient MFCC characteristics;
    对所述MFCC特征进行训练,获得基于声音的情绪识别模型。The MFCC features are trained to obtain a voice-based emotion recognition model.
  18. 根据权利要求15至17中任一项所述的存储介质,其特征在于,所述至少一个计算机可读指令被处理器执行以实现所述将多个所述图像样本输入基于图像的情绪识别模型,获得多个图像特征时,包括:The storage medium according to any one of claims 15 to 17, wherein the at least one computer readable instruction is executed by a processor to implement the input of a plurality of the image samples into an image-based emotion recognition model , When multiple image features are obtained, including:
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个人脸动作单元AU检测网络特征;将多个所述AU检测网络特征确定作为多个图像特征;或Input a plurality of said image samples into an image-based emotion recognition model to obtain a plurality of facial action unit AU detection network features; determine a plurality of said AU detection network features as a plurality of image features; or
    将多个所述图像样本输入基于图像的情绪识别模型,获得多个情绪检测网络特征;将多个所述情绪检测网络特征确定作为多个图像特征。A plurality of the image samples are input into an image-based emotion recognition model to obtain a plurality of emotion detection network features; the plurality of emotion detection network features are determined as a plurality of image features.
  19. 根据权利要求15至17中任一项所述的存储介质,其特征在于,所述至少一个计算机可读指令被处理器执行以实现所述将多个所述声音样本输入基于声音的情绪识别模型,获得多个声音特征时,包括:The storage medium according to any one of claims 15 to 17, wherein the at least one computer readable instruction is executed by a processor to implement the input of a plurality of the sound samples into a sound-based emotion recognition model , When multiple sound features are obtained, including:
    将多个所述声音样本输入基于声音的情绪识别模型,获得多个梅尔频率倒谱系数;将多个所述梅尔频率倒谱系数确定作为多个声音特征;或Input a plurality of said sound samples into a voice-based emotion recognition model to obtain a plurality of Mel frequency cepstral coefficients; determine the plurality of Mel frequency cepstral coefficients as a plurality of sound features; or
    将多个所述声音样本输入基于声音的情绪识别模型,获得多个功率谱;将多个所述功率谱确定作为多个声音特征。Inputting a plurality of the sound samples into a sound-based emotion recognition model to obtain a plurality of power spectra; determining the plurality of power spectra as a plurality of sound features.
  20. 根据权利要求15至17中任一项所述的存储介质,其特征在于,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:The storage medium according to any one of claims 15 to 17, wherein the at least one computer readable instruction is executed by the processor to further implement the following steps:
    若所述情绪识别结果表明所述待识别音视频所属用户的情绪为负向情绪,获取所述用户的用户终端;If the emotion recognition result indicates that the emotion of the user to which the audio and video to be recognized belongs is a negative emotion, obtain the user terminal of the user;
    将携带有所述情绪识别结果的提示信息发送至所述用户终端,所述提示信息用于提示所述用户注意调节情绪。The prompt information carrying the emotion recognition result is sent to the user terminal, and the prompt information is used to prompt the user to pay attention to adjusting emotions.
PCT/CN2019/102867 2019-06-14 2019-08-27 Emotion detection method and apparatus, electronic device, and storage medium WO2020248376A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910518131.1 2019-06-14
CN201910518131.1A CN110414323A (en) 2019-06-14 2019-06-14 Mood detection method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020248376A1 true WO2020248376A1 (en) 2020-12-17

Family

ID=68359158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102867 WO2020248376A1 (en) 2019-06-14 2019-08-27 Emotion detection method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN110414323A (en)
WO (1) WO2020248376A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343860A (en) * 2021-06-10 2021-09-03 南京工业大学 Bimodal fusion emotion recognition method based on video image and voice
CN113766405A (en) * 2021-07-22 2021-12-07 上海闻泰信息技术有限公司 Method and device for detecting noise of loudspeaker, electronic equipment and storage medium
CN114005468A (en) * 2021-09-07 2022-02-01 华院计算技术(上海)股份有限公司 Interpretable emotion recognition method and system based on global working space
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment
CN114863548A (en) * 2022-03-22 2022-08-05 天津大学 Emotion recognition method and device based on human motion posture nonlinear spatial features
CN114863636A (en) * 2022-03-25 2022-08-05 吉林云帆智能工程有限公司 Emotion recognition algorithm for rail vehicle driver
CN115100560A (en) * 2022-05-27 2022-09-23 中国科学院半导体研究所 Method, device and equipment for monitoring bad state of user and computer storage medium
CN114863548B (en) * 2022-03-22 2024-05-31 天津大学 Emotion recognition method and device based on nonlinear space characteristics of human body movement gestures

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145282B (en) * 2019-12-12 2023-12-05 科大讯飞股份有限公司 Avatar composition method, apparatus, electronic device, and storage medium
CN110991427B (en) * 2019-12-25 2023-07-14 北京百度网讯科技有限公司 Emotion recognition method and device for video and computer equipment
CN113129926A (en) * 2019-12-30 2021-07-16 中移(上海)信息通信科技有限公司 Voice emotion recognition model training method, voice emotion recognition method and device
CN114375466A (en) * 2019-12-31 2022-04-19 深圳市欢太科技有限公司 Video scoring method and device, storage medium and electronic equipment
CN111339913A (en) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 Method and device for recognizing emotion of character in video
CN111402920B (en) * 2020-03-10 2023-09-12 同盾控股有限公司 Method and device for identifying asthma-relieving audio, terminal and storage medium
CN111414959B (en) * 2020-03-18 2024-02-02 南京星火技术有限公司 Image recognition method, device, computer readable medium and electronic equipment
CN111967361A (en) * 2020-08-07 2020-11-20 盐城工学院 Emotion detection method based on baby expression recognition and crying
CN112101129B (en) * 2020-08-21 2023-08-18 广东工业大学 Face-to-face video and audio multi-view emotion distinguishing method and system
CN112233698B (en) * 2020-10-09 2023-07-25 中国平安人寿保险股份有限公司 Character emotion recognition method, device, terminal equipment and storage medium
CN112669876A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium
CN112686048B (en) * 2020-12-23 2021-11-23 沈阳新松机器人自动化股份有限公司 Emotion recognition method and device based on fusion of voice, semantics and facial expressions
CN112699774B (en) * 2020-12-28 2024-05-24 深延科技(北京)有限公司 Emotion recognition method and device for characters in video, computer equipment and medium
CN112884326A (en) * 2021-02-23 2021-06-01 无锡爱视智能科技有限责任公司 Video interview evaluation method and device based on multi-modal analysis and storage medium
CN113435357B (en) * 2021-06-30 2022-09-02 平安科技(深圳)有限公司 Voice broadcasting method, device, equipment and storage medium
CN113536999B (en) * 2021-07-01 2022-08-19 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN114431970A (en) * 2022-02-25 2022-05-06 上海联影医疗科技股份有限公司 Medical imaging equipment control method, system and equipment
CN115620268A (en) * 2022-12-20 2023-01-17 深圳市徐港电子有限公司 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
CN109522818A (en) * 2018-10-29 2019-03-26 中国科学院深圳先进技术研究院 A kind of method, apparatus of Expression Recognition, terminal device and storage medium
CN109766476A (en) * 2018-12-27 2019-05-17 西安电子科技大学 Video content sentiment analysis method, apparatus, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101317047B1 (en) * 2012-07-23 2013-10-11 충남대학교산학협력단 Emotion recognition appatus using facial expression and method for controlling thereof
CN107633203A (en) * 2017-08-17 2018-01-26 平安科技(深圳)有限公司 Facial emotions recognition methods, device and storage medium
CN109190487A (en) * 2018-08-07 2019-01-11 平安科技(深圳)有限公司 Face Emotion identification method, apparatus, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
CN109522818A (en) * 2018-10-29 2019-03-26 中国科学院深圳先进技术研究院 A kind of method, apparatus of Expression Recognition, terminal device and storage medium
CN109766476A (en) * 2018-12-27 2019-05-17 西安电子科技大学 Video content sentiment analysis method, apparatus, computer equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343860A (en) * 2021-06-10 2021-09-03 南京工业大学 Bimodal fusion emotion recognition method based on video image and voice
CN113766405A (en) * 2021-07-22 2021-12-07 上海闻泰信息技术有限公司 Method and device for detecting noise of loudspeaker, electronic equipment and storage medium
CN114005468A (en) * 2021-09-07 2022-02-01 华院计算技术(上海)股份有限公司 Interpretable emotion recognition method and system based on global working space
CN114565964A (en) * 2022-03-03 2022-05-31 网易(杭州)网络有限公司 Emotion recognition model generation method, recognition method, device, medium and equipment
CN114863548A (en) * 2022-03-22 2022-08-05 天津大学 Emotion recognition method and device based on human motion posture nonlinear spatial features
CN114863548B (en) * 2022-03-22 2024-05-31 天津大学 Emotion recognition method and device based on nonlinear space characteristics of human body movement gestures
CN114863636A (en) * 2022-03-25 2022-08-05 吉林云帆智能工程有限公司 Emotion recognition algorithm for rail vehicle driver
CN115100560A (en) * 2022-05-27 2022-09-23 中国科学院半导体研究所 Method, device and equipment for monitoring bad state of user and computer storage medium

Also Published As

Publication number Publication date
CN110414323A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
WO2020248376A1 (en) Emotion detection method and apparatus, electronic device, and storage medium
Xie et al. Speech emotion classification using attention-based LSTM
Zhang et al. Spontaneous speech emotion recognition using multiscale deep convolutional LSTM
Zhang et al. Learning affective features with a hybrid deep model for audio–visual emotion recognition
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
Zhang et al. Multimodal deep convolutional neural network for audio-visual emotion recognition
Mane et al. A survey on supervised convolutional neural network and its major applications
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
CN110852215B (en) Multi-mode emotion recognition method and system and storage medium
Rashid et al. Human emotion recognition from videos using spatio-temporal and audio features
Rahdari et al. A multimodal emotion recognition system using facial landmark analysis
WO2022199215A1 (en) Crowd-information-fused speech emotion recognition method and system
Zhang et al. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
CN113420556B (en) Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN112101096A (en) Suicide emotion perception method based on multi-mode fusion of voice and micro-expression
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
Ren et al. Multi-instance learning for bipolar disorder diagnosis using weakly labelled speech data
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN112466284B (en) Mask voice identification method
Hosseini et al. Multimodal modelling of human emotion using sound, image and text fusion
Mocanu et al. Speech emotion recognition using GhostVLAD and sentiment metric learning
Akinpelu et al. Lightweight Deep Learning Framework for Speech Emotion Recognition
CN116775873A (en) Multi-mode dialogue emotion recognition method
Cambria et al. Speaker-independent multimodal sentiment analysis for big data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933006

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933006

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19933006

Country of ref document: EP

Kind code of ref document: A1