CN117409396A

CN117409396A - Multi-mode emotion recognition method, system, equipment and medium

Info

Publication number: CN117409396A
Application number: CN202311416752.1A
Authority: CN
Inventors: 林燕丹; 吴磊磊
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-01-16

Abstract

The invention discloses a multi-mode emotion recognition method, a multi-mode emotion recognition system, multi-mode emotion recognition equipment and a multi-mode emotion recognition medium, and relates to the field of emotion recognition; the method comprises the following steps: acquiring multi-mode data of a driver; framing the electroencephalogram information data and extracting features to obtain electroencephalogram feature vectors; carrying out gray level conversion and feature detection on the face image data to obtain a face feature vector; splicing the brain electrical feature vector and the face feature vector to obtain a multi-mode fusion vector; carrying out data enhancement processing on the multi-mode fusion vector by adopting an countermeasure network to obtain an enhancement feature fusion vector; inputting the multimodal fusion vector and the enhanced feature fusion vector into an emotion recognition model, and outputting an emotion recognition result; the emotion recognition model is constructed by adopting a learning algorithm; the invention can improve the accuracy and response rate of emotion recognition.

Description

Multi-mode emotion recognition method, system, equipment and medium

Technical Field

The present invention relates to the field of emotion recognition, and in particular, to a method, system, device, and medium for multi-modal emotion recognition.

Background

The intelligent driving vehicle or the automatic driving vehicle is a comprehensive system integrating the functions of sensing, operation, decision and the like, and integrates the technologies of automobile engineering, sensors, automation, artificial intelligence, automatic driving and the like. The two main directions of intelligent driving vehicles are intelligent cabins and intelligent driving. Autopilot is primarily vehicle and external perception, interaction and decision making, while the intelligent cockpit is primarily vehicle and internal perception, interaction and decision making. In contrast, the intelligent cabin is easy to realize, and is currently used in a plurality of intelligent automobiles, wherein emotion recognition is one of typical functions of an intelligent cabin system, emotion characteristics of a driver can be determined through the emotion recognition function, and then scenes such as music, light, temperature and smell of the cabin can be adjusted automatically in a targeted manner, so that driving experience sense and safety performance are improved, for example, road anger is prevented.

In the prior art, researches on emotion recognition of drivers can be roughly divided into four types according to different modes: speech-based emotion recognition, facial image-based emotion recognition, driver physiological signal-based emotion recognition, and driving behavior-based emotion recognition. Because the emotion change of the driver cannot be timely reflected in the voice, the method can be used as an auxiliary means rather than a main recognition method. The recognition based on driving behavior mainly recognizes emotion through interaction between a driver and various modules of the vehicle, such as interaction force of a steering wheel or a pedal, and generally can only be used as an auxiliary means. Based on the physiological signals of the driver, such as brain lamps, the emotion of the driver can be accurately identified, but the brain signals are weak and are easily influenced by the actions of the driver or the driving environment, and the emotion is difficult to be detected as a unique method in the real environment. Emotion recognition based on facial images is currently the most widely used method, and emotion is generally recognized by extracting facial features. Facial features are easily masked and do not accurately recognize emotion. The current intelligent cabin system generally adopts a single mode to carry out emotion recognition, so that the emotion recognition accuracy is not high.

With the development of deep learning technology, the emotion recognition model can remarkably improve recognition accuracy by using neural network and other technologies. However, during driving, the collected data volume is limited, and long-time data accumulation is generally required to enable the deep learning recognition model to have high recognition accuracy, so that the response rate of emotion recognition is greatly affected.

Therefore, how to improve the accuracy and response rate of emotion recognition to meet the requirements of intelligent cabins is of great importance.

Disclosure of Invention

The invention aims to provide a multi-mode emotion recognition method, a multi-mode emotion recognition system, multi-mode emotion recognition equipment and a multi-mode emotion recognition medium, which can improve emotion recognition accuracy and response rate.

In order to achieve the above object, the present invention provides the following solutions:

a method of multimodal emotion recognition, the method comprising:

acquiring multi-mode data of a driver; the multi-modal data includes: electroencephalogram information data and face image data; the electroencephalogram data are acquired electroencephalogram signals by adopting electrodes distributed on the head of a driver;

carrying out framing treatment on the electroencephalogram information data and carrying out feature extraction to obtain an electroencephalogram feature vector;

performing gray level conversion and feature detection on the face image data to obtain a face feature vector;

splicing the electroencephalogram feature vector and the face feature vector to obtain a multi-mode fusion vector;

performing data enhancement processing on the multi-mode fusion vector by adopting an countermeasure network to obtain an enhancement feature fusion vector;

inputting the multi-modal fusion vector and the enhancement feature fusion vector into an emotion recognition model, and outputting an emotion recognition result; the emotion recognition model is constructed by adopting a learning algorithm; the emotion recognition result is anger emotion or non-anger emotion.

Optionally, framing the electroencephalogram information data and extracting features to obtain an electroencephalogram feature vector, which specifically includes:

preprocessing the electroencephalogram information data by adopting a band-pass filter and a blind source separation technology to obtain processed electroencephalogram information data; the pretreatment comprises the following steps: removing noise and removing artifacts;

carrying out framing treatment on the treated electroencephalogram information data according to a set step length by adopting a sliding window method to obtain a plurality of window data;

extracting the characteristics of each window data to obtain corresponding characteristic vectors;

and determining all the characteristic vectors as the electroencephalogram characteristic vectors.

Optionally, the set step size is 0-2 seconds.

Optionally, performing gray level conversion and feature detection on the face image data to obtain a face feature vector, which specifically includes:

cutting the face image data to obtain face image cutting data;

carrying out gray conversion on the face image cutting data to obtain gray face image data;

and carrying out feature detection on the gray-scale face image data by adopting a directional gradient square and a local binary pattern feature method to obtain the face feature vector.

Optionally, the method for determining the emotion recognition model specifically includes:

acquiring training data; the training data includes: training multimodal data and tag data; the label data is a emotion recognition result corresponding to the training multi-mode data;

framing the electroencephalogram information data in the training multimodal data and extracting the characteristics to obtain a training electroencephalogram characteristic vector;

performing gray level conversion and feature detection on face image data in the trained multi-mode data to obtain a training face feature vector;

splicing the training electroencephalogram feature vector and the training face feature vector to obtain a training multi-mode fusion vector;

performing data enhancement processing on the training multi-mode fusion vector by adopting an countermeasure network to obtain a training enhancement feature fusion vector;

dividing the fusion vector into a training set and a testing set; the fusion vector includes: the training enhancement feature fusion vector and the training multi-modal fusion vector;

a framework learning neural network;

inputting the training set into the learning neural network, and training parameters of the learning network by taking the minimum objective function as a target to obtain a trained learning neural network; the objective function is determined according to the error between the output result of the learning neural network and the label data corresponding to the training set;

inputting the test set and the label data corresponding to the test set into a trained learning neural network, and adjusting parameters of the trained learning neural network to obtain an adjusted learning neural network;

and determining the adjusted learning neural network as the emotion recognition model.

Optionally, the method further comprises:

and carrying out emotion adjustment processing on the driver according to the emotion recognition result.

A multi-modal emotion recognition system, the system comprising:

the data acquisition module is used for acquiring multi-mode data of a driver; the multi-modal data includes: electroencephalogram information data and face image data; the electroencephalogram data are acquired electroencephalogram signals by adopting electrodes distributed on the head of a driver;

the extraction module is used for carrying out framing treatment on the electroencephalogram information data and carrying out feature extraction to obtain an electroencephalogram feature vector;

the detection module is used for carrying out gray level conversion and feature detection on the face image data to obtain a face feature vector;

the fusion module is used for splicing the electroencephalogram feature vector and the face feature vector to obtain a multi-mode fusion vector;

the enhancement processing module is used for carrying out data enhancement processing on the multi-mode fusion vector by adopting an countermeasure network to obtain an enhancement feature fusion vector;

the recognition module is used for inputting the multi-modal fusion vector and the enhanced feature fusion vector into a emotion recognition model and outputting an emotion recognition result; the emotion recognition model is constructed by adopting a learning algorithm; the emotion recognition result is anger emotion or non-anger emotion.

An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the multimodal emotion recognition method described above.

A computer readable storage medium storing a computer program which when executed by a processor implements the multimodal emotion recognition method described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a multi-mode emotion recognition method, a system, equipment and a medium, wherein multi-mode data of a driver are obtained; framing the electroencephalogram information data and extracting features to obtain electroencephalogram feature vectors; carrying out gray level conversion and feature detection on the face image data to obtain a face feature vector; splicing the brain electrical feature vector and the face feature vector to obtain a multi-mode fusion vector; carrying out data enhancement processing on the multi-mode fusion vector by adopting an countermeasure network to obtain an enhancement feature fusion vector; inputting the enhanced feature fusion vector into an emotion recognition model, and outputting an emotion recognition result; because the emotion recognition model is constructed by adopting a learning algorithm and the multi-mode fusion vector is subjected to data enhancement processing, the invention can improve the accuracy and response rate of emotion recognition.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a multi-modal emotion recognition method provided by an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating steps of a method for identifying multi-modal emotion provided in an embodiment of the present invention in actual application;

FIG. 3 is a schematic diagram illustrating a processing procedure of a multimodal fusion vector according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a process of multimodal fusion vector according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a countermeasure network processing procedure according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of emotion recognition and emotional intervention provided by an embodiment of the present invention;

fig. 7 is a block diagram of a multi-modal emotion recognition system according to an embodiment of the present invention.

Symbol description:

the device comprises a data acquisition module-1, an extraction module-2, a detection module-3, a fusion module-4, an enhancement processing module-5 and an identification module-6.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, an embodiment of the present invention provides a multi-modal emotion recognition method, which includes:

step 100: multimodal data of the driver is obtained. Wherein the multi-modal data includes: electroencephalogram information data and face image data; the electroencephalogram data is acquired by adopting electrodes distributed on the head of a driver.

Step 200: and carrying out framing treatment on the electroencephalogram information data and carrying out feature extraction to obtain an electroencephalogram feature vector.

The method comprises the steps of framing electroencephalogram information data and extracting features to obtain electroencephalogram feature vectors, and specifically comprises the following steps:

preprocessing the electroencephalogram information data by adopting a band-pass filter and a blind source separation technology to obtain processed electroencephalogram information data; the pretreatment comprises the following steps: noise removal and artifact removal.

And carrying out framing treatment on the processed electroencephalogram information data according to a set step length by adopting a sliding window method to obtain a plurality of window data. The step size is set to 0-2 seconds.

And extracting the characteristics of each window data to obtain corresponding characteristic vectors.

All feature vectors are determined as brain electrical feature vectors.

In practical application, electrodes related to emotion recognition can be selected from all electrodes arranged on the head of a driver, so that the computational complexity of an electroencephalogram signal sample can be eliminated, and for example, electrodes such as FP1, FP2, F3, C4 and the like are selected.

And carrying out framing treatment on the electroencephalogram data of the selected channel, dividing the electroencephalogram data into a plurality of frames in a window of 0-2 seconds by using the techniques of a Hanning window, a Hamming window, a triangular window and the like, extracting features from each frame, and forming an electroencephalogram feature vector.

The electroencephalogram feature vector comprises vectors corresponding to time domain features, frequency domain features and time-frequency related features. The invention prefers differential entropy (Differential Entropy, DE) eigenvectors of the brain electricity.

Step 300: and carrying out gray level conversion and feature detection on the face image data to obtain a face feature vector.

The method comprises the steps of carrying out gray level conversion and feature detection on face image data to obtain face feature vectors, and specifically comprises the following steps:

cutting the face image data to obtain face image cutting data; carrying out gray conversion on the face image cutting data to obtain gray face image data; and adopting a directional gradient square chart and a local binary pattern feature method to perform feature detection on the gray-scale face image data to obtain a face feature vector.

In practical application, the face image data is cut out to eliminate information irrelevant to the face area. Face image data may be extracted from a video sequence.

The face in the image can be detected by using the Viola-Jones technology, the detected face is cut and converted into a gray image with a certain size, and the feature extraction is carried out from the image.

The Viola-Jones technology is an object detection method for real-time processing, and is mainly used in the aspect of face detection. The method comprises the steps of extracting a feature matrix of a picture, selecting features and training a classifier by using an AdaBoost machine algorithm, and identifying a face region by using a cascade architecture and fast and pedicle omission/false detection rate.

The feature detection is performed on the gray-scale face image data, and the feature detection is information or features possibly related to emotion recognition in the face image data. After clipping and converting to a gray level map, the gray level range of all pixels is 0-255, and under different situations, different structures on the face, such as eyes, cheeks, eyebrows, nose, mouth, etc., can show different structures. By varying the gray level of these structures, corresponding features can be obtained, which can be used to identify emotions.

Step 400: and splicing the electroencephalogram feature vector and the facial feature vector to obtain the multi-modal fusion vector.

In practical application, the electroencephalogram feature vector extracted from each 0-2 second frame and the face feature vector extracted from each 0-2 second frame are spliced to obtain the multi-mode fusion vector.

Step 500: and carrying out data enhancement processing on the multi-mode fusion vector by adopting an countermeasure network to obtain an enhanced feature fusion vector.

Based on the multi-modal fusion vector, more high-quality feature fusion data are generated through a deep learning method. The data enhancement is performed after the feature fusion stage, but not for the original data enhancement, the multi-mode fusion vector can be enhanced, the whole data volume is greatly reduced, and the data enhancement rate is remarkably improved.

The generation of the challenge network (Generative Adversarial Networks, GAN), and its derivative models, such as wasperstein, are preferably used.

GAN is an emerging deep learning structure, often used to generate data that resembles reality. A standard GAN consists of two competing deep neural network components, a generator and a arbiter, respectively. The generator G generates data that resembles reality given input noise variables, and the arbiter attempts to identify one sample as coming from the generated data or the real data. The process of the antagonism training between the generator and the discriminant can be expressed as a maximum and minimum problem:

wherein θ is _g And theta _d Parameters of the generator and the arbiter, P _g Is composed of x _g ＝G(x _z ) Implicitly defined, x _z Is a noise sample, typically collected from a uniform distribution or gaussian distribution.For data x _r Fusing expected values of the data for the features; />Identifying data x for a arbiter _r Probability of fusing data for features；D(x _g ) Identifying data x for a arbiter _g Probability of fusing data for the features; v (P) _r ，P _g ) Is cross entropy loss; e is desired.

By antagonizing the arbiter, the generator is able to generate a large number of high quality feature fusion vectors. As shown in fig. 5. The quality of the generated data may be selected using the generator and the discriminated penalty condition.

Step 600: and inputting the multimodal fusion vector and the enhanced feature fusion vector into the emotion recognition model, and outputting an emotion recognition result. Wherein, the emotion recognition model is constructed by adopting a learning algorithm; the emotion recognition result is anger emotion or non-anger emotion.

The method for determining the emotion recognition model specifically comprises the following steps:

acquiring training data; the training data includes: training multimodal data and tag data; the tag data is a emotion recognition result corresponding to the trained multi-mode data.

And framing the electroencephalogram information data in the training multimodal data and extracting the characteristics to obtain the training electroencephalogram characteristic vector.

And carrying out gray level conversion and feature detection on the face image data in the training multimodal data to obtain a training face feature vector.

And splicing the training brain electrical characteristic vector and the training face characteristic vector to obtain a training multi-mode fusion vector.

And carrying out data enhancement processing on the training multi-mode fusion vector by adopting an countermeasure network to obtain a training enhancement feature fusion vector.

Dividing the fusion vector into a training set and a testing set; the fusion vector includes: training the enhanced feature fusion vector and training the multimodal fusion vector.

A framework learning neural network; inputting the training set into a learning neural network, and training parameters of the learning network by taking the minimum objective function as a target to obtain a trained learning neural network; the objective function is determined based on an error between the output of the learning neural network and the tag data corresponding to the training set.

And inputting the test set and the label data corresponding to the test set into the trained learning neural network, and adjusting the parameters of the trained learning neural network to obtain the adjusted learning neural network.

And determining the adjusted learning neural network as an emotion recognition model.

As shown in fig. 6, in practical application, the feature fusion Pr and the generated high-quality feature fusion Pg may also be input into the emotion recognition model together. The emotion recognition model uses a deep learning algorithm or a machine learning algorithm, such as a support vector machine (Support Vector Machine, SVM), random Forest (RF), K-Nearest Neighbor (KNN), and the like, to recognize feature fusion data, and outputs an emotion recognition result.

In one embodiment, the method further comprises: and carrying out emotion adjustment processing on the driver according to the emotion recognition result.

In practical application, the cabin system can be adopted to make intervention actions according to the result output by the emotion recognition model. For example, when the emotion abnormality of the driver is detected, an early warning signal can be sent out timely, and the emotion of the driver can be regulated through various methods. Common mood adjustment methods include video, music, lights, temperature and smell, etc.

In practical application, as shown in fig. 2, the method for identifying the emotion in multiple modes includes the steps of obtaining original data in multiple modes of a driver or a passenger, preprocessing the original data, enhancing the data in multiple modes, obtaining an emotion identification result through an emotion identification model, and performing intervention.

The processing of the multimodal fusion vector is illustrated in fig. 3. And respectively preprocessing the electroencephalogram information data and the facial image data and extracting the characteristics, and finally carrying out characteristic fusion Pr to obtain a multimodal fusion vector.

Fig. 4 is a schematic diagram of a process of stitching an electroencephalogram feature vector and a face feature vector to obtain a multi-modal fusion vector.

Regarding face feature vector detection, the directional gradient square (Histogram of Oriented Gradients, HOG) and local binary pattern (Local Binary Patterns) feature methods are preferably employed.

Example 2

As shown in fig. 7, an embodiment of the present invention provides a multi-modal emotion recognition system, including: the device comprises a data acquisition module 1, an extraction module 2, a detection module 3, a fusion module 4, an enhancement processing module 5 and an identification module 6.

The data acquisition module 1 is used for acquiring multi-mode data of a driver; the multimodal data includes: electroencephalogram information data and face image data; the electroencephalogram data is acquired by adopting electrodes distributed on the head of a driver.

And the extraction module 2 is used for carrying out framing treatment on the electroencephalogram information data and carrying out feature extraction to obtain an electroencephalogram feature vector.

And the detection module 3 is used for carrying out gray level conversion and feature detection on the face image data to obtain a face feature vector.

And the fusion module 4 is used for splicing the electroencephalogram feature vector and the face feature vector to obtain a multi-mode fusion vector.

And the enhancement processing module 5 is used for carrying out data enhancement processing on the multi-mode fusion vector by adopting an countermeasure network to obtain an enhancement feature fusion vector.

The recognition module 6 is used for inputting the multi-mode fusion vector and the enhancement feature fusion vector into the emotion recognition model and outputting an emotion recognition result; the emotion recognition model is constructed by adopting a learning algorithm; the emotion recognition result is anger emotion or non-anger emotion.

Example 3

The embodiment of the invention provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic equipment to execute the multi-mode emotion recognition method in the embodiment 1.

As an alternative embodiment, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the multimodal emotion recognition method of embodiment 1.

According to the invention, through acquiring multi-modal data acquisition, data preprocessing and feature extraction, feature fusion and feature fusion enhancement, constructing a emotion recognition model of a multi-modal network, outputting emotion recognition results, and finally providing emotion intervention means for the intelligent cabin. Because the emotion recognition model has higher requirements on data volume, in order to rapidly recognize emotion, the invention uses neural networks such as an antagonism network to generate high-quality enhanced feature fusion vectors, and the high-quality enhanced feature fusion vectors can be input into the emotion recognition model together with original fusion features, so that the emotion recognition precision and speed of the recognition model can be increased. The emotion recognition method for the driver can remarkably improve the recognition accuracy and response speed of emotion and can contribute to emotion recognition and emotion intervention of the driver in the intelligent cabin.

In short, the invention provides a multi-mode feature extraction and fusion method, wherein the feature fusion is subjected to data enhancement processing through a deep learning algorithm, and the emotion of a driver is rapidly recognized through an emotion recognition model, so that a foundation is provided for emotion adjustment of an intelligent cabin. According to the invention, emotion can be accurately identified, and under the condition of less multi-mode data in a short time, the identification accuracy and response speed are improved through the simultaneous data enhancement of multiple modes.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method of multimodal emotion recognition, the method comprising:

2. The multi-modal emotion recognition method according to claim 1, wherein the electroencephalogram feature vector is obtained by framing and extracting the electroencephalogram feature data, and specifically comprises:

3. A multimodal emotion recognition method according to claim 2, wherein the set step size is 0-2 seconds.

4. The method for recognizing multi-modal emotion according to claim 1, wherein the steps of performing gray level conversion and feature detection on the face image data to obtain a face feature vector include:

cutting the face image data to obtain face image cutting data;

5. The multi-modal emotion recognition method according to claim 1, wherein the method for determining the emotion recognition model specifically comprises:

a framework learning neural network;

6. A multi-modal emotion recognition method as claimed in claim 1, further comprising:

7. A multi-modal emotion recognition system, the system comprising:

8. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the multimodal emotion recognition method of any of claims 1 to 6.

9. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the multimodal emotion recognition method as claimed in any one of claims 1 to 6.