CN115620268A - Multi-modal emotion recognition method and device, electronic equipment and storage medium - Google Patents

Multi-modal emotion recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115620268A
CN115620268A CN202211636214.9A CN202211636214A CN115620268A CN 115620268 A CN115620268 A CN 115620268A CN 202211636214 A CN202211636214 A CN 202211636214A CN 115620268 A CN115620268 A CN 115620268A
Authority
CN
China
Prior art keywords
driver
emotion
video data
emotion recognition
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211636214.9A
Other languages
Chinese (zh)
Inventor
李少君
汪骏
张富国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xugang Electronics Co ltd
Original Assignee
Shenzhen Xugang Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xugang Electronics Co ltd filed Critical Shenzhen Xugang Electronics Co ltd
Priority to CN202211636214.9A priority Critical patent/CN115620268A/en
Publication of CN115620268A publication Critical patent/CN115620268A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of emotion recognition, in particular to a multi-modal emotion recognition method, a device, electronic equipment and a storage medium, wherein voice and images to be processed are aligned in a time dimension by data marking, a resnet18 network and a voice model are utilized in a multi-modal emotion recognition model to fuse two modal data in a characteristic layer, finally, an LSTM network is utilized to capture context information in the data, an emotion two-dimensional value is output, and the current emotion of a driver is obtained according to the emotion two-dimensional value.

Description

Multi-modal emotion recognition method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of emotion recognition, in particular to a multi-modal emotion recognition method and device, electronic equipment and a storage medium.
Background
The intelligent vehicle is a comprehensive system integrating functions of environmental perception, planning decision, multi-level auxiliary driving and the like, intensively applies technologies such as computer, modern sensing, information fusion, communication, artificial intelligence, automatic control and the like, and is a typical high and new technology complex. At present, intelligent automobile mainly develops towards two main directions of intelligent passenger cabin and intelligent driving, and the intelligence passenger cabin realizes that the degree of difficulty is low relatively and the price/performance ratio is higher, have become the scene of falling to the ground first on the intelligent thread, wherein, emotion recognition is one of the basic function of intelligence passenger cabin system, through the driver's of emotion recognition function discernment emotion characteristic, and then scenes such as music, light of automatically regulated intelligence passenger cabin system, the most important can prevent "road anger symptom", thereby promote and drive experience sense and security performance.
In the prior art, an intelligent cockpit system generally adopts a single modality to perform emotion recognition, for example, the emotion recognition accuracy is low due to the fact that a camera only collects facial features of a driver to recognize the emotion of the driver; although there is a way of emotion recognition through multiple modalities in other application fields, the amount of calculation is too large to solve the problem of fusion between multiple modalities, and the method is not suitable for the mobile terminal of the smart car.
Disclosure of Invention
In view of the above, an object of the present application is to provide a multi-modal emotion recognition method, apparatus, electronic device and storage medium, which can improve the emotion recognition accuracy by recognizing the emotion of a driver through a multi-modal emotion recognition model with a smaller parameter amount, and meet the requirement of mobile terminal application of an intelligent cabin.
The embodiment of the application provides a multi-modal emotion recognition method, which is applied to an automobile cabin system and comprises the following steps:
acquiring video data of a driver;
pre-processing the video data, comprising: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel cepstrum coefficient characteristics of the audio data;
based on a trained multi-modal emotion recognition model, obtaining an emotion two-dimensional value related to the driver according to the input image data and the Mel cepstrum coefficient characteristics, and obtaining the current emotion of the driver according to the emotion two-dimensional value, wherein the emotion two-dimensional value comprises an emotion intensity value and an emotion positive degree value.
In some embodiments, the extracting image data including a face of the driver from the video data includes:
extracting a frame of picture from the video data according to a set time interval;
carrying out face detection on each acquired picture and acquiring image data containing the face of a driver, wherein the method comprises the following steps: acquiring the coordinates of the face of the driver in each picture; cutting the picture based on the acquired coordinates of the face of the driver to obtain a face block image of the driver; and carrying out size conversion and normalization processing on the obtained face block image of the driver to obtain image data containing the face of the driver.
In some embodiments, mel-frequency cepstral coefficient features of the audio data are extracted by:
dividing the audio data according to a set time interval, and performing Fourier transform on the audio data of each divided time interval section to obtain a corresponding signal spectrum;
passing the obtained signal spectrum through a Mel filter to obtain a Mel spectrum;
and carrying out cepstrum analysis on the obtained Mel frequency spectrum to obtain a Mel cepstrum coefficient characteristic.
In some embodiments, the multi-modal emotion recognition model comprises a resnet18 network, a convolution module and an LSTM network, and the obtaining of the emotion two-dimensional value about the driver according to the input image data and the mel-frequency cepstrum coefficient features based on the trained multi-modal emotion recognition model comprises the following steps:
inputting the extracted image data containing the face of the driver into a resnet18 network to obtain an image feature map;
inputting the extracted Mel cepstrum coefficient characteristics of the audio data into a four-layer convolution module to obtain a voice characteristic diagram;
splicing the obtained image feature map and the obtained voice feature map through a Concat function to obtain feature data of a fusion image and voice;
and inputting the feature data of the fused image and voice into a full connection layer of the multi-modal emotion recognition model after passing through the two-layer LSTM network so as to output an emotion two-dimensional value about the driver.
In some embodiments, the multimodal emotion recognition model is trained by:
collecting a plurality of video data samples containing the faces of drivers;
dividing each acquired video data sample according to a set time interval, and labeling emotion two-dimensional values of the video data of each divided time interval section;
and taking a video data sample containing emotion two-dimensional value marks as a video data training set, training the multi-modal emotion recognition model by using the video data training set until a loss function of the multi-modal emotion recognition model is smaller than a set threshold value, and obtaining the trained multi-modal emotion recognition model.
In some embodiments, several samples of video data containing a driver's face are acquired by:
collecting a plurality of video data samples containing the face of a driver based on monitoring equipment in an automobile cabin;
or acquiring a plurality of video data samples containing the faces of the drivers based on network crawling.
In some embodiments, the current mood of the driven person is classified as one or more of happy, excited, stressed, angry, depressed, bored, tired, calm, relaxed, satisfied based on the mood two-dimensional value.
The embodiment of the application provides a multi-modal emotion recognition device, is applied to car cockpit system, the device includes:
the acquisition module is used for acquiring video data of a driver based on monitoring equipment in a vehicle cabin;
the preprocessing module is used for preprocessing the video data and comprises: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel cepstrum coefficient characteristics of the audio data;
and the recognition module is used for obtaining an emotion two-dimensional value related to the driver according to the input image data and the Mel cepstrum coefficient characteristics based on a trained multi-modal emotion recognition model so as to obtain the current emotion of the driver according to the emotion two-dimensional value, wherein the emotion two-dimensional value comprises an emotion intensity degree value and an emotion positive degree value.
An electronic device provided in an embodiment of the present application includes a processor, a memory and a bus, where the memory stores machine-readable instructions executable by the processor, and when the electronic device runs, the processor and the memory communicate with each other through the bus, and when the machine-readable instructions are executed by the processor, the method performs any one of the steps of the multi-modal emotion recognition method.
A computer-readable storage medium is provided in an embodiment of the present application, and has a computer program stored thereon, where the computer program is executed by a processor to perform any one of the steps of the multimodal emotion recognition method described above.
According to the multi-modal emotion recognition method, the multi-modal emotion recognition device, the electronic equipment and the storage medium, video data of a driver are acquired; preprocessing the video data, including: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel cepstrum coefficient characteristics of the audio data; based on a trained multi-modal emotion recognition model, obtaining an emotion two-dimensional value of the driver according to the input image data and the Mel-cepstrum coefficient characteristics, and obtaining the current emotion of the driver according to the emotion two-dimensional value; the adopted multi-modal emotion recognition model integrates a video mode and a voice mode, has smaller parameters and meets the application requirement of an intelligent cockpit mobile terminal.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 shows a flow chart of a multimodal emotion recognition method provided by an embodiment of the present application;
FIG. 2 is a flow chart illustrating a process of extracting image data containing a face of a driver from video data according to an embodiment of the present application;
fig. 3 shows a flowchart of extracting mel-frequency cepstral coefficient features of audio data according to an embodiment of the present application;
FIG. 4 shows a schematic structural diagram of a multi-modal emotion recognition model provided by an embodiment of the present application;
FIG. 5 shows a flow chart for obtaining a two-dimensional value of emotion related to a driver based on a trained multi-modal emotion recognition model provided by an embodiment of the application;
FIG. 6 is a schematic diagram illustrating emotion classification based on emotion two-dimensional values provided in an embodiment of the present application;
FIG. 7 shows a flowchart for training a multi-modal emotion recognition model provided by an embodiment of the present application;
fig. 8 is a block diagram illustrating a multi-modal emotion recognition apparatus provided in an embodiment of the present application;
fig. 9 shows a block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to the flowchart, or may remove one or more operations from the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.
The automobile is gradually changed from a simple vehicle to an intelligent terminal, an automobile cabin is taken as a key decision-making factor for the consumer to purchase and use the automobile, the current automobile cabin is not a simple riding space any more, the interaction between people and the automobile is richer through a network and an AI, and particularly, the user experience can be greatly improved through the emotion recognition function. Compared with single-modal emotion recognition, the multi-modal emotion recognition has the advantages of strong robustness and high accuracy. When the conventional intelligent cockpit realizes the emotion recognition function, data of a plurality of different modalities are simply spliced and fused directly, correlation information among the modalities is not considered, and a model is easy to simulate due to information redundancy; moreover, since the data of different modalities are located in different spaces, direct fusion not only has technical difficulty, but also brings problems of dimensional disaster and model convergence, so that the calculation amount is too large, and model deployment at a mobile terminal is not facilitated. Based on the method and the device, the electronic equipment and the storage medium, the emotion of the driver is recognized through the multi-modal emotion recognition model with small parameter quantity, the emotion recognition accuracy can be improved, and the mobile terminal application of the intelligent cabin is met.
Referring to the accompanying drawing 1, in one embodiment, the application provides a multi-modal emotion recognition method applied to an automobile cabin system, and the method comprises the following steps:
s1, acquiring video data of a driver;
in order to clearly understand the technical solution of the embodiment of the present invention, an application scenario may be first exemplarily described. In the application, the multi-mode emotion recognition method is applied to an automobile cabin system, the automobile cabin system is provided with hardware devices such as a camera, a loudspeaker and a microphone, and a vehicle mainboard is used as a control center to perform man-machine interaction.
In step S1, video data of the driver may be obtained through a camera configured in the vehicle cabin system, where the video data obtained by the camera includes not only face information of the driver but also voice information of the driver.
S2, preprocessing the video data, including: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel cepstrum coefficient characteristics of the audio data;
in step S2, the image data and the audio data in the video data obtained in step S1 are respectively preprocessed, so that a more accurate emotion recognition result is obtained through a multi-modal emotion recognition model in the following step.
Specifically, referring to fig. 2 of the specification, the method for extracting image data containing a face of a driver from the video data includes the following steps:
s201, frames are extracted from the video data at set time intervals;
s202, carrying out face detection on each acquired picture and acquiring image data containing the face of a driver, wherein the image data comprises the following steps: acquiring the coordinates of the face of the driver in each picture; cutting the picture based on the acquired coordinates of the face of the driver to obtain a face block image of the driver; and carrying out size conversion and normalization processing on the obtained face block image of the driver to obtain image data containing the face of the driver.
For example, if the duration of the video data acquired in step S1 is 100 seconds, one picture may be extracted at 10 seconds, 20 seconds … … seconds, and 100 seconds respectively at intervals of 10 seconds, and then face detection may be performed on the acquired 10 pictures, specifically, coordinates of the face of the driver are detected from each picture, where the coordinates include at least three data, the first data represents a vertex coordinate of the face of the driver, and the second data and the third data represent the length and width of the face of the driver, so that a face block image of the driver can be cut out based on the three data, and then size conversion and normalization processing are performed on the obtained 10 face block images of the driver, so as to obtain image data including the face of the driver. The size transformation and normalization processing should be a technical means well known to those skilled in the art, and the calculation amount is not large, and will not be described herein.
When the audio data of the driver contained in the video data is preprocessed, referring to the attached figure 3 of the specification, the mel frequency cepstrum coefficient characteristics of the audio data are extracted in the following way:
s203, segmenting the audio data according to a set time interval, and performing Fourier transform on the audio data of each segmented time interval segment to obtain a corresponding signal frequency spectrum;
s204, passing the obtained signal spectrum through a Mel filter to obtain a Mel spectrum;
s205, performing cepstrum analysis on the obtained Mel frequency spectrum to obtain a Mel cepstrum coefficient characteristic.
For example, if the mel frequency cepstrum coefficient feature extraction is performed on all audio data within N seconds every N seconds, and the frame number of the audio data is set to be S: audio = int (S × N), that is, the mel cepstrum coefficient feature data of the audio frame can be obtained within N seconds, and all the mel cepstrum coefficient feature data are stored as the input of the multi-modal emotion recognition model in step S3. The mel frequency cepstrum coefficient is a voice feature which is often used for processing a voice signal, and the extraction of the mel frequency cepstrum coefficient feature is not complicated according to the steps.
And S3, based on the trained multi-modal emotion recognition model, obtaining an emotion two-dimensional value related to the driver according to the input image data and the Mel cepstrum coefficient characteristics, and obtaining the current emotion of the driver according to the emotion two-dimensional value, wherein the emotion two-dimensional value comprises an emotion intensity degree value and an emotion positive degree value.
In step S3, the multi-modal emotion recognition model is shown in fig. 4, and includes a resnet18 network, a convolution module, and an LSTM network. Furthermore, referring to fig. 5 of the specification, when performing emotion recognition based on the trained multi-modal emotion recognition model, the method comprises the following steps:
s301, inputting the extracted image data containing the face of the driver into a resnet18 network to obtain an image feature map;
s302, inputting the extracted Mel cepstrum coefficient characteristics of the audio data into a four-layer convolution module to obtain a voice characteristic diagram;
s303, splicing the obtained image feature map and the obtained voice feature map through a Concat function to obtain feature data of a fusion image and voice;
and S304, inputting the feature data of the fused image and the voice into a full connection layer of the multi-modal emotion recognition model after passing through the two-layer LSTM network so as to output an emotion two-dimensional value related to the driver.
Firstly, inputting image data obtained through preprocessing in the step S2 into a resnet18 network in a multi-modal emotion recognition model to output an image feature map; and (3) inputting the Mel cepstrum coefficient characteristics obtained by preprocessing in the step (S2) into a convolution module in the multi-modal emotion recognition model so as to output a voice characteristic diagram. In this embodiment, the multi-modal emotion recognition model adopts four layers of convolution modules to form a speech model, wherein each convolution module includes a one-dimensional convolution Conv1D, a batch normalization BatchNormalization and an ELU function; then, concat splicing operation is carried out on the image feature map and the voice feature map output by the resnet18 network and the voice model, and feature data of the image and the voice are fused; and inputting the fused characteristic data into a two-layer LSTM network, wherein the output of the second layer LSTM network is used as the input of a full connection layer with an activation function of a line, the LSTM network is a long-short term memory structure and is a special RNN, and the LSTM network can solve the problems of gradient disappearance and gradient explosion generated by long sequence data in the training process besides the capability of processing time sequence data by a common RNN. The output of the full connection layer is the output of the multi-modal emotion recognition model, wherein the output of the multi-modal emotion recognition model is a continuous emotion two-dimensional value, and therefore a linear activation function is selected.
It should be noted that the emotion two-dimensional value is specifically a strength value representing emotion and a positive degree value representing emotion. In this example, arousal represents the intensity of emotion, and Valence represents the negative or positive intensity, both of which are taken within the range of [ -1, 1]. And, referring to fig. 6 of the specification, the driver's emotion is classified into at least one or more of happiness, joy, excitement, tension, anger, depression, boredom, fatigue, calmness, relaxation, and satisfaction based on the two-dimensional value of emotion.
In addition, referring to fig. 7 of the specification, the multi-modal emotion recognition model is trained by:
p1, collecting a plurality of video data samples containing the faces of drivers;
p2, segmenting each acquired video data sample according to a set time interval, and labeling the emotion two-dimensional value of the video data of each segmented time interval segment;
and P3, taking a video data sample containing emotion two-dimensional value marks as a video data training set, training the multi-modal emotion recognition model by using the video data training set until the loss function of the multi-modal emotion recognition model is smaller than a set threshold value, and obtaining the trained multi-modal emotion recognition model.
Firstly, acquiring a plurality of video data samples containing the faces of drivers based on monitoring equipment in an automobile cabin; or acquiring a plurality of video data samples containing the faces of the drivers based on network crawling; carrying out emotion value valency and Arousal labeling on the current video frame number every N seconds for each video data sample; and then, taking the marked video data sample as a video data training set, and training the constructed multi-modal emotion recognition model until the loss function of the multi-modal emotion recognition model is smaller than a set threshold value to obtain the trained multi-modal emotion recognition model. The loss function is used for evaluating the quality of the multi-modal emotion recognition model (establishing a function of a certain model prediction value and a certain model truth value difference degree), and the weight among all layers of the multi-modal emotion recognition model is continuously adjusted through the loss function result of each round of multi-modal emotion recognition model training until the loss function is smaller than a threshold value, and the multi-modal emotion recognition model training is completed.
In the embodiment, the predicted value of the multi-modal emotion recognition model is PR, and the mean value and the variance of the predicted value are PRm and PRv; the true value of the multi-modal emotion recognition model is TR, the mean value and the variance of the true value are TRm and TRv, and the Loss function Loss is;
Figure P_221216164630437_437493001
wherein:
Figure P_221216164630531_531204001
according to the multi-modal emotion recognition method, data are used for marking voice to be processed and images to be aligned in a time dimension, a resnet18 network and a voice model are used in a multi-modal emotion recognition model to fuse two modal data in a characteristic layer, finally, an LSTM network is used for capturing context information in the data, and an emotion two-dimensional value is output so that the current emotion of a driver can be obtained according to the emotion two-dimensional value. Therefore, according to the multi-mode emotion recognition method, emotion recognition is carried out by utilizing two modes of voice and images, and when data of the two modes are fused, the parameter quantity is small, the multi-mode emotion recognition model can be conveniently deployed at a mobile terminal, the calculation capability of a vehicle mainboard of an automobile cabin system is met, and meanwhile the emotion recognition accuracy of a driver is improved.
Based on the same inventive concept, the embodiment of the present application further provides a multi-modal emotion recognition apparatus, and as the principle of the apparatus in the embodiment of the present application for solving the problem is similar to the multi-modal emotion recognition method in the embodiment of the present application, the implementation of the apparatus can refer to the implementation of the method, and repeated details are not repeated.
As shown in fig. 8, the present application further provides a multi-modal emotion recognition apparatus for use in an automobile cabin system, the apparatus comprising:
an obtaining module 801, configured to obtain video data of a driver;
a preprocessing module 802, configured to preprocess the video data, including: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel cepstrum coefficient characteristics of the audio data;
the recognition module 803 is configured to obtain, based on the trained multi-modal emotion recognition model, an emotion two-dimensional value of the driver according to the input image data and the mel-frequency cepstrum coefficient characteristics, so as to obtain the current emotion of the driver according to the emotion two-dimensional value, where the emotion two-dimensional value includes an emotion intensity value and an emotion aggressiveness value.
In some embodiments, the pre-processing module 802 extracts image data containing a face of the driver from the video data, including:
extracting a frame of picture from the video data according to a set time interval;
the method for detecting the face of each acquired picture by using the face detection model and acquiring the image data containing the face of the driver comprises the following steps: acquiring the coordinates of the face of the driver in each picture; cutting the picture based on the acquired coordinates of the face of the driver to obtain a face block image of the driver; and carrying out size conversion and normalization processing on the obtained face block image of the driver to obtain image data containing the face of the driver.
In some implementations, the pre-processing module 802 extracts mel-frequency cepstral coefficient features of the audio data by:
dividing the audio data according to a set time interval, and performing Fourier transform on the audio data of each divided time interval section to obtain a corresponding signal spectrum;
passing the obtained signal spectrum through a Mel filter to obtain a Mel spectrum;
and carrying out cepstrum analysis on the obtained Mel frequency spectrum to obtain a Mel cepstrum coefficient characteristic.
In some embodiments, the multi-modal emotion recognition model includes a resnet18 network, a convolution module, and an LSTM network, and the recognition module 803 obtains an emotion two-dimensional value about the driver according to the input image data and the mel-frequency cepstrum coefficient features based on the trained multi-modal emotion recognition model, including:
inputting the extracted image data containing the face of the driver into a resnet18 network to obtain an image feature map;
inputting the extracted Mel cepstrum coefficient characteristics of the audio data into a four-layer convolution module to obtain a voice characteristic diagram;
splicing the obtained image feature map and the obtained voice feature map through a Concat function to obtain feature data of a fusion image and voice;
and inputting the feature data of the fused image and voice into a full connection layer of the multi-modal emotion recognition model after passing through the two-layer LSTM network so as to output an emotion two-dimensional value about the driver.
In some embodiments, the apparatus further comprises a training module to train the multi-modal emotion recognition models by:
collecting a plurality of video data samples containing the face of a driver;
dividing each acquired video data sample according to a set time interval, and labeling emotion two-dimensional values of the video data of each divided time interval section;
and taking the video data containing the emotion two-dimensional value mark as a video data training set, training the multi-modal emotion recognition model by using the video data training set until the loss function of the multi-modal emotion recognition model is smaller than a set threshold value, and obtaining the trained multi-modal emotion recognition model.
In some embodiments, the training module collects a number of video data samples containing the face of the driver by:
collecting a plurality of video data samples containing the faces of drivers based on monitoring equipment in an automobile cabin;
or acquiring a plurality of video data samples containing the faces of the drivers based on network crawling.
In some embodiments, the current mood of the driven person is classified as one or more of happy, excited, nervous, angry, depressed, bored, tired, calm, relaxed, satisfied based on the mood two-dimensional value.
According to the multi-modal emotion recognition device, video data of a driver are acquired through an acquisition module; preprocessing the video data through a preprocessing module, including: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel cepstrum coefficient characteristics of the audio data; the recognition module is used for obtaining an emotion two-dimensional value of the driver according to the input image data and the Mel cepstrum coefficient characteristics based on a trained multi-modal emotion recognition model so as to obtain the current emotion of the driver according to the emotion two-dimensional value; the adopted multi-modal emotion recognition model integrates a video mode and a voice mode, has smaller parameters and meets the application requirement of an intelligent cockpit mobile terminal.
Based on the same concept of the present invention, as shown in fig. 9 in the specification, an embodiment of the present application provides a structure of an electronic device 900, where the electronic device 900 includes: at least one processor 901, at least one network interface 904 or other user interface 903, memory 905, at least one communication bus 902. A communication bus 902 is used to enable connective communication between these components. The electronic device 900 optionally contains a user interface 903 including a display (e.g., touchscreen, LCD, CRT, holographic (Holographic) or projection (Projector), etc.), a keyboard or a pointing device (e.g., mouse, trackball (trackball), touch pad or touch screen, etc.).
The memory 905 may include a read-only memory and a random access memory, and provides instructions and data to the processor 901. A portion of the memory 905 may also include non-volatile random access memory (NVRAM).
In some embodiments, the memory 905 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof:
an operating system 9051, which includes various system programs for implementing various basic services and for processing hardware-based tasks;
the application module 9052 contains various applications, such as a desktop (launcher), a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services.
In the embodiment of the present application, by calling a program or an instruction stored in the memory 905, the processor 901 is configured to execute steps in a multi-modal emotion recognition method, and recognize the emotion of the driver through a multi-modal emotion recognition model with a smaller parameter amount, so that the accuracy of emotion recognition can be improved, and mobile terminal applications of the smart car can be satisfied.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps as in the multimodal emotion recognition method.
In particular, the storage medium can be a general storage medium, such as a removable disk, a hard disk, etc., and when a computer program on the storage medium is executed, the multi-modal emotion recognition method can be executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of one logic function, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the technical solutions of the present application, and the scope of the present application is not limited thereto, although the present application is described in detail with reference to the foregoing examples, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A multi-modal emotion recognition method, applied to an automobile cabin system, comprising the steps of:
acquiring video data of a driver;
pre-processing the video data, comprising: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel-cepstrum coefficient characteristics of the audio data;
based on a trained multi-modal emotion recognition model, obtaining an emotion two-dimensional value related to the driver according to the input image data and the Mel cepstrum coefficient characteristics, and obtaining the current emotion of the driver according to the emotion two-dimensional value, wherein the emotion two-dimensional value comprises an emotion intensity value and an emotion positive degree value.
2. The method of claim 1, wherein the step of extracting image data containing the face of the driver from the video data comprises the steps of:
extracting a frame of picture from the video data according to a set time interval;
carrying out face detection on each acquired picture and acquiring image data containing the face of a driver, wherein the method comprises the following steps: acquiring the coordinates of the face of the driver in each picture; cutting the picture based on the acquired coordinates of the face of the driver to obtain a face block image of the driver; and carrying out size conversion and normalization processing on the obtained face block image of the driver to obtain image data containing the face of the driver.
3. A method of multimodal emotion recognition, as recited in claim 2, wherein mel-frequency cepstral coefficient features of the audio data are extracted by:
dividing the audio data according to a set time interval, and performing Fourier transform on the audio data of each divided time interval section to obtain a corresponding signal spectrum;
passing the obtained signal spectrum through a Mel filter to obtain a Mel spectrum;
and carrying out cepstrum analysis on the obtained Mel frequency spectrum to obtain a Mel cepstrum coefficient characteristic.
4. A multi-modal emotion recognition method as recited in claim 3, wherein said multi-modal emotion recognition model comprises a resnet18 network, a convolution module, and an LSTM network, and said obtaining an emotion two-dimensional value about said driver from said inputted image data and said mel-frequency cepstrum coefficient characteristics based on said trained multi-modal emotion recognition model comprises the steps of:
inputting the extracted image data containing the face of the driver into a resnet18 network to obtain an image feature map;
inputting the extracted Mel cepstrum coefficient characteristics of the audio data into a four-layer convolution module to obtain a voice characteristic diagram;
splicing the obtained image feature map and the obtained voice feature map through a Concat function to obtain feature data of a fusion image and voice;
and inputting the feature data of the fused image and voice into a full connection layer of the multi-modal emotion recognition model after passing through the two-layer LSTM network so as to output an emotion two-dimensional value about the driver.
5. The method for multi-modal emotion recognition of claim 4, wherein the multi-modal emotion recognition model is trained by:
collecting a plurality of video data samples containing the face of a driver;
dividing each acquired video data sample according to a set time interval, and labeling emotion two-dimensional values of the video data of each divided time interval section;
and taking a video data sample containing emotion two-dimensional value marks as a video data training set, training the multi-modal emotion recognition model by using the video data training set until a loss function of the multi-modal emotion recognition model is smaller than a set threshold value, and obtaining the trained multi-modal emotion recognition model.
6. The method of claim 5, wherein a plurality of video data samples containing the face of the driver are collected by:
collecting a plurality of video data samples containing the face of a driver based on monitoring equipment in an automobile cabin;
or acquiring a plurality of video data samples containing the faces of the drivers based on network crawling.
7. The method of claim 6, wherein the current mood of the driver is classified as one or more of happy, excited, nervous, angry, depressed, boring, tired, calm, relaxed, and happy based on the two-dimensional value of mood.
8. A multimodal emotion recognition apparatus, applied to an automobile cabin system, comprising:
the acquisition module is used for acquiring video data of a driver based on monitoring equipment in a vehicle cabin;
the preprocessing module is used for preprocessing the video data and comprises: extracting image data containing the face of a driver from the video data, extracting audio data from the video data, and extracting the Mel-cepstrum coefficient characteristics of the audio data;
and the recognition module is used for obtaining an emotion two-dimensional value related to the driver according to the input image data and the Mel cepstrum coefficient characteristics based on the trained multi-modal emotion recognition model so as to obtain the current emotion of the driver according to the emotion two-dimensional value, wherein the emotion two-dimensional value comprises an emotion intensity degree value and an emotion positive degree value.
9. An electronic device comprising a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the multi-modal emotion recognition method of any of claims 1 to 7.
10. A computer-readable storage medium characterized by: the computer readable storage medium has stored thereon a computer program which, when being executed by a processor, performs the steps of the multimodal emotion recognition method as claimed in any of claims 1 to 7.
CN202211636214.9A 2022-12-20 2022-12-20 Multi-modal emotion recognition method and device, electronic equipment and storage medium Pending CN115620268A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211636214.9A CN115620268A (en) 2022-12-20 2022-12-20 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211636214.9A CN115620268A (en) 2022-12-20 2022-12-20 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115620268A true CN115620268A (en) 2023-01-17

Family

ID=84881051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211636214.9A Pending CN115620268A (en) 2022-12-20 2022-12-20 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115620268A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117219265A (en) * 2023-10-07 2023-12-12 东北大学秦皇岛分校 Multi-mode data analysis method, device, storage medium and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium
CN113033450A (en) * 2021-04-02 2021-06-25 山东大学 Multi-mode continuous emotion recognition method, service inference method and system
CN113496156A (en) * 2020-03-20 2021-10-12 阿里巴巴集团控股有限公司 Emotion prediction method and equipment
CN115359576A (en) * 2022-07-29 2022-11-18 华南师范大学 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium
CN113496156A (en) * 2020-03-20 2021-10-12 阿里巴巴集团控股有限公司 Emotion prediction method and equipment
CN113033450A (en) * 2021-04-02 2021-06-25 山东大学 Multi-mode continuous emotion recognition method, service inference method and system
CN115359576A (en) * 2022-07-29 2022-11-18 华南师范大学 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PANAGIOTIS ET AL: "End-to-End Multimodal Emotion Recognition using Deep Neural Networks" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117219265A (en) * 2023-10-07 2023-12-12 东北大学秦皇岛分校 Multi-mode data analysis method, device, storage medium and equipment

Similar Documents

Publication Publication Date Title
Omerustaoglu et al. Distracted driver detection by combining in-vehicle and image data using deep learning
CN110914872A (en) Navigating video scenes with cognitive insights
KR101617649B1 (en) Recommendation system and method for video interesting section
CN111078940B (en) Image processing method, device, computer storage medium and electronic equipment
CN110298257B (en) Driver behavior recognition method based on human body multi-part characteristics
CN114465737A (en) Data processing method and device, computer equipment and storage medium
CN111091044B (en) Network appointment-oriented in-vehicle dangerous scene identification method
KR101996371B1 (en) System and method for creating caption for image and computer program for the same
CN114549946A (en) Cross-modal attention mechanism-based multi-modal personality identification method and system
CN115620268A (en) Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN111798259A (en) Application recommendation method and device, storage medium and electronic equipment
CN115376559A (en) Emotion recognition method, device and equipment based on audio and video
CN112365956A (en) Psychological treatment method, psychological treatment device, psychological treatment server and psychological treatment storage medium based on virtual reality
CN115937949A (en) Expression recognition method and device, electronic equipment and storage medium
CN113128284A (en) Multi-mode emotion recognition method and device
CN111274946A (en) Face recognition method, system and equipment
CN116721449A (en) Training method of video recognition model, video recognition method, device and equipment
CN114821552A (en) Desktop background dynamic display method, device, equipment and storage medium
CN112115779B (en) Interpretable classroom student emotion analysis method, system, device and medium
CN115719428A (en) Face image clustering method, device, equipment and medium based on classification model
CN114299295A (en) Data processing method and related device
CN116958615A (en) Picture identification method, device, equipment and medium
CN113096134A (en) Real-time instance segmentation method based on single-stage network, system and electronic equipment thereof
CN111797869A (en) Model training method and device, storage medium and electronic equipment
CN116453194B (en) Face attribute discriminating method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230117