CN114424940A

CN114424940A - Emotion recognition method and system based on multi-mode spatiotemporal feature fusion

Info

Publication number: CN114424940A
Application number: CN202210101019.XA
Authority: CN
Inventors: 郑向伟; 郭鲠源; 张利峰; 郑法; 高鹏志; 嵇存; 李淑芹
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-03

Abstract

The invention belongs to the technical field of emotion recognition, and provides an emotion recognition method and system based on multi-mode spatiotemporal feature fusion, which comprises the following steps: acquiring original physiological data; preprocessing the acquired original physiological data to obtain multi-modal physiological data; respectively extracting spatial characteristics and temporal characteristics of the multi-modal data based on the obtained multi-modal physiological data; performing feature level fusion on the spatial characteristics and the temporal characteristics of the extracted multi-modal data to obtain fusion features; and classifying according to the obtained fusion characteristics to obtain a result of emotion recognition.

Description

Emotion recognition method and system based on multi-mode spatiotemporal feature fusion

Technical Field

The disclosure belongs to the technical field of emotion recognition, and particularly relates to an emotion recognition method and system based on multi-mode spatiotemporal feature fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The emotional changes are usually produced under the stimulation of the external environment, accompanied by changes in individual characteristics and psychological response, and thus can be measured and simulated by scientific methods. With the research and development and popularization of portable and wearable sensors, the difficulty in acquiring physiological signals is reduced, and emotion recognition methods based on physiological signals are concerned by more and more researchers. The emotion recognition method based on physiological signals is used for analyzing emotional changes of human mind by processing the physiological signals. It has been widely used in many fields such as fatigue driving detection and medical care. However, many studies show that physiological signals such as electroencephalograms (EEG), Electrocardiograms (ECG), Galvanic Skin Response (GSR), Respiration (RSP), and Electrooculogram (EOG) have certain correlation with specific emotions, but each physiological signal has different characteristics, and shows different behaviors in emotion recognition tasks, and each physiological signal needs to be studied independently.

Current emotion recognition methods based on physiological signals can be roughly divided into two categories: a single-modal emotion recognition method and a multi-modal emotion recognition method. Among the single-mode emotion recognition methods, the emotion recognition method based on electroencephalogram signals is most used, and besides, physiological signals such as electrocardiograms, skin electricity and respiration have good effects in the emotion recognition field. The multi-modal emotion recognition method is a method for combining data and characteristics of multiple modalities to obtain a final classification result together, and mainly comprises three levels of data level fusion, characteristic level fusion and decision level fusion. The method for recognizing the single-modal or multi-modal emotion mainly comprises the steps of data preprocessing, feature extraction, feature optimization, feature fusion, emotion classification and the like, and is mainly characterized in a feature engineering part. Therefore, how to extract features with strong emotion characterization capability and apply the features to an emotion recognition task becomes a key challenge.

According to the inventor, the existing emotion recognition method has the following technical problems:

(1) the traditional emotion recognition method based on physiological signals mainly extracts artificial features such as statistical features and frequency domain features from data according to professional knowledge and experience, has strong interpretability, but has high requirements on the professional knowledge and can cause problems such as information loss and the like, thereby influencing the recognition accuracy. Some researchers have proposed that extracting high-level features of data using neural networks is another possible method, however, different network structures show a large difference in the effect of feature extraction. How to use the neural network to extract high-level features with high emotion characterization capability becomes a great technical difficulty.

(2) The multi-modal technology mainly comprises three aspects of data level fusion, feature level fusion and decision level fusion, and research is mainly focused on decision level fusion and feature level fusion in the field of emotion recognition. Decision-level fusion is to integrate results on the basis of training a classifier for each mode, so that beneficial information of each mode can be fully utilized, the implementation is simple and strong in interpretability, and complementary information loss among the modes can be caused. The feature level fusion is to extract features from original data, fuse the features of the modalities to obtain fusion features for recognition tasks, and can fully utilize emotion complementarity of the modalities.

Disclosure of Invention

In order to solve the problems, the emotion recognition method and system based on multi-mode spatiotemporal feature fusion are provided by the disclosure, ECG, RSP and eye movement signals are used as input, linear filling and noise reduction processing are carried out on original physiological data, and the influence of abnormal values and noise on recognition accuracy is eliminated; extracting time characteristics and space characteristics of physiological data by using a Convolutional Neural Network (CNN) and a Long Short Term Memory Neural Network (LSTM) to represent emotion; the multi-mode compact bilinear pooling layer is used for fusing the time characteristics and the space characteristics of the physiological data, complementary information among different physiological data is fully utilized on the basis of keeping effective information and reducing dimensionality, the related technical problem of complementary information fusion among different modes is solved, and the identification accuracy of the model is improved.

According to some embodiments, a first aspect of the present disclosure provides an emotion recognition method based on multi-modal spatiotemporal feature fusion, which adopts the following technical solutions:

a method for recognizing emotion based on multi-modal spatiotemporal feature fusion comprises the following steps:

acquiring original physiological data;

preprocessing the acquired original physiological data to obtain multi-modal physiological data;

respectively extracting spatial characteristics and temporal characteristics of the multi-modal data based on the obtained multi-modal physiological data;

performing feature level fusion on the spatial characteristics and the temporal characteristics of the extracted multi-modal data to obtain fusion features;

and classifying according to the obtained fusion characteristics to obtain a result of emotion recognition.

Here, the acquired raw physiological data includes at least acquisition of electrocardiographic, respiratory, and eye movement signals.

As a further technical limitation, the original emotion data set comprises an emotional stimulation stage and a self-evaluation stage, physiological signals in the original emotion data set are cut, and emotional stimulation stage data are intercepted; performing linear interpolation on the intercepted data to eliminate the influence of missing values in the data acquisition and processing processes; and performing noise reduction processing on the data by using a wavelet noise reduction method to eliminate the influence of noise on the identification effect.

As a further technical limitation, the preprocessing comprises:

the original emotion data set comprises an emotion stimulation stage and self-evaluation truncation, physiological signals in the original emotion data set are cut, and emotion stimulation stage data are intercepted;

performing linear interpolation on the intercepted data to eliminate the influence of missing values in the data acquisition and processing processes;

and carrying out noise reduction processing on the data by using a wavelet noise reduction method, and eliminating the shadow of noise on the identification effect.

Furthermore, the preprocessed multi-modal physiological data is converted into a gray scale image and input into the neural network, and the spatial features of the multi-modal physiological data are extracted from the gray scale image.

Furthermore, the preprocessed multi-modal physiological data are respectively input into the neural network, and the time characteristics of the multi-modal physiological data are extracted.

As a further technical limitation, feature level fusion is carried out on the time features and the spatial features of the multi-modal physiological data extracted from the neural network to obtain fusion features, the fusion features are used for emotion recognition tasks, and the specific process is as follows: counting the occurrence frequency of each element by using a CountSktech algorithm to realize mapping from a high dimension to a low dimension; and fusing the reduced features through a bilinear pooling method to obtain fused features.

As a further technical limitation, a classification task is performed according to the obtained fusion features to obtain a final emotion recognition result, and the specific process is as follows: training an SVM classifier; and inputting the fusion features into a classifier to obtain a final recognition result.

According to some embodiments, a second aspect of the present disclosure provides an emotion recognition system based on multi-modal spatiotemporal feature fusion, which adopts the following technical solutions:

an emotion recognition system based on multi-modal spatiotemporal feature fusion, comprising:

the acquisition module is configured to acquire original physiological data, and preprocess the acquired original physiological data to obtain multi-modal physiological data;

the fusion module is configured to respectively extract spatial characteristics and temporal characteristics of the multi-modal data based on the obtained multi-modal physiological data, and perform feature level fusion on the spatial characteristics and the temporal characteristics of the extracted multi-modal data to obtain fusion features;

and the recognition module is configured to classify according to the obtained fusion characteristics to obtain a result of emotion recognition.

According to some embodiments, a third aspect of the present disclosure provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the steps in the method for emotion recognition based on multimodal spatiotemporal feature fusion as described in the first aspect of the present disclosure.

According to some embodiments, a fourth aspect of the present disclosure provides an electronic device, which adopts the following technical solutions:

an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the method for emotion recognition based on multimodal spatiotemporal feature fusion according to the first aspect of the present disclosure when executing the program.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the method comprises the steps of firstly preprocessing an original physiological signal, wherein the preprocessing step comprises data truncation, linear interpolation and noise reduction processing; secondly, a spatial feature extraction module based on a Two-Dimensional Convolutional Neural Network (2D-CNN for short) is provided, which converts the preprocessed multi-modal physiological data into an image of 1 × 120 × 120 (1 represents the number of layers, and 120 × 120 represents the pixel size of the image), and inputs the image into the 2D-CNN to extract spatial features of the multi-modal physiological data; thirdly, a time characteristic extraction module based on a long-time memory neural network (LSTM) is provided, and the preprocessed multi-modal physiological data are respectively input into the LSTM to extract the time characteristics of the multi-modal physiological data; fourthly, a multimode Compact Bilinear Pooling (multimode Compact Bilinear Pooling) method is adopted to perform feature level fusion on the time features and the space features to obtain fusion features; and fifthly, inputting the fusion characteristics into a trained classifier to obtain a final emotion recognition result.

2. The system described in this disclosure consists of five parts: the device comprises a data preprocessing module, a spatial feature extraction module, a temporal feature extraction module, a feature fusion module and an emotion classification module. Through analysis, when the human emotion changes or fluctuates greatly, the EEG, the RSP and the like also change, such as amplitude increase, frequency increase and the like, and the current emotion is accurately identified through analysis of the waveform rules; the spatial feature extraction module proposed in the present disclosure converts the preprocessed multi-modal physiological data into 1 × 120 × 120 images (1 represents the number of layers, and 120 × 120 represents the pixel size of the images), extracts the spatial features of the physiological signals using a 2D-CNN network, and utilizes the waveform law of the physiological signals to the maximum extent.

3. In order to extract the time information in the physiological signal in a targeted manner, the disclosure provides a time characteristic extraction module, which respectively inputs the preprocessed physiological signal into the LSTM to extract the time characteristic, and utilizes the time information in the physiological signal to the maximum extent.

4. In order to fully integrate the time characteristic and the space characteristic and utilize complementary information between different modes, the invention provides a characteristic fusion module which fuses the time characteristic and the space characteristic by adopting a multimode compact bilinear pooling method to obtain a fusion characteristic for an emotion recognition task.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flowchart of an emotion recognition method based on multi-modal spatiotemporal feature fusion in a first embodiment of the disclosure;

FIG. 2 is a specific working schematic diagram of an emotion recognition method based on multi-modal spatiotemporal feature fusion in a first embodiment of the disclosure;

FIG. 3 is a flowchart illustration of an emotion recognition method based on multi-modal spatiotemporal feature fusion in an embodiment of the disclosure;

fig. 4 is a schematic structural diagram of spatial feature extraction in the first embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of temporal feature extraction in the first embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of feature fusion in the first embodiment of the present disclosure;

fig. 7 is a block diagram of a structure of an emotion recognition system based on multi-modal spatiotemporal feature fusion in a second embodiment of the disclosure.

Detailed Description

The present disclosure is further illustrated by the following examples in conjunction with the accompanying drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example one

The embodiment I of the disclosure introduces an emotion recognition method based on multi-mode spatiotemporal feature fusion.

As shown in FIG. 1, the emotion recognition method based on multi-modal spatiotemporal feature fusion is characterized by comprising the following steps:

acquiring original physiological data;

As shown in fig. 2 and fig. 3, the embodiment provides an emotion recognition method based on multi-modal spatiotemporal feature fusion, and the embodiment is exemplified by applying the method to a server, it can be understood that the method can also be applied to a terminal, and can also be applied to a system comprising the terminal and the server, and is implemented through interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:

step S01: for an original emotion data set D, data cutting, linear interpolation and noise reduction processing are carried out on physiological signals in the data set D to obtain a data set D^*；

Step S02: for data set D^*The physiological signal S in (1) is converted into a gray image I _ S, and the gray image I _ S is input into the 2D-CNN to obtain a spatial characteristic F_spatio；

Step S03: for data level D^*The physiological signal S is input into the LSTM to obtain a time characteristic F_temporal；

Step S04: spatial feature F_spatioAnd time characteristic F_temporalInputting the data into a multi-mode compact bilinear pooling layer for feature fusion to obtain a fusion feature F_fusion；

Step S05: fusing the features F_fusionInput to the classificationAnd obtaining a final recognition result in the device.

In step S01 of the embodiment, the original emotion data set includes an emotional stimulation phase and a self-evaluation phase, and the physiological signals in the original emotion data set are clipped to capture emotional stimulation phase data; performing linear interpolation on the intercepted data to eliminate the influence of missing values in the data acquisition and processing processes; and performing noise reduction processing on the data by using a wavelet noise reduction method to eliminate the influence of noise on the identification effect.

Initializing input original mood data set D ═ S₁,S₁,…,S_n]In which S is_nAnd (4) obtaining a physiological signal sequence collected in the nth emotional stimulation experiment, wherein N belongs to N, and N represents the number of experiments. Physiological signal sequence S collected in each experiment_n＝[ECG_1,n,ECG_2,n,ECG_3,n,RSP_n,Eye_Data_n]Wherein the ECG_i,nRepresenting the electrocardiosignals of the ith channel collected in the nth emotional stimulation experiment, i is 1,2,3, RSP_nRepresents the respiratory Data collected in the nth emotional stimulation experiment, Eye _ Data_nShowing the eye movement data collected in the nth emotional stimulation experiment. Physiological signal sequence S for each experiment in the data set D_nPerforming data truncation, intercepting the physiological signal of the emotional stimulation part, and performing linear interpolation and noise reduction processing to obtain a preprocessed data set D^*。

In step S02 of the present embodiment, as shown in fig. 4, the present disclosure uses matplotlib toolkit to combine data set D^*Each physiological signal sequence in (1 × 120 × 120, 1 representing the number of layers, 120 × 120 representing the pixel size of the image) is converted into a grayscale image. Then building a 2D-CNN network by using a Pythrch toolkit, inputting the gray level picture into the network, and extracting the spatial feature F_spatio。

For the design of the spatial feature extraction module, firstly, a grayscale image I _ s with the dimension of 120 × 120 × 1 is input into a 2D-CNN, the 2D-CNN includes 2 convolution operations of 5 × 5 × 32 and 2 convolution operations of 5 × 5 × 64 to obtain a feature map of 30 × 30 × 64, and then a spatial feature F with the dimension of 1 × 50 is obtained through two fully connected layers_spatio. Convolution with a bit lineThe operation can be expressed as:

s(t)＝(x×ω)(t) (1)

where the first parameter x becomes the "input data", the second parameter ω is called the "kernel function", and s (t) is the output, i.e. the feature map.

Taking the gray image I _ s of a certain experiment as an example, the calculation process for extracting the spatial features by using the 2D-CNN is as follows:

(1) first convolutional layer: inputting a 120 × 120 × 1 grayscale image I _ s (height × width × number of color channels), the layer having 32 convolution kernels, each convolution kernel having a size of 7 × 7 × 1, and an output matrix having a size of 60 × 60 × 32;

(2) a second convolutional layer: inputting a matrix of 60 multiplied by 32, wherein the layer has 32 convolution kernels, the size of each convolution kernel is 7 multiplied by 32, and the size of an output matrix is 60 multiplied by 32;

(3) a third convolutional layer: inputting a matrix of 60 × 60 × 32, wherein the layer has 64 convolution kernels, the size of each convolution kernel is 7 × 7 × 32, and the size of an output matrix is 30 × 30 × 64;

(4) a fourth convolutional layer: inputting a matrix of 30 × 30 × 64, wherein the layer has 64 convolution kernels, the size of each convolution kernel is 7 × 7 × 64, and the size of an output matrix is 30 × 30 × 64;

(5) first fully-connected layer: inputting a matrix of 30 multiplied by 64 and outputting a matrix of 1 multiplied by 512;

(6) second fully-connected layer: the input is a 1 × 512 matrix and the output is a 1 × 50 feature matrix.

In step S03 of the present embodiment, as shown in fig. 5, the present disclosure builds an LSTM network from dataset D using a pytorech toolkit^*Extracting a time characteristic F from each physiological signal sequence in the sequence_temporal. The LSTM network requires the concept of a gate, which is essentially a fully connected layer, and can be expressed as:

g(x)＝σ(wx+b) (2)

where x is the input vector, w is the weight vector of the gate, b is the bias term, and g (x) is the output vector.

LSTM uses two gates to control the contents of cell state c, oneIs a forgetting Gate (Forget Gate) which determines the cell state c at the previous moment_t-1How much to keep current time c_tThe forgetting gate can be expressed as:

f₁＝σ(W_f·[h_t-1·x_t]+b_f) (3)

wherein, W_fIs the weight matrix of the forgetting gate, [ h ]_t-1·x_t]Representing the concatenation of two vectors into a longer vector, b_fIs the bias term for the forgetting gate, σ is the sigmoid function.

The other is an Input Gate (Input Gate), which determines the Input x of the network at the present moment_tHow many cells have been saved to cell state c_t. The input gate can be expressed as:

i_t＝σ(W_i·[h_t-1·x_t]+b_i) (4)

wherein, W_iIs the weight matrix of the forgetting gate, b_iIs the bias term for the forgetting gate, σ is the sigmoid function.

Currently entered cell state

It is calculated according to the last output and the current input, and the calculation formula is as follows:

next, the state c of the cell at the current time is calculated_tFrom the last cell state c_t-1Multiplication by element of forget gate f_tReuse the currently input cell state

Element multiplication input gate i_tAnd then the two products are added to generate the sum, and the calculation formula is as follows:

wherein, ". "means multiplication by element.

Output Gate (Output Gate) for LSTM controls how much unit state c is Output to current Output value h of LSTM_tThe calculation is disclosed as follows:

o_t＝σ(W_o·[h_t-1·x_t]+b_o) (7)

the final output of the LSTM is determined by the output gate and cell states, and the calculation is disclosed as follows:

h_t＝o_t·tanh(c_t) (8)

in step S04 of the embodiment, as shown in fig. 6, the present disclosure uses the CountSktech algorithm to count the frequency of occurrence of each element, implementing data F_spatioAnd F_temporalMapping from high dimension to low dimension to obtain the feature after dimension reduction

And

the computation load of subsequent feature fusion is reduced. Then, a fusion feature F is obtained by Fast Fourier Transform (FFT), point multiplication, and Inverse Fast Fourier Transform (IFFT)_fusion。

In step S05 of this embodiment, a classification task is performed according to the obtained fusion features to obtain a final emotion recognition result, which specifically includes:

training an SVM classifier;

and inputting the fusion features into a classifier to obtain a final recognition result.

Example two

The second embodiment of the disclosure introduces an emotion recognition system based on multi-modal spatiotemporal feature fusion.

Fig. 7 shows an emotion recognition system based on multi-modal spatiotemporal feature fusion, which includes:

The detailed steps are the same as the emotion recognition method based on multi-modal spatiotemporal feature fusion provided in the first embodiment, and are not described herein again.

EXAMPLE III

The third embodiment of the disclosure provides a computer-readable storage medium.

A computer-readable storage medium, on which a program is stored, which when executed by a processor implements the steps in the method for emotion recognition based on multi-modal spatiotemporal feature fusion according to an embodiment of the present disclosure.

The detailed steps are the same as the emotion recognition method based on multi-modal spatiotemporal feature fusion provided in the first embodiment, and are not described again here.

Example four

The fourth embodiment of the disclosure provides an electronic device.

An electronic device includes a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for emotion recognition based on multi-modal spatiotemporal feature fusion according to an embodiment of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method for recognizing emotion based on multi-modal spatiotemporal feature fusion is characterized by comprising the following steps:

acquiring original physiological data;

2. The emotion recognition method based on multi-modal spatiotemporal feature fusion as claimed in claim 1, wherein the original emotion data set comprises an emotional stimulation phase and a self-assessment phase, the physiological signals in the original emotion data set are cut, and the emotional stimulation phase data are intercepted; performing linear interpolation on the intercepted data to eliminate the influence of missing values in the data acquisition and processing processes; and performing noise reduction processing on the data by using a wavelet noise reduction method to eliminate the influence of noise on the identification effect.

3. A method of emotion recognition based on multi-modal spatiotemporal feature fusion as recited in claim 1, wherein said preprocessing comprises:

the original emotion data set comprises an emotional stimulation phase and self-evaluation truncation, a physiological signal in the original emotion data set is cut, and emotional stimulation phase data are intercepted;

and performing noise reduction processing on the data by using a wavelet noise reduction method, and eliminating the shadow of noise on the identification effect.

4. The emotion recognition method based on fusion of multi-modal spatiotemporal features as defined in claim 3, wherein the preprocessed multi-modal physiological data is converted into gray scale images and inputted into the neural network, and the spatial features of the multi-modal physiological data are extracted from the gray scale images.

5. The emotion recognition method based on fusion of multi-modal spatiotemporal features as defined in claim 3, wherein the pre-processed multi-modal physiological data are respectively inputted into the neural network to extract the temporal features of the multi-modal physiological data.

6. The emotion recognition method based on multi-modal spatiotemporal feature fusion as claimed in claim 1, wherein the temporal features and the spatial features of the multi-modal physiological data extracted from the neural network are subjected to feature level fusion to obtain fusion features for emotion recognition tasks, and the specific process is as follows: counting the occurrence frequency of each element by using a CountSktech algorithm to realize mapping from a high dimension to a low dimension; and fusing the reduced features through a bilinear pooling method to obtain fused features.

7. The emotion recognition method based on multi-modal spatiotemporal feature fusion as claimed in claim 1, wherein, according to the obtained fusion features, a classification task is performed to obtain a final emotion recognition result, and the specific process is as follows: training an SVM classifier; and inputting the fusion features into a classifier to obtain a final recognition result.

8. An emotion recognition system based on multi-modal spatiotemporal feature fusion is characterized by comprising the following steps of:

9. A computer-readable storage medium, on which a program is stored, which when executed by a processor performs the steps in the method for emotion recognition based on multi-modal spatiotemporal feature fusion as claimed in any of claims 1-7.

10. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for emotion recognition based on multi-modal spatiotemporal feature fusion as claimed in any of claims 1-7 when executing the program.