CN117809354B

CN117809354B - Emotion recognition method, medium and device based on head wearable device perception

Info

Publication number: CN117809354B
Application number: CN202410223747.7A
Authority: CN
Inventors: 张通; 吴梦琪; 王锦炫; 陈俊龙
Original assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Current assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-06-21
Anticipated expiration: 2044-02-29
Also published as: CN117809354A

Abstract

The invention relates to the technical field of emotion recognition, and particularly provides an emotion recognition method, medium and device based on head wearable device perception; the method comprises the following steps of: collecting multi-mode emotion data of a wearer; processing by adopting a local fusion emotion recognition network: performing deep convolution on the left eye input and the right eye input; the left lower face input and the right lower face input are respectively embedded with the extracted action units through an embedding layer, and then are input into the space domain map convolution together with the face behavior code; performing space mapping in a multi-layer perceptron, and performing feature map fusion after calculating space attention and channel attention to obtain emotion features; and fusing the emotion characteristics, and classifying to obtain a composite emotion recognition result. According to the method, facial action unit information is adopted in the local multi-view facial data to assist emotion perception, so that the robustness of the emotion information projected from the appearance of the body of a wearer is improved, and the emotion distinguishing precision is improved.

Description

Emotion recognition method, medium and device based on head wearable device perception

Technical Field

The invention relates to the technical field of emotion recognition, in particular to an emotion recognition method, medium and device based on head wearable device perception.

Background

Mental diseases often affect activities such as daily life, study, work, social activities and the like of a patient, long-term anxiety and depression can affect personal development of the patient, and even behaviors such as self injury, injury and the like occur. Therefore, screening and monitoring of mental diseases to achieve timely diagnosis and treatment of mental diseases is very important for mental disease patients.

The existing mental disease diagnosis mode generally adopts a form filling answer mode, or adopts external equipment to collect biological signal data or expression data of a diagnosed person, and performs data processing and emotion recognition, so that a mental disease diagnosis result is obtained. However, the external device is adopted to collect biological signal data or expression data of the diagnosed person, and the diagnosed person is generally required to stay in a specific detection environment to collect data within a short period of time; the mental state of the diagnosed person in the specific environment cannot fully represent the mental state in daily life; and the sampling time is not long, the sampling data volume is limited, and the accuracy of the diagnosis result is influenced. If the head wearable device is used for collecting emotion data, and then the emotion data is processed and analyzed, the detection place limitation and the sampling time limitation can be solved, and the method is an ideal mode. But the head wearable device is adopted to acquire the face image, and because the head wearable device is too close to the face, the face image is difficult to acquire, and a plurality of cameras are required to cooperate to acquire the images of all the positions of the face. The existing mode only can extract the emotion characteristics of the whole face (global) when carrying out expression trend recognition, and does not have the capability of extracting and fusing the emotion characteristics of the expression with a plurality of visual angles.

The expression is divided into macro-expressions and micro-expressions. The macro expression is a few facial expressions with obvious and long duration, is easy to detect and identify, plays a very important role in emotion identification, and is emotion data mainly adopted in the existing emotion identification technology; however, since people sometimes hide or unconsciously suppress their emotion, and deliberately control and change macro-expressions, it is not accurate to recognize emotion only by macro-expressions. The micro-expression is spontaneously generated in an unconscious state, is difficult to disguise or camouflage, is generally directly related to the true emotion, so that the micro-expression data is added in emotion analysis more reliably; however, the duration of the micro expression is short, and is usually only 1/25 s-1/3 s; the motion strength is low, the motion is difficult to perceive and capture, and the motion is less applied to the existing emotion recognition technology.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention aims to provide an emotion recognition method, medium and device based on perception of head wearable equipment; according to the method, the local multi-view face data of the wearer can be acquired, the facial action unit information is adopted in the local multi-view face data to assist emotion perception, consistency and integrity of multi-region emotion states are coordinated, robustness of emotion information projected from the body appearance of the wearer is improved, and accordingly emotion distinguishing precision is improved.

In order to achieve the above purpose, the invention is realized by the following technical scheme: an emotion recognition method based on head wearable device perception, the head wearable device comprising: the device comprises a device body and a multi-mode data acquisition device arranged on the device body; the multimode data acquisition device comprises four first camera modules for respectively acquiring pictures and videos of four visual angles of a wearer; the four views refer to: left eye, right eye, lower left face, lower right face;

The emotion recognition method based on the perception of the head wearable device comprises a wearer emotion recognition method; the emotion recognition method for the wearer comprises the following steps of:

Step X1, collecting multi-mode emotion data of a wearer; the multimode emotion data of the wearer comprise picture data I and video data I which are acquired through four camera modules I; the first picture data and the first video data comprise four view angle data;

Step X2, extracting emotion characteristics of the multimodal emotion data of the wearer:

Processing the first picture data and the first video data by adopting a local fusion emotion recognition network respectively; the first picture data is input by taking four view angle data as four view angles of a local fusion emotion recognition network respectively; firstly, extracting a start frame and a peak frame from the video data to the four view angle data respectively, and inputting the start frame and the peak frame as the four view angles of the local fusion emotion recognition network;

The processing mode of the four visual angle inputs in the local fusion emotion recognition network is as follows: the left eye input and the right eye input are subjected to deep convolution to extract local visual angle characteristics; the left lower face input and the right lower face input are respectively embedded with the extracted action units through an embedding layer, and then are input into a space domain map convolution together with a facial behavior coding FACS to extract local visual angle characteristics; inputting the local visual angle characteristics obtained by extraction of the four visual angle inputs into a multi-layer perceptron at the same time for space mapping, and carrying out feature map fusion after calculating the space attention and the channel attention to obtain final emotion characteristics;

And step X3, fusing the emotion characteristics obtained in the step X2, and obtaining a composite emotion recognition result through classification.

Preferably, the number of the local fusion emotion recognition networks is four, namely a local fusion emotion recognition network I, a local fusion emotion recognition network II, a local fusion emotion recognition network III and a local fusion emotion recognition network IV;

The local fusion emotion recognition network processes the first picture data to obtain macro expression emotion characteristics of the first picture data; processing the first picture data by the local fusion emotion recognition network II to obtain micro-expression emotion characteristics of the first picture data; processing the video data I by the local fusion emotion recognition network III to obtain macro expression emotion characteristics of the video data I; and processing the video data one by the local fusion emotion recognition network four to obtain micro-expression emotion characteristics of the video data one.

Preferably, the four local fusion emotion recognition networks comprise four local feature extraction units respectively input from four visual angles; the two local feature extraction units for left eye input and right eye input both comprise a deep convolution network I; the two local feature extraction units aiming at the left lower face input and the right lower face input are formed by sequentially connecting an embedding layer and an airspace map convolution network; the embedded layer is also connected with the action unit extractor; the airspace map convolution network is also connected with a face motion coding system; lower left face input and lower right face input; the outputs of the four local feature extraction units are simultaneously connected with the multi-layer perceptron, and are fused through channel attention and space attention;

The local fusion emotion recognition network III and the local fusion emotion recognition network IV respectively comprise an action amplifying network aiming at two local feature extraction units of left eye input and right eye input; the left eye input and the right eye input amplify smile expression through an action amplifying network respectively, then input a deep convolution network I, and extract local visual angle characteristics.

Preferably, in the step X2, the first picture data and the first video data are preprocessed respectively before being processed by the local fusion emotion recognition network;

the method comprises the steps of preprocessing picture data I, wherein preprocessing comprises the step of performing face detection by utilizing a serially connected preprocessing convolutional neural network I; the face detection by using the serially connected preprocessing convolutional neural network is that: generating candidate frames, carrying out preliminary screening on the candidate frames, and detecting key points of the human face; after convolution, activation function, pooling and full connection processing, the confidence coefficient, the coordinate offset and the coordinates of five key points of each candidate frame are output so as to realize face detection;

Preprocessing the first video data comprises performing face detection by utilizing a serial preprocessing multi-layer deep convolutional neural network II; the face detection by using the serially connected preprocessing multi-layer deep convolutional neural network II means that: using a video streaming mode to read video data one frame by frame; each frame of image of the first video data is operated by utilizing pyramid data after the two pairs of images of the serially-connected preprocessing multi-layer depth convolution neural network are changed in size, so that a face frame, key point coordinates and face classification are obtained, and face detection is realized; the preprocessing multi-layer deep convolutional neural network II comprises an image size changing layer, a convolutional neural unit I, a convolutional neural unit II, a maximum pooling layer I, a full connecting layer I, a convolutional neural unit III, a maximum pooling layer II and a full connecting layer II which are sequentially connected, and a space attention layer connected between the convolutional neural unit III and the maximum pooling layer II.

Preferably, the step X3 means: and (3) fusing the emotion characteristics obtained in the step (X2) by adopting a multi-mode self-adaptive fusion module: the input of the multi-mode self-adaptive fusion module is emotion characteristics X= { X ₁,…,X_n }, wherein X _i is the ith emotion characteristic, and n is the number of emotion characteristics; feature fusion is performed iteratively by using an attention mechanism, and fusion features are finally obtained; inputting the fusion characteristics into a classifier for learning to obtain a composite emotion recognition result; the composite emotion recognition result adopts composite representation of emotion states; the composite representation of emotional states is: emotion category and corresponding proportion.

Preferably, the head wearable device, the multi-mode data acquisition device further comprises an audio acquisition module; in the step X1, the multimodal emotion data further includes audio data; in the step X2, emotion feature extraction is also performed on the audio data: filtering, smoothing and framing the audio data; extracting the characteristic of the mel-frequency spectrum coefficient; the mel cepstrum coefficient features are utilized and loaded into a feature vector form, and the feature vector form is input into BiLSTM neural networks based on an attention mechanism to extract emotion features;

In the step X1, the multi-modal emotion data further comprises text data; in the step X2, emotion feature extraction is also performed on the text data: and processing the text data by using a word2vec model to obtain a sequence context word vector representation, and extracting emotion characteristics by using an LSTM-based emotion analysis network.

Preferably, the head wearable device, the multi-mode data acquisition device further comprises a second camera module for acquiring pictures and videos of the observed person;

the emotion recognition method based on the perception of the head wearable equipment further comprises an observed person emotion recognition method; the emotion recognition method of the observed person comprises the following steps:

Step Y1, collecting multi-mode emotion data of an observed person; the multi-mode emotion data of the observed person comprises picture data II and video data II which are acquired through a camera module II;

step Y2, extracting emotion characteristics of the multi-mode emotion data of the observed person:

Carrying out emotion feature extraction on the picture data II by adopting an expression convolutional neural network I, an expression depth convolutional neural network II and a picture neural network respectively to obtain macro-expression emotion features, micro-expression emotion features and gesture emotion features of the picture data II;

extracting emotion characteristics of the video data II by adopting a three-dimensional convolutional neural network III and a convolutional neural network IV based on peak frame optical flow respectively to obtain macro expression emotion characteristics and micro expression emotion characteristics of the video data II;

The fourth convolutional neural network based on the peak frame optical flow performs emotion feature extraction, which means that: firstly, carrying out preprocessing of rotation, cutting and face alignment on video data II, and extracting a peak frame by using a peak frame detection algorithm; then, calculating optical flow vectors u, v and optical strain epsilon between the initial frame and the peak frame of the video data II; after the graying treatment, respectively taking the three channels as RGB images to synthesize an RGB image; extracting emotion characteristics of the RGB image;

The optical flow vectors u, v and optical strain ε are calculated using the following set of equations:

；

Wherein, I (x, y, t) represents the light intensity of the pixel point at the initial frame; t represents the time dimension in which it is located; dx, dy represents the movement values of the abscissa and the ordinate of the start frame to the peak frame, respectively; dt represents the time it takes for the start frame to move to the peak frame; order the ，/>，/>Respectively representing partial derivatives of gray scales of pixel points along x, y and t directions; wherein, I _x,I_y,I_t is obtained by image data;

and Y3, fusing the emotion characteristics obtained in the step Y2, and obtaining a composite emotion recognition result through classification.

Preferably, the method for establishing the individual data storage database is further included: constructing an individual ID in an individual data storage database; before the emotion recognition method of the wearer or the emotion recognition method of the observed person starts, the individual ID of the wearer or the observed person is acquired; after the emotion recognition method of the wearer or the observed person is completed, the emotion recognition result obtained by the emotion recognition method of the wearer or the observed person and the time and place are stored in the corresponding individual ID.

A readable storage medium, wherein the storage medium stores a computer program which, when executed by a processor, causes the processor to perform the emotion recognition method based on head wearable device perception.

The computer equipment comprises a processor and a memory for storing a program executable by the processor, wherein the emotion recognition method based on the perception of the head wearable equipment is realized when the processor executes the program stored by the memory.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. The invention is based on the head wearable equipment, can collect the data of the wearer in real time, and further realize emotion perception and recognition of the wearer; the method has the advantages that local multi-view expression data of a wearer can be acquired, a local fusion emotion recognition network of macro expression and micro expression is assisted by facial action unit information, and the expression capacity of macro expression and micro expression characteristics is improved;

2. The invention can identify the complex emotion state, which accords with the cognition of emotion in neurology, and emotion is a multidimensional and multi-level experience;

3. The invention supports the extraction and fusion of the emotion characteristics of the multi-modal emotion data, and has the capability of mutually supporting and mutually supplementing the multi-modal emotion data;

4. The invention realizes the collection of the emotion signals of the cross-individuals, can collect and process the emotion signals of the user and a plurality of testees at the same time, and expands the application scene; for example, the method can be used in the scenes of assisting in mental disease inquiry, daily communication, mental disease screening, mental disease patient daily emotion change monitoring and the like; thereby the invention has the capability of improving the auxiliary screening, diagnosis and monitoring efficiency;

5. The method comprises the steps of establishing an individual data storage database from collected data and emotion analysis results; the method is beneficial to mining emotion requirements in home life, assisting in psychological disease diagnosis and recording daily emotion states; carrying out the review and analysis of the emotion state by combining the historical activities and data records of the face and the equipment with real-time data, and establishing the emotion personal portrait of the user so as to know the change of the emotion of the individual in time, environment and experience; and meanwhile, information such as places, background pictures, wearing and facial expressions and the like acquired by the equipment are processed, and factors affecting the emotion state are analyzed.

Drawings

FIG. 1 is a schematic structural view of a head wearable device of the present invention;

FIG. 2 is a schematic flow chart of the emotion recognition method of the wearer of the present invention;

FIG. 3 is a block diagram of a pre-processing multi-layer deep convolutional neural network two of the present invention;

FIG. 4 is a block diagram of a first and a second local fusion emotion recognition networks of the present invention;

FIG. 5 is a block diagram of a third and a fourth local fusion emotion recognition network of the present invention;

FIG. 6 is a block diagram of a multi-modal adaptive fusion module of the present invention;

FIG. 7 is a schematic flow chart of a preferred embodiment of the emotion recognition method of the present invention;

FIG. 8 is a schematic structural view of a two-head wearable device of an embodiment;

fig. 9 is a flowchart of a second embodiment of a method for identifying emotion of an observed person.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

Example 1

According to the emotion recognition method based on perception of the head wearable device, the head wearable device comprises: the device comprises a device body 1 and a multi-mode data acquisition device arranged on the device body; in this embodiment, the apparatus body 1 adopts glasses, as shown in fig. 1; the multimode data acquisition device comprises four camera modules I3, 4, 5 and 6 for respectively acquiring pictures and videos of four visual angles of a wearer; four views refer to: left eye, right eye, lower left face, lower right face; the first camera module 3 is used for collecting pictures and videos of the right eye of the wearer; the first camera module 4 is used for collecting left eye pictures and videos of a wearer; the first camera module 5 is used for collecting pictures and videos of the right lower face of the wearer; the first camera module 6 is used for collecting pictures and videos of the lower left face of the wearer.

The emotion recognition method based on the perception of the head wearable device comprises a wearer emotion recognition method.

The emotion recognition method of the wearer, as shown in fig. 2, comprises the following steps:

first, it is preferable to perform preprocessing on the first picture data and the first video data, respectively:

Preprocessing the first video data comprises performing face detection by utilizing a serial preprocessing multi-layer deep convolutional neural network II; the face detection by using the serially connected preprocessing multi-layer deep convolutional neural network II means that: using a video streaming mode to read video data one frame by frame; each frame of image of the first video data is operated by utilizing pyramid data after the two pairs of images of the serially-connected preprocessing multi-layer depth convolution neural network are changed in size, so that a face frame, key point coordinates and face classification are obtained, and face detection is realized; the preprocessing multi-layer deep convolutional neural network II comprises an image size changing layer, a convolutional neural unit I, a convolutional neural unit II, a maximum pooling layer I, a full connecting layer I, a convolutional neural unit III, a maximum pooling layer II and a full connecting layer II which are sequentially connected, and a space attention layer connected between the convolutional neural unit III and the maximum pooling layer II as shown in fig. 3.

Then, processing the first picture data and the first video data by adopting a local fusion emotion recognition network respectively; the first picture data is input by taking four view angle data as four view angles of a local fusion emotion recognition network respectively; and extracting a start frame and a peak frame from the video data to the four view angle data respectively to serve as four view angle inputs of the local fusion emotion recognition network.

Specifically, the number of the local fusion emotion recognition networks is four, namely a local fusion emotion recognition network I, a local fusion emotion recognition network II, a local fusion emotion recognition network III and a local fusion emotion recognition network IV.

Processing the first picture data by the local fusion emotion recognition network to obtain macro expression emotion characteristics of the first picture data; and processing the first picture data by the local fusion emotion recognition network II to obtain micro-expression emotion characteristics of the first picture data.

The structure of the first local fusion emotion recognition network and the second local fusion emotion recognition network is shown in fig. 4, and the first local fusion emotion recognition network and the second local fusion emotion recognition network respectively comprise four local feature extraction units for inputting at four visual angles. The two local feature extraction units for left eye input and right eye input both comprise a deep convolution network I; the two local feature extraction units aiming at the left lower face input and the right lower face input are formed by sequentially connecting an embedding layer and an airspace map convolution network; the embedded layer is also connected with the action unit extractor; the airspace map convolution network is also connected with a face motion coding system; lower left face input and lower right face input; the outputs of the four local feature extraction units are simultaneously connected with the multi-layer perceptron and are fused through channel attention and space attention.

Processing the video data I by the local fusion emotion recognition network III to obtain macro expression emotion characteristics of the video data I; and processing the video data one by the local fusion emotion recognition network four to obtain micro-expression emotion characteristics of the video data one.

The structure of the local fusion emotion recognition network III and the local fusion emotion recognition network IV is shown in fig. 5, and the local fusion emotion recognition network III and the local fusion emotion recognition network IV respectively comprise four local feature extraction units for inputting at four visual angles. The two local feature extraction units aiming at the left eye input and the right eye input comprise an action amplifying network and a depth convolution network I; the left eye input and the right eye input amplify smile expression through an action amplifying network respectively, then input a deep convolution network I, and extract local visual angle characteristics. The two local feature extraction units aiming at the left lower face input and the right lower face input are formed by sequentially connecting an embedding layer and an airspace map convolution network; the embedded layer is also connected with the action unit extractor; the airspace map convolution network is also connected with a face motion coding system; lower left face input and lower right face input; the outputs of the four local feature extraction units are simultaneously connected with the multi-layer perceptron and are fused through channel attention and space attention.

The processing mode of the four visual angle inputs in the local fusion emotion recognition network is as follows: the left eye input and the right eye input are subjected to deep convolution to extract local visual angle characteristics; the left lower face input and the right lower face input are respectively embedded with the extracted action units through an embedding layer, and then are input into a space domain map convolution together with a facial behavior coding FACS to extract local visual angle characteristics; FACS (Facial Action Coding System), facial behavior code, which refers to a set of facial muscle movements. And simultaneously inputting the local visual angle characteristics obtained by extracting the four visual angle inputs into a multi-layer perceptron to carry out space mapping, and carrying out feature map fusion after calculating the space attention and the channel attention to obtain the final emotion characteristics.

The four local fusion emotion recognition networks adopt a convolutional neural network to extract emotion characteristics, and adopt a facial expression recognition network based on privilege action unit information, and AUs (AUs is a facial movement unit and is used for describing the minimum movement unit and expression characteristics of a human face) is used as privilege information to guide emotion recognition; meanwhile, AUs is used as an auxiliary output label of the network shallow layer, so that shallow layer characteristics in the auxiliary model are expressed.

And step X3, adopting a multi-mode self-adaptive fusion module to fuse the emotion characteristics obtained in the step X2: the structure of the multi-mode self-adaptive fusion module is shown in fig. 6, wherein the input is emotion characteristics X= { X ₁,…,X_n }, X _i is the ith emotion characteristic, and n is the number of emotion characteristics; feature fusion is performed iteratively by using an attention mechanism, and fusion features are finally obtained; inputting the fusion characteristics into a classifier for learning to obtain a composite emotion recognition result; the composite emotion recognition result adopts composite representation of emotion states; the composite representation of emotional states is: emotion category and corresponding proportion. For example, happy-75%, sad-1%, surprised-5%, fear 1.5%, aversion-1.5%, gas-1%, neutral-15%.

In this embodiment, the head wearable device, the multi-modal data collection apparatus preferably further comprises an audio collection module 7, such as a microphone;

the preferable scheme of the emotion recognition method of the wearer is shown in fig. 7:

in the step X1, the multi-modal emotion data further comprises audio data and text data; text data can be obtained through an applet/mobile terminal APP;

In step X2, emotion feature extraction is also performed on the audio data: filtering, smoothing and framing the audio data; extracting the characteristic of the mel-frequency spectrum coefficient; the mel cepstrum coefficient features are utilized and loaded into a feature vector form, and the feature vector form is input into BiLSTM neural networks based on an attention mechanism to extract emotion features;

emotion feature extraction is also performed on the text data: and processing the text data by using a word2vec model to obtain a sequence context word vector representation, and extracting emotion characteristics by using an LSTM-based emotion analysis network.

The method for establishing the individual data storage database is preferably further included in the embodiment: constructing an individual ID in an individual data storage database; before the emotion recognition method of the wearer or the emotion recognition method of the observed person starts, the individual ID of the wearer or the observed person is acquired; after the emotion recognition method of the wearer or the observed person is completed, the emotion recognition result obtained by the emotion recognition method of the wearer or the observed person and the time and place are stored in the corresponding individual ID. The location may be obtained by the GPS module 2 of the head wearable device.

The personal emotion portrait of the user is established by using a database technology to store the data records of the personal ID, time, place, emotion state, voice high-frequency word, picture background information, emotion state result, wearing (color) and the like acquired by the equipment, carrying out data statistics and carrying out the review and analysis of the emotion state by combining with real-time data, and knowing the change of the emotion of the individual under the conditions of time, environment and experience. Factors affecting emotional state are analyzed.

By establishing an emotion state portrait of an individual time dimension, counting individual voice high-frequency vocabulary, visualizing through a histogram, aiming at emotion states at different times, giving an emotion change curve chart visualization and a frequency statistics chart of high-frequency emotion occurrence, giving negative emotion early warning for the situation that more negative emotion occurs, and simultaneously giving corresponding reference advice.

By establishing individual emotion state portraits in space dimension, statistics and analysis of data such as position data, picture background, emotion state, wearing, high-frequency vocabulary and the like are used for analyzing emotion influencing factors including influence of environment and places on emotion, for example, the emotion state in the same environment is counted, and a simple statistics report is given.

Example two

The difference between the emotion recognition method based on head wearable device perception and the first embodiment is that: in this embodiment, the head wearable device, the multi-mode data acquisition device further includes a second camera module 8 for acquiring pictures and videos of the observed person, as shown in fig. 8.

The emotion recognition method based on the perception of the head wearable equipment further comprises an observed person emotion recognition method; as shown in fig. 9, the observed emotion recognition method includes the steps of:

Step Y1, collecting multi-mode emotion data of an observed person; the multi-mode emotion data of the observed person comprises picture data II and video data II which are acquired through a camera module II 8;

；

Wherein, I (x, y, t) represents the light intensity of the pixel point at the initial frame; t represents the time dimension in which it is located; dx, dy represents the movement values of the abscissa and the ordinate of the start frame to the peak frame, respectively; dt represents the time it takes for the start frame to move to the peak frame; order the ，/>，/>Respectively representing partial derivatives of gray scales of pixel points along x, y and t directions; Wherein, I _x,I_y,I_t is obtained by image data.

In step Y1 and step Y2, the multimodal emotion data of the observed person may further include audio data two; and extracting emotion characteristics of the second audio data.

Example III

The readable storage medium of this embodiment stores a computer program, which when executed by a processor, causes the processor to perform the emotion recognition method based on perception of a head wearable device described in the first or second embodiment.

Example IV

The computer device of the present embodiment includes a processor and a memory for storing a program executable by the processor, where when the processor executes the program stored in the memory, the emotion recognition method based on perception of the head wearable device of the first embodiment or the second embodiment is implemented.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. An emotion recognition method based on head wearable equipment perception is characterized by comprising the following steps of: the head wearable device includes: the device comprises a device body and a multi-mode data acquisition device arranged on the device body; the multimode data acquisition device comprises four first camera modules for respectively acquiring pictures and videos of four visual angles of a wearer; the four views refer to: left eye, right eye, lower left face, lower right face;

Step X3, fusing the emotion characteristics obtained in the step X2, and obtaining a composite emotion recognition result through classification;

The local fusion emotion recognition networks are four, namely a local fusion emotion recognition network I, a local fusion emotion recognition network II, a local fusion emotion recognition network III and a local fusion emotion recognition network IV;

The local fusion emotion recognition network processes the first picture data to obtain macro expression emotion characteristics of the first picture data; processing the first picture data by the local fusion emotion recognition network II to obtain micro-expression emotion characteristics of the first picture data; processing the video data I by the local fusion emotion recognition network III to obtain macro expression emotion characteristics of the video data I; processing the first video data by the local fusion emotion recognition network IV to obtain micro-expression emotion characteristics of the first video data;

the four local fusion emotion recognition networks comprise four local feature extraction units respectively input from four visual angles; the two local feature extraction units for left eye input and right eye input both comprise a deep convolution network I; the two local feature extraction units aiming at the left lower face input and the right lower face input are formed by sequentially connecting an embedding layer and an airspace map convolution network; the embedded layer is also connected with the action unit extractor; the airspace map convolution network is also connected with a face motion coding system; lower left face input and lower right face input; the outputs of the four local feature extraction units are simultaneously connected with the multi-layer perceptron, and are fused through channel attention and space attention;

2. The emotion recognition method based on head wearable device perception according to claim 1, characterized in that: in the step X2, the first picture data and the first video data are preprocessed respectively before being processed by adopting a local fusion emotion recognition network;

3. The emotion recognition method based on head wearable device perception according to claim 1, characterized in that: the step X3 refers to: and (3) fusing the emotion characteristics obtained in the step (X2) by adopting a multi-mode self-adaptive fusion module: the input of the multi-mode self-adaptive fusion module is emotion characteristics X= { X ₁,…,X_n }, wherein X _i is the ith emotion characteristic, and n is the number of emotion characteristics; feature fusion is performed iteratively by using an attention mechanism, and fusion features are finally obtained; inputting the fusion characteristics into a classifier for learning to obtain a composite emotion recognition result; the composite emotion recognition result adopts composite representation of emotion states; the composite representation of emotional states is: emotion category and corresponding proportion.

4. The emotion recognition method based on head wearable device perception according to claim 1, characterized in that: the multi-mode data acquisition device further comprises an audio acquisition module; in the step X1, the multimodal emotion data further includes audio data; in the step X2, emotion feature extraction is also performed on the audio data: filtering, smoothing and framing the audio data; extracting the characteristic of the mel-frequency spectrum coefficient; the mel cepstrum coefficient features are utilized and loaded into a feature vector form, and the feature vector form is input into BiLSTM neural networks based on an attention mechanism to extract emotion features;

5. The emotion recognition method based on head wearable device perception according to claim 1, characterized in that: the multi-mode data acquisition device further comprises a second camera module for acquiring pictures and videos of observed persons;

；

Wherein, I (x, y, t) represents the light intensity of the pixel point at the initial frame; t represents the time dimension in which it is located; dx, dy represents the movement values of the abscissa and the ordinate of the start frame to the peak frame, respectively; dt represents the time it takes for the start frame to move to the peak frame; order the ，/>，/>Respectively representing partial derivatives of gray scales of pixel points along x, y and t directions; /(I)Wherein, I _x,I_y,I_t is obtained by image data;

6. The emotion recognition method based on head wearable device perception of claim 5, wherein: the method for establishing the individual data storage database is also included: constructing an individual ID in an individual data storage database; before the emotion recognition method of the wearer or the emotion recognition method of the observed person starts, the individual ID of the wearer or the observed person is acquired; after the emotion recognition method of the wearer or the observed person is completed, the emotion recognition result obtained by the emotion recognition method of the wearer or the observed person and the time and place are stored in the corresponding individual ID.

7. A readable storage medium, wherein the storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the emotion recognition method based on head wearable device perception of any of claims 1-6.

8. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the emotion recognition method of any of claims 1-6 based on head wearable device perception.