CN111353354B

CN111353354B - Human body stress information identification method and device and electronic equipment

Info

Publication number: CN111353354B
Application number: CN201811582833.8A
Authority: CN
Inventors: 任亦立; 许朝斌
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2024-01-23
Anticipated expiration: 2038-12-24
Also published as: CN111353354A

Abstract

The application provides a method and a device for identifying human body stress information and electronic equipment, wherein the method comprises the following steps: acquiring at least one frame of image containing the same target object; respectively determining a first facial image of a target object in each frame of image, and generating a first facial image sequence; adjacent first facial images in the first sequence of facial images are associated in adjacent timing; processing each first facial image in the first facial image sequence to obtain each second facial image only containing a specified facial organ, and generating a second facial image sequence; adjacent second facial images in the second facial image sequence are associated in adjacent time sequences; inputting the second facial image sequence into a preset stress information identification model to identify the stress information of the target object by the stress information identification model. By using the method provided by the application, the identification of the stress information of the human body can be realized.

Description

Human body stress information identification method and device and electronic equipment

Technical Field

The present invention relates to the field of image processing, and in particular, to a method and an apparatus for identifying stress information of a human body, and an electronic device.

Background

The human body stress refers to unconscious behaviors of a person caused by external stimulus, and can comprise human facial expressions, human facial micro-expressions and the like.

The identification of the stress information of the human body has great significance for analyzing the emotion and the psychology of the human body, and has great potential application value in various fields such as the field of man-machine interaction, the intelligent monitoring field, the virtual reality field and the like, so that how to identify the stress information of the human body becomes a problem to be solved in the industry.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus and an electronic device for identifying human body stress information, which are used for identifying the human body stress information.

Specifically, the application is realized by the following technical scheme:

according to a first aspect of the present application, the present application provides a method for identifying stress information of a human body, including:

acquiring at least one frame of image containing the same target object;

respectively determining a first facial image of a target object in each frame of image, and generating a first facial image sequence; adjacent first facial images in the first sequence of facial images are associated in adjacent timing;

processing each first facial image in the first facial image sequence to obtain each second facial image only containing a specified facial organ, and generating a second facial image sequence; adjacent second facial images in the second facial image sequence are associated in adjacent time sequences;

Inputting the second facial image sequence into a preset stress information identification model to identify the stress information of the target object by the stress information identification model.

Optionally, the processing each first facial image in the first facial image sequence to obtain each second facial image only including the specified facial organ, and generating a second facial image sequence includes:

inputting the first facial image sequence into a trained face segmentation neural network, identifying all facial organs of each first facial image in the first facial image sequence by the face segmentation neural network to obtain each third facial image with each facial organ marked in a distinguishing way, and outputting a third facial image sequence composed of each third facial image; adjacent third face images in the third face image sequence are associated at adjacent timings;

and respectively performing mask processing on each third facial image in the third facial image sequence to obtain each second facial image only containing the appointed facial organ, and forming a second facial image sequence.

Optionally, the stress information is expression information, and the specified facial organs are all facial organs of the face of the target object;

Performing mask processing on each third face image in the third face image sequence to obtain each second face image only including the appointed face organ, forming a second face image sequence, including:

the pixel points of the facial organs marked in all the third facial images are combined to generate an expression mask image;

and performing mask operation on the expression mask image and each third facial image respectively to obtain each second facial image containing all facial organs, and generating a second facial image sequence.

Optionally, the expression information includes confidence degrees of various expressions;

the inputting the second facial image sequence into a preset stress information identification model to identify the stress information of the target object by the stress information identification model includes:

inputting the second facial image sequence into a trained 3D convolutional neural network, assembling the second facial image sequence into a 3D data set by a convolutional layer of the 3D convolutional neural network, performing convolution operation by adopting at least one preset 3D convolution kernel to obtain an expression feature map, and inputting the expression feature map into a pooling layer of the 3D convolutional neural network;

The pooling layer of the 3D convolutional neural network pools the expression feature images, and inputs the pooled expression feature images to the softmax layer of the 3D convolutional neural network;

the softmax layer classifies the expression feature images to obtain the confidence degrees of the target object corresponding to each expression type;

wherein a first dimension of the three dimensions represents a second facial image sequence length, a second dimension represents a second facial image height, and a third dimension represents a second facial image width.

Optionally, the stress information is microexpressive information of N types of microexpressions; the specified facial organ is a facial organ corresponding to N types of micro-expressions; n is an integer greater than zero;

selecting the pixel points of the appointed facial organs corresponding to each type of micro-expression from the pixel points of all the facial organs marked in each third facial image aiming at each type of micro-expression, and taking a union set of the pixel points selected from each third facial image to generate a micro-expression mask image of the type of micro-expression;

Performing mask operation on each frame of image in the third facial image sequence by adopting a micro-expression mask image corresponding to N micro-expressions respectively to obtain N second facial image sequences corresponding to the N micro-expressions one by one;

wherein, the second facial image in the second facial image sequence corresponding to each micro expression only comprises: a specific facial organ corresponding to the first type of microexpression.

Optionally, the microexpressive information of the N types of microexpressions includes: the microexpressive intensity corresponding to the N types of microexpressions;

the stress information identification model comprises a first 2D convolutional neural network and a second 2D convolutional neural network;

inputting the second face image sequence into a preset stress information identification model to identify the stress information of the target object by the stress information identification model, wherein the method comprises the following steps of:

respectively calculating optical flow diagrams of any two continuous frames of second facial images in the N second facial image sequences in the vertical direction and the horizontal direction, generating N optical flow diagram sequences, inputting the N optical flow diagram sequences into a first 2D convolutional neural network, enabling the first 2D convolutional neural network to conduct micro-expression intensity recognition on the N optical flow diagram sequences, and outputting a first intensity set composed of micro-expression intensities corresponding to N types of micro-expressions;

Processing the appointed second face images of the N second face image sequences to obtain N RGB images, inputting the N RGB images into a second 2D convolutional neural network, carrying out microexpressive intensity recognition on the N RGB images by the second 2D convolutional neural network, and outputting a second intensity set composed of microexpressive intensities corresponding to N types of microexpressions;

and combining the microexpressive intensity in the first intensity set and the microexpressive intensity in the second intensity set to obtain microexpressive intensity corresponding to the final N types of microexpressions.

Optionally, the N types of micro-expressions are divided into three micro-expression groups, namely a first micro-expression group, a second micro-expression group and a third micro-expression group;

combining the microexpressive intensity in the first intensity set and the microexpressive intensity in the second intensity set to obtain microexpressive intensity corresponding to the final N types of microexpressions, wherein the method comprises the following steps:

aiming at various micro-expressions in the first micro-expression group, selecting micro-expression intensities corresponding to the micro-expressions from the first intensity set;

aiming at various micro-expressions in the second micro-expression group, selecting micro-expression intensities corresponding to the micro-expressions from the second intensity set;

For each type of microexpressions in the third microexpressions group, detecting whether the microexpressions intensity of the type of microexpressions in the first intensity set is larger than or equal to a preset threshold value corresponding to the type of microexpressions, if so, selecting the microexpressions intensity corresponding to the type of microexpressions from the second intensity set; if not, determining that the micro expression intensity is 0;

and combining the microexpressive intensities corresponding to the microexpressive expressions in the first microexpressive expression group, the second microexpressive expression group and the third microexpressive expression group to obtain the final microexpressive intensity corresponding to the N types of microexpressive expressions.

According to a second aspect of the present application, there is provided a human body stress information identifying apparatus comprising:

an acquisition unit for acquiring at least one frame of image containing the same target object;

a first generation unit configured to determine first face images of the target object in each frame of images, respectively, and generate a first face image sequence; adjacent first facial images in the first sequence of facial images are associated in adjacent timing;

a second generating unit, configured to process each first facial image in the first facial image sequence to obtain each second facial image only including a specified facial organ, and generate a second facial image sequence; adjacent second facial images in the second facial image sequence are associated in adjacent time sequences;

And the identification unit is used for inputting the second facial image sequence into a preset stress information identification model so as to identify the stress information of the target object by the stress information identification model.

Optionally, the second generating unit is specifically configured to input the first facial image sequence into a trained face segmentation neural network, identify all facial organs of each first facial image in the first facial image sequence by using the face segmentation neural network, obtain each third facial image that is marked by distinguishing each facial organ, and output a third facial image sequence composed of each third facial image; adjacent third face images in the third face image sequence are associated at adjacent timings; and respectively performing mask processing on each third facial image in the third facial image sequence to obtain each second facial image only containing the appointed facial organ, and forming a second facial image sequence.

the second generating unit is specifically configured to generate an expression mask map by performing mask processing on each third face image in the third face image sequence to obtain each second face image only including a specified face organ, and when the second face image sequence is formed, the second generating unit is specifically configured to obtain a union set of pixel points where the face organ marked in each third face image is located; and performing mask operation on the expression mask image and each third facial image respectively to obtain each second facial image containing all facial organs, and generating a second facial image sequence.

the recognition unit is specifically configured to input the second facial image sequence into a trained 3D convolutional neural network, assemble the second facial image sequence into a 3D data set by a convolutional layer of the 3D convolutional neural network, perform convolutional operation by using at least one preset 3D convolutional kernel, obtain an expression feature map, and input the expression feature map into a pooling layer of the 3D convolutional neural network; the pooling layer of the 3D convolutional neural network pools the expression feature images, and inputs the pooled expression feature images to the softmax layer of the 3D convolutional neural network; the softmax layer classifies the expression feature images to obtain the confidence degrees of the target object corresponding to each expression type; wherein a first dimension of the three dimensions represents a second facial image sequence length, a second dimension represents a second facial image height, and a third dimension represents a second facial image width.

The second generating unit is specifically configured to select, for each type of microexpressions, pixels where the specified facial organ corresponding to the type of microexpressions is located, from among pixels where all the facial organs marked in the third facial image are located, and to obtain a union set of the pixels selected from the third facial images, when each second facial image including only the specified facial organ is obtained and the second facial image sequence is formed; performing mask operation on each frame of image in the third facial image sequence by adopting a micro-expression mask image corresponding to N micro-expressions respectively to obtain N second facial image sequences corresponding to the N micro-expressions one by one; wherein, the second facial image in the second facial image sequence corresponding to each micro expression only comprises: a specific facial organ corresponding to the first type of microexpression.

The recognition unit is specifically configured to respectively calculate optical flow diagrams of any two consecutive second facial images in the N second facial image sequences in a vertical direction and a horizontal direction, generate N optical flow diagram sequences, and input the N optical flow diagram sequences into the first 2D convolutional neural network, so that the first 2D convolutional neural network performs micro-expression intensity recognition on the N optical flow diagram sequences, and output a first intensity set composed of micro-expression intensities corresponding to N types of micro-expressions; processing the appointed second face images of the N second face image sequences to obtain N RGB images, inputting the N RGB images into a second 2D convolutional neural network, carrying out microexpressive intensity recognition on the N RGB images by the second 2D convolutional neural network, and outputting a second intensity set composed of microexpressive intensities corresponding to N types of microexpressions; and combining the microexpressive intensity in the first intensity set and the microexpressive intensity in the second intensity set to obtain microexpressive intensity corresponding to the final N types of microexpressions.

The identification unit is specifically configured to select, for each type of microexpressions in the first microexpressions group, microexpressions intensity corresponding to the type of microexpressions from the first intensity set when combining the microexpressions intensity in the first intensity set and the microexpressions intensity in the second intensity set to obtain microexpressions intensity corresponding to the final N types of microexpressions; aiming at various micro-expressions in the second micro-expression group, selecting micro-expression intensities corresponding to the micro-expressions from the second intensity set; for each type of microexpressions in the third microexpressions group, detecting whether the microexpressions intensity of the type of microexpressions in the first intensity set is larger than or equal to a preset threshold value corresponding to the type of microexpressions, if so, selecting the microexpressions intensity corresponding to the type of microexpressions from the second intensity set; if not, determining that the micro expression intensity is 0; and combining the microexpressive intensities corresponding to the microexpressive expressions in the first microexpressive expression group, the second microexpressive expression group and the third microexpressive expression group to obtain the final microexpressive intensity corresponding to the N types of microexpressive expressions.

According to a third aspect of the present application there is provided an electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to perform the method of the first aspect.

According to a fourth aspect of the present application there is provided a machine-readable storage medium storing machine-executable instructions which, when invoked and executed by a processor, cause the processor to perform the method of the first aspect.

As can be seen from the above description, on one hand, the human body stress information identification method provided by the application can realize the identification of the human body stress information.

On the other hand, as the mask processing is carried out on the first facial image, the background in the facial image is removed, only the second facial image sequence of the important facial organ which is conducive to the identification of the stress information of the human body is highlighted, and then the stress model is used for identifying the second facial image sequence, so that the identification of the expression or the micro expression is not influenced by the background in the facial image, and the identification of the expression and the micro expression is more accurate.

Drawings

FIG. 1 is a flow chart of a method for identifying human stress information according to an exemplary embodiment of the present application;

FIG. 2a is a schematic illustration of a first facial image shown in accordance with an exemplary embodiment of the present application;

FIG. 2b is a schematic illustration of a third facial image shown in accordance with an exemplary embodiment of the present application;

FIG. 2c is a schematic diagram of a third facial image masked with an expressive mask image according to an example embodiment of the present application;

FIG. 2d is a schematic illustration of masking a third facial image using an eyebrow mask map according to an exemplary embodiment of the present application;

FIG. 2e is a schematic illustration of a third face image masked with an eye mask image according to an exemplary embodiment of the present application;

FIG. 2f is a schematic illustration of a third face image masked with a mouth mask image according to an exemplary embodiment of the present application;

fig. 3 is a block diagram of a human body stress information identifying apparatus according to an exemplary embodiment of the present application;

fig. 4 is a hardware configuration diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Referring to fig. 1, fig. 1 is a flowchart of a method for identifying human stress, which is applicable to an electronic device and includes the following steps.

Step 101: the electronic device may acquire at least one frame of image containing the same target object.

When the electronic device is a server, the electronic device can receive at least one frame of image containing the same target object, which is acquired by the front-end image acquisition device.

When the electronic device is a front-end image acquisition device, such as a camera, the electronic device can acquire at least one frame of image containing the same target object.

The electronic device is described here by way of example only and is not particularly limited.

Step 102: the electronic equipment respectively determines a first facial image of a target object in each frame of image, and generates a first facial image sequence; adjacent first facial images in the first sequence of facial images are associated at adjacent timings.

When implemented, the electronic device may sequentially input each frame of images to the face recognition model. For each frame of image, the face recognition model may identify a face region of a target object in the frame of image, and then may extract the face region from the frame of image to form a face region image. The face recognition model may output face region images corresponding to the respective frame images.

The electronic device may then pre-process each facial region image to form each first facial image, constituting a first sequence of facial images. Adjacent first facial images in the first sequence of facial images are associated at adjacent timings.

For example, the electronic apparatus receives 3 frames of images including the target object 1, which are respectively a 1 st frame image (denoted as image 1), a 2 nd frame image (denoted as image 2), and a 3 rd frame image (denoted as image 3).

The electronic device may input image 1, image 2, and image 3 into a face recognition model, which may identify a face region of a target object in image 1, and extract the identified face region from image 1 to form a face region image 1 corresponding to image 1. Similarly, the face recognition model can also obtain a face region image 2 and a face region image 3. The face recognition model may output a face region image 1, a face region image 2, and a face region image 3.

The electronic device may pre-process the face region image 1, the face region image 2, and the face region image 3, respectively, to obtain a first face image 1, a first face image 2, and a first face image 3 corresponding to the face region image 1, the face region image 2, and the face region image 3, respectively. The electronic device may generate the first sequence of facial images in frame order of the first facial image 1, the first facial image 2, and the third facial image 3. The first facial image sequence generated is { first facial image 1, first facial image 2, first facial image 3}.

The first facial image 1 and the first facial image 2 have a sequential relationship of adjacent time sequences, and the first facial image 2 and the first facial image 3 have a sequential relationship.

The preprocessing of the facial region image may be performed by a processing method known to those skilled in the art, for example, correction processing and alignment processing of the facial region image. Pretreatment is described here by way of example only, and is not particularly limited.

In addition, the face recognition model may include FRCNN (Fast Region-based Convolutional Neural Networks, area-based Fast convolutional neural network) network, or YOLO ((You Only Look Once)) network, which are only exemplary of the face recognition model and are not specifically limited.

Step 103: the electronic equipment respectively processes each first facial image in the first facial image sequence to obtain each second facial image only containing a designated facial organ, and generates a second facial image sequence; adjacent second facial images in the second sequence of facial images are associated at adjacent timings.

It should be noted that, because the mask processing is performed on the first facial image, the background in the facial image is removed, only the second facial image sequence of the important facial organ which is conducive to expression and micro-expression recognition is highlighted, and then the stress model is used for recognizing the second facial image sequence, so that the expression or micro-expression recognition is not affected by the background in the facial image, and the expression and micro-expression recognition is more accurate.

Step 103 will be specifically described below by two steps 1031 and 1032.

Step 1031: the electronic equipment can input the first facial image sequence into a trained face segmentation neural network so as to identify all facial organs of each first facial image in the first facial image sequence by the face segmentation neural network, obtain each third facial image which corresponds to each first facial image and is marked with all facial organs in a distinguishing way, and output a third facial image sequence consisting of each third facial image; adjacent third face images in the third face image sequence are associated at adjacent timings.

For example, the example in step 102 is still taken as an example.

The first facial image sequence includes: a first facial image 1, a first facial image 2, and a first facial image 3.

The electronic device may input the first sequence of facial images to a Face segmentation (Face segmentation) neural network.

Taking the processing of the first facial image 1 as an example, see fig. 2a, it is assumed that the first facial image 1 is as shown in fig. 2 a. The face segmentation neural network can identify all facial organs of the first facial image 1, and obtain a third facial image 1 corresponding to the first facial image 1 and labeled with all facial organs differently, and the third facial image 1 is shown in fig. 2 b.

It should be noted that the function of the face segmentation network is to distinguish and mark different facial organs. For example, as shown in fig. 2b, after the first facial image 1 is processed by the face segmentation neural network, a third facial image 1 is obtained, and the pixel values of different facial organs in the third facial image 1 are different. And distinguishing and marking each facial organ through the value of the pixel point.

For example, in the third face image 1, the values of the pixels at the eyebrows are 1, the values of the pixels at the mouth are 2, the values of the pixels at the nose are 3, and the values of the pixels at the eyes are 4. All facial organs such as eyebrows, mouth, nose, eyes and the like are marked by different pixel values.

Similarly, the face-divided neural network can also obtain a third face image 2 corresponding to the first face image 2 and a third face image 3 corresponding to the first face image 3.

The face segmentation neural network may then compose the third face image 1, the third face image 2, and the third face image 3 into a third face image sequence in frame order, and output the third face image sequence. The third side image sequence is { third side image 1, third side image 2, third side image 3}.

The third face image 1 and the third face image 2 have a time sequence relationship adjacent to each other, and the third face image 2 and the third face image 3 have a time sequence relationship.

Step 1032: the electronic device performs mask processing on each third face image in the third face image sequence to obtain each second face image which corresponds to each third face image and only contains a specified face organ, and forms a second face image sequence.

1) The stress information is expression information

When the stress information is expression information, the specified facial organ is all organs of the face of the target object, and the mask used in performing the masking process is an expression mask.

When the method is implemented, the electronic equipment obtains a union set of pixel points where the facial organs marked in all the third facial images are located, and an expression mask image is generated; the expression mask map includes all facial organs, for example, the values of the pixels corresponding to all facial organs on the expression mask map are 1, and the values of the pixels of other parts except all facial organs are 0.

In implementing step 1032, the electronic device may perform mask operation with each third facial image in the third facial sequence by using the preset expression mask map, so as to obtain each second facial image corresponding to each third facial image and including all facial organs, and generate the second facial image sequence.

For example, still taking the example of step 1031 as an example, assume that the third side image sequence is { third side image 1, third side image 2, third side image 3}.

The electronic device may generate an expression mask map by merging the pixel points where all the facial organs are marked in the third facial image 1, the pixel points where all the facial organs are marked in the third facial image 2, and the pixel points where all the facial organs are marked in the third facial image 3. And the pixel point value of the union set in the generated expression mask image is 1, and the other pixel points are 0.

And then the electronic equipment can adopt the expression mask images to respectively carry out mask operation with each third facial image to obtain each second facial image containing all facial organs, and a second facial image sequence is generated.

Taking the third face image 1 as an example, the third face image 1 is shown in fig. 2 b.

The electronic device may perform mask operation on the third face image 1 by using a preset expression mask map, so as to obtain a second face image 1. The second face image 1 is obtained as shown in fig. 2 c.

Similarly, the electronic device can also obtain the second face image 2 corresponding to the third face image 2 and the second face image 3 corresponding to the third face image 3.

Then, the electronic device may compose the second face image 1, the second face image 2, and the second face image 3 into a second face image sequence in frame order, and output the second face image sequence. The second face image sequence is { second face image 1, second face image 2, second face image 3}.

2) The stress information is microexpressive information of N types of microexpressions

When the stress information is microexpressive information, the specified facial organ is an organ corresponding to N types of microexpressions, and the mask used in the masking process is a microexpressive mask corresponding to N types of microexpressions.

Wherein, the value of N is an integer greater than 0.

Typically, the microexpressive category may be represented by AU (Action Units). Paul Ekman proposed a facial coding system (FACS, facial Action Coding System). The FACS may divide the facial motion into 64 AUs, AU1 to AU64, respectively. Each AU represents a class of microexpressions.

It should be further noted that different types of micro-expressions may correspond to the same mask image or may correspond to different mask images.

For example, the N-type microexpressions are 8-type microexpressions, which are AU1, AU4, AU5, AU12, AU15, AU25, AU32, and AU45, respectively.

For example, AU1 and AU4 are micro expressions related to the eyebrows, the designated organ corresponding to AU1 and AU4 is the eyebrow, and AU1 and AU4 may correspond to the same mask, where the mask includes the eyebrows, that is, the value of the pixel point of the eyebrow in the mask is 1, and the value of the other pixel points except for the eyebrow is 0.

AU5 and AU45 are micro expressions related to eyes, and then the designated organs corresponding to AU5 and AU45 are eyes, and AU5 and AU45 can correspond to the same mask, where the mask includes eyes, that is, the value of the pixel point of the eyes in the mask is 1, and the value of the other pixel points except the eyes is 0.

AU12, AU15, AU25, AU32 are micro-expressions related to the mouth, then the assigned organ corresponding to AU5 and AU45 is the mouth, AU12, AU15, AU25, AU32 can correspond to the same mask, the mask comprises the mouth, namely the value of the pixel point of the mouth in the mask is 1, and the values of the other pixel points except the mouth are 0.

When implementing step 1032, the electronic device may select, for each type of micro-expression, among the pixel points where all the facial organs marked in each third facial image are located, the pixel point where the specified facial organ corresponding to the type of micro-expression is located, and obtain a union set of the pixel points selected from each third facial image, so as to generate a mask image of the micro-expression of the type of micro-expression. The electronic equipment can generate N micro-expression mask images corresponding to N types of micro-expressions respectively based on the method.

Then, the electronic equipment can adopt mask images corresponding to preset N types of micro expressions respectively to perform mask operation on each frame of image in the third facial image sequence, so as to obtain N second facial image sequences corresponding to the N types of micro expressions one by one;

the second facial image in each second facial image sequence only comprises a designated facial organ corresponding to the micro expression of the second facial image sequence; the mask map corresponding to each type of micro-expression comprises a designated facial organ corresponding to the type of micro-expression.

Let N kinds of micro-expressions be 8 kinds of micro-expressions, which are AU1, AU4, AU5, AU12, AU15, AU25, AU32, AU45, respectively.

The designated facial organs corresponding to AU1 and AU4 are eyebrows, the designated facial organs corresponding to AU5 and AU45 are eyes, and the designated facial organs corresponding to AU12, AU15, AU25 and AU32 are mouth.

Taking the micro-expression mask diagram corresponding to the generated AU1 as an example for explanation.

The electronic device may select the pixel point where the eyebrow is located among the pixel points where all the facial organs marked in the third facial image 1 are located. Similarly, the pixel point where the eyebrow is located is selected in each of the third face image 2 and the third face image 3.

Then, the electronic device can generate an eyebrow mask map corresponding to AU1 by merging the pixels where the eyebrows selected from the third face image 1, the third face image 2, and the third face image 3 are located.

Similarly, the electronic device may also generate an eyebrow mask map corresponding to AU4, an eye mask map corresponding to AU5 and AU45, and a mouth mask map corresponding to AU12, AU15, AU25, and AU32, respectively, based on the method.

Then, performing mask operation on each frame of image in the third facial image sequence by using a micro-expression mask image corresponding to 8 micro-expressions respectively by the electrons to obtain 8 second facial image sequences corresponding to the 8 micro-expressions one by one;

taking AU1 as an example, the electronic device may perform a masking operation on the third face image 1 (as shown in fig. 2 b) using an eyebrow mask map, to obtain a second face image 1 (as shown in fig. 2 d) corresponding to the third face image 1 and including only eyebrows. Similarly, the electronic apparatus can also obtain the second face image 12 corresponding to the third face image 2 and including only the eyebrows, and the second face image 13 corresponding to the third face image 13 and including only the eyebrows. The electronic device may compose the second face image 11, the second face image 12, and the second face image 13 into a second face image sequence for AU 1. Adjacent second facial images in the second sequence of facial images are associated in adjacent timing.

Similarly, the electronic device may further obtain a second facial image sequence corresponding to AU4, where the facial image in the second facial image sequence corresponding to AU4 includes only the eyebrows.

Taking AU5 as an example, the electronic device may perform a masking operation on the third face image 1 (as shown in fig. 2 b) using the eye mask map, to obtain a second face image 21 (as shown in fig. 2 e) corresponding to the third face image 1 and including only eyes. Similarly, the electronic apparatus can also obtain the second face image 22 corresponding to the third face image 2 and containing only eyes, and the second face image 23 corresponding to the third face image 3 and containing only eyes. The electronic device may compose the second face image 21, the second face image 22, and the second face image 23 into a second face image sequence for AU 5. Adjacent second facial images in the second sequence of facial images are associated in adjacent timing.

Similarly, AU45 corresponds to a second sequence of facial images, the second facial image in this second sequence of facial images comprising only eyes.

Taking AU12 as an example, the electronic device may perform a masking operation on the third face image 1 (as shown in fig. 2 b) using a mouth mask map, to obtain a second face image 31 (as shown in fig. 2 f) corresponding to the third face image 1 and including only eyes. Similarly, the electronic apparatus can also obtain the second face image 32 corresponding to the third face image 2 and including only the mouth, and the second face image 33 corresponding to the third face image 3 and including only the mouth. The electronic device may compose second facial image 31, second facial image 32, and second facial image 33 into a second facial image sequence for AU 12. Adjacent second facial images in the second sequence of facial images are associated in adjacent timing.

Similarly, the electronic device may further obtain 3 second facial image sequences corresponding to AU15, AU25, and AU32, respectively, where the second facial image in the 4 second facial image sequences includes only the mouth.

Finally, the electronic device outputs 8 second face image sequences corresponding to AU1, AU4, AU5, AU12, AU15, AU25, AU32, AU45, respectively.

Step 104: the electronic equipment inputs the second facial image sequence into a preset stress information identification model so as to identify the stress information of the target object by the stress information identification model.

1) The stress information is expression information

The expression information may include confidence degrees of various expressions, and may include other information related to the expressions, which is only illustrated by way of example and not limited in detail herein.

When the stress information is expression information, the stress recognition model is a 3D convolutional neural network.

Wherein the 3D convolutional neural network comprises a convolutional layer, a pooling layer, and a softmax layer.

In implementing step 104, the electronic device may input the second sequence of facial images obtained in step 103 into a 3D convolutional neural network.

And the convolution layer of the 3D convolution neural network assembles the second facial image sequence into a 3-dimensional data set, carries out convolution operation by adopting at least one preset 3-dimensional convolution kernel, obtains an expression characteristic diagram and inputs the expression characteristic diagram into the pooling layer of the 3D convolution neural network. And the pooling layer of the 3D convolutional neural network pools the expression feature map, and inputs the pooled expression feature map into the softmax layer of the 3D convolutional neural network. And classifying the expression feature images by a softmax layer of the 3D convolutional neural network to obtain the confidence degrees of the target object corresponding to each expression type.

For example, expression types include 8 categories, namely aversion, anger, fear, sadness, happiness, surprise, slight, and no expression, respectively.

After the electronic device inputs the second facial image sequence into the 3D convolutional neural network, the 3D convolutional neural network can output confidence that the expression is aversive, confidence that the expression is anger, confidence that the expression is fear, confidence that the expression is sad, confidence that the expression is happy, confidence that the expression is surprised, confidence that the expression is light and confidence that the expression is unopposed.

The electronic device can then make the 8 expressions and their confidence levels into pie charts for display. Of course, the electronic device may also be displayed by other display modes, such as a table, etc., which is only an exemplary illustration of the electronic device displaying the 8 expressions and the confidence thereof, and is not specifically limited herein.

When the expression recognition is carried out on the second facial image sequence by the 3D convolutional neural network, the spatial relationship of the second facial images of each frame in the second facial image sequence is considered, and the time sequence relationship of the second facial images of each frame is also considered, so that the accuracy of the expression recognition is greatly improved by adopting the 3D convolutional neural network.

It should be noted that, the microexpressive information of the N types of microexpressions includes microexpressive intensity of the N types of microexpressions, and of course, the microexpressive information may also include other information related to the microexpressions, where the microexpressive information is illustrated by way of example and not limited specifically.

Wherein, each type of micro-expression intensity refers to the degree of change of the type of micro-expression relative to the state of no expression. For example, if the microexpressive type is AU1, AU1 indicates that the inner side of the eyebrow is pulled up, the strength of AU1 is the degree of the inner side of the eyebrow being pulled up.

When the stress information is microexpressive information of N types of microexpressions, the stress identification model comprises a first 2D convolutional neural network and a second 2D convolutional neural network.

The following describes the microexpressive intensity for identifying N types of microexpressions in detail through steps 1041 to 1043.

Step 1041: the electronic equipment can respectively calculate optical flow diagrams of any two continuous frames of second facial images in the N second facial image sequences in the vertical direction and the horizontal direction, generate N optical flow diagram sequences, input the N optical flow diagram sequences into the first 2D convolutional neural network, enable the first 2D convolutional neural network to conduct micro-expression intensity recognition on the N optical flow diagram sequences, and output a first intensity set composed of micro-expression intensities corresponding to N types of micro-expressions.

When calculating the optical flow sequence corresponding to one second face image sequence, two optical flow maps of two second face images in the horizontal direction and the vertical direction need to be calculated based on any two consecutive second face images in the second face sequence.

Therefore, when the second face image sequence includes M frames of second face images, the optical flow sequence corresponding to the second face image sequence includes 2 x (M-1) frames of optical flow images, and the optical flow sequence has a length of (M-1).

Wherein the electronic device may input optical flow sequence 1 to optical flow sequence 8 into the first 2D convolutional neural network.

The first 2D convolutional neural network is trained by taking N optical flow sequences and the calibrated microexpressive intensity values corresponding to the N optical flow sequences one by one as samples, and the training process can be trained by using a process of training the neural network, which is well known to those skilled in the art, and will not be described herein.

The preset data format of the first 2D convolutional neural network is [ N, C, H, W ]. Wherein N is a preset value, usually 1, c is the number 2 (M-1) of the optical flow graphs in the optical flow sequence, H is the height of the optical flow graph, and W is the width of the optical flow graph.

The first 2D convolutional neural network includes a convolutional layer, a pooling layer, and a Euclidean Loss layer.

The convolution layer of the first 2D convolutional neural network may use the preset data format [ N, C, H, W ] to compose a data set from the input N optical flow sequences, then use the 2D convolutional check to perform convolution operation on the data set, and send the data set after the convolution operation to the pooling layer. The pooling layer pools the data set after convolution operation and sends the data set to the Euclidean Loss layer, and the Euclidean Loss layer carries out regression calculation to obtain the micro expression intensity corresponding to the N micro expressions. The microexpressive intensities corresponding to the N types of microexpressions obtained through the first 2D convolutional neural network can form a first intensity set.

For example, let 8 second face image sequences be provided, which are the second face image sequence 1 to the second face image sequence 8, respectively.

It is assumed that 3 frames of second facial feature images are included in each second facial image sequence.

Taking the second face image sequence 1 as an example, assume that the second face image sequence 1 is { second face image 1, second face image 2, and second face image 3}

The electronic device can calculate an optical flow map 11 in the horizontal direction and an optical flow map 12 in the vertical direction based on the second face image 1 and the second face image 1. Based on the second face image 2 and the second face image 3, an optical flow 21 in the horizontal direction and an optical flow map 22 in the vertical direction are calculated, 4 optical flow maps are obtained, and an optical flow sequence 1 corresponding to the second face image sequence 1 is formed.

Similarly, the electronic device may calculate the optical flow sequences 2 to 8 corresponding to the second face image sequences 2 to 8, respectively.

The electronic device may input optical flow sequence 1 through optical flow sequence 8 into the first 2D convolutional neural network.

The convolution layer of the first 2D convolutional neural network may use the preset data format [ N, C, H, W ] to form a data set from the optical flow sequence 1 to the optical flow sequence 8, then use the 2D convolutional check to perform convolution operation on the data set, and send the data set after the convolution operation to the pooling layer. The pooling layer pools the data set after convolution operation and sends the data set to the Euclidean Loss layer, and the Euclidean Loss layer carries out regression calculation to obtain micro expression intensities corresponding to 8 types of micro expressions. The microexpressive intensities corresponding to the 8 types of microexpressions obtained through the first 2D convolutional neural network can form a first intensity set.

Step 1042: and processing a designated frame of face images of the N second face image sequences to obtain N RGB images, inputting the N RGB images into a second 2D convolutional neural network, carrying out micro-expression intensity recognition on the N RGB images by the second 2D convolutional neural network, and outputting a second intensity set composed of micro-expression intensities corresponding to N types of micro-expressions.

The second 2D convolutional neural network is trained by using the N RGB images and the calibrated microexpressive intensity values corresponding to the N RGB images as samples, and the training process may be performed by using a process of training the neural network, which is well known to those skilled in the art, and will not be described herein.

The preset data format of the second 2D convolutional neural network is [ N, C, H, W ]. Where N is a preset value, typically 1. Since the second 2D convolutional neural network processes the RGB image, C is 3.H is the height of the RGB image and W is the width of the RGB image.

The second 2D convolutional neural network includes a convolutional layer, a pooling layer, and a Euclidean Loss layer.

When the method is implemented, the electronic device can respectively process the appointed second face images of the N second face image sequences to obtain N RGB images. The resulting N RGB images are then input to a second 2D convolutional neural network.

The convolution layer of the second 2D convolution neural network may use the preset data format [ N, C, H, W ] to compose the input N RGB maps into a data set, then use the 2D convolution kernel to perform convolution operation on the data set, and send the data set after the convolution operation to the pooling layer. The pooling layer pools the data set after convolution operation and sends the data set to the Euclidean Loss layer, and the Euclidean Loss layer carries out regression calculation to obtain the micro expression intensity corresponding to the N micro expressions. The microexpressive intensities corresponding to the N types of microexpressions obtained through the second 2D convolutional neural network can form a second intensity set.

The specified face image may be the last frame image of the second face image sequence or may be the last but one frame image of the second face image sequence, and the specified second face image is only described here by way of example and is not particularly limited.

For example, let 8 second face image sequences be provided, which are the second face image sequence 1 to the second face image sequence 8, respectively. Assume that the second facial image is designated as the last frame of the second facial image in the second facial image sequence.

The electronic device may obtain 8 second face images by respectively taking the second face image of the last frame from the 8 second face image sequences.

Then, the electronic device may perform RGB processing on the 8 second face images, to obtain 8 RGB images. The electronic device may input 8 RGB images to the second 2D convolutional neural network.

The convolution layer of the second 2D convolution neural network can adopt the preset data format [ N, C, H, W ] to form a data set from the input 8 RGB images, then adopts the 2D convolution check to carry out convolution operation on the data set, and sends the data set after the convolution operation to the pooling layer. The pooling layer pools the data set after convolution operation and sends the data set to the Euclidean Loss layer, and the Euclidean Loss layer carries out regression calculation to obtain the micro expression intensity corresponding to 8 types of micro expressions. The microexpressive intensities corresponding to 8 types of microexpressions obtained through the second 2D convolutional neural network can form a second intensity set.

Step 1043: the electronic equipment can combine the micro-expression intensity in the first intensity set and the micro-expression intensity in the second intensity set to obtain micro-expression intensities corresponding to the final N types of micro-expressions.

The N micro-expressions are divided into 3 micro-expression groups in advance, namely a first micro-expression group, a second micro-expression group and a third micro-expression group.

Specifically, for each type of micro-expression in the first micro-expression group, the electronic device may select a micro-expression intensity corresponding to the type of micro-expression from the first intensity set.

Aiming at various micro-expressions in the second micro-expression group, the electronic equipment can select micro-expression intensities corresponding to the micro-expressions from the second intensity set;

aiming at each type of microexpressions in the third microexpressions group, the electronic equipment can detect whether the microexpressions intensity of the type of microexpressions in the first intensity set is larger than a preset threshold corresponding to the type of microexpressions, if so, the microexpressions intensity corresponding to the type of microexpressions is selected from the second intensity set; if not, determining that the micro expression intensity is 0. The preset thresholds corresponding to the various micro-expressions in the third micro-expression group can be the same or different.

The electronic device can combine the micro-expression intensities corresponding to the micro-expressions in the first micro-expression group, the second micro-expression group and the third micro-expression group to obtain the final micro-expression intensities corresponding to the N types of micro-expressions.

For example, 8 kinds of micro-expressions are assumed, namely AU1, AU4, AU5, AU12, AU15, AU25, AU32 and AU45. These 8 kinds of micro-expressions are divided into 3 micro-expression groups in advance, which are a first micro-expression group, a second micro-expression group, and a third micro-expression group, respectively.

The micro-expression types in the first micro-expression group include: AU45;

the micro-expression types in the second micro-expression group include: AU1, AU12, AU15, AU32, AU25;

the micro-expression types in the third micro-expression group include: AU4 and AU5.

For the first microexpressive group, the electronic device may select the microexpressive intensity value of AU45 from the first intensity set as the final microexpressive intensity value of AU45.

For the second microexpressive group, the electronic device may select microexpressive intensity values corresponding to AU1, AU12, AU15, AU32, and AU25 from the second intensity set, and the microexpressive intensity values are the final microexpressive intensity values of AU1, AU12, AU15, AU32, and AU 25.

For the third microexpressive group, AU4 is taken as an example. The electronic device may detect whether the microexpressive intensity corresponding to AU4 in the first intensity set is greater than a preset threshold corresponding to AU4, and if the microexpressive intensity corresponding to AU4 is greater than or equal to the preset threshold corresponding to AU4, select the microexpressive intensity corresponding to AU4 from the second intensity set as the final microexpressive intensity of AU 4. If the microexpressive intensity corresponding to AU4 is smaller than the preset threshold, determining that the microexpressive intensity corresponding to AU4 is 0, and taking 0 as the final microexpressive intensity of AU 4.

Similarly, the electronic device can obtain the final micro-expression intensity of AU5 according to the manner of obtaining the final micro-expression intensity of AU 4.

The electronic equipment can combine the final micro-expression intensities of all the obtained AU to obtain the final micro-expression intensities of 8 micro-expressions of AU1, AU4, AU5, AU12, AU15, AU25, AU32 and AU 45.

It should be noted that, the above-mentioned method for dividing N types of micro-expressions into different micro-expression groups is implemented by a preset dividing algorithm, and the N types of micro-expressions may be manually divided into different micro-expression groups according to experience of a user. The specific limitation is not given here.

It should also be noted that when the present application recognizes micro-expressions, two dimensions are adopted for recognition. First dimension: the method comprises the steps of carrying out microexpressive intensity recognition on N input optical flow chart sequences by using a first 2D convolutional neural network to obtain microexpressive strong compositions of N types of microexpressions into a first intensity set. Second dimension: and carrying out microexpressive intensity recognition on the N RGB images by adopting a second 2D convolutional neural network to obtain microexpressive intensities of N types of microexpressions to form a second intensity set. And then the electronic equipment can combine the microexpressive intensities of various microexpressions in the first intensity set and the second intensity set to obtain microexpressive intensities of various microexpressions. Since the micro-expression recognition is adopted for more dimensions, the intensity of the recognized micro-expression is more accurate.

As can be seen from the above description, the electronic device receives at least one frame of image containing the same target object; respectively determining a first facial image of a target object in each frame of image, and generating a first facial image sequence; adjacent first facial images in the first sequence of facial images are associated in adjacent timing; processing each first facial image in the first facial image sequence to obtain each second facial image only containing a specified facial organ, and generating a second facial image sequence; adjacent second facial images in the second facial image sequence are associated in adjacent time sequences; inputting the second facial image sequence into a preset stress information identification model to identify the stress information of the target object by the stress information identification model.

On one hand, the mask processing is firstly carried out on the first facial image to obtain the second facial image sequence which is used for removing the background in the facial image and only highlighting the important facial organs which are helpful for the recognition of the expressions and the micro-expressions, and then the stress model is used for recognizing the second facial image sequence, so that the recognition of the expressions or the micro-expressions is not influenced by the background in the facial image, and the recognition of the expressions and the micro-expressions is more accurate.

In another aspect, the present application identifies the second facial image sequence through a 3D convolutional neural network. When the 3D convolutional neural network performs expression recognition on the second facial image sequence, not only the spatial relationship of the second facial images of each frame in the second facial image sequence, but also the time sequence relationship of the second facial images of each frame are considered, so that the accuracy of expression recognition is greatly improved by adopting the 3D convolutional neural network.

In the third aspect, when the micro-expression is identified, two dimensions are adopted for identification. First dimension: and carrying out microexpressive intensity recognition on the input N optical flow chart sequences by adopting a first 2D convolutional neural network to obtain microexpressive intensity of N types of microexpressions to form a first intensity set. Second dimension: and carrying out microexpressive intensity recognition on the N RGB images by adopting a second 2D convolutional neural network to obtain microexpressive intensities of N types of microexpressions to form a second intensity set. And then the electronic equipment can combine the microexpressive intensities of various microexpressions in the first intensity set and the second intensity set to obtain the final microexpressive intensities of various microexpressions. Since the micro-expression recognition is adopted in more dimensions, the final various micro-expression intensities are determined more accurately by referring to the two dimensions.

In a fourth aspect, when the expression and the micro-expression of the target object are identified, the multi-frame image containing the target object is adopted for identification, instead of identifying one frame of image containing the target object, and because the time sequence among the images is considered for identifying the multi-frame image, the identified expression and micro-expression are more accurate.

The expression recognition and the micro-expression recognition are described in detail below by way of specific examples.

1. Expression recognition

Step 201: the electronic device receives at least one frame of image containing the same target object.

Assume that at least one frame of image containing the same target object is 3 frames of image, namely image 1, image 2 and image 3. Wherein, image 1 is a first frame image, image 2 is a second frame image, and image 3 is a third frame image.

The three frames of images are the most originally acquired images, and may include the whole body of the target object, the upper body of the target object, or the like.

Step 202: the electronic device may determine facial images of the target object in image 1, image 2, and image 3, respectively, to form a first sequence of facial images.

The electronic device may input image 1, image 2, and image 3 to the face recognition model.

Taking image 1 as an example, the face recognition model may recognize a face region of a target object in image 1, and extract the recognized face region from image 1 to form a face region image 1 corresponding to image 1. Similarly, the face recognition model can also obtain a face region image 2 and a face region image 3. The face recognition model may output a face region image 1, a face region image 2, and a face region image 3.

The electronic device may perform preprocessing such as correction, alignment, etc. on the face region image 1, the face region image 2, and the face region image 3, respectively, to obtain a first face image 1, a first face image 2, and a first face image 3 corresponding to the face region image 1, the face region image 2, and the face region image 3, respectively.

The electronic device may generate the first sequence of facial images in frame order of the first facial image 1, the first facial image 2, and the first facial image 3. The first facial image sequence generated is { first facial image 1, first facial image 2, first facial image 3}.

Step 203: the electronic device may input the first facial image sequence to a trained face segmentation neural network, so that all facial organs of each first facial image in the first facial image sequence are identified by the face segmentation neural network, each third facial image corresponding to each first facial image and marked with each facial organ in a distinguishing manner is obtained, and a third facial image sequence composed of each third facial image is output.

For example, the electronic device may input the first sequence of facial images to a face segmentation neural network.

Taking the processing of the first facial image 1 as an example, see fig. 2a, it is assumed that the first facial image 1 is as shown in fig. 2 a. The face segmentation neural network can identify all facial organs of the first facial image 1, and obtain a third facial image 1 corresponding to the first facial image 1 and labeled with all facial organs, and the third facial image 1 is shown in fig. 2 b.

Step 204: the electronic device may generate an expression mask map, and perform mask operation with each third facial image in the third facial sequence by using the expression mask map, to obtain each second facial image corresponding to each third facial image and including all facial organs, and generate a second facial image sequence.

The electronic device may generate an expression mask map by summing up the pixel points where all the facial organs marked in the third facial image 1, the pixel points where all the facial organs marked in the third facial image 2, and the pixel points where all the facial organs marked in the third facial image 3 are located.

For the third face image 1 in the third face image sequence, the electronic device may use the expression mask map to perform a mask operation on the third face image 1 to obtain the second face image 1. The second face image 1 is obtained as shown in fig. 2 c.

Step 205: the electronic device can input the second facial image sequence into the trained 3D convolutional neural network, so that the 3D convolutional neural network can identify the second facial image sequence, and the confidence coefficient of the target object relative to various expressions can be obtained.

For example, expression types include 8 categories, namely aversion, anger, fear, sadness, happiness, surprise, slight and no expression, respectively.

After the electronic equipment inputs the second facial image sequence into the 3D convolutional neural network, the convolutional layer of the 3D convolutional neural network assembles the second facial image sequence into a 3D data set, and carries out convolutional operation by adopting at least one preset 3D convolutional kernel to obtain an expression feature map and inputs the expression feature map into the pooling layer of the 3D convolutional neural network. And the pooling layer of the 3D convolutional neural network pools the expression feature map, and inputs the pooled expression feature map into the softmax layer of the 3D convolutional neural network. And classifying the expression feature images by a softmax layer of the 3D convolutional neural network to obtain the confidence degrees of the target object corresponding to each expression type.

The 3D convolutional neural network can output 0.1 confidence level of expression as aversion, 0.05 confidence level of expression as anger, 0.1 confidence level of expression as fear, 0.1 confidence level of expression as sad, 0.4 confidence level of expression as happiness, 0.2 confidence level of expression as surprise, 0.05 confidence level of expression as light and 0 without expression.

2. Recognition of micro-expressions

The microexpressive recognition method will be described in detail through steps 301 to 305.

Step 301: the electronic device receives at least one frame of image containing the same target object.

Step 302: the electronic device may determine facial images of the target object in image 1, image 2, and image 3, respectively, to form a first sequence of facial images.

The method for generating the first facial image sequence is referred to as step 202, and will not be described in detail here.

The first facial image sequence generated is { first facial image 1, first facial image 2, first facial image 3}.

Step 303: the electronic device may input the first facial image sequence to a trained face segmentation neural network, so that all facial organs of each first facial image in the first facial image sequence are identified by the face segmentation neural network, each third facial image corresponding to each first facial image and marked with all facial organs in a distinguishing manner is obtained, and a third facial image sequence composed of each third facial image is output.

The manner in which the third face image sequence is generated is shown in step 203 and will not be described in detail here.

The third face image sequence generated is { third face image 1, third face image 2, third face image 3}.

Step 304: the electronic equipment can adopt mask images corresponding to 8 preset micro-expressions respectively to perform mask operation on each frame of image in the third facial image sequence, so as to obtain N second facial image sequences corresponding to the 8 micro-expressions one by one.

The specified facial organs corresponding to AU1 and AU4 are eyebrows, the specified facial organs corresponding to AU5 and AU45 are eyes, and the specified facial organs corresponding to AU12, AU15, AU25 and AU32 are mouth. Taking AU1 as an example, the electronic device can select the pixel point where the eyebrow is located among the pixel points where all the facial organs are located, which are marked in the third face image 1. Similarly, the pixel point where the eyebrow is located is selected in each of the third face image 2 and the third face image 3.

The electronic device may perform a masking operation on the third face image 1 (as shown in fig. 2 b) using the eyebrow mask map, to obtain a second face image 11 (as shown in fig. 2 d) corresponding to the third face image 1 and including only the eyebrows. Similarly, the electronic apparatus can also obtain the second face image 12 corresponding to the third face image 2 and including only the eyebrows, and the second face image 13 corresponding to the third face image 3 and including only the eyebrows. The electronic device may compose the second face image 11, the second face image 12, and the second face image 13 into the second face image sequence 1 for AU 1. Adjacent second facial images in the second sequence of facial images are associated in adjacent timing.

The second face image sequence 1 is { second face image 11, second face image 12, second face image 13}.

Like AU1, the electronic device may further obtain a second facial image sequence 2 corresponding to AU4, where the second facial image in the second facial image sequence corresponding to AU4 includes only the eyebrows. The second face image sequence 2 is { second face image 21, second face image 22, second face image 23}.

Taking AU5 as an example, the electronic device may select, from among the pixels where all the facial organs marked in the third facial image 1 are located, the pixel where the eyes are located. Similarly, the pixel point where the eye is selected in the third face image 2 and the third face image 3, respectively.

Then, the electronic device can generate an eye mask map corresponding to AU5 by merging the pixel points where the eyes selected from the third face image 1, the third face image 2, and the third face image 3 are located.

The electronic device may perform a masking operation on the third face image 1 (as shown in fig. 2 b) using the eye mask map, to obtain a second face image 31 (as shown in fig. 2 e) corresponding to the third face image 1 and including only eyes. Similarly, the electronic apparatus can also obtain the second face image 32 corresponding to the third face image 2 and including only the eyes, and the second face image 33 corresponding to the third face image 3 and including only the eyes. The electronic device may compose the second face image 31, the second face image 32, and the second face image 33 into the second face image sequence 3 for AU 5. Adjacent second facial images in the second sequence of facial images are associated in adjacent timing.

Like AU5, the electronic device may further obtain a second sequence of facial images 4 corresponding to AU45, where the second facial images in the second sequence of facial images 4 include only eyes. The second face image sequence 4 is { second face image 41, second face image 42, second face image 43}.

Taking AU12 as an example, the electronic device may select, from among the pixel points where all the facial organs marked in the third face image 1 are located, the pixel point where the mouth is located. Similarly, the pixel point where the mouth is located is selected in the third face image 2 and the third face image 3, respectively.

Then, the electronic device can combine the pixel points where the mouth is selected in the third face image 1, the third face image 2, and the third face image 3, and generate a mouth mask map corresponding to the AU 12.

The electronic device may perform a masking operation on the third face image 1 (as shown in fig. 2 b) using the mouth mask map, to obtain a second face image 51 (as shown in fig. 2 f) corresponding to the third face image 1 and including only eyes. Similarly, the electronic apparatus can also obtain the second face image 52 corresponding to the third face image 2 and including only the mouth, and the second face image 53 corresponding to the third face image 3 and including only the mouth. The electronic device may compose second facial image 51, second facial image 52, and second facial image 53 into second facial image sequence 5 for AU 12. Adjacent second facial images in the second sequence of facial images are associated in adjacent timing. The second face image sequence 5 is { second face image 51, second face image 52, second face image 53}.

Like AU12, the electronic device can obtain a second sequence of facial images 6 corresponding to AU15, where the second sequence of facial images 6 is { second facial image 61, second facial image 62, second facial image 63}.

Like AU12, the electronic device can obtain a second sequence of facial images 7 corresponding to AU25, where the second sequence of facial images 7 is { second facial image 71, second facial image 72, second facial image 73}.

Like AU12, the electronic device can obtain a second sequence of facial images 8 corresponding to AU32, and the second sequence of facial images 7 is { second facial image 81, second facial image 82, second facial image 83}.

The second face image of the second sequence of face images 5, 6, 7, 8 comprises only the mouth.

Step 305: the electronic device can respectively calculate optical flow diagrams of each frame of second facial image in the 8 second facial image sequences in the vertical direction and the horizontal direction, generate 8 optical flow diagram sequences, input the 8 optical flow diagram sequences into the first 2D convolutional neural network, enable the first 2D convolutional neural network to conduct micro-expression intensity recognition on the 8 optical flow diagram sequences, and output a first intensity set composed of micro-expression intensities corresponding to 8 types of micro-expressions.

Taking the second face image sequence 1 as an example, assume that the second face image sequence 1 is { second face image 11, second face image 12, and second face image 13}.

The electronic device can calculate the optical flow map 11 in the horizontal direction and the optical flow map 12 in the vertical direction based on the second face image 11 and the second face image 12. Based on the second face image 12 and the second face image 13, an optical flow 21 in the horizontal direction and an optical flow map 22 in the vertical direction are calculated, and 4 optical flow maps are obtained, so that an optical flow sequence 1 corresponding to the second face image sequence 1 is formed.

The convolution layer of the first 2D convolutional neural network may use the preset data format [ N, C, H, W ] to form a data set from the optical flow sequence 1 to the optical flow sequence 8, then use the 2D convolutional check to perform convolution operation on the data set, and send the data set after the convolution operation to the pooling layer. The pooling layer pools the data set after convolution operation and sends the data set to the Euclidean Loss layer, and the Euclidean Loss layer carries out regression calculation to obtain micro expression intensities corresponding to 8 types of micro expressions.

Assume that the 8 micro-expression intensities output by the first 2D convolutional neural network are divided into: AU1 strength 0.5, AU4 strength 0.8, AU5 strength 0.6, AU12 strength 0.4, AU15 strength 0.6, AU25 strength 0.7, AU32 strength 0.3, AU45 strength 0.4.

The microexpressive intensities corresponding to the 8 types of microexpressions obtained through the first 2D convolutional neural network can form a first intensity set. The first set of intensities is { AU1:0.5, AU4:0.8, AU5:0.6, AU12:0.4, AU15:0.6, AU25:0.7, AU32:0.3, AU45:0.4}.

The preset data format of the first 2D convolutional neural network is [ N, C, H, W ]. Wherein N is a preset value, usually 1, C is the number of the optical flow graphs in the optical flow sequence (namely 4), H is the height of the optical flow graphs, and W is the width of the optical flow graphs.

Step 306: the electronic equipment processes the last frame of the 8 second facial image sequences to obtain 8 RGB images, inputs the 8 RGB images into a second 2D convolutional neural network, carries out micro-expression intensity recognition on the 8 RGB images by the second 2D convolutional neural network, and outputs a second intensity set composed of micro-expression intensities corresponding to 8 types of micro-expressions

The convolution layer of the second 2D convolution neural network can adopt the preset data format [ N, C, H, W ] to form a data set from the input 8 RGB images, then adopts the 2D convolution check to carry out convolution operation on the data set, and sends the data set after the convolution operation to the pooling layer. The pooling layer pools the data set after convolution operation and sends the data set to the Euclidean Loss layer, and the Euclidean Loss layer carries out regression calculation to obtain the micro expression intensity corresponding to 8 types of micro expressions.

Assume that 8 micro-expression intensities output by the second 2D convolutional neural network are divided into: AU1 intensity was 0.2, AU4 intensity was 0.3, AU5 intensity was 0.5, AU12 intensity was 0.4, AU15 intensity was 0.5, AU25 intensity was 0.6, AU32 intensity was 0.4, and AU45 intensity was 0.5.

The microexpressive intensities corresponding to 8 types of microexpressions obtained through the second 2D convolutional neural network can form a second intensity set. The second intensity set is { AU1:0.2, AU4:0.3, AU5:0.5, AU12:0.4, AU15:0.5, AU25:0.6, AU32:0.4, AU45:0.5}.

Step 307: the electronic equipment can combine the micro-expression intensity in the first intensity set and the micro-expression intensity in the second intensity set to obtain micro-expression intensity corresponding to the final 8 types of micro-expressions.

It is assumed that these 8 kinds of micro-expressions are divided into 3 micro-expression groups in advance, which are a first micro-expression group, a second micro-expression group, and a third micro-expression group, respectively.

The micro-expression types in the first micro-expression group include: AU45;

For the first microexpressive group, the electronic device may select the microexpressive intensity value of AU45 from the first intensity set as the final microexpressive intensity value of AU 45. I.e. the final AU45 has a strength of 0.4.

For the second microexpressive group, the electronic device may select microexpressive intensity values corresponding to AU1, AU12, AU15, AU32, and AU25 from the second intensity set, and the microexpressive intensity values are the final microexpressive intensity values of AU1, AU12, AU15, AU32, and AU 25. Namely, the final AU1 had a strength of 0.2, AU12 had a strength of 0.4, AU15 had a strength of 0.5, AU32 had a strength of 0.4, and AU25 had a strength of 0.6.

For the third microexpressive group, for AU4, it is assumed that the preset intensity threshold corresponding to AU4 is 0.3. The electronic device may detect whether the microexpressive intensity corresponding to AU4 in the first intensity set is greater than a preset threshold corresponding to AU4, and in this example, since microexpressive intensity 0.8 of AU4 in the first intensity set is greater than a preset threshold corresponding to AU4 by 0.3, the microexpressive intensity corresponding to AU4 (i.e. 0.3) is selected from the second intensity set as the final microexpressive intensity of AU 4. I.e. the final AU4 strength is 0.3.

For AU5, it is assumed that the preset intensity threshold corresponding to AU5 is 0.7. The electronic device may detect whether the microexpressive intensity corresponding to AU5 in the first intensity set is greater than a preset threshold corresponding to AU5, in this example, since microexpressive intensity 0.6 of AU5 in the first intensity set is less than a preset threshold corresponding to AU5 of 0.7, the final intensity of AU5 is 0.

The microexpressive intensity of the various microexpressions finally obtained is: AU1 strength 0.2, AU4 strength 0.3, AU5 strength 0, AU12 strength 0.4, AU15 strength 0.5, AU32 strength 0.4, AU25 strength 0.6, AU45 strength 0.4.

The electronic equipment can show the micro-expression intensities corresponding to the 8 AU types in a graph mode and the like.

The application also provides a human body stress identification device corresponding to the human body stress information identification method.

Referring to fig. 3, fig. 3 is a block diagram of a human body stress information identifying apparatus according to an exemplary embodiment of the present application. The apparatus may include the units shown below.

An acquiring unit 301, configured to acquire at least one frame of image including the same target object;

a first generating unit 302, configured to determine first facial images of the target object in each frame image, and generate a first facial image sequence; adjacent first facial images in the first sequence of facial images are associated in adjacent timing;

a second generating unit 303, configured to process each first facial image in the first facial image sequence to obtain each second facial image only including a specified facial organ, and generate a second facial image sequence; adjacent second facial images in the second facial image sequence are associated in adjacent time sequences;

and an identifying unit 304, configured to input the second facial image sequence into a preset stress information identifying model, so as to identify the stress information of the target object by the stress information identifying model.

Optionally, the second generating unit 303 is specifically configured to input the first sequence of facial images into a trained face segmentation neural network, so that all facial organs of each first facial image in the first sequence of facial images are identified by the face segmentation neural network, obtain each third facial image that is marked by distinction, and output a third sequence of facial images that is composed of each third facial image; adjacent third face images in the third face image sequence are associated at adjacent timings; and respectively performing mask processing on each third facial image in the third facial image sequence to obtain each second facial image only containing the appointed facial organ, and forming a second facial image sequence.

the second generating unit 303 is configured to perform mask processing on each third face image in the third face image sequence to obtain each second face image only including the specified face organ, and when forming the second face image sequence, specifically configured to obtain a union set of pixel points where the face organs marked in all the third face images are located, and generate an expression mask map; and performing mask operation on the expression mask image and each third facial image respectively to obtain each second facial image containing all facial organs, and generating a second facial image sequence.

the identifying unit 304 is specifically configured to input the second facial image sequence into a trained 3D convolutional neural network, assemble the second facial image sequence into a 3D data set by a convolutional layer of the 3D convolutional neural network, perform convolutional operation by using at least one preset 3D convolutional kernel, obtain an expression feature map, and input the expression feature map into a pooling layer of the 3D convolutional neural network; the pooling layer of the 3D convolutional neural network pools the expression feature images, and inputs the pooled expression feature images to the softmax layer of the 3D convolutional neural network; the softmax layer classifies the expression feature images to obtain the confidence degrees of the target object corresponding to each expression type; wherein a first dimension of the three dimensions represents a second facial image sequence length, a second dimension represents a second facial image height, and a third dimension represents a second facial image width.

The second generating unit 303 is specifically configured to select, for each type of microexpressive expression, pixels where the specified facial organ corresponding to the type of microexpressive expression is located, from among pixels where all the facial organs marked in each type of microexpressive expression are located, and to obtain a union set of the pixels selected from all the third facial images, when each second facial image including only the specified facial organ is obtained and the second facial image sequence is formed; performing mask operation on each frame of image in the third facial image sequence by adopting a micro-expression mask image corresponding to N micro-expressions respectively to obtain N second facial image sequences corresponding to the N micro-expressions one by one; wherein, the second facial image in the second facial image sequence corresponding to each micro expression only comprises: a specific facial organ corresponding to the first type of microexpression.

The identifying unit 304 is specifically configured to respectively calculate optical flow diagrams of any two consecutive second facial images in the N second facial image sequences in a vertical direction and a horizontal direction, generate N optical flow diagram sequences, and input the N optical flow diagram sequences into a first 2D convolutional neural network, so that the first 2D convolutional neural network performs micro-expression intensity identification on the N optical flow diagram sequences, and output a first intensity set composed of micro-expression intensities corresponding to N types of micro-expressions; processing the appointed second face images of the N second face image sequences to obtain N RGB images, inputting the N RGB images into a second 2D convolutional neural network, carrying out microexpressive intensity recognition on the N RGB images by the second 2D convolutional neural network, and outputting a second intensity set composed of microexpressive intensities corresponding to N types of microexpressions; and combining the microexpressive intensity in the first intensity set and the microexpressive intensity in the second intensity set to obtain microexpressive intensity corresponding to the final N types of microexpressions.

The identifying unit 304 is specifically configured to select, for each type of microexpressions in the first microexpressions group, microexpressions intensity corresponding to the type of microexpressions from the first intensity set when combining the microexpressions intensity in the first intensity set and the microexpressions intensity in the second intensity set to obtain microexpressions intensity corresponding to the final N types of microexpressions; aiming at various micro-expressions in the second micro-expression group, selecting micro-expression intensities corresponding to the micro-expressions from the second intensity set; for each type of microexpressions in the third microexpressions group, detecting whether the microexpressions intensity of the type of microexpressions in the first intensity set is larger than or equal to a preset threshold value corresponding to the type of microexpressions, if so, selecting the microexpressions intensity corresponding to the type of microexpressions from the second intensity set; if not, determining that the micro expression intensity is 0; and combining the microexpressive intensities corresponding to the microexpressive expressions in the first microexpressive expression group, the second microexpressive expression group and the third microexpressive expression group to obtain the final microexpressive intensity corresponding to the N types of microexpressive expressions.

The present application also provides a hardware architecture diagram of an electronic device corresponding to the apparatus shown in fig. 3.

Referring to fig. 4, fig. 4 is a hardware configuration diagram of an electronic device according to an exemplary embodiment of the present application.

The electronic device in fig. 4 includes: a communication interface 401, a processor 402, a machine-readable storage medium 403, and a bus 404; wherein the communication interface 401, the processor 402 and the machine readable storage medium 403 perform communication with each other via a bus 404. The processor 402 may perform the human stress information identification method described above by reading and executing machine-executable instructions corresponding to the human stress information identification control logic in the machine-readable storage medium 403.

The machine-readable storage medium 403 referred to herein may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: volatile memory, nonvolatile memory, or similar storage medium. In particular, the machine-readable storage medium 403 may be RAM (Radom Access Memory, random access memory), flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, DVD, etc.), or a similar storage medium, or a combination thereof.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for identifying stress information of a human body, comprising:

Acquiring at least one frame of image containing the same target object;

inputting the second facial image sequence into a preset stress information identification model to identify the stress information of the target object by the stress information identification model;

the stress information is microexpressive information of N types of microexpressions, and the microexpressive information of the N types of microexpressions comprises: the specified facial organ is a facial organ corresponding to N types of micro-expressions, the second facial image sequences are N second facial image sequences corresponding to N types of micro-expressions one by one, N is an integer greater than zero, and the stress information identification model comprises a first 2D convolutional neural network and a second 2D convolutional neural network;

2. The method of claim 1, wherein the processing each first facial image in the first sequence of facial images to obtain each second facial image comprising only the specified facial organ, generating the second sequence of facial images, comprises:

3. The method according to claim 2, wherein the stress information is expression information, and the specified facial organ is all facial organs of the target subject's face;

4. The method of claim 3, wherein the expression information includes confidence levels for each type of expression;

5. The method according to claim 2, wherein masking each third facial image in the sequence of third facial images to obtain each second facial image including only the specified facial organ, and forming the second sequence of facial images, comprises:

Wherein, the second facial image in the second facial image sequence corresponding to each micro expression only comprises: and designating facial organs corresponding to the micro-expressions.

6. The method of claim 1, wherein the N types of microexpressions are divided into three microexpressions groups, a first microexpressions group, a second microexpressions group, and a third microexpressions group, respectively;

7. A human stress information recognition device, comprising:

the identification unit is used for inputting the second facial image sequence into a preset stress information identification model so as to identify the stress information of the target object by the stress information identification model;

8. The apparatus according to claim 7, wherein the second generating unit is specifically configured to input the first sequence of facial images into a trained face segmentation neural network, to identify all facial organs of each first facial image in the first sequence of facial images by the face segmentation neural network, to obtain each third facial image that is differentially labeled for each facial organ, and to output a third sequence of facial images that is composed of each third facial image; adjacent third face images in the third face image sequence are associated at adjacent timings; and respectively performing mask processing on each third facial image in the third facial image sequence to obtain each second facial image only containing the appointed facial organ, and forming a second facial image sequence.

9. The apparatus of claim 8, wherein the stress information is expression information, and the specified facial organ is all facial organs of the face of the target subject;

10. The apparatus of claim 9, wherein the expression information includes confidence levels of various expressions;

11. The apparatus of claim 8, wherein the device comprises a plurality of sensors,

the second generating unit is specifically configured to select, for each type of microexpressions, pixels where the specified facial organ corresponding to the type of microexpressions is located, from among pixels where all the facial organs marked in each type of microexpressions are located, and to obtain a union set of the pixels selected from all the third facial images, when each second facial image including only the specified facial organ is obtained and the second facial image sequence is formed; performing mask operation on each frame of image in the third facial image sequence by adopting a micro-expression mask image corresponding to N micro-expressions respectively to obtain N second facial image sequences corresponding to the N micro-expressions one by one; wherein, the second facial image in the second facial image sequence corresponding to each micro expression only comprises: and designating facial organs corresponding to the micro-expressions.

12. The apparatus of claim 7, wherein the N types of micro-expressions are divided into three micro-expression sets, a first micro-expression set, a second micro-expression set, and a third micro-expression set, respectively;

13. An electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor to cause the method of any one of claims 1 to 6 to be performed.

14. A machine-readable storage medium storing machine-executable instructions which, when invoked and executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.