CN115984930A

CN115984930A - Micro expression recognition method and device and micro expression recognition model training method

Info

Publication number: CN115984930A
Application number: CN202211681160.8A
Authority: CN
Inventors: 陶江龙; 胡治满; 于亚洲; 陶和平; 闫帅; 张艺严; 申润业
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-18

Abstract

The application provides a micro expression recognition method, a device, a micro expression recognition model training method, electronic equipment and a storage medium, wherein the micro expression recognition method comprises the following steps: acquiring a plurality of image blocks of an image to be identified; respectively extracting the self-attention features of the image blocks based on a multi-head self-attention mechanism to obtain feature maps of the image blocks; respectively learning the attention weight of the feature map based on two channels to adjust the feature map to obtain a target feature map; and carrying out micro-expression recognition on the image to be recognized based on the target characteristic graph. According to the technical scheme, the micro-expression recognition method and the device can achieve accurate recognition of the micro-expression.

Description

Micro expression recognition method and device and micro expression recognition model training method

Technical Field

The application belongs to the field of artificial intelligence, and particularly relates to a micro expression recognition method and device, a micro expression recognition model training method, electronic equipment and a storage medium.

Background

Micro-expressions, due to the duration of time, have low muscle fluctuations, presenting a significant challenge to the automatic identification technology. The traditional micro expression recognition method is generally based on manual features such as a local binary pattern, an optical flow histogram, a gradient histogram and the like, so that micro expression analysis is realized, but the methods excessively depend on prior knowledge, most of extracted information is floated on the surface, and abstract features representing micro expressions are lacked.

In recent years, methods based on convolutional neural networks are popularized and applied to automatic recognition of human face micro-expressions, but the methods need massive data to train models, the micro-expression data are often relatively small in quantity, so that micro-expressions cannot be recognized accurately, and in addition, the global modeling capability of the convolutional neural networks is weak, so that micro-expression changes cannot be perceived according to global facial muscle movements.

Disclosure of Invention

In order to solve the technical problem, embodiments of the present application provide a micro expression recognition method, a micro expression recognition device, a micro expression recognition model training method, an electronic device, and a computer-readable storage medium.

According to an aspect of an embodiment of the present application, there is provided a micro-expression recognition method, including: acquiring a plurality of image blocks of an image to be identified; respectively extracting self-attention features of the image blocks based on a multi-head self-attention mechanism to obtain feature maps of the image blocks; respectively learning the attention weight of the feature map based on two channels to adjust the feature map to obtain a target feature map; and carrying out micro-expression recognition on the image to be recognized based on the target feature map.

In an embodiment, the extracting self-attention features of the plurality of image blocks respectively based on the multi-head self-attention mechanism to obtain feature maps of the plurality of image blocks includes:

dividing the plurality of image blocks into different image block sets;

respectively extracting set self-attention features of all image block sets aiming at different image block sets;

and performing feature splicing on the set of each image block set according to the attention features to obtain a feature map of the plurality of image blocks.

In one embodiment, the image blocks comprise a first set containing face key points and a second set not containing face key points; the learning of the attention weight of the feature map based on the two channels respectively to adjust the feature map to obtain a target feature map includes:

respectively learning the feature map of the first set and the feature map of the second set based on two channels, and correspondingly obtaining a first attention weight and a second attention weight;

adjusting the feature maps of the first set based on the first attention weight and adjusting the feature maps of the second set based on the second attention weight;

and splicing the adjusted feature map of the first set and the adjusted feature map of the second set to obtain the target feature map.

In an embodiment, before the learning of the feature maps of the first set and the feature maps of the second set based on two channels respectively and the corresponding first attention weight and the second attention weight, the method further includes:

positioning key points of the human face in the image to be recognized;

and taking the image blocks containing the face key points in the plurality of image blocks as the first set, and taking the image blocks not containing the face key points in the plurality of image blocks as the second set.

In an embodiment, the locating the key points of the face in the image to be recognized includes:

acquiring an initial face key point in the image to be recognized;

removing key points related to the face contour from the initial face key points to obtain first key points;

based on the first key point, positioning the position of the cheek in the image to be identified to obtain a second key point;

and taking the first key point and the second key point as the key points of the human face.

In an embodiment, the locating, based on the first keypoint, a position of a cheek in the image to be recognized, and obtaining a second keypoint includes:

selecting a target key point set from the first key points;

calculating central points among the key points in the target key point set;

and carrying out fixed offset on the central point, and taking the central point subjected to fixed offset and the central point as a second key point.

According to an aspect of an embodiment of the present application, there is provided a training method of a micro expression recognition model, including: inputting an image to be trained into an initial micro expression recognition model, carrying out random mask inactivation treatment on a plurality of training image blocks of the image to be trained in the initial micro expression recognition model, obtaining training feature maps of the training image blocks subjected to the random mask inactivation treatment based on a multi-head self-attention mechanism, adjusting the training feature maps based on two channels to obtain a target training feature map, and obtaining a training prediction result based on the target training feature map; and training the initial micro-expression recognition model according to a prediction result output by a pre-trained teacher model aiming at the image to be trained and the training prediction result.

According to an aspect of an embodiment of the present application, there is provided a micro expression recognition apparatus including: the image block acquisition module is configured to acquire a plurality of image blocks of an image to be identified; the feature map acquisition module is configured to extract self-attention features of the image blocks respectively based on a multi-head self-attention mechanism to obtain feature maps of the image blocks; the target characteristic diagram module is configured to learn attention weights of the characteristic diagrams respectively based on two channels so as to adjust the characteristic diagrams to obtain target characteristic diagrams; and the micro expression recognition module is configured to perform micro expression recognition on the image to be recognized based on the target feature map.

According to an aspect of an embodiment of the present application, there is provided an electronic device including one or more processors; a storage device for storing one or more computer programs that, when executed by the one or more processors, cause the electronic device to implement the micro expression recognition method or the training method of the micro expression recognition model as described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor of a computer, cause the computer to execute the micro expression recognition method or the training method of the micro expression recognition model as described above.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the micro expression recognition method or the training method of the micro expression recognition model provided in the above-mentioned various alternative embodiments.

In the technical scheme provided by the embodiment of the application, the multi-head self-attention mechanism deeply excavates the features of the image to be recognized, and self-adaptively learns the importance of different channels of the feature vector through the binary channel sensing unit, so that micro-expression recognition is accurately realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;

FIG. 2 is a flow diagram of a micro expression recognition method shown in an exemplary embodiment of the present application;

FIG. 3 is a block diagram of a micro expression recognition model shown in an exemplary embodiment of the present application;

FIG. 4 is a flowchart of step S230 of the embodiment shown in FIG. 2 in an exemplary embodiment;

FIG. 5 is a flowchart of step S250 of the embodiment shown in FIG. 2 in an exemplary embodiment;

FIG. 6 is a block diagram of a binary channel sensing unit shown in an exemplary embodiment of the present application;

FIG. 7 is a flow diagram of a micro expression recognition method in accordance with another exemplary embodiment of the present application;

FIG. 8 is a flowchart illustrating a method for training a micro-expression recognition model in accordance with an exemplary embodiment of the present application;

FIG. 9 is a flow chart of a method for training a micro-expression recognition model in accordance with another exemplary embodiment of the present application;

FIG. 10 is a diagram of a hard distillation process shown in an exemplary embodiment of the present application;

fig. 11 is a schematic structural diagram of a micro expression recognition apparatus according to an exemplary embodiment of the present application;

FIG. 12 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should also be noted that: reference to "a plurality" in this application means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The existing micro expression recognition method has the advantages and disadvantages that:

the method has the advantages of identifying the micro-expression based on the traditional method:

the features such as image texture, edges and the like are extracted through a manually designed feature extractor, the method is easy to realize, and model interpretability is strong.

The method has the following defects of identifying the micro expression based on the traditional method:

the prior knowledge is excessively relied on, most of the extracted information floats on the surface, and abstract features representing the micro expression are lacked; complex experimental design and tedious parameter adjustment are required to obtain an ideal result model.

The advantage of recognizing the micro expression based on the neural network is as follows:

the powerful feature extraction capability of the neural network can effectively extract facial features, and micro expressions can be recognized in a fully intelligent mode.

The method comprises the following steps of identifying the defects of the micro-expression based on the neural network:

the model relies on massive micro-expression training data, the global relation modeling capability of the convolutional neural network is weak, and micro-expression changes cannot be sensed according to global facial muscle movement.

The method and the device for recognizing micro expressions, the method for training the micro expression recognition model, the electronic device and the storage medium provided by the embodiment of the application will be described in detail below.

Referring first to fig. 1, fig. 1 is a schematic diagram of an implementation environment related to the present application. The implementation environment includes a terminal 100 and a server 200, and communication between the terminal 100 and the server 200 is performed through a wired or wireless network.

The terminal 100 is configured to receive an image to be recognized, where the image to be recognized should include a face image of a task, so as to perform micro-expression recognition based on the face image, and the image to be recognized may be an image frame in a segment of video.

The terminal 100 further sends the image to be recognized to the server 200, and a pre-trained micro expression recognition model is arranged in the server 200, so that the micro expression recognition model in the server 200 recognizes the micro expression of the member in the image to be recognized, a recognition result is obtained, and finally the recognition result can be visually displayed through a display module of the terminal 100.

For example, after receiving the image to be recognized, the terminal 100 sends the image to be recognized to the server 200; after receiving the image to be recognized, the server 200 acquires a plurality of image blocks of the image to be recognized; respectively extracting self-attention features of the image blocks based on a multi-head self-attention mechanism to obtain feature maps of the image blocks; respectively learning the attention weight of the feature map based on two channels to adjust the feature map to obtain a target feature map; and carrying out micro-expression recognition on the image to be recognized based on the target characteristic graph.

The terminal 100 may be any electronic device capable of implementing data visualization, such as a smart phone, a tablet, a notebook, and a computer, and is not limited in this respect. The server 200 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, where the plurality of servers may form a block chain, and the server is a node on the block chain, and the server 200 may also be a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network ), big data, and artificial intelligence platform, which is not limited herein.

Of course, the micro-expression recognition method proposed in this embodiment can also be implemented in the terminal 100 alone.

FIG. 2 is a flow chart illustrating a micro-expression recognition method according to an example embodiment. The micro expression recognition method can be applied to the implementation environment shown in fig. 1 and specifically executed by the server 200 in the implementation environment, it should be understood that the method can also be applied to other exemplary implementation environments and specifically executed by devices in other implementation environments, and the embodiment does not limit the implementation environment to which the method is applied.

As shown in fig. 2, in an exemplary embodiment, the method may include steps S210 to S270, which are described in detail as follows:

step S210: a plurality of image blocks of an image to be recognized are acquired.

The micro expression recognition in this embodiment is completed in a pre-trained micro expression recognition model, and the structure of the micro expression recognition model can refer to fig. 3, which includes a preprocessing module, a transform module (a machine learning model), and a prediction module, where the transform module further includes a multi-head self-attention unit and a binary channel sensing unit.

In a specific embodiment, an image to be recognized enters a preprocessing module, and the preprocessing module divides the image to be recognized to obtain a plurality of image blocks.

Step S230: based on a multi-head self-attention mechanism, self-attention features of the image blocks are respectively extracted to obtain feature maps of the image blocks.

A plurality of image blocks arrive in a Transformer module, firstly, a multi-head self-attention unit encodes the image blocks to obtain a two-dimensional image block sequence, and then two vectors x capable of being learned are introduced _class And x _distill Splicing the two vectors with the two-dimensional image block sequence to obtain a spliced two-dimensional image block sequence, and then identifying the model according to the following micro expression _class And x _distill And (5) performing micro expression recognition.

The multi-head self-attention unit comprises a plurality of heads, the image blocks in the spliced two-dimensional image block sequence are sequentially assigned with different heads, self-attention feature extraction is respectively carried out, namely each head can obtain a corresponding feature sequence, and then the feature sequences obtained by calculation of each head are spliced and subjected to linear mapping to obtain a feature map with the same size as the input (namely the spliced two-dimensional image block sequence).

In this embodiment, the multiple heads in the multiple head self-attention unit may be regarded as multiple channels, that is, the two-dimensional image block sequence is split and spliced, the multiple image blocks obtained after the splitting are respectively sent to different channels to perform self-attention feature acquisition, and finally the feature sequences of each channel are spliced, so as to obtain a feature map of the spliced two-dimensional image block sequence.

Step S250: and respectively learning the attention weight of the feature map based on the two channels so as to adjust the feature map to obtain a target feature map.

Certainly, some unit structures (not shown in fig. 3) exist in the transform module, such as an LN layer (for normalization operation) and a residual connection unit, after the multi-head self-attention unit outputs the feature map, the LN layer performs normalization operation on the output feature map, and the residual connection unit splices the output feature map and data input into the multi-head self-attention unit and sends the spliced feature map to the binary channel sensing unit.

In this embodiment, the image blocks of the image to be recognized include a first set including image blocks with face key points and a second set including image blocks without face key points, and the binary channel sensing unit adaptively learns the channel attention weights of the input feature sequences from the first set and the second set, respectively.

That is, the binary channel sensing unit includes two channels, one channel learns the feature map corresponding to the second set, and the other channel learns the features corresponding to the first set, where the feature map corresponding to the first set is the feature map obtained by the image block including the face keypoint through the steps shown in step S230.

Thus, the two channels correspondingly output a first attention weight and a second attention weight, at this time, the feature map of the first set is adjusted by the first attention weight, the feature map of the second set is adjusted based on the second attention weight, and the adjusted feature map of the first set and the adjusted feature map of the second set are restored according to the position when the feature maps are input into the binary channel sensing unit, so as to obtain a target feature map with the same size as the input binary channel sensing unit.

Step S270: and carrying out micro expression recognition on the image to be recognized based on the target characteristic diagram.

The processing process is the same as that of the multi-head self-attention unit after outputting the feature map, after the target feature map with the same size of the binary channel sensing unit, normalization operation is carried out on the output target feature map through an LN layer, the residual error connecting unit splices the output target feature map and data input into the binary channel sensing unit and sends the spliced target feature map to the prediction unit to carry out micro-expression recognition through the prediction unit, and a vector x in the target feature map is subjected to micro-expression recognition _class And x _distill Has been learned, i.e. can pass through the vector x in the target feature map _class And x _distill Micro-expression recognition is performed.

In this embodiment, the prediction unit may be an MLP Head classifier to identify the micro-expression through the MLP Head.

In the embodiment, a multi-head self-attention mechanism is arranged to deeply mine the characteristics of the image to be recognized, so that richer characteristic data are provided for the subsequent micro expression recognition; on the other hand, the image blocks are divided into different sets based on the key points of the face, the importance of different channels of the feature vectors is learned in a self-adaptive mode through the binary channel sensing unit based on the different sets, and then the feature vectors are balanced again according to importance values, so that the model can adapt to various micro-expression recognition scenes, and the effect of accurately recognizing the change of the micro-expression of the face is achieved.

The micro-expression recognition method in the embodiment can be applied to scenes such as interview communication, online education, fatigue driving and the like, can capture 'micro-expressions' unconsciously shown by a recognized person in real time, sense the real feeling and emotion conflict of the inner heart of the recognized person, and further can help an observer or a system to make effective intervention measures or improve the communication skills of the observer or the system, for example, in the fatigue driving scene, whether abnormal driving states such as fatigue driving and distraction exist or not is accurately judged by detecting the facial micro-expressions of the driver in real time, the occurrence of traffic accidents is reduced, and the pace of smart safe city construction is accelerated.

Fig. 4 is a flow chart in an exemplary embodiment for step S230 of the embodiment shown in fig. 2. As shown in fig. 4, in an exemplary embodiment, the step S230 extracts the self-attention features of the plurality of image blocks respectively based on the multi-head self-attention mechanism, and the process of obtaining the feature maps of the plurality of image blocks may include steps S410 to S450, which are described in detail as follows:

step S410: the plurality of image blocks is divided into different sets of image blocks.

In this embodiment, for an input size of x ∈ R ^224×224×3 Can be divided into 196 image blocks, and then the image to be recognized is first encoded into x _p ∈R ^196×768 The number of image blocks is 196, the length of the one-dimensional image block vector is 768, and x is introduced _class ∈R ⁷⁶⁸ And x _distill ∈R ⁷⁶⁸ Two learnable vectors are spliced with the image block sequence to obtain x _p ∈R ^198×768 As input to a multi-headed self-attentive unit.

For x _p ∈R ^198×768 If there are 8 heads, then 8 image block sets are obtained and distributed according to the order of the heads, for example, the image block in the image block set corresponding to the first head is x ₁ ∈R ^198×96 The image block in the image block set corresponding to the second header is x ₂ ∈R ^198×96 The 96 corresponding to the second header is 96 units after the 96 corresponding to the first header, so that the image block set corresponding to each header can be allocated.

Step S430: and respectively extracting set self-attention features of all image block sets aiming at different image block sets.

And each head processes the corresponding image block set to obtain the corresponding set self-attention feature.

In a specific embodiment, each header comprises three fully-connected layers, and q, k, v are calculated by the three fully-connected layers for the input image block set, for example, for the first header, the calculation process is [ q, k, v [ ]]＝x ₁ (W _q ，W _k ，W _v )，W _q ，W _k ，W _v The weights of the 3 fully-connected layers respectively, and then the set self-attention feature of the image block set is calculated by q, k, v calculated by the three fully-connected layers:

wherein d is _k Is the dimension of the k sequence, is constant, O _h For a set of image blocksAttention-gathering features.

Step S450: and performing feature splicing on the set of each image block set from the attention features to obtain a feature map of a plurality of image blocks.

In this embodiment, the feature sequence (set of self-attention features) calculated by each head is spliced and then subjected to linear mapping to obtain a feature map O with the same input size _final ：

O _final ＝concat(O _h1 ，O _h2 ，...，O _h8 )W _H

Wherein, W _H Is the weight of the full connection layer, O _h1 Is the set of self-attention features output by the first head.

Of course, the above is exemplary to provide 8 headers, and in other embodiments, other numbers of headers may be provided to provide the self-attention feature of the image blocks, and no particular limitation is made herein.

The embodiment provides a mode for carrying out self-attention characteristics based on a multi-head self-attention mechanism, which directly carries out global relationship modeling on all image blocks, prompts a model to extract micro-expression characteristics from difference information subspaces of different image blocks, and respectively extracts the characteristics of the image blocks by setting multi-heads so as to obtain richer and more accurate characteristics and provide reference data for subsequent accurate identification of micro-expressions.

Fig. 5 is a flow chart in an exemplary embodiment for step S250 of the embodiment shown in fig. 2. As shown in fig. 5, in an exemplary embodiment, the plurality of image blocks include a first set containing face keypoints and a second set not containing face keypoints; step S250 learns the attention weights of the feature maps based on two channels respectively to adjust the feature maps, and the process of obtaining the target feature map may include steps S510 to S550, which are described in detail as follows:

step S510: and respectively learning the feature map of the first set and the feature map of the second set based on two channels, and correspondingly obtaining a first attention weight and a second attention weight.

For a plurality of image blocks, the image blocks can be divided into a first set through the face key pointsCombining the second set, the binary channel sensing unit adaptively learns the channel attention weights of the input feature map from the first set and the second set respectively, and then dynamically adjusts the values in the feature map, which can refer to fig. 6 for x _p ∈R ^198×768 Then the first set may be x ₁ ∈R ^N1N1×768 The second set may be x ₂ ∈R ^N2×768 。

Wherein, the dropout (dropout is that some neurons do not work with a certain probability) probabilities of the first set and the second set are different, the dropout probability of the first set is 0.2, the dropout probability of the second set is 0.1, and the calculation formula of the attention weight is as follows:

z _h ＝sigmoid(conv ₂ (dropout(Gelu(conv ₁ (z _in )))))

wherein z is _h Attention weight, conv convolution, gelu activation function, z _in Is a function of the feature map of the first set or the feature map of the second set.

Step S530: the feature maps of the first set are adjusted based on the first attention weight, and the feature maps of the second set are adjusted based on the second attention weight.

After obtaining the attention weight, the value of the corresponding position of the feature map may be adjusted based on the attention weight:

z _out ＝z _h ×z _in +z _in

wherein z is _out Is the adjusted feature map of the first set or the adjusted feature map of the second set.

Step S550: and splicing the adjusted feature map of the first set and the adjusted feature map of the second set to obtain a target feature map.

In this embodiment, the adjusted feature map of the first set and the adjusted feature map of the second set are merged and fused (merqe), that is, restored to a target feature map having the same size as the data of the input binary channel sensing unit:

wherein z is _OUT In order to be a target feature map,

respectively, the adjusted feature map of the first set and the adjusted feature map of the second set.

In this embodiment, the importance of different channels of the feature map is adaptively learned through the binary channel sensing unit, and then the feature map is re-weighted according to the importance value, so that an accurate prediction result is obtained subsequently.

FIG. 7 is a flowchart illustrating a micro expression recognition method according to another exemplary embodiment. The method may run before step S510 of fig. 5, and specifically, the method may be completed in step S210 of fig. 2, that is, in a preprocessing module, and the process may include steps S710 to S730, which are described in detail as follows:

step S710: and positioning the key points of the human face in the image to be recognized.

In this embodiment, the first set and the second set are determined in the preprocessing module.

Specifically, 68 initial face key points are first located on the image to be recognized through 2D-FAN (based on human body posture estimation architecture), 18 key points of the face contour are discarded, and the remaining 50 initial face key points, namely 50 first key points, are retained.

In order to effectively represent muscle changes of the cheek region, four key points are needed to be added to calibrate the position of the cheek, the key points of the position of the cheek can be obtained through manual calibration, and can also be obtained through calculation of a first key point, for example, a target key point set is selected from the first key points; calculating the central points among the key points in the target key point set to obtain the key points of the left face and the right face respectively; and carrying out fixed offset on the central point of the left/right face, and obtaining two key points of the fixed offset.

For example, in a specific embodiment, a certain key point on the eyebrow bow and a certain key point on the lip are respectively selected on the left face and the right face, so as to obtain a target key point set, where the target key point set includes 4 key points, 2 key points are on the left face in the image to be recognized, and the other two key points are on the right face in the image to be recognized.

For 2 key points of the left face, one key point is at the eyebrow arch of the left face, the other key point is at the lip of the left face, if the key point at the eyebrow arch is the second key point at the eyebrow arch of the image to be recognized, the key point at the lip of the left face is the first key point of the left lower lip, then the central point is calculated according to the two key points of the left face, and the key point at the position of a calibration cheek of the left face is obtained.

For the 2 key points of the right face, the process of acquiring the key point of the position where the cheek is located on the left face may also be referred to, and thus, the key point of the position where one cheek is located on the right face is also obtained.

Then, according to the fixed offset (x, y) of the left and right mouth corner points _left ＝(x _left -16，y _left 16,x, right = xright +16, right +16, and calculate the other two points, i.e. 4 key points, where (x, y) _left Coordinates, x, of a key point of the left face obtained by performing a fixed offset on a key point of the left face where the cheek is located _left And y _left Coordinates of a key point for the left face to mark the location of the cheek, (x, y) _right Coordinates of a key point x obtained by performing fixed offset on a key point of the right face where the cheek is located _right And _yright coordinates of a key point of the right face where the cheek is located are calibrated.

The obtained 4 key points and 50 first key points are collectively called as face key points in the image to be recognized.

Step S730: and taking image blocks which contain the face key points in the plurality of image blocks as a first set, and taking image blocks which do not contain the face key points in the plurality of image blocks as a second set.

For input size of _x ∈R ^H×W×C H, W and C are respectively the height, width and channel (generally 3 channels) of the image to be identified, and are divided intoDividing the image blocks into 16 × 16 image blocks to obtain 196 image blocks, dividing the image blocks where the face key points are located into a first set, and dividing the other image blocks into a second set

And (4) collecting.

The embodiment of the invention provides a mode for dividing image blocks into a first set and a second set, and the image blocks with human face key points and without human face key points are divided, so that the image blocks are processed through multiple channels in the follow-up process, and the accuracy of follow-up micro expression recognition is improved.

Based on the micro expression recognition methods in fig. 2 to 7, fig. 8 is a flowchart illustrating a training method of a micro expression recognition model according to an exemplary embodiment. As shown in fig. 8, in an exemplary embodiment, the method may include steps S810 to S830, which are described in detail as follows:

step S810: inputting an image to be trained into an initial micro expression recognition model, carrying out random mask inactivation treatment on a plurality of training image blocks of the image to be trained in the initial micro expression recognition model, obtaining training feature maps of the training image blocks subjected to the random mask inactivation treatment based on a multi-head self-attention mechanism, adjusting the training feature maps based on two channels to obtain a target training feature map, and obtaining a training prediction result based on the target training feature map.

Referring to fig. 9, in the training method of the micro expression recognition model in this embodiment, first, an image to be trained is input into an initial micro expression recognition model, and the initial micro expression recognition model performs preprocessing on input data: as data enhancement: input image center clipping to R ^224×224×3 Horizontal flip, colorjitter (adjust brightness, contrast, saturation and hue); and (3) mixing and enhancing: mixup (an algorithm for enhancing images by blending, which is used in computer vision), cutmix (overlay), with a probability of 0.8 for mixup and 1.0 for cutmix; data normalization: the mean μ and standard deviation σ of the dataset of images to be trained are counted, data is normalized using Z-Score, etc.

Similarly, in the preprocessing module of the initial micro-expression recognition model, a plurality of training image blocks of the image to be trained are also divided into a first training set containing face key points and a second training set without the face key points, and in the training stage, the preprocessing module further performs random mask inactivation processing on the first training set and the second training set respectively.

In the training stage, a mask inactivation mechanism is set to solve the over-fitting problem of the neural network in the micro-expression task, the problem can be divided into two types, firstly, the neural network tends to classify the micro-expression through a small part of significant areas; secondly, the feature extractors of the neural network have interdependence and interaction relationship, and have poor generalization capability.

Specifically, image blocks are randomly inactivated (inactivated to set the pixels of the part of image blocks to 0) in a first training set and a second training set respectively at a ratio of 1/8 each time, wherein the first training set inactivates the image blocks at a high inactivation rate of 0.5, the second training set inactivates the image blocks at a low inactivation rate of 0.3, and different image blocks with different inactivated images can force a model to learn the whole facial features more comprehensively instead of outputting the discrimination result only depending on a local area.

Subsequently, the training image blocks subjected to the random mask deactivation process enter the Transformer module of the initial micro-expression recognition model, and the processing process in the Transformer module is the same as the process of the micro-expression recognition model in the practical application, which can be referred to specifically in fig. 2 to fig. 7.

At this time, the Transformer module will output a target training feature map, which includes the learned vector x _class And x _distill Different from the micro expression recognition model in practical application, the knowledge distillation module is additionally arranged in the prediction module for model training in the embodiment to ensure the training degree of the initial micro expression recognition model.

Step S830: and training the initial micro-expression recognition model according to a prediction result output by the pre-trained teacher model aiming at the image to be trained and a training prediction result.

The self-attention mechanism in the self-attention network does not have inductive bias capability, but the convolutional neural network naturally has strong inductive bias capability: local similarities and translation, etc., which also allows the convolutional neural network to use less data to achieve better results. In the embodiment, a knowledge distillation module is introduced, so that the initial micro-expression recognition model learns the induction bias capability of a pre-trained teacher model (RegNeTY 16GF, a machine learning model), the dependence on data volume is reduced, and the hard distillation can obtain a better result in the task.

The distillation mechanism includes a soft distillation mechanism, which is achieved by minimizing the KL divergence of softmax results for teacher and student models, and a hard distillation mechanism. The calculation formula is as follows:

wherein, Z _s And Z _t Respectively indicating the output prediction results of a student network and a teacher network, psi indicating softmax operation, tau indicating distillation temperature, lambda indicating balance coefficient, KL indicating Kullback-Leibler divergence loss, and L _CE The finger cross entropy loss, y the true tag.

In this embodiment, hard distillation is used in the training process, and the hard distillation process can refer to fig. 10, which directly uses the determination result of the teacher model as another real label, and its formula is as follows:

wherein, y _t ＝argmax(Z _t (c) Y) the result of discrimination output from the instructor model, y _t And y serve the same function.

In this embodiment, the student network is an initial micro expression recognition model, and different from the micro expression recognition model, during distillation, the prediction result output by the student network is not directed at the vector x in the target training feature map _class And x _distill Obtained instead of aloneFor x _distill Obtaining the product; and the prediction result output by the teacher model is output for the image to be trained.

In fig. 10, patch tokens are 768-dimensional features obtained by splicing and cutting images by 16 × 16 and then performing linear layer coding, classttoken and distilationken are learnable embedded vectors having the same dimension as that of the patch tokens, respectively, where classttoken is used to generate a discrimination layer for finally solving a loss function with a real label, and distilationken is used to generate a discrimination layer for finally solving the loss function with a teacher network output. L is a radical of an alcohol _teacher Is a loss function of the teacher model.

In this embodiment, the initial micro-expression recognition model is pre-trained on ImageNet1K, an AdamW tuning algorithm is used, the minimum batch processing amount of training is 64, the iteration cycle is 200 rounds, the initial learning rate is 0.0005, the learning rate attenuation uses a cosine annealing strategy, the attenuation is performed once every 30 rounds, the attenuation rate is 0.05, and the loss function is

Calculating a hard distillation loss on an initial micro-expression recognition model>

And (3) reducing the loss by back propagation and recursion, transferring induction bias capability of the teacher network to the student network, namely an initial micro expression recognition model, outputting a prediction result when the loss function is minimum, and otherwise, iteratively updating the weight matrix to obtain a trained micro expression recognition model.

First, the first method is a training method of performing random mask inactivation and hard distillation processing on the initial micro-expression recognition model in fig. 8, the second method is a training method using only hard distillation processing, the third method is a training method using random mask inactivation and hard distillation processing, but replacing the binary channel sensing unit of the initial micro-expression recognition model with an MLP layer, the fourth method is a training method using only random mask inactivation, and the fifth method is a training method using random mask inactivation and soft distillation processing, so as to obtain the accuracy index (ACC) shown in table 1:

training method	ACC(％)
		Method of the first kind	90.3
Method of the second kind	87.6
		Method of the third type	89.4
Method of the fourth type	88.2
		Method of the fifth type	88.2

TABLE 1

As can be seen from Table 1, the accuracy of the microexpression recognition model can be greatly improved by performing random mask inactivation treatment and a hard distillation mechanism and adding a binary channel sensing unit, and the hard distillation mechanism also proves that the microexpression recognition model learns the inductive bias capability of the teacher model.

On the other hand, in order to verify the effect of the model, the trained micro expression recognition model selects the data of the members of the video networking platform 2000+ for sampling verification, 20000 face images are captured in total, and the micro expressions are divided into 7 types: the method comprises the steps of joying, anger, neutrality, surprise, disgust, fear and sadness, marking 20000 face images one by one according to 7 types of expressions, selecting 16000 images as a training set and 4000 images as a test set, inputting test data into a model for verification, greatly improving the accuracy and recall rate of a micro-expression recognition model after verification, and showing each algorithm model on the test set as the following table 2

Model (model)	Accuracy (%)	Recall (%)	F-Measure (F fraction,%)
				Resnet 50	93.43	82.76	83.1
RegNetY 16GF	86.6	85.45	86.02
				ViT-B/16	79.2	76.3	77.7
DeiT-B	87.6	85.42	86.5
				The model	90.3	88.7	89.5

TABLE 2

Wherein Resnet 50, regNeTY 16GF, viT-B/16 and DeiT-B are all machine learning models.

Fig. 11 is a schematic structural diagram illustrating a micro expression recognition apparatus according to an exemplary embodiment. As shown in fig. 11, in an exemplary embodiment, the apparatus includes:

an image block obtaining module 1110 configured to obtain a plurality of image blocks of an image to be recognized;

the feature map obtaining module 1130 is configured to extract self-attention features of the plurality of image blocks respectively based on a multi-head self-attention mechanism, so as to obtain feature maps of the plurality of image blocks;

a target feature map module 1150 configured to learn attention weights of the feature maps based on the two channels, respectively, to adjust the feature maps to obtain target feature maps;

and the micro-expression recognition module 1170 is configured to perform micro-expression recognition on the image to be recognized based on the target feature map.

The micro expression recognition device provided by the embodiment can be used for precise micro expression recognition.

In one embodiment, the feature map acquisition module includes:

a set dividing unit configured to divide a plurality of image blocks into different sets of image blocks;

the self-attention feature acquisition unit is configured to extract set self-attention features of all image block sets respectively aiming at different image block sets;

and the feature map acquisition unit is configured to perform feature splicing on the set of each image block set according to the attention features to obtain feature maps of a plurality of image blocks.

In one embodiment, the image blocks comprise a first set containing face key points and a second set not containing face key points; the target feature map module comprises:

an attention weight acquiring unit configured to learn a feature map of a first set and a feature map of a second set based on two channels, respectively, and obtain a first attention weight and a second attention weight correspondingly;

a feature adjusting unit configured to adjust feature maps of the first set based on the first attention weight and adjust feature maps of the second set based on the second attention weight;

and the target feature map unit is configured to splice the adjusted feature maps of the first set and the adjusted feature maps of the second set to obtain a target feature map.

In one embodiment, the micro-expression recognition apparatus further includes:

the key point positioning module is configured to position key points of the face in the image to be recognized;

and the set dividing module is configured to take the image blocks containing the face key points in the plurality of image blocks as a first set, and take the image blocks not containing the face key points in the plurality of image blocks as a second set.

In one embodiment, the keypoint localization module comprises:

the initial key point positioning unit is configured to acquire initial face key points in an image to be recognized;

the contour removing unit is configured to remove key points related to the face contour from the initial face key points to obtain first key points;

the cheek positioning unit is configured to position the position of a cheek in the image to be identified based on the first key point to obtain a second key point;

and the face key point acquisition unit is configured to take the first key point and the second key point as face key points.

In one embodiment, the cheek positioning unit includes:

the set determination plate is configured to select a target key point set from the first key points;

a central point acquisition plate configured to calculate central points between the key points in the target key point set;

and the cheek positioning plate is configured to fixedly offset the central point, and the central point after the fixed offset and the central point are taken as second key points.

It should be noted that the micro expression recognition apparatus provided in the foregoing embodiment and the micro expression recognition method provided in the foregoing embodiment belong to the same concept, and specific ways of performing operations by each module and unit have been described in detail in the method embodiment, and are not described again here.

An embodiment of the present application further provides an electronic device, including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by one or more processors, the electronic equipment is enabled to realize the micro expression recognition method provided in the above embodiments.

It should be noted that the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU) 1201, which can perform various appropriate actions and processes, such as executing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM1203, various programs and data necessary for system operation are also stored. The CPU 1201, ROM 1202, and RAM1203 are connected to each other by a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output section 1207 including a Display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 910 is also connected to the I/O interface 1205 as necessary. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program performs various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1201.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-mentioned micro-expression recognition method. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist separately without being incorporated in the electronic device.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the micro expression recognition method provided in the above embodiments.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A micro-expression recognition method is characterized by comprising the following steps:

acquiring a plurality of image blocks of an image to be identified;

respectively extracting self-attention features of the image blocks based on a multi-head self-attention mechanism to obtain feature maps of the image blocks;

respectively learning the attention weight of the feature map based on two channels to adjust the feature map to obtain a target feature map;

and carrying out micro-expression recognition on the image to be recognized based on the target feature map.

2. The method according to claim 1, wherein the extracting self-attention features of the plurality of image blocks respectively based on the multi-head self-attention mechanism to obtain a feature map of the plurality of image blocks comprises:

dividing the plurality of image blocks into different image block sets;

3. The method of claim 1, wherein the plurality of image blocks comprise a first set comprising face keypoints and a second set comprising no face keypoints; the learning of the attention weights of the feature maps based on two channels respectively to adjust the feature maps to obtain a target feature map comprises the following steps:

4. The method of claim 3, wherein before the learning the first set of feature maps and the second set of feature maps based on two channels respectively to obtain a first attention weight and a second attention weight, the method further comprises:

positioning key points of the human face in the image to be recognized;

5. The method according to claim 4, wherein the locating the key points of the face in the image to be recognized comprises:

acquiring an initial face key point in the image to be recognized;

based on the first key point, positioning the position of the cheek in the image to be recognized to obtain a second key point;

6. The method according to claim 5, wherein the locating the position of the cheek in the image to be recognized based on the first keypoint, and obtaining a second keypoint comprises:

selecting a target key point set from the first key points;

calculating central points among the key points in the target key point set;

7. A training method of a micro expression recognition model is characterized by comprising the following steps:

inputting an image to be trained into an initial micro expression recognition model, carrying out random mask inactivation treatment on a plurality of training image blocks of the image to be trained in the initial micro expression recognition model, obtaining training feature maps of the training image blocks subjected to the random mask inactivation treatment based on a multi-head self-attention mechanism, adjusting the training feature maps based on two channels to obtain a target training feature map, and obtaining a training prediction result based on the target training feature map;

and training the initial micro-expression recognition model according to a prediction result output by a pre-trained teacher model aiming at the image to be trained and the training prediction result.

8. A micro expression recognition device, comprising:

the image block acquisition module is configured to acquire a plurality of image blocks of an image to be identified;

the feature map acquisition module is configured to extract self-attention features of the image blocks respectively based on a multi-head self-attention mechanism to obtain feature maps of the image blocks;

the target characteristic diagram module is configured to learn attention weights of the characteristic diagrams respectively based on two channels so as to adjust the characteristic diagrams to obtain target characteristic diagrams;

and the micro expression recognition module is configured to perform micro expression recognition on the image to be recognized based on the target feature map.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more computer programs that, when executed by the one or more processors, cause the electronic device to implement the method of any of claims 1-7.

10. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1-7.