WO2021217919A1

WO2021217919A1 - Facial action unit recognition method and apparatus, and electronic device, and storage medium

Info

Publication number: WO2021217919A1
Application number: PCT/CN2020/104042
Authority: WO
Inventors: 胡艺飞; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-04-29
Filing date: 2020-07-24
Publication date: 2021-11-04
Also published as: CN111639537A

Abstract

Embodiments of the present application relate to the technical field of artificial intelligence, and provide a facial action unit recognition method. The method comprises: obtaining a face image to be recognized, and performing face correction on said face image to obtain a target face image to be recognized; performing feature extraction on said target face image by using a separable convolutional block and an inverted residual block of a pretrained facial action unit recognition model to obtain sub-features of three target-class facial action units; obtaining outputs of the sub-features of the three target-class facial action units by means of an attention mechanism of the facial action unit recognition model; and respectively obtaining a recognition result of each target-class facial action unit according to the outputs of the sub-features of the three target-class facial action units. Implementation of embodiments of the facial action unit recognition method of the present application facilitates improving the efficiency of facial action unit recognition in face images. In addition, the present application further relates to a blockchain technology, and the recognition results can be stored in a blockchain node.

Description

Face action unit recognition method, device, electronic equipment and storage medium

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on April 29, 2020, the application number is 202010359833.2, and the invention title is "Face Action Unit Recognition Method, Device, Electronic Equipment and Storage Medium", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic device, and storage medium for recognizing face action units.

Background technique

With the development of computer vision technology in artificial intelligence, the face action unit has shown great excavability in the field of human-computer interaction, attracting more and more enterprises or researchers. The recognition of facial action units is the basis of facial expression analysis, emotion analysis, and deeper behavioral analysis of whether the subject has lied, fraud, etc. It usually needs to be implemented by building a neural network model using annotated facial image data sets. In order to improve the recognition accuracy of the existing facial action unit recognition model, the network structure adopted is relatively complex, and the model level after training is generally too large. Therefore, it is not suitable for mobile devices. Even if it can be deployed on mobile devices, the inventor Realize that because the performance of the processor of the mobile device is much lower than that of the server, the model requires a lot of time to run once, which makes the recognition efficiency of the facial action unit low.

Summary of the invention

The embodiments of the present application provide a method, device, electronic device, and storage medium for recognizing a face action unit, which are beneficial to improve the efficiency of recognition of a face action unit in a face image.

In the first aspect, an embodiment of the present application provides a method for recognizing a facial action unit, the method including:

Acquiring a face image to be recognized, performing face correction on the face image to be recognized, to obtain a target face image to be recognized;

The separable convolution block and the de-residual block of the pre-trained face action unit recognition model are used to perform feature extraction on the target face image to be recognized to obtain the first target category face action unit sub-features and the second target category Face action unit sub-features and the third target category face action unit sub-features;

Input the first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature into the attention of the face action unit recognition model The force mechanism performs convolution processing to obtain the first output feature of the face action unit sub-feature of the first target category, the second output feature of the face action unit sub-feature of the second target category, and the third target category The third output feature of the sub-feature of the face action unit;

According to the first output feature, the second output feature, and the third output feature, the recognition result of the first target type face action unit and the recognition of the second target type face action unit are respectively obtained Result and the recognition result of the third target type face action unit.

In the second aspect, an embodiment of the present application provides a face action unit recognition device, which includes:

The face correction module is used to obtain a face image to be recognized, perform face correction on the face image to be recognized, to obtain a target face image to be recognized;

The feature extraction module is used to extract features of the target face image to be recognized by using the separable convolution block and the inverse residual block of the pre-trained face action unit recognition model to obtain the first target type face action unit sub Features, sub-features of the second target type of facial action unit, and sub-features of the third target type of facial action unit;

A feature processing module, configured to input the sub-features of the first target-type face action unit, the sub-features of the second target-type face action unit, and the sub-features of the third target-type face action unit into the face The attention mechanism of the action unit recognition model performs convolution processing to obtain the first output feature of the sub-feature of the first target type of face action unit, the second output feature of the sub-feature of the second target type of face action unit, and The third output feature of the sub-feature of the face action unit of the third target category;

The facial action unit classification module is configured to obtain the recognition result of the first target type facial action unit and the first target facial action unit according to the first output feature, the second output feature, and the third output feature. Two recognition results of the target face action unit and the recognition result of the third target face action unit.

In a third aspect, an embodiment of the present application provides an electronic device. The electronic device includes a processor, a memory, and a computer program stored on the memory and running on the processor, and the processor executes the Realize in computer program:

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium with a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following is achieved:

In the embodiment of this application, the backbone network of the face action unit recognition model uses a stack of separable convolution blocks and de-residual blocks to extract sub-features, and the separable convolution makes the processing parameters of the face action unit recognition model The anti-residual block is smaller than the positive residual structure, and the matrix multiplication is used in the attention mechanism to calculate, which can ensure the running speed of the face action unit recognition model. It can be seen that the entire face action unit recognition model is in the structure The upper lighter weight and fast calculation speed help to improve the efficiency of facial motion unit recognition in facial images.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is an example diagram of an application scenario provided by an embodiment of the application;

Figure 2 is a network architecture diagram provided by an application embodiment;

FIG. 3 is a schematic flowchart of a method for recognizing a face action unit provided by an embodiment of the application;

4 is a schematic structural diagram of a multi-task convolutional neural network model provided by an embodiment of the application;

FIG. 5 is a schematic structural diagram of a facial action unit recognition model provided by an embodiment of the application;

FIG. 6 is an example diagram of a separable convolution provided by an embodiment of the application;

FIG. 7 is a schematic flowchart of another method for recognizing a face action unit provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of a face action unit recognition device provided by an embodiment of the application;

FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms "including" and "having" appearing in the specification, claims, and drawings of this application and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment. In addition, the terms "first", "second", "third", etc. are used to distinguish different objects, but not to describe a specific sequence.

The embodiment of the present application proposes a face action unit recognition solution, which can be applied to the scenario in which staff and customers/people handle business as shown in Figure 1. Staff usually need to use a terminal to collect videos or photos, for example: When bank staff handle loan business for customers, insurance company handles insurance business for customers, and government affairs center handles related business for the masses, of course, the scene shown in Figure 1 is just for illustration and does not limit this application. The facial action unit proposed in the application can also be used in many scenes such as facial expression analysis, psychological activity analysis, and interviews. The face action unit recognition model adopted in this scheme adopts separable convolution in convolution processing, which greatly reduces the parameter amount of the model, and uses the anti-residual module to extract deeper features. The residual module and the anti-residual module are lighter. At the same time, the backbone network of the model and the operation in the attention mechanism adopt matrix-like multiplication. The entire design makes the model less than 7M in size, which guarantees that 39 facial action units are guaranteed. In the case of high recognition accuracy, the running speed is faster and the efficiency is higher, and it can be deployed not only on the server side, but also on the mobile terminal.

The face action unit recognition solution can be implemented based on the network architecture shown in FIG. 2. As shown in FIG. 2, the network architecture includes at least a terminal and a server. The terminal and the server communicate through a network, which includes but is not limited to Virtual private network, local area network, metropolitan area network, the terminal is mainly used for photographing and uploading face images and displaying the final recognition result. The terminal can be a mobile phone, a tablet, a notebook computer, a handheld computer and other devices. After obtaining the face image sent by the terminal, the server performs a series of facial motion unit recognition operations, and finally outputs the recognition results to the terminal. The server can be a single server, a server cluster, or a cloud server. It is the executive body of the entire facial action unit recognition scheme. In some embodiments of the present application, when the facial action unit recognition model is deployed on the terminal, the execution subject may also be the terminal, and related models or algorithms such as face detection and face correction are also deployed on the terminal.

Based on the foregoing description, the face action unit recognition method provided by the embodiments of the present application will be described in detail below in conjunction with other drawings. Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a face action unit recognition method provided by an embodiment of the application, which is applied to a server, as shown in FIG. 3, including steps S31-S34:

S31: Obtain a face image to be recognized, and perform face correction on the face image to be recognized to obtain a target face image to be recognized.

In the specific embodiment of the present application, the face image to be recognized is the face image collected by the terminal and uploaded to the server in real time. It may be a short video or a separate picture, which is not limited here. After the server obtains the image to be recognized, it first inputs it into the pre-trained multi-task convolutional neural network model for face detection and face key point positioning. As shown in Figure 4, the multi-task convolutional neural network model is composed of P- Net, R-Net and O-Net are composed of three sub-networks. The input size of P-Net (ie width, height and depth) is 12*12*3, and the input size of R-Net is 24*24*3. Followed by a 128-channel fully connected layer, the input size of O-Net is 48*48*3, followed by a 256-channel fully connected layer, the face image to be recognized is first input to P-Net for processing, and the output of P-Net As the input of R-Net, the output of R-Net is used as the input of O-Net to form a cascaded structure. Each sub-network uses 3*3 convolution or 2*2 convolution, and 3*3 pooling Or 2*2 pooling for processing, and finally a face classifier is used to give the confidence of whether the area is a face. At the same time, border regression and key point locator are used to calibrate the face area and face key points Positioning. The key points of the face are the five key points of the two eyes, the nose, and the corners of the mouth on the left and right sides of the face in the face image to be recognized, and the coordinate information of the five key points will be obtained by locating them.

In addition, after obtaining the coordinate information of the five face key points, the pre-stored coordinate information of the face key points of the standard face image is obtained from the database. The standard face image means that the face in the image does not have rotation and does not need to be rotated. Corrected face image. Compare the coordinate information of the five key points of the face in the face image to be recognized with the coordinate information of the key points of the face in the standard face image to obtain the similarity transformation matrix T, and solve the similarity transformation matrix T according to the following similarity transformation matrix equation:

After that, the coordinate information of the five key points of the face in the face image to be recognized is multiplied by the similarity transformation matrix T to obtain the target face image to be recognized, that is, the correction of the face in the face image to be recognized is completed. Among them, in the above-mentioned similarity transformation matrix equation, (x, y) represents the coordinate information of the key points of the face in the face image to be recognized, (x', y') represents the coordinate information of the key points of the face in the standard face image,

That is, the similarity transformation matrix T, s represents the scaling factor, θ represents the rotation angle, usually counterclockwise rotation, and (t _x , t _y ) represents the translation parameter. Specifically, the transformation.SimilarityTransform function can be used to iteratively solve the similarity transformation matrix T.

S32, using the separable convolution block and the inverse residual block of the pre-trained face action unit recognition model to perform feature extraction on the target face image to be recognized, to obtain the first target type face action unit sub-features and the second target face action unit sub-features. The sub-feature of the target facial action unit and the sub-feature of the third target facial action unit.

In the specific embodiment of the present application, after obtaining the target face image to be recognized by the method described in step S31, it is input into the pre-trained face action unit recognition model to recognize the face action unit, in order to improve the recognition of the face action unit The processing efficiency of the model uses a more lightweight convolutional neural network. The specific structure is shown in Figure 5. The backbone network part of the facial action unit recognition model is a stack of 7 separable convolutional blocks and anti-residual modules, with a total of 17 layers, which are mainly used for inputting the target face to be recognized. Image feature extraction. The convolution kernels of all standard convolutional neural networks in the face action unit recognition model are replaced with separable convolutions. If the input feature map size is d*d*m (d is the width and height of the feature map, m is the channel Number), the output feature map is d*d*n, and the convolution size is k*k. The computational complexity of the standard convolution kernel is d*d*m*n*k*k, and the computational complexity of the separable convolution kernel is d*d*m*(n+k*k), for example: Recognize the 12*12*3 feature map of the target face image, as shown in Figure 6, first use the 3*3*1 convolution kernel for convolution, and the resulting feature map size is 10*10*3, for 10* The 10*3 feature map is then convolved with a 1*1*3 convolution kernel to obtain a 10*10*1 feature map. The processing parameters of the model are changed from the original 3*3*3*3=81 The number is reduced to 3*3*3+1*1*3*3=36, and the calculation speed is obviously faster than the ordinary convolution operation. Secondly, the anti-residual module is constructed on the basis of separable convolution, and the depth of the feature map is expanded and compressed using the "expansion-convolution-compression" processing mode, in order to extract deeper features, compared with positive The residual module and the anti-residual module have a smaller structure, which is more conducive to improving the computational efficiency of the model.

The first target face action unit is the pre-divided face action unit around the eyes, the second target face action unit is the face and nose face action unit, and the third target face action unit is the face action unit. Mouth-like facial action unit. Since the data set used to train the aforementioned facial action unit recognition model is an annotated data set that divides 39 face action units into 3 categories, namely, the area around the eyes, the face and nose, the mouth, and the around the eyes. The changes in facial action units are generally slight skin tightening or stretching, the changes in facial action units around the nose are generally folds, and the changes in facial action units around the mouth are generally bulging of the skin caused by the lips or tongue. Wait. For example: AU45 (blink) belongs to the area around the eyes, AU18 (toot mouth) belongs to the mouth category, and AU04 (Frowning) belongs to the area around the eyes. Therefore, the facial action unit recognition model learns to extract the above three categories respectively. The sub-features of the face action unit, that is, the first target face action unit sub-feature, the second target face action unit sub-feature, and the third target output after processing by the separable convolution block and the inverse residual block Sub-features of human-like facial action units.

S33. Input the sub-features of the first target-type facial motion unit, the sub-features of the second target-type facial motion unit, and the sub-features of the third target-type facial motion unit into the facial motion unit recognition model Convolution processing is performed on the attention mechanism of the first target class of face action unit sub-features, the second output feature of the second target class face action unit sub-features, and the third The third output feature of the sub-feature of the target facial action unit.

In the specific embodiment of the present application, the first output feature is the feature map output after the first target-type face action unit sub-feature is processed by the convolution processing in the attention mechanism module, and the second output feature and the third output feature are the same. Please refer to Figure 5 again. The face action unit recognition model is divided into three branches after the main network part. Each branch processes the sub-features of the eye area, face and nose, and mouth. Attention mechanism modules are added to each branch, and each attention mechanism module is composed of three layers of 1*1 convolution, the first target type of face action unit sub-feature, and the second target type of face action unit sub-feature And the sub-features of the third target class of face action units are respectively subjected to three times of 1*1 convolution to obtain the output features of each type of sub-features.

Inputting the sub-features of different regions into the corresponding branches for processing can reduce the learning difficulty of the network, and help the network become shallower to improve processing efficiency. The attention mechanism module in each branch uses three consecutive layers of 1*1 volume. The product learns two-dimensional weights, which can clarify the feature information of which position of the input face is more conducive to the recognition of facial action units. At the same time, the attention mechanism module uses matrix multiplication to calculate, which ensures the model's operation speed and strengthens the model. The ability to extract high-level features of facial action units.

S34. Acquire the recognition result of the first target type face action unit and the second target type face action unit, respectively, according to the first output feature, the second output feature, and the third output feature And the recognition result of the face action unit of the third target category.

In the specific embodiment of the present application, after the first output feature, the second output feature, and the third output feature are obtained, the output feature is used as a weight, and its width and height are respectively the same as the sub-features of the first target type face action unit and the second output feature. The corresponding width and height of the target facial action unit sub-feature and the third target facial action unit sub-feature are multiplied to pay more attention to the useful features of the first target facial action unit, that is, the width and height of the first output feature The height is multiplied by the width and height of the sub-features of the first target type of face action unit. The second output feature and the third output feature also do this operation to obtain the first target type face action unit's first feature to be classified, The second feature to be classified of the face action unit of the second target category and the third feature to be classified of the face action unit of the third target category. The feature to be classified of each type of face action unit is the input feature of the fully connected layer. The first feature to be classified, the second feature to be classified, and the third feature to be classified are input to the fully connected layer, and the fully connected layer performs classification respectively, and finally outputs the recognition result of the first target type of face action unit, and the second target type of person The recognition result of the face action unit and the recognition result of the third target type of face action unit, that is, the output is the recognition result of the face action unit of the area around the eyes, the recognition result of the face and nose face action unit, the mouth The recognition result of the facial action unit of the category, the result is a probability value, and a threshold can be set for it. When the recognition result of a specific facial action unit is greater than or equal to the threshold, it indicates that the face image to be recognized is When the face action unit appears in the face of, when it is less than the threshold, it indicates that the face in the face image to be recognized does not appear the face action unit. For example, the value of AU45 (blink) is 0.8, and the value of AU18 (frowning) The value is 0.3. When the threshold value is 0.5, it indicates that the face in the image to be recognized has AU45 but not AU18.

It can be seen that the embodiment of the present application obtains the face image to be recognized, performs face correction on the face image to be recognized, and obtains the target face image to be recognized; the separable convolution block of the pre-trained face action unit recognition model is used And de-residual block to perform feature extraction on the face image of the target to be recognized, and obtain the sub-features of the first target type of facial motion unit, the second target type of facial motion unit sub-feature, and the third target type of facial motion unit sub-feature; The first target type of face action unit sub-features, the second target type of face action unit sub-features, and the third target type of face action unit sub-features are input into the attention mechanism of the face action unit recognition model for convolution processing, and the first An output feature, a second output feature, and a third output feature; according to the first output feature, the second output feature, and the third output feature, the recognition results of the first target type face action unit and the second target type face are obtained respectively The recognition result of the action unit and the recognition result of the face action unit of the third target category. Since the backbone network of the facial motion unit recognition model uses a stack of separable convolutional blocks and de-residual blocks to extract sub-features, the separable convolution makes the processing parameters of the facial motion unit recognition model doubled and reversed. The residual block is smaller than the positive residual structure, and the attention mechanism adopts matrix multiplication to calculate, which can ensure the running speed of the face action unit recognition model. It can be seen that the entire face action unit recognition model is lighter in structure. And the calculation speed is fast, which is beneficial to improve the efficiency of facial action unit recognition in the face image.

Based on the description of the embodiment of the facial motion unit recognition method shown in FIG. 3, please refer to FIG. 7. FIG. 7 is a schematic flowchart of another facial motion unit recognition method provided by an embodiment of the application, as shown in FIG. 7, Including steps S71-S75:

S71: Obtain a face image to be recognized;

S72: Perform face correction on the face image to be recognized to obtain a target face image to be recognized;

Optionally, the above-mentioned performing face correction on the face image to be recognized to obtain the target face image to be recognized includes:

Using a pre-trained multi-task convolutional neural network model to perform face detection on the face image to be recognized, and locate the key points of the face in the face image to be recognized;

Perform face correction on the face image to be recognized based on the face key points.

Optionally, the foregoing performing face correction on the face image to be recognized based on the face key points includes:

Comparing the coordinate information of the key points of the face with the coordinate information of the key points of the face in the pre-stored standard face image to obtain a similarity transformation matrix T;

Solving the similarity transformation matrix T according to a preset similarity transformation matrix equation;

The coordinate information of the key points of the human face is multiplied by the similar transformation matrix T obtained after the solution is obtained to obtain the face image of the target to be recognized.

In this embodiment, the face image to be recognized is not directly input into the face action unit recognition model for processing, but a multi-task convolutional neural network model is first used to perform face correction on the face image to be recognized, and the face is rotated at different angles. The time model can be accurately judged, which guarantees the stability of the model.

S73. Input the target face image to be recognized into the backbone network of the pre-trained face action unit recognition model, and perform the recognition on the to-be-recognized image through the separable convolution block and the anti-residual block of the backbone network. Feature extraction is performed on the target face image to obtain the sub-features of the first target type of facial motion unit, the second target type of facial motion unit sub-features, and the third target type of facial motion unit sub-features;

S74. Input the sub-features of the first target-type facial motion unit, the sub-features of the second target-type facial motion unit, and the sub-features of the third target-type facial motion unit into the facial motion unit recognition model Convolution processing is performed on the attention mechanism of the first target class of face action unit sub-features, the second output feature of the second target class face action unit sub-features, and the third The third output feature of the sub-feature of the target facial action unit;

Optionally, the above-mentioned inputting the sub-features of the first target-type facial motion unit, the sub-features of the second target-type facial motion unit, and the sub-features of the third target-type facial motion unit into the facial motion The attention mechanism of the unit recognition model performs convolution processing to obtain the first output feature of the sub-feature of the first target type of face action unit, the second output feature of the sub-feature of the second target type of face action unit, and all The third output feature of the sub-feature of the face action unit of the third target category includes:

The first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature are respectively input into the face action unit recognition model In the corresponding branch;

After multiple times of 1*1 convolution processing by the attention mechanism in each branch, the first output feature, the second output feature, and the third output feature are obtained.

In this embodiment, the backbone network is followed by 3 branches, and each branch processes the sub-features of the face-like action units in the area around the eyes, the sub-features of the face and nose-like face action units, and the mouth-like face. The sub-features of the action unit ensure that the kinetic energy of 39 face action units can be recognized, and the attention mechanism module in each branch adopts a three-layer 1*1 convolution stack, which makes the model pay more attention to useful features.

S75. Acquire the recognition result of the first target type face action unit and the second target type face action unit, respectively, according to the first output feature, the second output feature, and the third output feature And the recognition result of the face action unit of the third target category.

Optionally, according to the first output feature, the second output feature, and the third output feature, the recognition result of the first target type face action unit and the second target type person The recognition result of the face action unit and the recognition result of the third target type of face action unit include:

The width and height of the first output feature, the second output feature, and the third output feature are respectively compared with the sub-features of the first target type face action unit and the second target type face action Multiply the width and height of the sub-feature of the unit sub-feature and the sub-feature of the third target type face action unit to obtain the first to-be-classified feature of the first target type face action unit and the second target type face action The second feature to be classified of the unit and the third feature to be classified of the face action unit of the second target category;

Input the first feature to be classified, the second feature to be classified, and the third feature to be classified into the fully connected layer of the face action unit recognition model to classify, respectively, to obtain the first target class face The recognition result of the action unit, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit, wherein the recognition result is stored in a blockchain.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned recognition result, the above-mentioned recognition result can also be stored in a node of a blockchain.

In this embodiment, the features output by the attention mechanism module are used as weights to perform calculations on their input features respectively to obtain the input features of the fully connected layer, and then the features to be classified of the three target face action units are input to the fully connected layer for two steps. Classification helps the model pay more attention to the difference between the three target facial action units.

The specific implementations of the above steps S71-S75 have been described in detail in the embodiment shown in FIG. 3, and can achieve the same or similar beneficial effects. In order to avoid repetition, they will not be repeated here.

Based on the description of the foregoing face action unit recognition method embodiment, the present application also provides a face action unit recognition device, which can execute the method shown in FIG. 3 or FIG. 7. See Figure 8. The device includes:

The face correction module 81 is configured to obtain a face image to be recognized, perform face correction on the face image to be recognized, to obtain a target face image to be recognized;

The feature extraction module 82 is configured to use the separable convolution block and the inverse residual block of the pre-trained face action unit recognition model to perform feature extraction on the target face image to be recognized to obtain the first target type face action unit Sub-features, sub-features of face action units of the second target category, and sub-features of face action units of the third target category;

The feature processing module 83 is configured to input the sub-features of the first target type of face action unit, the second target type of face action unit sub-features, and the third target type of face action unit sub-features into the person The attention mechanism of the facial action unit recognition model performs convolution processing to obtain the first output feature of the sub-feature of the first target type of face action unit, and the second output feature of the sub-feature of the second target type of face action unit And the third output feature of the sub-feature of the face action unit of the third target category;

The facial action unit classification module 84 is configured to obtain the recognition result of the first target facial action unit, the recognition result of the first target facial action unit, and the The recognition result of the second target type face action unit and the recognition result of the third target type face action unit.

In one embodiment, the feature extraction module 82 is specifically configured to:

Inputting the face image of the target to be recognized into the backbone network;

The feature extraction of the target face image to be recognized is performed through the separable convolution block and the inverse residual block of the backbone network.

In one embodiment, the sub-features of the first target-type facial motion unit, the sub-features of the second target-type facial motion unit, and the sub-features of the third target-type facial motion unit are input to the person The attention mechanism of the facial action unit recognition model performs convolution processing to obtain the first output feature of the sub-feature of the first target type of face action unit, and the second output feature of the sub-feature of the second target type of face action unit And in terms of the third output feature of the sub-feature of the third target type of face action unit, the feature processing module 83 is specifically configured to:

In one embodiment, according to the first output feature, the second output feature, and the third output feature, the recognition result of the first target type face action unit and the second target Regarding the recognition result of the facial motion unit and the recognition result of the third target facial motion unit, the facial motion unit classification module 84 is specifically configured to:

In an embodiment, in terms of performing face correction on the face image to be recognized, the face correction module 81 is specifically configured to:

In an embodiment, in terms of performing face correction on the face image to be recognized based on the face key points, the face correction module 81 is specifically further configured to:

The facial action unit recognition device provided by the embodiment of the application obtains the face image to be recognized, performs face correction on the face image to be recognized, and obtains the target face image to be recognized; a pre-trained face motion unit recognition model is adopted Separate convolution block and de-residual block to perform feature extraction on the face image of the target to be recognized, and obtain the sub-features of the first target type of face action unit, the second target type of face action unit sub-feature, and the third target type of face action Unit sub-features; input the first target-type facial motion unit sub-features, the second target-type facial motion unit sub-features, and the third target-type facial motion unit sub-features into the attention mechanism of the facial motion unit recognition model for volume Product processing to obtain the first output feature, the second output feature, and the third output feature; according to the first output feature, the second output feature, and the third output feature, the recognition results and the first target facial action units of the first target category are obtained respectively. The recognition result of the second target type face action unit and the recognition result of the third target type face action unit. Since the backbone network of the facial motion unit recognition model uses a stack of separable convolutional blocks and de-residual blocks to extract sub-features, the separable convolution makes the processing parameters of the facial motion unit recognition model doubled and reversed. The residual block is smaller than the positive residual structure, and the attention mechanism adopts matrix multiplication to calculate, which can ensure the running speed of the face action unit recognition model. It can be seen that the entire face action unit recognition model is lighter in structure. And the calculation speed is fast, which is beneficial to improve the efficiency of facial action unit recognition in the face image.

According to an embodiment of the present application, the various modules of the facial action unit recognition device shown in FIG. 8 can be separately or completely combined into one or several other units to form, or some of the modules can be further It is divided into multiple functionally smaller units to form, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application. The above-mentioned units are divided based on logical functions. In practical applications, the function of one unit may also be realized by multiple units, or the functions of multiple units may be realized by one unit. In other embodiments of the present application, the face action unit recognition device may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by multiple units in cooperation.

According to another embodiment of the present application, a general-purpose computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM) and other processing elements and storage elements can be used Run a computer program (including program code) capable of executing the steps involved in the corresponding method shown in FIG. 3 or FIG. 7 to construct the facial motion unit recognition device as shown in FIG. 8 and to implement this The face action unit recognition method of the application embodiment is applied. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into the above-mentioned computing device through the computer-readable recording medium, and run in it.

Based on the description of the foregoing method embodiment and apparatus embodiment, please refer to FIG. 9. FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application. As shown in FIG. 9, the electronic device includes at least a memory 901 for Stores a computer program; the processor 902 is used to call the computer program stored in the memory 901 to implement the steps in the embodiment of the implementation method of the convolutional neural network; the input and output interface 903 is used to perform input and output, and the input and output interface 903 can It is one or more; it is understandable that each part of the electronic device is connected to the bus respectively.

A computer-readable storage medium may be stored in the memory 901 of the electronic device. The computer-readable storage medium is used to store a computer program. The computer program includes program instructions. The processor 902 is used to execute the computer-readable storage. Program instructions stored in the medium. The processor 902 (or CPU (Central Processing Unit, central processing unit)) is the computing core and control core of an electronic device. It is suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to achieve Corresponding method flow or corresponding function.

Wherein, the processor 902 is specifically configured to call a computer program to execute the following steps:

In a possible implementation manner, the processor 902 executes the feature extraction of the target face image to be recognized by using the separable convolution block and the inverse residual block of the pre-trained face action unit recognition model, including :

In a possible implementation manner, the processor 902 executes the combination of the first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face The action unit sub-features are input into the attention mechanism of the facial action unit recognition model for convolution processing to obtain the first output feature of the first target-type face action unit sub-features, and the second target-type face action The second output feature of the unit sub-feature and the third output feature of the third target-type face action unit sub-feature include:

In a possible implementation manner, the processor 902 executes the acquisition of the facial action unit of the first target type according to the first output feature, the second output feature, and the third output feature. The recognition result, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit include:

In a possible implementation manner, the processor 902 executing the face correction on the face image to be recognized includes:

In a possible implementation manner, the processor 902 executing the face correction on the face image to be recognized based on the face key points includes:

Exemplarily, for example, the foregoing electronic device may be various servers, hosts, and other devices. The electronic device may include, but is not limited to, a processor 902, a memory 901, and an input/output interface 903. Those skilled in the art can understand that the schematic diagram is only an example of the electronic device, and does not constitute a limitation on the electronic device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different components.

It should be noted that, since the processor 902 of the electronic device executes the computer program to implement the steps in the above-mentioned facial motion unit recognition method, the above-mentioned embodiments of the facial motion unit recognition method are all applicable to the electronic device, and all are capable of Achieve the same or similar beneficial effects.

The embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following steps are implemented:

In another example, when the computer program is executed by the processor, the following steps are further implemented:

Exemplarily, the computer program in the computer-readable storage medium includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable storage medium may be non-volatile or volatile, and may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, Computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media, etc.

It should be noted that, since the computer program of the computer-readable storage medium is executed by the processor 902 to implement the steps in the above-mentioned face action unit recognition method, all embodiments of the above-mentioned face action unit recognition method are applicable to the computer. The storage medium is readable and can achieve the same or similar beneficial effects.

The embodiments of the application are described in detail above, and specific examples are used in this article to illustrate the principles and implementation of the application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the application; at the same time, for Those of ordinary skill in the art, based on the idea of the application, will have changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as a limitation to the application.

Claims

A method for recognizing facial action units, wherein the method includes:

Acquiring a face image to be recognized, performing face correction on the face image to be recognized, to obtain a target face image to be recognized;

The separable convolution block and the de-residual block of the pre-trained face action unit recognition model are used to perform feature extraction on the target face image to be recognized to obtain the first target category face action unit sub-features and the second target category Face action unit sub-features and the third target category face action unit sub-features;

Input the first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature into the attention of the face action unit recognition model The force mechanism performs convolution processing to obtain the first output feature of the face action unit sub-feature of the first target category, the second output feature of the face action unit sub-feature of the second target category, and the third target category The third output feature of the sub-feature of the face action unit;

According to the first output feature, the second output feature, and the third output feature, the recognition result of the first target type face action unit and the recognition of the second target type face action unit are respectively obtained Result and the recognition result of the third target type face action unit.
The method according to claim 1, wherein the using the separable convolution block and the inverse residual block of the pre-trained face action unit recognition model to extract the features of the target face image to be recognized comprises:

Inputting the face image of the target to be recognized into the backbone network of the face action unit recognition model;

The feature extraction of the target face image to be recognized is performed through the separable convolution block and the inverse residual block of the backbone network.
The method according to claim 1, wherein the sub-feature of the first target-type face action unit, the sub-feature of the second target-type face action unit, and the third target-type face action unit are combined Sub-features are input into the attention mechanism of the facial action unit recognition model for convolution processing to obtain the first output features of the first target-type face action unit sub-features, and the second target-type face action unit sub-features The second output feature of the feature and the third output feature of the sub-feature of the face action unit of the third target category include:

The first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature are respectively input into the face action unit recognition model In the corresponding branch;

After multiple times of 1*1 convolution processing by the attention mechanism in each branch, the first output feature, the second output feature, and the third output feature are obtained.
The method according to any one of claims 1 to 3, wherein the first target facial action is obtained according to the first output feature, the second output feature, and the third output feature, respectively The recognition result of the unit, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit include:

The width and height of the first output feature, the second output feature, and the third output feature are respectively compared with the sub-features of the first target type face action unit and the second target type face action Multiply the width and height of the sub-feature of the unit sub-feature and the sub-feature of the third target type face action unit to obtain the first to-be-classified feature of the first target type face action unit and the second target type face action The second feature to be classified of the unit and the third feature to be classified of the face action unit of the second target category;

Input the first feature to be classified, the second feature to be classified, and the third feature to be classified into the fully connected layer of the face action unit recognition model to classify, respectively, to obtain the first target class face The recognition result of the action unit, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit, wherein the recognition result is stored in a blockchain.
The method according to any one of claims 1 to 3, wherein the performing face correction on the face image to be recognized comprises:

Using a pre-trained multi-task convolutional neural network model to perform face detection on the face image to be recognized, and locate the key points of the face in the face image to be recognized;

Perform face correction on the face image to be recognized based on the face key points.
The method according to claim 5, wherein the performing face correction on the face image to be recognized based on the face key points comprises:

Comparing the coordinate information of the key points of the face with the coordinate information of the key points of the face in the pre-stored standard face image to obtain a similarity transformation matrix T;

Solving the similarity transformation matrix T according to a preset similarity transformation matrix equation;

The coordinate information of the key points of the human face is multiplied by the similar transformation matrix T obtained after the solution is obtained to obtain the face image of the target to be recognized.
The method according to claim 1, wherein the first target-type face action unit refers to a pre-divided face action unit around the eyes, and the second target-type face action unit refers to a pre-divided face action unit. Face and nose type face action units, the third target type face action unit refers to a pre-divided mouth type face action unit.
A face action unit recognition device, wherein the device includes:

The face correction module is used to obtain a face image to be recognized, perform face correction on the face image to be recognized, to obtain a target face image to be recognized;

The feature extraction module is used to extract features of the target face image to be recognized by using the separable convolution block and the inverse residual block of the pre-trained face action unit recognition model to obtain the first target type face action unit sub Features, sub-features of the second target type of facial action unit, and sub-features of the third target type of facial action unit;

The feature processing module is configured to input the sub-features of the first target-type face action unit, the sub-features of the second target-type face action unit, and the sub-features of the third target-type face action unit into the face The attention mechanism of the action unit recognition model performs convolution processing to obtain the first output feature of the sub-feature of the first target type of face action unit, the second output feature of the sub-feature of the second target type of face action unit, and The third output feature of the sub-feature of the face action unit of the third target category;

The facial action unit classification module is configured to obtain the recognition result of the first target type facial action unit and the first target facial action unit according to the first output feature, the second output feature, and the third output feature. Two recognition results of the target face action unit and the recognition result of the third target face action unit.
An electronic device, wherein the electronic device includes a processor, a memory, and a computer program that is stored on the memory and can run on the processor, and when the processor executes the computer program:

Acquiring a face image to be recognized, performing face correction on the face image to be recognized, to obtain a target face image to be recognized;

The separable convolution block and the de-residual block of the pre-trained face action unit recognition model are used to perform feature extraction on the target face image to be recognized to obtain the first target category face action unit sub-features and the second target category Face action unit sub-features and the third target category face action unit sub-features;

Input the first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature into the attention of the face action unit recognition model The force mechanism performs convolution processing to obtain the first output feature of the face action unit sub-feature of the first target category, the second output feature of the face action unit sub-feature of the second target category, and the third target category The third output feature of the sub-feature of the face action unit;

According to the first output feature, the second output feature, and the third output feature, the recognition result of the first target type face action unit and the recognition of the second target type face action unit are respectively obtained Result and the recognition result of the third target type face action unit.
The electronic device according to claim 9, wherein the processor executes the separable convolution block and the de-residual block of the pre-trained face action unit recognition model to perform processing on the target face image to be recognized Feature extraction, including:

Inputting the face image of the target to be recognized into the backbone network;

The feature extraction of the target face image to be recognized is performed through the separable convolution block and the inverse residual block of the backbone network.
The electronic device according to claim 9, wherein the processor executes the sub-feature of the first target-type face action unit, the sub-feature of the second target-type face action unit, and the third The sub-features of the target face action unit are input into the attention mechanism of the face action unit recognition model for convolution processing to obtain the first output feature and the second target of the sub-features of the first target face action unit The second output feature of the sub-feature of the face action unit and the third output feature of the sub-feature of the third target type of face action unit include:

The first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature are respectively input into the face action unit recognition model In the corresponding branch;

After multiple times of 1*1 convolution processing by the attention mechanism in each branch, the first output feature, the second output feature, and the third output feature are obtained.
The electronic device according to any one of claims 9-11, wherein the processor executes the acquisition of the first output feature, the second output feature, and the third output feature to obtain the The recognition result of the first target type face action unit, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit include:

The width and height of the first output feature, the second output feature, and the third output feature are respectively compared with the sub-features of the first target type face action unit and the second target type face action Multiply the width and height of the sub-feature of the unit sub-feature and the sub-feature of the third target type face action unit to obtain the first to-be-classified feature of the first target type face action unit and the second target type face action The second feature to be classified of the unit and the third feature to be classified of the face action unit of the second target category;

Input the first feature to be classified, the second feature to be classified, and the third feature to be classified into the fully connected layer of the face action unit recognition model to classify, respectively, to obtain the first target class face The recognition result of the action unit, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit, wherein the recognition result is stored in a blockchain.
The electronic device according to any one of claims 9-11, wherein the execution of the face correction on the face image to be recognized by the processor comprises:

Using a pre-trained multi-task convolutional neural network model to perform face detection on the face image to be recognized, and locate the key points of the face in the face image to be recognized;

Perform face correction on the face image to be recognized based on the face key points.
The electronic device according to claim 13, wherein the execution by the processor to perform face correction on the face image to be recognized based on the key points of the face comprises:

Comparing the coordinate information of the key points of the face with the coordinate information of the key points of the face in the pre-stored standard face image to obtain a similarity transformation matrix T;

Solving the similarity transformation matrix T according to a preset similarity transformation matrix equation;

The coordinate information of the key points of the human face is multiplied by the similar transformation matrix T obtained after the solution is obtained to obtain the face image of the target to be recognized.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to realize:

Acquiring a face image to be recognized, performing face correction on the face image to be recognized, to obtain a target face image to be recognized;

The separable convolution block and the de-residual block of the pre-trained face action unit recognition model are used to perform feature extraction on the target face image to be recognized to obtain the first target category face action unit sub-features and the second target category Face action unit sub-features and the third target category face action unit sub-features;

Input the first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature into the attention of the face action unit recognition model The force mechanism performs convolution processing to obtain the first output feature of the face action unit sub-feature of the first target category, the second output feature of the face action unit sub-feature of the second target category, and the third target category The third output feature of the sub-feature of the face action unit;

According to the first output feature, the second output feature, and the third output feature, the recognition result of the first target type face action unit and the recognition of the second target type face action unit are respectively obtained Result and the recognition result of the third target type face action unit.
The computer-readable storage medium according to claim 15, wherein, when the computer program is executed by the processor, it further implements:

Inputting the face image of the target to be recognized into the backbone network;

The feature extraction of the target face image to be recognized is performed through the separable convolution block and the inverse residual block of the backbone network.
The computer-readable storage medium according to claim 15, wherein, when the computer program is executed by the processor, it further implements:

The first target type face action unit sub-feature, the second target type face action unit sub-feature, and the third target type face action unit sub-feature are respectively input into the face action unit recognition model In the corresponding branch;

After multiple times of 1*1 convolution processing by the attention mechanism in each branch, the first output feature, the second output feature, and the third output feature are obtained.
18. The computer-readable storage medium according to any one of claims 15-17, wherein the computer program, when executed by a processor, further implements:

The width and height of the first output feature, the second output feature, and the third output feature are respectively compared with the sub-features of the first target type face action unit and the second target type face action Multiply the width and height of the sub-feature of the unit sub-feature and the sub-feature of the third target type face action unit to obtain the first to-be-classified feature of the first target type face action unit and the second target type face action The second feature to be classified of the unit and the third feature to be classified of the face action unit of the second target category;

Input the first feature to be classified, the second feature to be classified, and the third feature to be classified into the fully connected layer of the face action unit recognition model to classify, respectively, to obtain the first target class face The recognition result of the action unit, the recognition result of the second target type face action unit, and the recognition result of the third target type face action unit, wherein the recognition result is stored in a blockchain.
The computer-readable storage medium according to any one of claims 15-17, wherein, when the computer program is executed by a processor, it further implements:

Using a pre-trained multi-task convolutional neural network model to perform face detection on the face image to be recognized, and locate the key points of the face in the face image to be recognized;

Perform face correction on the face image to be recognized based on the face key points.
The computer-readable storage medium according to claim 19, wherein, when the computer program is executed by the processor, it further implements:

Comparing the coordinate information of the key points of the face with the coordinate information of the key points of the face in the pre-stored standard face image to obtain a similarity transformation matrix T;

Solving the similarity transformation matrix T according to a preset similarity transformation matrix equation;

The coordinate information of the key points of the human face is multiplied by the similar transformation matrix T obtained after the solution is obtained to obtain the face image of the target to be recognized.