WO2021196389A1

WO2021196389A1 - Facial action unit recognition method and apparatus, electronic device, and storage medium

Info

Publication number: WO2021196389A1
Application number: PCT/CN2020/092805
Authority: WO
Inventors: 胡艺飞; 徐国强
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-04-03
Filing date: 2020-05-28
Publication date: 2021-10-07
Also published as: CN111597884A

Abstract

A facial action unit recognition method and apparatus, an electronic device and a storage medium. Said method comprises: acquiring a first facial image to be recognized, uploaded by a terminal; using a pre-trained convolutional neural network model to perform face detection on said first facial image, to obtain position information of key facial points in said first facial image; using the position information of key facial points to perform face correction on said first facial image, to obtain a second facial image to be recognized; inputting said second facial image into a pre-trained facial action unit recognition model, and processing same by means of a main network part of the facial action unit recognition model, an attention mechanism and a fully-connected layer, so as to obtain a facial action unit recognition result of said first facial image; and outputting to the terminal the facial action unit recognition result of said first facial image. Said method is beneficial for improving the accuracy of recognition of a facial action unit in a facial image.

Description

Facial action unit recognition method, device, electronic equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 3, 2020, the application number is 202010262740.8, and the invention title is "Facial Action Unit Recognition Method, Apparatus, Electronic Equipment, and Storage Medium". The entire content is approved The reference is incorporated in this application.

Technical field

This application relates to the field of computer vision technology, and in particular to a method, device, electronic device, and storage medium for facial motion unit recognition.

Background technique

Facial expression recognition and facial emotion analysis are currently popular areas of computer vision research, and the results of these studies depend on the recognition accuracy of facial action units (AU) to varying degrees. The so-called facial action unit refers to the recognition of whether the muscle actions of specific parts of the face such as blinking, frowning, and beating mouth appear. With the development of computer information technology, deep learning has a wide range of applications in the recognition of facial action units, that is, by constructing a network model However, the inventor realized that most of the existing facial motion unit recognition models support a small number of facial motion units, and the description of subtle facial expression changes is relatively rough. In addition, when the face in the picture is When different rotation angles, or when there is interference information in the picture that does not affect the face, or when some attributes of the picture are changed, the output of the facial action unit recognition model will be affected, resulting in a lower recognition accuracy. Low.

Summary of the invention

The embodiments of the present application provide a method, a device, an electronic device, and a storage medium for identifying a facial action unit, which are beneficial to improve the accuracy of facial action unit recognition in a face image.

In the first aspect, an embodiment of the present application provides a facial motion unit recognition method, which includes:

Acquiring the first face image to be recognized uploaded by the terminal;

Using a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized, to obtain position information of key points of the face in the first face image to be recognized;

Performing face correction on the first face image to be recognized by using the position information of the key points of the face to obtain a second face image to be recognized;

The second face image to be recognized is input into a pre-trained facial action unit recognition model, and the main body network part, attention mechanism and fully connected layer of the facial action unit recognition model are processed to obtain the first to be recognized According to the recognition result of the facial action unit of the face image, the main body network part includes a plurality of deep residual dense networks, each of the deep residual dense networks is formed by stacking a deep residual network and a deep dense network;

Outputting the facial action unit recognition result of the first facial image to be recognized to the terminal.

In a second aspect, an embodiment of the present application provides a facial motion unit recognition device, which includes:

The image acquisition module is used to acquire the first face image to be recognized uploaded by the terminal;

A face detection module, configured to use a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized, to obtain position information of key points of the face in the first face image to be recognized;

A face correction module, configured to perform face correction on the first face image to be recognized by using the position information of the key points of the face to obtain a second face image to be recognized;

The facial motion unit recognition module is used to input the second facial image to be recognized into a pre-trained facial motion unit recognition model, and then process the main network part, attention mechanism and fully connected layer of the facial motion unit recognition model , Obtaining the facial action unit recognition result of the first face image to be recognized, the main body network part includes a plurality of deep residual dense networks, each of the deep residual dense networks is composed of a deep residual network and a deep dense network Stacked

The recognition result output module is configured to output the facial action unit recognition result of the first face image to be recognized to the terminal.

In a third aspect, an embodiment of the present application provides an electronic device that includes an input device and an output device, and also includes a processor, which is adapted to implement one or more instructions; and, a computer-readable storage medium. The readable storage medium stores one or more instructions, and the one or more instructions are suitable for being loaded by the processor and executing the following steps:

Acquiring the first face image to be recognized uploaded by the terminal;

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores one or more instructions, and the one or more instructions are suitable for being loaded by a processor and executing the following steps :

Acquiring the first face image to be recognized uploaded by the terminal;

In this embodiment of the application, when the terminal inputs the first face image to be recognized, the location information of the key points of the face of the first face image to be recognized is first obtained, and the location information is used to compare the position information of the face image in the first face image to be recognized. The face is corrected to straighten it, and then the second face image to be recognized is input into the facial action unit recognition model composed of the main network part, the attention mechanism module and the fully connected layer for recognition. The facial action unit recognition result is more accurate than the prior art.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Figure 1 is a network architecture diagram provided by an embodiment of the application;

Fig. 2a is an example diagram of obtaining a face image provided by an application embodiment;

Fig. 2b is another example diagram of obtaining a face image provided by the application embodiment;

FIG. 3 is a schematic flowchart of a facial motion unit recognition method provided by an embodiment of this application;

4 is a schematic structural diagram of a convolutional neural network model provided by an embodiment of the application;

5 is a schematic structural diagram of a facial action unit recognition model provided by an embodiment of the application;

6 is a schematic structural diagram of a deep residual dense network provided by an embodiment of this application;

FIG. 7 is a schematic flowchart of another facial motion unit recognition method provided by an embodiment of this application;

FIG. 8 is a schematic structural diagram of a facial action unit recognition device provided by an embodiment of the application;

FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The embodiment of the application proposes a facial action unit recognition scheme, which can be applied to face-to-face audits, customer expression analysis, and psychological activity analysis when handling business (for example, loan business, insurance business), etc. The face used in the scheme The action unit recognition model combines a deep residual network and a deep dense network to ensure that high-level features can be learned, thereby improving the accuracy of facial action unit recognition on the face image input by the terminal. At the same time, due to the low-level features In the stage, the features of the facial action units of the face are similar. If the models are trained separately for different facial action units, a lot of repetitive work will be generated. This solution branches the facial action unit recognition model in the high-level feature stage, and only needs to train one. The model can identify 39 facial motion units, which can reduce the difficulty of deploying the facial motion unit recognition model on the device and increase the running speed of the model. Specifically, the solution can be implemented based on the network architecture shown in Figure 1. As shown in Figure 1, the network architecture includes at least a terminal and a server. The terminal and the server communicate through a network, which includes but is not limited to virtual private Network, local area network, metropolitan area network, the terminal can directly collect face images, or it can rely on external image collection tools for face images and then obtain face images from external image collection tools. The terminals can be mobile phones, tablets, and laptops. Computers, handheld computers and other equipment. In some embodiments of the present application, as shown in FIG. 2a, the terminal can automatically complete the collection of face images when a face is detected, and then send the collected face images to the server. In some embodiments of the present application, As shown in Figure 2b, the terminal can also start collecting face images after the controls on the screen are triggered, and then send the collected face images to the server. The controls can appear in a fixed form or It appears in the form of floating, and the triggering method can be light touch, long press, slide, etc., which is not limited here. After the server obtains the face image sent by the terminal, the processor performs a series of operations such as face key point detection, face correction, and calling the facial motion unit recognition model for facial motion unit recognition, and finally outputs the recognition result to the terminal. User presentation. The server can be a single server, a server cluster, or a cloud server. It is the main body of execution of the entire facial motion unit recognition solution. It can be seen that the network architecture shown in Figure 1 can enable this solution to be implemented. Of course, The network architecture can also include more components, such as a database and so on.

Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a facial motion unit recognition method according to an embodiment of the application. As shown in FIG. 3, it includes steps S31-S35:

S31: Acquire the first face image to be recognized uploaded by the terminal.

In the specific embodiment of this application, the first face image to be recognized is the original face image uploaded by the terminal without face detection and face correction. It can be a face image in any open source database at home and abroad, or It is a customer's face image collected when a bank, insurance company, communication company, etc. handle business, or it can also be an image collected by monitoring equipment in any monitoring area such as a residential area or a shopping mall.

S32, using a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized, to obtain position information of key points of the face in the first face image to be recognized.

In the specific embodiment of the present application, the key points of the face are the five key points of the two eyes, the nose, and the corners of the mouth on the left and right sides of the detected face, and the position information is the coordinates of the key points, for example: the center points of the two eye ellipses The coordinates of the nose, the coordinates of the nose, the coordinates of the corners of the mouth on the left and right sides.

The pre-trained convolutional neural network model refers to Multi-task Cascaded Convolutional Networks (MTCNN), as shown in Figure 4, using a three-layer cascade architecture combined with a convolutional neural network algorithm for face detection and key points Positioning includes neural networks P-Net, R-Net and O-Net. The first face image to be recognized is first input to P-Net for recognition, and the output of P-Net is used as the input of R-Net. At the same time, R-Net’s The output is used as the input of O-Net, the input size of each network is different, the input size of P-Net is 12*12*3, the input size of R-Net is 24*24*3, and the input size of O-Net 48*48*3, the processing in P-Net is mainly 3*3 convolution and 2*2 pooling, and the processing in R-Net is mainly 3*3 convolution and 3*3 pooling And 2*2 pooling, the processing in O-Net is better than R-Net 3*3 convolution and 2*2 pooling. After each network, a face classifier is used to determine whether the area is a face. At the same time Use border regression and a key point locator to detect the face area. Specifically, the processing process of the multi-task convolutional neural network is: input the first face image to be recognized into P-Net for recognition to obtain the first candidate window and the bounding regression box, and compare the first candidate form and the bounding regression box according to the bounding regression box. The candidate window is calibrated, and non-maximum value suppression is used to remove the overlapped first candidate window after calibration to obtain a second candidate window; the second candidate window is input into R-Net for identification, and falsehoods are filtered out Obtain the third candidate window of the second candidate window; input the third window into O-Net for recognition, output the face area through the bounding box regression, and output the first face image to be recognized through key point positioning Location information of key points in the face. It should be noted that P-Net does not use full connection, while R-Net and O-Net use 128 channels and 256 channels respectively, and O-Net has one more layer of convolution processing than R-Net. .

S33: Perform face correction on the first face image to be recognized by using the position information of the key points of the face to obtain a second face image to be recognized.

In the specific embodiment of the present application, the second face image to be recognized is a straightened face image obtained after face correction is performed on the first face image to be recognized, where face correction involves zooming, rotation, and translation, etc. Operation, after using MTCNN to obtain the position information of the key points of the face in the first face image to be recognized, obtain the position information of the key points of the face in the pre-stored standard face image. The so-called standard face image refers to the image The face is front and the head does not have a face that does not need to be corrected. The position information (coordinate information) of the key points of the face in the standard face image has been obtained in advance and stored in the preset database. The position information of the key points of the face in the recognition face image is compared with the position information of the key points of the face in the standard face image, and the similarity transformation matrix H is obtained, and the similarity transformation matrix H is solved according to the following similarity transformation matrix equation:

After that, the position information of each pixel in the first face image to be recognized is multiplied by the similar transformation matrix H obtained after the solution, to obtain the second face image to be recognized with the face straightened. In the above-mentioned similarity transformation matrix equation, (x, y) represents the location information of the key points of the face in the first face image to be recognized, (x', y') represents the location information of the key points of the face in the standard face image,

It is the similarity transformation matrix H, s represents the scaling factor, θ represents the rotation angle, usually counterclockwise rotation, and (t _x , t _y ) represents the translation parameter. Specifically, the transformation.SimilarityTransform function can be used to iteratively solve the similarity transformation matrix H. This function comes from the python sklearn library (a machine learning library).

S34. Input the second face image to be recognized into a pre-trained facial action unit recognition model, and process the main body network part, attention mechanism and fully connected layer of the facial action unit recognition model to obtain the first According to the recognition result of the facial action unit of the face image to be recognized, the main body network part includes a plurality of deep residual dense networks, and each of the deep residual dense networks is formed by stacking a deep residual network and a deep dense network.

S35. Output a facial action unit recognition result of the first facial image to be recognized to the terminal.

In the specific embodiment of this application, the structure of the facial action unit recognition model is shown in Figure 5, which mainly includes the main body network part, the attention mechanism module and the last fully connected layer. The input of the model is a color image in RGB format, that is, the input The image depth of is 3, and the recognition result of the model is the probability value of the appearance of 39 facial action units. If the facial action unit is greater than or equal to 0.5, it means that the facial action unit appears, and if it is less than 0.5, it means that the facial action unit does not appear. For example, output AU45 (blink) The value of is 0.8, and the value of AU04 (frowning) is 0.3, which means that the face in the input image has AU45 but not AU04.

Specifically, the aforementioned second face image to be recognized is input into a pre-trained facial action unit recognition model, and the main body network part, attention mechanism, and fully connected layer of the facial action unit recognition model are processed to obtain the The recognition result of the facial action unit of the first face image to be recognized includes:

Input the second face image to be recognized into the main network part of the pre-trained facial action unit recognition model, and then perform feature extraction through multiple deep residual dense networks to obtain high-level feature maps; use the attention mechanism of the facial action unit recognition model Perform maximum pooling and average pooling operations on this high-level feature map to obtain a first feature map and a second feature map with the same width, height and high-order feature map, and a depth of 1. The first feature map and the second feature map are combined in the depth direction The second feature map is spliced, and the spliced feature map is convolved 1*1 to obtain the third feature map; the width and height of the third feature map are multiplied by the width and height of the higher-order feature map to obtain A target feature map, the target feature map is used as the input of the fully connected layer, and the fully connected layer performs two classifications, and finally outputs the facial action unit recognition result of the first face image to be recognized.

The main network part of the facial action unit recognition model is composed of four deep residual dense networks, with a total of 92 hidden layers. As shown in Figure 6, each deep residual dense network is stacked by a deep residual module and a deep dense module. A deep residual dense network starts with a 1*1 convolutional layer, followed by a 3*3 convolutional layer, and is divided into two parts after the last 1*1 convolutional layer, one part is divided into two parts according to the corresponding width and height Access to the deep residual module in an additive manner, using the characteristics of the residual network, so that the learned good features will not be forgotten as the network deepens, for example: the width and sum of the features obtained by the second hidden layer The two dimensions of height are added to the width and height of the feature obtained by the fifth hidden layer, the depth dimension remains unchanged, and the other part is connected to the path of the depth-intensive module, for example: the depth of the feature obtained by the second hidden layer is this One dimension is stitched with the depth of the features obtained by the fifth hidden layer to maintain the diversity of higher-order features. For example, for two features with depths of 20 and 30, the stitched feature depth will be 50, while the width and height remain unchanged . It should be noted that the main network part adopts a combination of deep residual network and deep dense network. Compared with the prior art only using deep residual network, it is more conducive to maintaining the diversity of high-order features, and more Conducive to accurately identify 39 facial action units.

In addition, the function of the attention mechanism module is to give weights to the high-order features extracted by the main network part, so that these high-order features can be recombined. It uses a combination of maximum pooling, average pooling and 1*1 convolution. Its input is the output of the main network part. After maximum pooling and average pooling, two feature maps with the same width and height as the input feature and depth of 1 are obtained, namely the first feature map and the second feature map. The two feature maps are stitched in depth, and the output feature map of the attention mechanism module is obtained through the convolution of 1*1 convolution, that is, the third feature map. The width and height of the output feature map are compared with the attention mechanism Multiply the width and height corresponding to the input feature map of the module (ie, high-order feature map) to obtain the input feature map of the fully connected layer, that is, the target special diagnosis map. Input the target feature map into the fully connected layer for matrix multiplication, and get 39 The two-class probability values of three facial action units, and finally the two-class probability values of 39 facial action units are output to the terminal, and the recognition results of the facial action units of the first image to be recognized are displayed. Here, the use of different scales of maximum pooling and average pooling for processing is conducive to capturing feature information of different scales, focusing on obtaining the weights of the width and height dimensions, and clarifying which position of the input face has more feature information. Conducive to the recognition of facial action units.

Please refer to FIG. 7. FIG. 7 is a schematic flowchart of another facial motion unit recognition method provided by an embodiment of the application. As shown in FIG. 7, it includes steps S71-S76:

S71: Obtain the first face image to be recognized uploaded by the terminal;

S72, using a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized, to obtain position information of key points of the face in the first face image to be recognized;

S73: Obtain the pre-stored position information of the key points of the face in the standard face image from the database;

S74: Perform face correction on the first face image to be recognized according to the location information of the face key points in the first face image to be recognized and the location information of the face key points in the standard face image. Obtain the second face image to be recognized;

In a possible implementation manner, the above-mentioned comparison of the position information of the key points of the face in the first face image to be recognized and the position information of the key points of the face in the standard face image is performed on the first face to be recognized. Perform face correction on the face image to obtain the second face image to be recognized, including:

Comparing the position information of the key points of the face in the first face image to be recognized with the position information of the key points of the face in the standard face image to obtain a similarity transformation matrix H;

Solving the similarity transformation matrix H according to a preset similarity transformation matrix equation;

The position information of each pixel in the first face image to be recognized is multiplied by the similarity transformation matrix H obtained after the solution, to obtain the second face image to be recognized that is straightened.

In this embodiment, MTCNN is used to perform face correction, and the model can accurately judge when the face rotates at different angles in the first face image to be recognized, which ensures the stability of the model.

S75. Input the second face image to be recognized into a pre-trained facial action unit recognition model, and process the main body network part, attention mechanism and fully connected layer of the facial action unit recognition model to obtain the first According to the recognition result of the facial action unit of the face image to be recognized, the main body network part includes a plurality of deep residual dense networks, each of the deep residual dense networks is formed by stacking a deep residual network and a deep dense network;

In a possible implementation manner, the aforementioned second face image to be recognized is input into a pre-trained facial action unit recognition model, and passes through the main body network part, attention mechanism, and fully connected layer of the facial action unit recognition model The processing to obtain the facial action unit recognition result of the first face image to be recognized includes:

Inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map;

Using the attention mechanism of the facial action unit recognition model to perform maximum pooling and average pooling operations on the high-level feature map to obtain a first feature map and a second feature map;

Obtain a target feature map according to the first feature map and the second feature map.

Wherein, obtaining the target feature map according to the first feature map and the second feature map described above includes:

Splicing the first feature map and the second feature map in the depth direction, and performing a 1*1 convolution on the spliced feature map to obtain a third feature map;

Obtaining a target feature map according to the high-level feature map and the third feature map;

The target feature map is input into the fully connected layer of the facial action unit recognition model to perform two classifications, and the facial action unit recognition result of the first face image to be recognized is output.

Wherein, the above-mentioned inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map includes:

The second face image to be recognized is input into the main network part, and feature extraction is performed through a plurality of the deep residual dense networks to obtain the high-order feature map; wherein, each of the deep residual dense networks The convolution processing starts from the 1*1 convolution layer, followed by the 3*3 convolution layer, and then the 1*1 convolution layer, and then it is divided into two parts for processing, and one part is connected to the deep residual network In the deep residual network, the features output by the two hidden layers are added in width and height, the depth remains unchanged, and the other part is connected to the path of the deep dense network, in the deep dense network The features output by the two hidden layers are stitched in depth, and the width and height remain unchanged.

In this embodiment, the main network part of the facial action unit recognition model is formed by stacking deep residual networks and deep dense networks to ensure that higher-order features are learned, plus maximum pooling, average pooling, and 1*1 The convolutional attention mechanism module helps to delete redundant features and improves the recognition accuracy of 39 facial action units.

S76: Output the facial action unit recognition result of the first facial image to be recognized to the terminal.

It should be noted that the specific implementations of the above steps S71-S76 have been described in detail in the embodiment shown in FIG. 3, and can achieve the same or similar beneficial effects, and will not be repeated here.

The present application also provides a facial motion unit recognition device. The facial motion unit recognition device may be a computer program (including program code) running in a terminal. The facial motion unit recognition device can execute the method shown in FIG. 3 or FIG. 7. See Figure 8. The device includes:

The image acquisition module 81 is configured to acquire the first face image to be recognized uploaded by the terminal;

The face detection module 82 is configured to use a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized to obtain position information of key points of the face in the first face image to be recognized ；

The face correction module 83 is configured to use the position information of the key points of the face to perform face correction on the first face image to be recognized to obtain a second face image to be recognized;

The facial action unit recognition module 84 is used to input the second face image to be recognized into a pre-trained facial action unit recognition model, and pass through the main network part, attention mechanism and fully connected layer of the facial action unit recognition model Processing to obtain the facial action unit recognition result of the first face image to be recognized, the main body network part includes a plurality of deep residual dense networks, each of the deep residual dense networks is composed of a deep residual network and a deep dense Network stacking;

The recognition result output module 85 is configured to output the facial action unit recognition result of the first facial image to be recognized to the terminal.

In one embodiment, in terms of using the position information of the key points of the face to perform face correction on the first face image to be recognized to obtain the second face image to be recognized, the face correction module 83 is specifically configured to :

Obtain the position information of the key points of the face in the pre-stored standard face image from the database;

Perform face correction on the first face image to be recognized according to the location information of the face key points in the first face image to be recognized and the location information of the face key points in the standard face image to obtain the result The second face image to be recognized.

In one embodiment, according to the position information of the key points of the face in the first face image to be recognized and the position information of the key points of the face in the standard face image, compare the first face image to be recognized In terms of performing face correction to obtain the second face image to be recognized, the face correction module 83 is specifically configured to:

In one embodiment, when the second face image to be recognized is input into a pre-trained facial action unit recognition model, the main body network part, attention mechanism and fully connected layer of the facial action unit recognition model are processed, In terms of obtaining the facial motion unit recognition result of the first face image to be recognized, the facial motion unit recognition module 84 is specifically configured to:

Using the attention mechanism to perform maximum pooling and average pooling operations on the high-order feature map, to obtain a first feature map and a second feature map with the same width and height as the high-order feature map and a depth of 1;

Obtain a target feature map according to the first feature map and the second feature map, and input the target feature map into the fully connected layer for two-class classification to obtain the facial action unit recognition of the first face image to be recognized result.

In one embodiment, in terms of obtaining the target feature map according to the first feature map and the second feature map, the facial action unit recognition module 84 is specifically configured to:

The target feature map is obtained by correspondingly multiplying the width and height of the third feature map with the width and height of the higher-order feature map.

In one embodiment, in terms of inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map, the facial action unit recognition module 84 is specifically configured to:

According to an embodiment of the present application, the various modules of the facial motion unit recognition device shown in FIG. 8 can be separately or completely combined into one or several other units to form, or some of the modules can also be disassembled. It is composed of multiple units with smaller functions, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application. The above-mentioned units are divided based on logical functions. In practical applications, the function of one unit may also be realized by multiple units, or the functions of multiple units may be realized by one unit. In other embodiments of the present application, the facial motion unit recognition device may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by multiple units in cooperation.

According to another embodiment of the present application, it can be implemented on a general-purpose computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM) and other processing elements and storage elements Run a computer program (including program code) capable of executing the steps involved in the corresponding method shown in FIG. 3 or FIG. 7 to construct the facial motion unit recognition device as shown in FIG. 8 and to implement the present application The facial motion unit recognition method of the embodiment. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into the above-mentioned computing device through the computer-readable recording medium, and run in it.

Based on the description of the foregoing method embodiment and apparatus embodiment, please refer to FIG. 9. FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application. As shown in FIG. A device 902, an output device 903, and a computer-readable storage medium 904. Wherein, the processor 901, the input device 902, the output device 903, and the computer-readable storage medium 904 in the electronic device may be connected by a bus or other methods.

The computer-readable storage medium 904 may be stored in the memory of the electronic device. The computer-readable storage medium 904 is used to store a computer program. The computer program includes program instructions. The processor 901 is used to execute the computer-readable Program instructions stored in the storage medium 904. The processor 901 (or CPU (Central Processing Unit, central processing unit)) is the computing core and control core of an electronic device. It is suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to achieve Corresponding method flow or corresponding function.

In one embodiment, the processor 901 of the electronic device provided in the embodiment of the present application may be used to perform a series of facial action unit recognition processing on the acquired facial image:

Acquiring the first face image to be recognized uploaded by the terminal;

In a possible implementation manner, the processor 901 executes the use of the position information of the key points of the face to perform face correction on the first face image to be recognized to obtain a second face image to be recognized, including :

In a possible implementation manner, the processor 901 performs the comparison of the position information of the key points of the face in the first face image to be recognized and the position information of the key points of the face in the standard face image. The performing face correction on the first face image to be recognized to obtain the second face image to be recognized includes:

In a possible implementation manner, the processor 901 executes the input of the second face image to be recognized into the pre-trained facial action unit recognition model, and passes through the main network part and the attention of the facial action unit recognition model. The mechanism and the processing of the fully connected layer to obtain the facial action unit recognition result of the first face image to be recognized includes:

In a possible implementation manner, the execution of the processor 901 to obtain the target feature map according to the first feature map and the second feature map includes:

In a possible implementation manner, the processor 901 executing the input of the second face image to be recognized into the main network part for feature extraction to obtain a high-level feature map includes:

Exemplarily, the above-mentioned electronic device may be a server, a computer host, a cloud server and other devices. The electronic device may include, but is not limited to, a processor 901, an input device 902, an output device 903, and a computer-readable storage medium 904. Those skilled in the art can understand that the schematic diagram is only an example of the electronic device, and does not constitute a limitation on the electronic device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different components.

It should be noted that since the processor 901 of the electronic device executes the computer program to implement the steps in the facial motion unit recognition method described above, the above-mentioned embodiments of the facial motion unit recognition method are all applicable to the electronic device, and can achieve the same Or similar beneficial effects.

The embodiment of the present application also provides a computer-readable storage medium (Memory). The computer-readable storage medium is a memory device in an electronic device for storing programs and data. It can be understood that the computer-readable storage medium herein may include a built-in storage medium in the terminal, and of course, may also include an extended storage medium supported by the terminal. The computer-readable storage medium provides storage space, and the storage space stores the operating system of the terminal. In addition, one or more instructions suitable for being loaded and executed by the processor 901 are stored in the storage space, and these instructions may be one or more computer programs (including program codes). It should be noted that the computer-readable storage medium here can be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, it can also be at least one located far away The aforementioned processor 901 is a computer-readable storage medium. In one embodiment, the processor 901 can load and execute one or more instructions stored in a computer-readable storage medium to implement the corresponding steps of the above-mentioned facial motion unit recognition method; in a specific implementation, the computer-readable storage medium One or more instructions of is loaded by the processor 901 and executes the following steps:

Acquiring the first face image to be recognized uploaded by the terminal;

In an example, when one or more instructions in the computer-readable storage medium are loaded by the processor 901, the following steps are also executed:

Exemplarily, the computer program in the computer-readable storage medium includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form, etc., and the computer-readable storage medium may It is non-volatile or volatile. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) ), Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.

It should be noted that, since the computer program of the computer-readable storage medium is executed by the processor to realize the steps in the above-mentioned facial motion unit recognition method, all the embodiments of the above-mentioned facial motion unit recognition method are applicable to the computer-readable storage Medium, and can achieve the same or similar beneficial effects.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

The above-disclosed are only part of the embodiments of this application. Of course, it cannot be used to limit the scope of rights of this application. Those of ordinary skill in the art can understand all or part of the procedures for implementing the above-mentioned embodiments and make them in accordance with the claims of this application. The equivalent change of is still within the scope of this application.

Claims

A method for recognizing facial action units, wherein the method includes:

Acquiring the first face image to be recognized uploaded by the terminal;

Using a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized, to obtain position information of key points of the face in the first face image to be recognized;

Performing face correction on the first face image to be recognized by using the position information of the key points of the face to obtain a second face image to be recognized;

The second face image to be recognized is input into a pre-trained facial action unit recognition model, and the main body network part, attention mechanism and fully connected layer of the facial action unit recognition model are processed to obtain the first to be recognized According to the recognition result of the facial action unit of the face image, the main body network part includes a plurality of deep residual dense networks, each of the deep residual dense networks is formed by stacking a deep residual network and a deep dense network;

Outputting the facial action unit recognition result of the first facial image to be recognized to the terminal.
The method according to claim 1, wherein said using the position information of the key points of the face to perform face correction on the first face image to be recognized to obtain a second face image to be recognized comprises:

Obtain the position information of the key points of the face in the pre-stored standard face image from the database;

Perform face correction on the first face image to be recognized according to the location information of the face key points in the first face image to be recognized and the location information of the face key points in the standard face image to obtain the result The second face image to be recognized.
2. The method according to claim 2, wherein the first to-be-recognized face image position information of the face key points and the standard face image position information of the face key points are compared to the first Performing face correction on a face image to be recognized to obtain the second face image to be recognized includes:

Comparing the position information of the key points of the face in the first face image to be recognized with the position information of the key points of the face in the standard face image to obtain a similarity transformation matrix H;

Solving the similarity transformation matrix H according to a preset similarity transformation matrix equation;

The position information of each pixel in the first face image to be recognized is multiplied by the similarity transformation matrix H obtained after the solution, to obtain the second face image to be recognized that is straightened.
The method according to any one of claims 1 to 3, wherein the input of the second face image to be recognized into a pre-trained facial action unit recognition model passes through the main network part of the facial action unit recognition model , The attention mechanism and the processing of the fully connected layer to obtain the facial action unit recognition result of the first face image to be recognized includes:

Inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map;

Using the attention mechanism to perform maximum pooling and average pooling operations on the high-order feature map, to obtain a first feature map and a second feature map with the same width and height as the high-order feature map and a depth of 1;

Obtain a target feature map according to the first feature map and the second feature map, and input the target feature map into the fully connected layer for two-class classification to obtain the facial action unit recognition of the first face image to be recognized result.
The method according to claim 4, wherein the obtaining a target feature map according to the first feature map and the second feature map comprises:

Splicing the first feature map and the second feature map in the depth direction, and performing a 1*1 convolution on the spliced feature map to obtain a third feature map;

The target feature map is obtained by correspondingly multiplying the width and height of the third feature map with the width and height of the higher-order feature map.
The method according to claim 4, wherein the inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-order feature map comprises:

The second face image to be recognized is input into the main network part, and feature extraction is performed through a plurality of the deep residual dense networks to obtain the high-order feature map; wherein, each of the deep residual dense networks The convolution processing starts from the 1*1 convolution layer, followed by the 3*3 convolution layer, and then the 1*1 convolution layer, and then it is divided into two parts for processing, and one part is connected to the deep residual network In the deep residual network, the features output by the two hidden layers are added in width and height, the depth remains unchanged, and the other part is connected to the path of the deep dense network, in the deep dense network The features output by the two hidden layers are stitched in depth, and the width and height remain unchanged.
A facial action unit recognition device, wherein the device includes:

The image acquisition module is used to acquire the first face image to be recognized uploaded by the terminal;

A face detection module, configured to use a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized, to obtain position information of key points of the face in the first face image to be recognized;

A face correction module, configured to perform face correction on the first face image to be recognized by using the position information of the key points of the face to obtain a second face image to be recognized;

The facial motion unit recognition module is used to input the second facial image to be recognized into a pre-trained facial motion unit recognition model, and then process the main network part, attention mechanism and fully connected layer of the facial motion unit recognition model , Obtaining the facial action unit recognition result of the first face image to be recognized, the main body network part includes a plurality of deep residual dense networks, each of the deep residual dense networks is composed of a deep residual network and a deep dense network Stacked

The recognition result output module is configured to output the facial action unit recognition result of the first face image to be recognized to the terminal.
The device according to claim 7, wherein, in terms of using the position information of the key points of the face to perform face correction on the first face image to be recognized to obtain a second face image to be recognized, the face correction Module 83 is specifically used for:

Obtain the position information of the key points of the face in the pre-stored standard face image from the database;

Perform face correction on the first face image to be recognized according to the location information of the face key points in the first face image to be recognized and the location information of the face key points in the standard face image to obtain the result The second face image to be recognized.
8. The device according to claim 8, wherein the comparison between the position information of the key points of the face in the first face image to be recognized and the position information of the key points of the face in the standard face image is compared to the first In terms of performing face correction on the face image to be recognized, and obtaining the second face image to be recognized, the face correction module is specifically configured to:

Comparing the position information of the key points of the face in the first face image to be recognized with the position information of the key points of the face in the standard face image to obtain a similarity transformation matrix H;

Solving the similarity transformation matrix H according to a preset similarity transformation matrix equation;

The position information of each pixel in the first face image to be recognized is multiplied by the similarity transformation matrix H obtained after the solution, to obtain the second face image to be recognized that is straightened.
The device according to any one of claims 7-9, wherein, after the second face image to be recognized is input into a pre-trained facial action unit recognition model, it passes through the main network part of the facial action unit recognition model, Regarding the processing of the attention mechanism and the fully connected layer to obtain the facial motion unit recognition result of the first face image to be recognized, the facial motion unit recognition module is specifically configured to:

Inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map;

Using the attention mechanism to perform maximum pooling and average pooling operations on the high-order feature map, to obtain a first feature map and a second feature map with the same width and height as the high-order feature map and a depth of 1;

Obtain a target feature map according to the first feature map and the second feature map, and input the target feature map into the fully connected layer for two-class classification to obtain the facial action unit recognition of the first face image to be recognized result.
The device according to claim 10, wherein, in terms of obtaining the target feature map according to the first feature map and the second feature map, the facial action unit recognition module is specifically configured to:

Splicing the first feature map and the second feature map in the depth direction, and performing a 1*1 convolution on the spliced feature map to obtain a third feature map;

The target feature map is obtained by correspondingly multiplying the width and height of the third feature map with the width and height of the higher-order feature map.
The device according to claim 10, wherein, in terms of inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map, the facial action unit recognition module is specifically configured to:

The second face image to be recognized is input into the main network part, and feature extraction is performed through a plurality of the deep residual dense networks to obtain the high-order feature map; wherein, each of the deep residual dense networks The convolution processing starts from the 1*1 convolution layer, followed by the 3*3 convolution layer, and then the 1*1 convolution layer, and then it is divided into two parts for processing, and one part is connected to the deep residual network In the deep residual network, the features output by the two hidden layers are added in width and height, the depth remains unchanged, and the other part is connected to the path of the deep dense network, in the deep dense network The features output by the two hidden layers are stitched in depth, and the width and height remain unchanged.
An electronic device, including an input device and an output device, which also includes:

Processor, suitable for implementing one or more instructions; and,

A computer-readable storage medium storing one or more instructions, and the one or more instructions are suitable for being loaded and executed by the processor: acquiring the first face image to be recognized uploaded by the terminal Using a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized to obtain position information of key points of the face in the first face image to be recognized; using the face key The position information of the point performs face correction on the first face image to be recognized to obtain a second face image to be recognized; the second face image to be recognized is input into a pre-trained facial action unit recognition model, and all The main body network part of the facial action unit recognition model, the attention mechanism, and the processing of the fully connected layer are processed to obtain the facial action unit recognition result of the first face image to be recognized. The main body network part includes a plurality of deep residual intensive Network, each of the deep residual dense networks is formed by stacking a deep residual network and a deep dense network; and output to the terminal the recognition result of the facial action unit of the first face image to be recognized.
The electronic device according to claim 13, wherein the processor executes the use of the position information of the key points of the face to perform face correction on the first face image to be recognized to obtain a second person to be recognized Face images, including:

Obtain the position information of the key points of the face in the pre-stored standard face image from the database;

Perform face correction on the first face image to be recognized according to the location information of the face key points in the first face image to be recognized and the location information of the face key points in the standard face image to obtain the result The second face image to be recognized.
The electronic device according to claim 14, wherein the processor executes the calculation based on the position information of the face key points in the first face image to be recognized and the face key points in the standard face image. The position information performs face correction on the first face image to be recognized to obtain the second face image to be recognized, including:

Comparing the position information of the key points of the face in the first face image to be recognized with the position information of the key points of the face in the standard face image to obtain a similarity transformation matrix H;

Solving the similarity transformation matrix H according to a preset similarity transformation matrix equation;

The position information of each pixel in the first face image to be recognized is multiplied by the similarity transformation matrix H obtained after the solution, to obtain the second face image to be recognized that is straightened.
The electronic device according to any one of claims 13-15, wherein the processor executes the input of the second face image to be recognized into a pre-trained facial action unit recognition model, and passes through the facial action unit The processing of the main network part, the attention mechanism and the fully connected layer of the recognition model to obtain the facial action unit recognition result of the first face image to be recognized includes:

Inputting the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map;

Using the attention mechanism to perform maximum pooling and average pooling operations on the high-order feature map, to obtain a first feature map and a second feature map with the same width and height as the high-order feature map and a depth of 1;

Obtain a target feature map according to the first feature map and the second feature map, and input the target feature map into the fully connected layer for two-class classification to obtain the facial action unit recognition of the first face image to be recognized result.
The electronic device according to claim 16, wherein the execution by the processor to obtain the target feature map according to the first feature map and the second feature map comprises:

Splicing the first feature map and the second feature map in the depth direction, and performing a 1*1 convolution on the spliced feature map to obtain a third feature map;

The target feature map is obtained by correspondingly multiplying the width and height of the third feature map with the width and height of the higher-order feature map.
The electronic device according to claim 16, wherein the processor executes the input of the second face image to be recognized into the main body network part for feature extraction to obtain a high-level feature map, comprising:

The second face image to be recognized is input into the main network part, and feature extraction is performed through a plurality of the deep residual dense networks to obtain the high-order feature map; wherein, each of the deep residual dense networks The convolution processing starts from the 1*1 convolution layer, followed by the 3*3 convolution layer, and then the 1*1 convolution layer, and then it is divided into two parts for processing, and one part is connected to the deep residual network In the deep residual network, the features output by the two hidden layers are added in width and height, the depth remains unchanged, and the other part is connected to the path of the deep dense network, in the deep dense network The features output by the two hidden layers are stitched in depth, and the width and height remain unchanged.
A computer-readable storage medium, wherein the computer-readable storage medium stores one or more instructions, and the one or more instructions are suitable for being loaded and executed by a processor: obtaining a first person to be identified uploaded by a terminal Face image; using a pre-trained convolutional neural network model to perform face detection on the first face image to be recognized to obtain position information of key points of the face in the first face image to be recognized; using the person The position information of the key points of the face performs face correction on the first face image to be recognized to obtain a second face image to be recognized; inputting the second face image to be recognized into a pre-trained facial action unit recognition model, After the main body network part of the facial action unit recognition model, the attention mechanism and the fully connected layer are processed, the facial action unit recognition result of the first face image to be recognized is obtained. The main body network part includes a plurality of deep residuals. A difference dense network, each of the deep residual dense networks is formed by stacking a deep residual network and a deep dense network; and the facial action unit recognition result of the first face image to be recognized is output to the terminal.
The computer-readable storage medium according to claim 19, wherein, when one or more instructions in the computer-readable storage medium are loaded by the processor, the following steps are further executed: obtaining the pre-stored standard human face image from the database. The position information of the face key points; according to the position information of the face key points in the first face image to be recognized and the position information of the face key points in the standard face image, compare the first face image to be recognized Perform face correction to obtain the second face image to be recognized.