WO2020248841A1

WO2020248841A1 - Au detection method and apparatus for image, and electronic device and storage medium

Info

Publication number: WO2020248841A1
Application number: PCT/CN2020/093313
Authority: WO
Inventors: 盛建达
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-13
Filing date: 2020-05-29
Publication date: 2020-12-17
Also published as: CN110399788A

Abstract

An AU detection method and apparatus for an image, and an electronic device and a storage medium, wherein same relate to fields of image processing and predictive analysis such as artificial intelligence. The method comprises: acquiring a facial image (S201); carrying out detection processing on the acquired facial image to acquire a unified facial area (S202); by taking the facial image on which the detection processing has been carried out as an original image, inputting same into an optimized ResNet network to carry out feature value extraction, so as to output a facial feature vector (S203); and inputting the facial feature vector output by the ResNet network into an LSTM network for training to obtain an AU recognition result of the facial image (S204). According to the method, a training model can make full use of dynamic information of a facial AU change to automatically learn of a mapping relationship between AU features of a recognized object, such that the prediction precision and robustness of the training model are improved, and the performance of AU recognition of the facial image is thus improved.

Description

Image AU detection method, device, electronic equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 13, 2019, the application number is 201910511707.1, and the invention title is "Image AU detection method, device, electronic equipment and storage medium". The entire content is approved The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence image processing, and in particular to an image AU detection method, device, electronic equipment and storage medium.

Background technique

Existing AU (Action Units, used to detect subtle movements of facial muscles) detection refers to comparing the similarity between the expression in the face image and the AU to determine which category the AU of the face image belongs to. FACS (Facial Action Coding System) analyzes in detail the activities of all facial muscle tissues and the changes in individual parts of the face caused by their activities, as well as the observable expressions caused by these muscle activities. On this basis, the facial movement is decomposed into some basic AU. AU refers to the basic muscle action units of the human face, such as: raised inner eyebrows, raised mouth corners, and wrinkled nose.

technical problem

At present, the AU detection methods for video streams in the industry usually include the following: 1 AU detection based on a single frame of image; 2 AU detection using LSTM (Long Short-Term Memory) algorithm. However, the AU detection method based on a single frame image detects AU on an average face. The inventor realized that the method ignores the correlation between frames and the AU detection accuracy is not high. Although the method of using the LSTM algorithm for AU detection makes good use of spatial correlation, the extraction of AU feature values is relatively rough, which also makes the AU detection accuracy not high.

Technical solutions

In view of the above, it is necessary to propose an image AU detection method, device, electronic equipment and computer readable storage medium to solve the problem of low AU detection accuracy when AU detection method based on single frame image or AU detection based on LSTM algorithm problem.

The first aspect of the present application provides an image AU detection method, the method includes:

Obtain face images;

Perform detection processing on the acquired face image to obtain a uniform face area;

Input the detected face image as the original image into the optimized ResNe network for feature value extraction to output the face feature vector; and

The face feature vector output by the ResNet network is input into the LSTM network for training, and the AU recognition result of the face image is obtained.

The second aspect of the present application provides an image AU detection device, the device includes:

The acquisition module is used to acquire a face image;

The preprocessing module is used to detect and process the acquired face images to obtain a unified face area;

The feature extraction module is used to input the detected face image as the original image into the optimized ResNe network for feature value extraction to output the face feature vector; and

The recognition module is used to input the face feature vector output by the ResNet network into the LSTM network for training, and obtain the AU recognition result of the face image.

A third aspect of the present application provides an electronic device, the electronic device includes a processor, and the processor is configured to implement the AU detection method of the image when executing computer-readable instructions stored in a memory.

The fourth aspect of the present application provides one or more readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, when the one or more processors execute Realize the AU detection method of the image.

Beneficial effect

This application enables the training model to make full use of the dynamic information of facial AU changes to automatically learn the mapping relationship between the AU features of the recognized object, thereby improving the prediction accuracy and robustness of the training model, thereby improving the face image AU recognition performance.

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

Fig. 1 is an application environment diagram of an image AU detection method in an embodiment of this application.

Fig. 2 is a flowchart of an image AU detection method in an embodiment of the present application.

Figure 3 is a schematic diagram of the basic operation structure of the ResNet network in this application.

Fig. 4 is a structural diagram of a ResNet network in an embodiment of this application.

FIG. 5 is a schematic diagram of the sequence processing flow of the LSTM network in an embodiment of this application.

[Corrected according to Rule 91 03.08.2020]

[Corrected according to Rule 91 03.08.2020]
FIG. 6 is a structural diagram of an image AU detection device in an embodiment of this application.

[Corrected according to Rule 91 03.08.2020]
Fig. 7 is a schematic diagram of the electronic equipment of this application.

Embodiments of the invention

In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.

In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the description of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.

Preferably, the AU detection method of the image of this application is applied to one or more electronic devices. The electronic device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The electronic device may be a computing device such as a desktop computer, a notebook computer, a tablet computer, and a cloud server. The device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

Example 1

FIG. 1 is a schematic diagram of an application environment of an image AU detection method in an embodiment of the present application.

Referring to FIG. 1, the AU detection method of the image is applied in the terminal device 1. The terminal device 1 includes an image acquisition unit 11. The image collection unit 11 is used to collect face images. In this embodiment, the terminal device 1 may obtain a face image through the image acquisition unit 11 and perform AU detection on the face image. In this application, 19 AUs in FACS are selected, including 6 upper half face AUs and 13 lower half face AUs. In this application, the above 19 AUs are used as the standard for detecting and comparing face images to predict which AU category the face image belongs to.

The terminal device 1 is also connected to an external device 2 in communication. In one embodiment, the terminal device 1 is in communication connection with the external device 2 via a network. In a specific embodiment, the network used to support the communication between the terminal device 1 and the external device 2 may be a wired network or a wireless network, such as radio, wireless fidelity (WIFI), cellular, satellite, broadcast Wait. In this embodiment, the terminal device 1 may be a computer device, a single server, a server cluster, or a cloud server. The external device 2 can be, but is not limited to, a computer device, a mobile phone, a notebook computer, a tablet computer, and other devices.

Fig. 2 is a flowchart of an image AU detection method in an embodiment of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.

Referring to FIG. 2, the AU detection method of the image specifically includes the following steps:

Step S201: Obtain a face image.

In this embodiment, the image acquisition unit 11 may be a 2D camera, and the terminal device 1 acquires the user's 2D face image as the user's face image through the 2D camera. In another embodiment, the image acquisition unit 11 may also be a 3D camera, and the terminal device 1 acquires the user's 3D face image as the user's face image through the 3D camera. In another embodiment, the terminal device 1 receives a face picture sent by an external device 2 communicatively connected with the terminal device. In other embodiments, the face image is stored in a storage device of the terminal device 1, and the terminal device 1 obtains the face image from the storage device. Wherein, the face image includes consecutive frames of face pictures. For example, in one embodiment, the face picture may be a face video or the like.

Step S202: Perform detection processing on the acquired face image to acquire a unified face area.

In this embodiment, the terminal device 1 may use the Adaboost face detection algorithm based on Haar-like features to perform face detection on each frame of face images in the acquired face images to determine the face area. In a specific implementation process, the Adaboost face detection algorithm may be used to scan each frame of the face image with a window of a preset size and a preset compensation until the face area in each frame of the image is determined. In this embodiment, the face area may be a fixed rectangular area including the forehead, chin, left cheek, and right cheek in the face image.

In an embodiment, the terminal device 1 is also used to calibrate the face area. Specifically, the terminal device 1 detects the key feature points in the face area, and performs alignment and calibration on the corresponding face images based on the positions of the detected key feature points. The key feature points in the face area may be eyes, nose, mouth, left cheek outer contour, right cheek outer contour, and so on. According to the key feature points detected in the face region, the face image can be aligned and calibrated by the landmark method, so that the positions of the key feature points of the face in the face image are basically the same. In this embodiment, in order to avoid that the size of the face image is not uniform and affect the recognition result of the subsequent face image, the terminal device 1 may also edit the face image after alignment and calibration according to a preset template to obtain uniformity. The size of the face image. In this embodiment, the editing processing includes one or two of cutting processing and zoom processing. For example, in the editing process, the terminal device 1 cuts out the corresponding face image according to a uniform template based on the key feature points in the detected face area and scales the face image to a uniform size, so, The editing of the face image is realized. In this embodiment, opencv's resize can be used to scale the face image to a uniform size based on a bilinear difference or area difference algorithm.

Step S203: Input the face image subjected to the detection process as the original image to the optimized ResNet (Residual Neural Network, deep residual network) network for feature value extraction to output a face feature vector.

Please refer to FIG. 3, which shows a schematic diagram of the basic operation structure of the ResNet network in this application. In this embodiment, the first basic operation structure of the ResNet network is shown in FIG. 3(a), and the input in the first basic operation structure is superimposed with the original input through the output of three convolutional layers. When the input and output matrix sizes are the same, the first basic operation structure shown in Figure 3(a) is used. The second basic operation structure of the ResNet network is shown in FIG. 3(b). The input in the second basic operation structure is superimposed with the original input through the output of three convolutional layers. When the input and output matrix sizes are not the same, the second basic operation structure shown in Figure 3(b) is used.

Please refer to FIG. 4, which shows a structural diagram of a ResNet network in an embodiment of this application. As shown in Figure 4, the overall structure of the ResNet network includes: a convolutional layer, a pooling layer, 4 sets of convolution packages with different parameters, a pooling layer, a fully connected layer, and a sigmoid layer. In this embodiment, the original image is sequentially processed through the convolutional layer, pooling layer, 4 sets of convolution packages with different parameters, pooling layer, fully connected layer, and sigmoid layer of the ResNet network to obtain the face Feature vector. In this embodiment, the four convolutional packets include conv2_x, conv3_x, conv4_x, and conv5_x, respectively. In this embodiment, the second convolutional layer of the first package in each group of convolutional packages is down-sampled with a stride of 2.

In this embodiment, the step of "inputting the face image after the detection process as the original image into the optimized ResNet network for feature value extraction to output the face feature vector" includes:

(S2031) Augment the original image to obtain sample data.

In this embodiment, the augmentation of the original image to obtain sample data specifically includes: obtaining the original image; randomly cropping a picture with preset resolution pixels from the original image to obtain an initial sample picture; and obtaining an evenly distributed image with [0,1] Random number. When the random number of the initial sample picture is less than the random threshold, it will flip to generate a new [0,1] uniformly distributed random number. When the random number is less than 0.5, the initial sample picture will be grayed out to obtain the first Sample picture; add point light source processing to the first sample picture to obtain a second sample picture; use the obtained initial sample picture, first sample picture, and second sample picture as sample data.

In this embodiment, the purpose of augmenting the original image is to increase the number of training samples. The methods of augmentation include, but are not limited to, flipping, randomly cropping an original image with a size of 256*256 by 248*248, graying the original image, modifying the lighting of the original image, and adding point light source lighting to the original image. Use multiple augmentation methods to couple the augmentation. For example, randomly crop a 248*248 pixel image from a 256*256 original image to obtain an initial sample image; generate a random number with uniform distribution of [0,1]. When the random number is less than 0.5, it will flip to generate a new uniform [0,1] The distributed random number, when the random number is less than 0.5, the initial sample image is grayed out to obtain the first sample image; or the first sample image is processed by adding point light sources to obtain the second sample image, etc., when the When the number of initial sample pictures, first sample pictures, and second sample pictures reaches the actual application scenario requirements, the augmentation ends.

(S2032) Train the sample data to optimize the ResNet network to obtain an optimized ResNet network.

In this embodiment, the ResNe network is optimized through the training sample data of the first basic operation structure or the second basic operation structure of the ResNet network. Among them, the first basic arithmetic structure of the ResNet network is the superposition of the output of the input through the three convolutional layers and the original input. The first basic operation structure is used when the input and output matrices have the same size.

In this embodiment, training the sample data to optimize the ResNet network, and the optimized ResNet network includes: training a classification network of face images to obtain a face classification network; and transferring the trained face classification network to train an AU neural network , Get the sample data after training.

In this embodiment, the migration method can be used to train the face classification network step by step, that is, the last fully-linked layer parameter is set to the number of face classifications, and the last 9 AU sigmoid layers in the AU neural network are replaced with a softmax layer . First train 100 classified faces, and when the accuracy reaches 70%, transfer the result of 100 classified faces to 1200 face classification for training. When the accuracy of 1200 face classification reaches 90%, it will be transferred to 16000 face classification for training, and finally 16000 face classification results will be trained as high as possible. In this embodiment, the parameters of each layer of the ResNet network from the convolutional layer to the conv3_x are fixed, and 16000 human face training parameters are transferred as the initial parameters to train the parameters of the conv4_x and subsequent layers. In this way, the prior knowledge of the existing face classification learning is fully utilized to improve the AU detection accuracy.

In this embodiment, the migration training refers to directly loading the face classification network parameters into the AU neural network, because only the last layer of the two neural networks is different in structure, and the number of other parameters are the same. So the parameters can be loaded. The AU neural network has a low output dimension and only 19 results, while the face classification network has a high dimension. Use the face classification training results and transfer them, and at the same time part of the layers are locked, so that the face structure features learned in the face classification are fully utilized in the AU detection.

(S2033) Input the face image to be detected in the optimized ResNet network to obtain the face feature vector.

In this embodiment, after the ResNet network is optimized, the face image after the detection process is input to the optimized ResNet network to obtain the features of the fully connected layer as the face feature vector output by the ResNet network.

Step S204: Input the face feature vector output by the ResNet network into an LSTM (Long Short-Term Memory) network for training, and obtain the AU recognition result of the face image.

In this embodiment, the LSTM network is a special recurrent neural network. The LSTM network regards the input sequence as a time sequence, and the long and short-term memory network can learn the short-term and long-term dependencies of the data in the input sequence in time. In this embodiment, the aforementioned AU recognition result may indicate the AU category of the face image.

Please refer to FIG. 5, which is a schematic diagram of the sequence processing flow of the LSTM network in an embodiment of this application. Among them, X0, X1,..., Xn are each frame image of a face image with a length of n frames, and each frame image in the face image is extracted through the ResNet network to extract the face feature vector Y0, Y1,... ., Yn, the face feature vectors Y0, Y1,..., Yn are sequentially input to the LSTM network in chronological order, and the AU recognition results h0, h1,... , Hn.

In this embodiment, the LSTM network includes: input gates (that is, input gates), forget gates (that is, forget gates), output gates (that is, output gates), state units (that is, cells), and LSTM output gates.

For the case where the input face image sequence contains more than two frames of face images, the processing procedures of the input gate, forget gate, output gate, state unit and LSTM output gate can be calculated and implemented by the following formulas:

i _t =σ(W _ix ·x _t +W _im ·m _t-1 +W _ic ·c _t-1 +b _i );

f _t =σ(W _fx ·xt+W _fm ·mt-1+W _fc ·c _t-1 +b _f );

c _t =ft⊙c _t-1 +i _t ⊙σ(W _cx ·x _t +W _cm ·m _t-1 +b _c );

o _t =σ(W _ox ·x _t +W _om ·m _t-1 +W _oc ·c _t-1 +b _o );

m _t =o _t ⊙h(c _t ).

Among them, in the above formula, x _{t is} expressed as the face feature vector input at time t; W (ie _Wix , _Wim , _Wiic , W _fx , W _fm , W _fc , W _cx , W _cm , W _ox , W _om and W _oc ) are preset weight matrices, indicating that the elements of each gate are obtained from the data of the corresponding dimension, that is to say, nodes of different dimensions do not interfere with each other; b (ie, b _i , b _f, b _c, b _o) represents a predetermined offset _{_{vector, i t, f t, o}} t, c t, m t respectively represent the gate input at time t, forgetting gate, the output of the gate, and the status unit In the state of the LSTM output gate, ⊙ is the dot product, σ() is the sigmoid function, and h() is the output activation function of the above-mentioned state unit. The output activation function may specifically be a tanh function.

The specific process of training the LSTM network in this embodiment is as follows: each frame image in the face image is extracted through the ResNet network to extract the face feature vector into the LSTM network, and the LSTM network is processed based on the back propagation algorithm. Training is performed so that the deviation between the value of the input image processed by the LSTM network and the mapping value of the expression category to which the image belongs is within a preset allowable range. Of course, the training process of the LSTM network can also be implemented with reference to other existing technical solutions, which is not limited here.

The image AU detection method in this application builds a training model based on the ResNet network and the LSTM network, and uses a collection of continuous frame images (such as video) of the face image as the training input of the training model, so that the training model can be fully utilized The dynamic information of face AU changes automatically learns the mapping relationship between the AU features of the recognition object, thereby improving the prediction accuracy and robustness of the training model, and further improving the performance of AU recognition of the face image.

Example 2

FIG. 6 is a structural diagram of an image AU detection device 40 in an embodiment of this application.

In some embodiments, the image AU detection device 40 runs in the terminal device 1. The image AU detection device 40 may include multiple functional modules composed of program code segments. The program code of each program segment in the image AU detection device 40 can be stored in the memory and executed by at least one processor to perform the function of face recognition.

In this embodiment, the image AU detection device 40 can be divided into multiple functional modules according to the functions it performs. Referring to FIG. 7, the image AU detection device 40 may include an acquisition module 401, a preprocessing module 402, a feature extraction module 403, and an identification module 404. The module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.

The acquiring module 401 is used to acquire a face image.

In this embodiment, the image acquisition unit 11 may be a 2D camera, and the acquisition module 401 acquires the user's 2D face image as the user's face image through the 2D camera. In another embodiment, the image acquisition unit 11 may also be a 3D camera, and the acquisition module 401 acquires the user's 3D face image as the user's face image through the 3D camera. In another embodiment, the acquisition module 401 receives a face picture sent by the external device 2 communicatively connected with the terminal device. In other embodiments, the face image is stored in a storage device of the terminal device 1, and the acquisition module 401 acquires the face image from the storage device. Wherein, the face image includes consecutive frames of face pictures. For example, in one embodiment, the face picture may be a face video or the like.

The preprocessing module 402 is used to perform detection processing on the acquired face image to acquire a unified face area.

In this embodiment, the preprocessing module 402 may use the Adaboost face detection algorithm based on Haar-like features to perform face detection on each frame of face images in the acquired face images to determine the face area. In a specific implementation process, the Adaboost face detection algorithm may be used to scan each frame of the face image with a window of a preset size and a preset compensation until the face area in each frame of the image is determined. In this embodiment, the face area may be a fixed rectangular area including the forehead, chin, left cheek, and right cheek in the face image.

In an embodiment, the preprocessing module 402 is also used to perform calibration processing on the face area. Specifically, the preprocessing module 402 detects key feature points in the face region, and performs alignment and calibration on the corresponding face image based on the positions of the detected key feature points. The key feature points in the face area may be eyes, nose, mouth, left cheek outer contour, right cheek outer contour, and so on. According to the key feature points detected in the face region, the face image can be aligned and calibrated by the landmark method, so that the positions of the key feature points of the face in the face image are basically the same. In this embodiment, in order to avoid that the size of the face image is not uniform and affect the recognition result of the subsequent face image, the preprocessing module 402 may also edit the face image after alignment and calibration according to a preset template to obtain Face images of uniform size. In this embodiment, the editing processing includes one or two of cutting processing and zoom processing. For example, in the editing process, the preprocessing module 402 cuts out the corresponding face image according to a uniform template based on the key feature points in the detected face area and scales the face image to a uniform size. , To realize the editing of the face image. In this embodiment, opencv's resize can be used to scale the face image to a uniform size based on a bilinear difference or area difference algorithm.

The feature extraction module 403 is configured to input the detected face image as an original image into an optimized ResNet (Residual Neural Network, deep residual network) network for feature value extraction to output a face feature vector.

In this embodiment, the first basic operation structure of the ResNet network is shown in FIG. 3(a), and the input in the first basic operation structure is superimposed with the original input through the output of three convolutional layers. When the input and output matrix sizes are the same, the first basic operation structure shown in Figure 3(a) is used. The second basic operation structure of the ResNet network is shown in FIG. 3(b). The input in the second basic operation structure is superimposed with the original input through the output of three convolutional layers. When the input and output matrix sizes are not the same, the second basic operation structure shown in Figure 3(b) is used.

As shown in Figure 4, the overall structure of the ResNet network includes: a convolutional layer, a pooling layer, 4 sets of convolution packages with different parameters, a pooling layer, a fully connected layer, and a sigmoid layer. In this embodiment, the original image is sequentially processed through the convolutional layer, pooling layer, 4 sets of convolution packages with different parameters, pooling layer, fully connected layer, and sigmoid layer of the ResNet network to obtain the face Feature vector. In this embodiment, the four convolutional packets include conv2_x, conv3_x, conv4_x, and conv5_x, respectively. In this embodiment, the second convolutional layer of the first package in each group of convolutional packages is down-sampled with a stride of 2.

In this embodiment, the “inputting the face image after the detection process as the original image into the optimized ResNet network for feature value extraction to output the face feature vector” includes:

(S2031) Augment the original image to obtain sample data.

In this embodiment, the ResNe network is optimized through the training sample data of the first basic operation structure or the second basic operation structure of the ResNet network. Among them, the first basic arithmetic structure of the ResNet network is the superposition of the output of the input through three convolutional layers and the original input. The first basic operation structure is used when the input and output matrices have the same size.

In this embodiment, after the ResNet network is optimized, the face image after the detection process is input to the optimized ResNet network to obtain the features of the fully connected layer as the face feature vectors output by the ResNet network.

The recognition module 404 is configured to input the face feature vector output by the ResNet network into an LSTM (Long Short-Term Memory) network for training, and obtain the AU recognition result of the face image.

In this embodiment, the LSTM network is a special recurrent neural network. The LSTM network regards the input sequence as a time sequence, and the long and short-term memory network can learn the short-term and long-term dependence of data in the input sequence in time. In this embodiment, the aforementioned AU recognition result may indicate the AU category of the face image.

FIG. 5 is a schematic diagram of the sequence processing flow of the LSTM network in an embodiment of this application. Among them, X0, X1,..., Xn are each frame image of a face image with a length of n frames, and each frame image in the face image is extracted through the ResNet network to extract the face feature vector Y0, Y1,... ., Yn, the face feature vectors Y0, Y1,..., Yn are sequentially input to the LSTM network in chronological order, and the AU recognition results h0, h1,... , Hn.

i _t =σ(W _ix ·x _t +W _im ·m _t-1 +W _ic ·c _t-1 +b _i );

f _t =σ(W _fx ·xt+W _fm ·mt-1+W _fc ·c _t-1 +b _f );

c _t =ft⊙c _t-1 +i _t ⊙σ(W _cx ·x _t +W _cm ·m _t-1 +b _c );

o _t =σ(W _ox ·x _t +W _om ·m _t-1 +W _oc ·c _t-1 +b _o );

m _t =o _t ⊙h(c _t ).

Example 3

FIG. 7 is a schematic diagram of the electronic device 6 in an embodiment of the application.

The electronic device 6 includes a memory 61, a processor 62, and computer readable instructions 63 stored in the memory 61 and executable on the processor 62. The processor 62 implements the steps in the embodiment of the AU detection method for the image when the processor 62 executes the computer readable instruction 63, such as steps S201 to S204 shown in FIG. 2. Alternatively, when the processor 62 executes the computer-readable instruction 63, the functions of the modules/units in the embodiment of the image AU detection device are realized, such as the modules 401 to 404 in FIG. 6.

Exemplarily, the computer-readable instructions 63 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 62, To complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 63 in the electronic device 6. For example, the computer-readable instruction 63 may be divided into the acquisition module 401, the preprocessing module 402, the feature extraction module 403, and the recognition module 404 in FIG. 6. For the specific functions of each module, refer to Embodiment 2.

The electronic device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud terminal device 1. Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 6 and does not constitute a limitation on the electronic device 6. It may include more or less components than those shown in the figure, or combine certain components, or different components. Components, for example, the electronic device 6 may also include input and output devices, network access devices, buses, and so on.

The so-called processor 62 may be a central processing module (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 62 can also be any conventional processor, etc. The processor 62 is the control center of the electronic device 6 and connects the entire electronic device 6 through various interfaces and lines. Parts.

In an embodiment, the present application provides an electronic device, wherein the electronic device includes a processor, and the processor is configured to execute computer-readable instructions stored in a memory to implement the following steps:

Obtain face images;

Wherein, the processor further implements the following steps when executing the computer-readable instruction:

Based on the Adaboost face detection algorithm, scan each frame of the face image with a preset size window and preset compensation until the face area in each frame of the image is determined, wherein the face area includes the forehead, the chin, The fixed rectangular area of the left and right cheeks.

Detect key feature points in the face area, and align the face image based on the positions of the detected key feature points. The key feature points in the face area include eyes, nose, mouth, The outer contour of the left cheek and the outer contour of the right cheek; and

The face image after alignment and calibration is edited according to a preset template to obtain a face image of a uniform size.

Wherein, the ResNet network structure includes a convolutional layer, a pooling layer, four groups of convolutional packets with different parameters, a pooling layer, a fully connected layer, and a sigmoid layer.

Augment the original image to obtain training sample data;

Training the sample data to optimize the ResNet network to obtain an optimized ResNet network; and

Input the face image to be detected into the optimized ResNet network to obtain the face feature vector.

Acquiring the original image;

Randomly crop a picture with preset resolution pixels from the original image to obtain an initial sample picture;

Obtain [0,1] uniformly distributed random numbers, and when the random number of the initial sample picture is less than the random threshold, flip to generate a new [0,1] uniformly distributed random number;

When the random number is less than 0.5, graying the initial sample picture to obtain a first sample picture;

Processing the first sample picture by adding a point light source to obtain a second sample picture;

And use the obtained initial sample picture, first sample picture, and second sample picture as the sample data.

The memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 can run or execute the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6. The memory 61 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may The data (such as audio data, phone book, etc.) created according to the use of the electronic device 6 is stored. In addition, the memory 61 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

If the integrated module/unit of the electronic device 6 is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions may be stored in a computer-readable storage medium. Wherein, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable instructions may implement the steps of the foregoing method embodiments when executed by a processor. Wherein, the computer-readable instruction includes computer-readable instruction code, and the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted in accordance with the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

In an embodiment, the agency’s application also provides one or more readable storage media storing computer readable instructions, wherein when the computer readable instructions are executed by one or more processors, the one or Multiple processors perform the following steps:

Obtain face images;

Wherein, when the computer-readable instruction is executed by one or more processors, the one or more processors further execute the following steps:

Augment the original image to obtain training sample data;

Acquiring the original image;

In the several embodiments provided in this application, it should be understood that the disclosed electronic device and method may be implemented in other ways. For example, the electronic device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

In addition, the functional modules in the various embodiments of the present application may be integrated in the same processing module, or each module may exist alone physically, or two or more modules may be integrated in the same module. The above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software functional modules.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any reference signs in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or electronic devices stated in the claims of an electronic device can also be implemented by the same module or electronic device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of this application.

Claims

An image AU detection method, wherein the method includes:

Obtain face images;

Perform detection processing on the acquired face image to obtain a uniform face area;

Input the detected face image as the original image into the optimized ResNe network for feature value extraction to output the face feature vector; and

The face feature vector output by the ResNet network is input into the LSTM network for training, and the AU recognition result of the face image is obtained.
The AU detection method of an image according to claim 1, wherein said performing detection processing on the obtained face image to obtain a unified face area comprises:

Based on the Adaboost face detection algorithm, scan each frame of the face image with a preset size window and preset compensation until the face area in each frame of the image is determined, wherein the face area includes the forehead, the chin, The fixed rectangular area of the left and right cheeks.
The AU detection method of an image according to claim 1, wherein said performing detection processing on the obtained face image to obtain a unified face area comprises:

Detect key feature points in the face area, and align the face image based on the positions of the detected key feature points. The key feature points in the face area include eyes, nose, mouth, The outer contour of the left cheek and the outer contour of the right cheek; and

The face image after alignment and calibration is edited according to a preset template to obtain a face image of a uniform size.
8. The image AU detection method according to claim 1, wherein the ResNet network structure includes a convolutional layer, a pooling layer, 4 sets of convolutional packets with different parameters, a pooling layer, a fully connected layer, and a sigmoid layer.
The AU detection method of an image according to claim 4, wherein the input of the detected face image as the original image into the optimized ResNet network for feature value extraction to output the face feature vector comprises:

Augment the original image to obtain training sample data;

Training the sample data to optimize the ResNet network to obtain an optimized ResNet network; and

Input the face image to be detected into the optimized ResNet network to obtain the face feature vector.
8. The image AU detection method of claim 5, wherein the sample data obtained by augmenting the original image to obtain training comprises:

Acquiring the original image;

Randomly crop a picture with preset resolution pixels from the original image to obtain an initial sample picture;

Obtain [0,1] uniformly distributed random numbers, and when the random number of the initial sample picture is less than the random threshold, flip to generate a new [0,1] uniformly distributed random number;

When the random number is less than 0.5, graying the initial sample picture to obtain a first sample picture;

Processing the first sample picture by adding a point light source to obtain a second sample picture;

And use the obtained initial sample picture, first sample picture, and second sample picture as the sample data.
The image AU detection method according to claim 1, wherein the LSTM network includes an input gate, a forget gate, an output gate, a state unit and an LSTM output gate, and the input gate, forget gate, output gate, state unit and The LSTM output gate is calculated and realized by the following formula:

i t =σ(W ix ·x t +W im ·m t-1 +W ic ·c t-1 +b i );

f t =σ(W fx ·xt+W fm ·mt-1+W fc ·c t-1 +b f );

c t =ft⊙c t-1 +i t ⊙σ(W cx ·x t +W cm ·m t-1 +b c );

o t =σ(W ox ·x t +W om ·m t-1 +W oc ·c t-1 +b o );

m t ＝o t ⊙h(c t );

Among them, x t represents the face feature vector input at time t; Wix , Wim , Wiic , W fx , W fm , W fc , W cx , W cm , W ox , W om and W oc are preset weight matrix; b i, b f, b c, b o represents a predetermined offset vector; i t, f t, o t, c t, m t represent time t of the input gate, door forgotten, The state of the output gate, the state unit and the LSTM output gate, where σ() is a sigmoid function, h() is the output activation function of the above state unit, and the output activation function is a tanh function.
An image AU detection device, wherein the device includes:

The acquisition module is used to acquire a face image;

The preprocessing module is used to detect and process the acquired face images to obtain a unified face area;

The feature extraction module is used to input the detected face image as the original image into the optimized ResNe network for feature value extraction to output the face feature vector; and

The recognition module is used to input the face feature vector output by the ResNet network into the LSTM network for training, and obtain the AU recognition result of the face image.
An electronic device, wherein the electronic device includes a processor configured to execute computer-readable instructions stored in a memory to implement the following steps:

Obtain face images;

Perform detection processing on the acquired face image to obtain a uniform face area;

Input the detected face image as the original image into the optimized ResNe network for feature value extraction to output the face feature vector; and

The face feature vector output by the ResNet network is input into the LSTM network for training, and the AU recognition result of the face image is obtained.
9. The electronic device of claim 9, wherein the processor further implements the following steps when executing the computer-readable instructions:

Based on the Adaboost face detection algorithm, scan each frame of the face image with a preset size window and preset compensation until the face area in each frame of the image is determined, wherein the face area includes the forehead, the chin, The fixed rectangular area of the left and right cheeks.
9. The electronic device of claim 9, wherein the processor further implements the following steps when executing the computer-readable instructions:

Detect key feature points in the face area, and align the face image based on the positions of the detected key feature points. The key feature points in the face area include eyes, nose, mouth, The outer contour of the left cheek and the outer contour of the right cheek; and

The face image after alignment and calibration is edited according to a preset template to obtain a face image of a uniform size.
8. The electronic device according to claim 9, wherein the ResNet network structure includes a convolutional layer, a pooling layer, four sets of convolutional packets with different parameters, a pooling layer, a fully connected layer, and a sigmoid layer.
The electronic device according to claim 12, wherein the processor further implements the following steps when executing the computer-readable instruction:

Augment the original image to obtain training sample data;

Training the sample data to optimize the ResNet network to obtain an optimized ResNet network; and

Input the face image to be detected into the optimized ResNet network to obtain the face feature vector.
The electronic device according to claim 13, wherein the processor further implements the following steps when executing the computer-readable instructions:

Acquiring the original image;

Randomly crop a picture with preset resolution pixels from the original image to obtain an initial sample picture;

Obtain [0,1] uniformly distributed random numbers, and when the random number of the initial sample picture is less than the random threshold, flip to generate a new [0,1] uniformly distributed random number;

When the random number is less than 0.5, graying the initial sample picture to obtain a first sample picture;

Processing the first sample picture by adding a point light source to obtain a second sample picture;

And use the obtained initial sample picture, first sample picture, and second sample picture as the sample data.
One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Obtain face images;

Perform detection processing on the acquired face image to obtain a uniform face area;

Input the detected face image as the original image into the optimized ResNe network for feature value extraction to output the face feature vector; and

The face feature vector output by the ResNet network is input into the LSTM network for training, and the AU recognition result of the face image is obtained.
15. The readable storage medium of claim 15, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Based on the Adaboost face detection algorithm, scan each frame of the face image with a preset size window and preset compensation until the face area in each frame of the image is determined, wherein the face area includes the forehead, the chin, The fixed rectangular area of the left and right cheeks.
15. The readable storage medium of claim 15, wherein when the computer readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Detect key feature points in the face area, and align the face image based on the positions of the detected key feature points. The key feature points in the face area include eyes, nose, mouth, The outer contour of the left cheek and the outer contour of the right cheek; and

The face image after alignment and calibration is edited according to a preset template to obtain a face image of a uniform size.
15. The readable storage medium of claim 15, wherein the ResNet network structure includes a convolutional layer, a pooling layer, four sets of convolutional packets with different parameters, a pooling layer, a fully connected layer, and a sigmoid layer.
The readable storage medium according to claim 18, wherein when the computer readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Augment the original image to obtain training sample data;

Training the sample data to optimize the ResNet network to obtain an optimized ResNet network; and

Input the face image to be detected into the optimized ResNet network to obtain the face feature vector.
The readable storage medium of claim 19, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Acquiring the original image;

Randomly crop a picture with preset resolution pixels from the original image to obtain an initial sample picture;

Obtain [0,1] uniformly distributed random numbers, and when the random number of the initial sample picture is less than the random threshold, flip to generate a new [0,1] uniformly distributed random number;

When the random number is less than 0.5, graying the initial sample picture to obtain a first sample picture;

Processing the first sample picture by adding a point light source to obtain a second sample picture;

And use the obtained initial sample picture, first sample picture, and second sample picture as the sample data.