CN114581973A

CN114581973A - Face pose estimation method and device, storage medium and computer equipment

Info

Publication number: CN114581973A
Application number: CN202210130312.9A
Authority: CN
Inventors: 孟娟
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Zero Boundary Integrated Circuit Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Zero Boundary Integrated Circuit Co Ltd
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2022-06-03

Abstract

The embodiment of the application discloses a face pose estimation method, a face pose estimation device, a storage medium and computer equipment, and relates to the field of image processing. The method comprises the following steps: carrying out normalization processing on an input picture to obtain a normalized image; wherein the input picture comprises a human face; detecting the normalized image by using a pre-trained simplified network model to obtain position information of a face prediction frame and position information of a plurality of face key points; and calculating the attitude angle of the face according to the position information of the plurality of face key points. The convolutional neural network has a simple structure and can be deployed on embedded equipment, so that the face pose estimation with high precision, high real-time performance and low computation amount can be realized.

Description

Face pose estimation method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a method and an apparatus for estimating a face pose, a storage medium, and a computer device.

Background

With the rapid development of artificial intelligence technology, related industries and applications thereof have penetrated into various fields, and various intelligent products greatly facilitate the lives of people. Human face pose estimation is taken as a connection inlet and an important branch in the field of artificial intelligence, and the application of the human face pose estimation is already rushed to a plurality of fields in production and life of people. Such as intelligent desk lamp gesture detection, driver fatigue detection and virtual reality.

Face pose estimation is generally based on model-based methods, face appearance-based methods, and classification-based methods. The model-based method is that 2D characteristic points (such as mouth corners, eye corners, noses and the like) are extracted from a face region in a two-dimensional image and are in corresponding relation with 3D characteristic points of a three-dimensional face model, and the posture of a face is estimated by using methods such as geometry and the like, so that the method for estimating the posture of the face is simple, efficient, high in speed and small in calculated amount on the premise of accurate characteristic point detection; the appearance-based method is that a certain corresponding relation exists between the three-dimensional face pose and certain characteristics (gray scale, graphic gradient and the like) of the face in a two-dimensional image, a large number of face images with known poses are trained by constructing a mathematical model to recover the relation and determine the face pose, but the actual corresponding relation needs a large number of training image verifications, interpolation is needed in the process of processing the image, the calculated amount is large, and finally the pose estimation result is poor.

Disclosure of Invention

The embodiment of the application provides a face pose estimation method, a face pose estimation device, a storage medium and computer equipment, and can solve the problems of large calculation amount, low accuracy and low real-time performance of face pose estimation in the prior art. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for estimating a face pose, where the method includes:

carrying out normalization processing on an input picture to obtain a normalized image; wherein the input picture comprises a human face;

detecting the normalized image by using a pre-trained network model to obtain position information of a face prediction frame and position information of a plurality of face key points;

and calculating the attitude angle of the human face according to the position information of the plurality of human face key points.

In a second aspect, an embodiment of the present application provides a face pose estimation apparatus, including:

the normalization unit is used for normalizing the input picture to obtain a normalized image; wherein the input picture comprises a human face;

the detection unit is used for detecting the normalized image by utilizing a pre-trained network model to obtain the position information of a face prediction frame and the position information of a plurality of face key points;

and the calculation unit is used for calculating the attitude angle of the human face according to the position information of the plurality of human face key points.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides a computer device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

the method comprises the steps of outputting position information of a face prediction frame and position information of face key points by using a pre-trained network model, and then calculating attitude angles of a face by using the position information of the face key points.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of a face pose estimation method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a face prediction box and an anchor box provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a world coordinate system provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a human face pose estimation apparatus provided in the present application;

fig. 5 is a schematic structural diagram of a computer device provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be noted that, the face pose estimation method provided by the present application is generally executed by a computer device, and accordingly, the face pose estimation apparatus is generally disposed in the computer device.

Computer devices including, but not limited to, smart phones, tablets, laptop portable computers, desktop computers, and the like. The computer equipment can also be provided with a display device and a camera, the display of the display device can be various devices capable of realizing the display function, and the camera is used for collecting video streams; for example: the display device may be a cathode ray tube (CR) display, a light-emitting diode (LED) display, an electronic ink screen, a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), or the like. The user can utilize the display device on the computer device to view the displayed information such as characters, pictures, videos and the like.

The following describes in detail a face pose estimation method provided by the embodiment of the present application with reference to fig. 1. The face pose estimation apparatus in the embodiment of the present application may be a computer device shown in fig. 1.

Referring to fig. 1, a flow chart of a method for estimating a face pose is provided in an embodiment of the present application. As shown in fig. 2, the method of the embodiment of the present application may include the steps of:

and S101, carrying out normalization processing on the original image to obtain a normalized image.

The original image comprises one or more human faces. The image normalization means that the original image to be processed is converted into a corresponding unique standard form through a series of transformation. For example: the normalization process includes: and dividing the value of each pixel included in the original image by 255 to obtain a normalized image, wherein the value of each pixel included in the normalized image is between [0 and 1 ].

S202, detecting the normalized image by using a pre-trained network model to obtain the position information of the face prediction frame and the position information of a plurality of face key points.

The pre-trained network model is a convolutional neural network obtained based on machine learning or deep learning algorithm training, for example: the convolutional neural network is yolov3(you lok only once v3) network model, and the following embodiments take yolov3 network model as an example for explanation. The computer equipment is deployed with a pre-trained yolov3 network model, the yolov3 network model is obtained by training with a data set, the data set comprises a training set, a verification set and a test set, the data set comprises a plurality of face images, and the type of the data set is not limited in the application. The occupation ratio of the training set, the verification set and the testing machine in the data set can be determined according to actual requirements, and the application is not limited. Each face image in the data set is provided with annotation information, the annotation information can be manually annotated by a user through an annotation tool, or automatically annotated by computer equipment, and the annotation information represents position information of a face and position information of a plurality of face key points in each face image, for example: the real position of the face is marked through a rectangular frame in the face image, the rectangular frame is the face marking frame, and the real positions of a plurality of key points on the face are marked through marking points.

It should be noted that the yolov3 network model of the present application is a simplified model obtained by cutting based on the original yolov3 network model, and the cutting process includes layer deletion, channel number change, convolution kernel size change, etc., and the number of layers of the yolov3 network model is reduced to about 20 layers, and the model size is about 9.4M, so that the model can be replaced to match the computing power of the embedded device.

In one or more possible embodiments, the training process of the network model includes:

clustering the data set by using a clustering algorithm to obtain anchor frames with various scales;

determining a face detection loss function and a face key point detection loss function;

and training the training set by using the anchor frames with various scales, the face detection loss function and the face key point loss function to obtain a network model.

The method comprises the steps of acquiring the sizes of face labeling frames in all images by using labeled face images in a data set, and clustering the sizes of the face labeling frames by using a K-means clustering algorithm to obtain anchor frames (anchor boxes) with various scales. The face detection loss function is used for calculating a difference value between a face prediction frame and a face labeling frame, and the face key point loss function is used for calculating a difference value between a real position and a predicted position of a face key point. In the present application, the face detection loss function may be an L1 loss function or an L2 loss function, and the face key point detection loss function may be expressed as

Wherein x is the difference between the true value and the predicted value, w, a and C are common knowledge, the value can be determined according to the actual requirement, the application is not limited, the gradient value of the logarithmic function in the loss function is 1/x, and the optimal step length is x²The method can balance error items with different sizes, solve the problem of face key point outlier and improve the accuracy of face key point detection. In the training process, respectively utilizing each anchor frame with different scales to respectively search faces in a face image, obtaining a plurality of candidate face prediction frames after the searching is finished, finally calculating the confidence coefficient of each candidate face prediction frame by adopting a non-maximum suppression algorithm, and taking the face prediction frame with the highest confidence coefficient as a final face prediction frame.

In the embodiment of the present application, the parameter values output by the network model include: t is t_x、t_y、t_wAnd t_h，t_xDenotes, t_wAnd t_hFor scale scaling, t_x＝G_x-c_x，t_y＝G_y-c_y，G_xAnd G_yIs the 2D coordinate of the center point of the face labeling frame on the corresponding feature map, see FIG. 2, c_xAnd c_yAnd 2D coordinates of the upper left corner of the characteristic grid diagram where the central point is located.

Then, the position information of the face prediction frame is calculated according to the following formula:

wherein, b_xAnd b_y2D coordinates of center point representing face prediction box, b_wWidth of the face prediction box, b_hRepresenting the height of the face prediction box, sigma () being a sigmoid function, p and w being constants, c_xAnd 2D coordinates of the upper left corner of the feature map corresponding to the normalized image are shown.

Further, the parameter values output by the network model of the application comprise offsets between 2D coordinates of a plurality of face key points and the center point of the face prediction frame, the offsets comprise horizontal 2D coordinate offsets and vertical 2D coordinate offsets, the number of the face key points is 5, and the offsets of the 5 face key points are (t)_x1,t_y1)、(t_x2,t_y2)、(t_x3,t_y3)、(t_x4,t_y4)、(t_x5,t_y5) The left eye keypoint offset, the right eye keypoint offset, the nose keypoint offset, the left mouth angle keypoint offset and the right mouth angle keypoint offset are respectively represented.

Then, the 2D coordinates of the 5 face key points are calculated according to the following formula:

wherein Leye_xAnd Leye_y2D coordinates representing the key points of the left eye, Reye_xAnd Reye_y2D coordinates representing Right eye Key points, Nose_xAnd Nose_y2D coordinates, Lmouth, representing key points of the nose_xAnd Lmouth_y2D coordinates representing a left mouth corner keypoint, Rmouth_xAnd Rmouth_yRepresenting the 2D coordinates of the right mouth corner keypoints.

And S103, calculating the pose angle of the human face according to the position information of the plurality of human face key points.

After the 2D coordinates of a plurality of face key points are obtained through calculation, three attitude angles of the face in a world coordinate system are estimated according to the 2D coordinates of the face key points: pitch angle (pitch), yaw angle (yaw), roll angle (roll), for example: the pose angle of the face is calculated by an EPNP (efficient personal dynamic-n-Point) algorithm, and the world coordinate system is shown in FIG. 3.

For example, the pose angle of the face in the world coordinate system is calculated from the 2D coordinates of the 5 face key points calculated in S102. In an embodiment, the coordinates in the world coordinate system and the camera coordinate system are denoted with superscripts w and c, respectively. The coordinates of the 5 3D reference points of the face in the world coordinate system are then

1, 5, 2D key points are coordinated in the camera coordinate system as

1., 5. The coordinates of the 4 control points in the world coordinate system are

j is 1, the coordinates of 4, 4 virtual control points in the camera coordinate system are

j 1.. 4, which are all non-homogeneous coordinates.

Wherein, the process of estimating the attitude angle comprises:

(1) selecting the quality of 5 3D reference points in a standard human face 3D modelThe unit vectors of the directions of the center and the three principal axes are used as four control points in a world coordinate system (for example, fig. 3 is a schematic diagram of the world coordinate system)

The centroid is taken as a first reference point, and the calculation formula is as follows:

in the formula, n represents the number of reference points. And (3) calculating to obtain four control points in the world coordinate system in the step (1).

The coordinates of the 5 3D reference points in the world coordinate system are expressed in a weighted summation manner, and according to the invariant characteristic of the euclidean transformation, the weighting parameters of the 2D coordinates of the 5 key points calculated in step S102 are also applicable in the camera coordinate system, and the formula is as follows:

wherein alpha is_ijThe weighting parameters for the four control points. Therefore, the camera coordinates and the control point coordinates in a world coordinate system can be linked through camera internal parameters and weighting parameters, and two linear equations can be obtained for each 3D reference point:

wherein the virtual control point coordinates are

j 1.. 4, [ u ] for projection of control points on a pixel plane_i,v_i,1]^TIs represented by (f)_u,f_v) And (u)_c,v_c) The focal length and optical center of the camera are obtained by calibrating the camera, 5 key points of the face are connected in series, and the corresponding coefficient can be written into M, so that an equation Mx is obtained, wherein the equation Mx is 0

The virtual control point coordinates of the control points in the camera coordinate system can be obtained by solving the equation.

(4) In the above calculation step, the control point calculated by the reference point is not changed, so the virtual control point calculated by the reference point is used as the initial control point, and the coordinate of the virtual control point calculated by the 2D coordinate is constantly changed, so that the posture of the human face can be calculated by only calculating the rotation and translation matrix of the virtual control point relative to the initial virtual control point. Therefore, the PNP problem from three-dimension to two-dimension is converted into the classical rigid motion problem from three-dimension to two-dimension, namely the ICP algorithm is adopted for solving, and the solving process is as follows:

the calculation center:

removing the center:

the objective function to be minimized is:

defining a matrix H:

SVD decomposition of H can obtain: h ═ U ∑ V^TU and V unitary matrices, Σ is a semi-positive diagonal matrix.

And (3) combining the coordinate of the removal center to mathematically deform the target function and then substituting H into the target function to obtain: r ═ VU^T,

Wherein, the value of N is 5,

respectively the center coordinates of the 2D keypoint and the 3D reference point coordinates,

and respectively removing the coordinates of the central point for the 2D key point and the coordinates of the central point for the 3D key point, wherein R and t are rotation and translation matrixes in the motion process of the human face.

According to the embodiment of the application, the position information of the face prediction frame and the position information of the face key points are output by using the pre-trained network model, then the attitude angle of the face is calculated by using the position information of the face key points, the functions of face detection and interpersonal key point detection are completed by using the yolov3 network, the network model has the characteristic of simple structure, and can be deployed on embedded equipment to realize high-precision and high-real-time face attitude estimation.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 4, a schematic structural diagram of a face pose estimation apparatus provided in an exemplary embodiment of the present application is shown, which is hereinafter referred to as apparatus 4. The apparatus 4 may be implemented as all or part of a computer device in software, hardware or a combination of both. The device 4 comprises: normalization section 401, detection section 402, and calculation section 403.

In one or more possible embodiments, the plurality of face keypoints are: a left eye key point, a right eye key point, a nose key point, a left mouth corner key point and a right mouth corner key point; the parameter values output by the network model comprise: left eye keypoint offset (t)_x1,t_y1) Right eye keypoint offset (t)_x2,t_y2) Nose key point offset (t)_x3,t_y3) Left mouth corner key point offset (t)_x4,t_y4) And right mouth angle keypoint offset (t)_x5,t_y5)；

The 2D coordinates of each keypoint are calculated according to the following formula:

wherein Leye_xAnd Leye_y2D coordinates representing the left eye keypoints, Reye_xAnd Reye_y2D coordinates representing Right eye Key points, Nose_xAnd Nose_y2D coordinates, Lmouth, representing key points of the nose_xAnd Lmouth_y2D coordinates representing a left mouth corner keypoint, Rmouth_xAnd Rmouth_y2D coordinates representing right mouth angle keypoints; c. C_xAnd c_y2D coordinates of the upper left corner of the characteristic grid diagram where the center point of the face prediction frame is located; δ () is a sigmoid function.

In one or more possible embodiments, the method further comprises:

the training unit is used for clustering the data set by utilizing a clustering algorithm to obtain anchor frames with various scales;

In one or more possible embodiments, the face of the personThe key point loss function is:

w, a and C are constants.

In one or more possible embodiments, the normalization process includes:

the value of each pixel included in the original image is divided by 255.

It should be noted that, when the device 4 provided in the foregoing embodiment executes the face pose estimation method, only the division of the above functional modules is taken as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above functions. In addition, the face pose estimation device and the face pose estimation method provided by the above embodiments belong to the same concept, and the detailed implementation process is referred to in the method embodiments, which is not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application further provides a computer storage medium, where multiple instructions may be stored in the computer storage medium, where the instructions are suitable for being loaded by a processor and for executing the method steps in the embodiment shown in fig. 1, and a specific execution process may refer to a specific description of the embodiment shown in fig. 1, which is not described herein again.

The present application further provides a computer program product storing at least one instruction, which is loaded and executed by the processor to implement the method for estimating a face pose as described in the above embodiments.

Referring to fig. 5, a schematic structural diagram of a computer device is provided in an embodiment of the present application. As shown in fig. 5, the computer device 500 may include: at least one processor 501, at least one network interface 504, a user interface 503, memory 505, at least one communication bus 502.

Wherein a communication bus 502 is used to enable connective communication between these components.

The user interface 503 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 503 may also include a standard wired interface and a wireless interface.

The network interface 504 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 501 may include one or more processing cores, among other things. The processor 501 interfaces with various components throughout the computer device 500 using various interfaces and lines to perform various functions of the computer device 500 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 505 and invoking data stored in the memory 505. Optionally, the processor 501 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable gate Array (FPGA), and Programmable Logic Array (PLA). The processor 501 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 501, but may be implemented by a single chip.

The Memory 505 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 505 includes a non-transitory computer-readable medium. The memory 505 may be used to store instructions, programs, code sets, or instruction sets. The memory 505 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 505 may alternatively be at least one memory device located remotely from the processor 501. As shown in fig. 5, the memory 505, which is a type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an application program.

In the computer device 500 shown in fig. 5, the user interface 503 is mainly used as an interface for providing input for a user, and acquiring data input by the user; the processor 501 may be configured to call the application program stored in the memory 505 and specifically execute the method shown in fig. 1, and the specific process may refer to fig. 1 and is not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A face pose estimation method is characterized by comprising the following steps:

and calculating the attitude angle of the face according to the position information of the plurality of face key points.

2. The method of claim 1, wherein the plurality of face keypoints are: a left eye key point, a right eye key point, a nose key point, a left mouth corner key point and a right mouth corner key point; the parameter values output by the network model comprise: left eye keypoint offset (t)_x1,t_y1) Right eye keypoint offset (t)_x2,t_y2) Nose key point offset (t)_x3,t_y3) Left mouth corner key point offset (t)_x4,t_y4) And right mouth angle keypoint offset (t)_x5,t_y5)；

wherein Leye_xAnd Leye_y2D coordinates representing the left eye keypoints, Reye_xAnd Reye_y2D coordinates, Nose, representing the key points of the right eye_xAnd Nose_y2D coordinates, Lmouth, representing key points of the nose_xAnd Lmouth_y2D coordinates representing a left mouth corner keypoint, Rmouth_xAnd Rmouth_y2D coordinates representing right mouth angle keypoints; c. C_xAnd c_y2D coordinates of the upper left corner of the characteristic grid diagram where the center point of the face prediction frame is located; δ () is a sigmoid function.

3. The method according to claim 1 or 2, wherein before normalizing the input picture to obtain the normalized image, the method further comprises:

4. According to the rightThe method of claim 3, wherein the face keypoint loss function is:

w, a and C are constants.

5. The method according to claim 1, 2 or 4, wherein the normalization process comprises:

the value of each pixel included in the original image is divided by 255.

6. A face pose estimation apparatus, comprising:

7. The apparatus of claim 6, wherein the plurality of face keypoints are: a left eye key point, a right eye key point, a nose key point, a left mouth corner key point and a right mouth corner key point; the parameter values output by the network model comprise: left eye keypoint offset (t)_x1,t_y1) Right eye keypoint offset (t)_x2,t_y2) Nose key point offset (t)_x3,t_y3) Left mouth corner key point offset (t)_x4,t_y4) And right mouth angle keypoint offset (t)_x5,t_y5)；

8. The apparatus of claim 6 or 7, further comprising:

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any one of claims 1 to 5.

10. A computer device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 5.