CN115984461A

CN115984461A - Face three-dimensional key point detection method based on RGBD camera

Info

Publication number: CN115984461A
Application number: CN202211609939.9A
Authority: CN
Inventors: 李观喜; 张威; 覃镇波
Original assignee: Guangzhou Ziweiyun Technology Co ltd
Current assignee: Guangzhou Ziweiyun Technology Co ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-04-18
Anticipated expiration: 2042-12-12
Also published as: CN115984461B

Abstract

The invention discloses a face three-dimensional key point detection method based on an RGBD camera, aiming at obtaining high-precision and complete face three-dimensional key points. The invention obtains RGB images and depth images of a plurality of visual angles by moving the RGBD camera around the human face. And converting each frame of depth image into point cloud according to the parameters of the depth camera, fusing the point cloud to a world coordinate system, acquiring high-precision dense point cloud, and converting the dense point cloud into a grid. And detecting two-dimensional face key points of each frame of RGB image, and acquiring three-dimensional face key points of a camera coordinate system according to the corresponding relation between the RGB image and the depth image. And acquiring face key points under the multi-frame world coordinate system according to the transformation matrix of the camera coordinate system and the world coordinate system of each frame, and acquiring final three-dimensional face key points from the top points of the grids according to the key points.

Description

Face three-dimensional key point detection method based on RGBD camera

Technical Field

The invention relates to the technical field of face three-dimensional key point detection, in particular to a face three-dimensional key point detection method based on an RGBD camera.

Background

In the prior art, the current human face three-dimensional key point detection algorithm is mainly divided into three categories. The first is to detect three-dimensional keypoints directly from a single frame of RGB images, whereas monocular cameras have no depth information. The three-dimensional key point z coordinate estimated by the method has poor precision and cannot meet the high-precision requirement. The second type is to use a binocular camera to obtain a pair of images, detect two-dimensional key points from the images of the left and right cameras respectively, and then obtain accurate three-dimensional key points by using a triangulation principle. For part of key points, the left camera and the right camera cannot observe simultaneously, so that the method detects the three-dimensional key points with high precision but incomplete. And the third type is that an RGBD camera is used for acquiring a pair of RGB images and depth images, two-dimensional key points are detected from the RGB images, and three-dimensional key points are acquired according to the corresponding relation between the RGB images and the depth images. The disadvantages of this approach are the same as the binocular, and for some key points, the RGB camera and the depth camera cannot be viewed simultaneously. In summary, to detect the three-dimensional key points of the face with high precision and integrity, multiple frames of face images must be acquired from different viewing angles.

In order to obtain high-precision and complete human face three-dimensional key points. The invention obtains RGB images and depth images of a plurality of visual angles by moving the RGBD camera around the human face. And converting each frame of depth image into point cloud according to the parameters of the depth camera, fusing the point cloud to a world coordinate system, acquiring high-precision dense point cloud, and converting the dense point cloud into a grid. And detecting two-dimensional face key points of each frame of RGB image, and acquiring three-dimensional face key points of a camera coordinate system according to the corresponding relation between the RGB image and the depth image. And acquiring face key points under the multi-frame world coordinate system according to the transformation matrix of the camera coordinate system and the world coordinate system of each frame, and acquiring final three-dimensional face key points from the top points of the grids according to the key points.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses a face three-dimensional key point detection method based on an RGBD camera, which comprises the following steps:

step 1, acquiring an RGB image and a corresponding depth image based on the RGBD camera, performing face detection and two-dimensional face key point detection on the RGB image, and converting the depth image into point cloud data;

step 2, detecting and judging the acquired image, determining a first frame in the RGB image and the corresponding depth image, and taking a first frame depth camera coordinate system as a world coordinate system;

step 3, starting from the second frame, calculating a transformation matrix of a camera coordinate system of the current frame and a transformation matrix of a camera coordinate system of the previous frame by using an ICP (inductively coupled plasma) algorithm, acquiring the transformation matrix from the current camera coordinate system to a world coordinate system, updating a TSDF (time series distribution function) model according to the transformation matrix, judging whether a sufficient amount of data information is acquired, if not, returning to the step 1 to acquire an RGB (red, green and blue) image and a corresponding depth image again, if so, acquiring face point cloud data from the TSDF model, and performing surface reconstruction on the face point cloud data to acquire face grid data;

step 4, carrying out face detection on each frame of RGB image through a lightweight face recognition model, and extracting face key points through a PFLD model;

and 5, calculating the centers of the multiple groups of three-dimensional key points, and traversing all the vertexes by adopting a kd-tree algorithm to find the closest point of each central point.

Furthermore, the TSDF model is a three-dimensional space representation, wherein the TSDF model is preset with a cuboid bounding box and divides each dimension of the cuboid into n equal parts, i.e. the cuboid is divided into n equal parts ³ A small cube, defined as a voxel, which is then transformed from the world coordinate system to the camera coordinate system:

in the formula (I), the compound is shown in the specification,

represents the coordinates of the jth voxel in the ith camera coordinate system>

Representing a transformation matrix from the world coordinate system to the i-th camera coordinate.

Further, the calculation formula of the TSDF value in the TSDF model is as follows:

tsdf _j ′＝max(-1,min(sdf _j /t,1))

in the formula, tsdf _j ' denotes the TSDF value of the j-th voxel before update, a max function is used to obtain the maximum of the two values, a min function is used to obtain the minimum of the two values, and t denotes the truncation distance, wherein the higher the depth camera precision, the smaller the truncation distance.

Further, said updating the TSDF model according to said transformation matrix further comprises updating the TSDF value and its weight, said updating formula is as follows:

w _j ＝w _j ′+w _j-1

in the formula, w _j ' denotes the weight before the jth voxel update, and the default is 1,w _j And (3) representing the updated weight, when the number of the acquired images is enough, searching a 0 isosurface in the TSDF model by using a Marching cube algorithm to serve as a reconstructed point cloud, calculating a normal vector of each point of the point cloud, and acquiring a grid from the point cloud and the normal vector thereof by using Poisson reconstruction.

Still further, the performing face detection through the lightweight face recognition model further comprises: the light-weight face recognition model replaces a backbone of a Yolov5-s model and is ShuffleNetv2, and the input resolution of the model is set to be 320X320.

Still further, the extracting the face key points through the PFLD model further includes: the PFLD model uses MobileNetv2 as a trunk of the FPLD, and coordinates of key points are obtained by a regression method, wherein a loss function adopted by the FPLD model is as follows:

wherein M represents the number of samples, N represents the number of key points, γ _n It is shown that the different weights are different,

the distance between the regression value and the true value of the nth keypoint in the mth sample is shown.

Further, the weight γ _n Is calculated as follows:

in the formula, C represents different types of numbers,

weights corresponding to different types of faces; k equals 3->

Shows yaw, pitch, roll of the nth key point, respectively.

Furthermore, a plurality of frames of RGB images and depth images are required for reconstructing the face, so that there are a plurality of groups of three-dimensional key points, and the centers of the plurality of groups of three-dimensional key points are calculated, wherein the center calculation formula is as follows:

in the formula, p _i Represents the center of the ith key point, a represents the frame number of the face,

representing the ith keypoint in the jth frame image.

Still further, the step 5 further comprises: the reconstructed face mesh contains more vertexes, and the time for traversing all vertexes to find the closest point of each central point is accelerated by adopting a kd-tree.

Furthermore, the Kd-tree is a binary tree, and vertices of face meshes are stored in a tree structure to reduce the number of search times, where N is the number of key points of the face, M is the number of vertices of the face meshes, N · M is the number of search times when the traversal search method is used, and N · log M is the number of search times when the Kd-tree search method is used.

Compared with the prior art, the invention has the beneficial effects that: compared with the RGBD face three-dimensional key point detection scheme based on a single visual angle, the method and the device for detecting the three-dimensional key points of the human face acquire the RGBD images at different visual angles. The RGBD camera slowly moves around the human face, and a large amount of data at different viewing angles are acquired and fused into the TSDF model. The TSDF model can improve the precision of the human face three-dimensional reconstruction, thereby obtaining the human face three-dimensional key points with higher precision. Because data of a plurality of visual angles are acquired and can cover a complete human face, the problem that partial key points cannot be observed simultaneously by an RGB camera and a depth camera, and reconstruction is incomplete is solved.

Drawings

The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the different views.

Fig. 1 is a general flow chart of face surface reconstruction in the RGBD camera-based face three-dimensional keypoint detection method of the present invention.

Fig. 2 is a flowchart of selecting a point closest to the center from the face mesh as a final three-dimensional key point according to the embodiment of the present invention.

FIG. 3 is a schematic diagram of the face mesh and some key points according to the present invention.

Detailed Description

The technical solution of the present invention will be described in more detail with reference to the accompanying drawings and examples.

A mobile terminal implementing various embodiments of the present invention will now be described with reference to the accompanying drawings. In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

The mobile terminal may be implemented in various forms. For example, the terminal described in the present invention may include a mobile terminal such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. In the following, it is assumed that the terminal is a mobile terminal. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present invention can be applied to a fixed type terminal in addition to elements particularly used for moving purposes.

The invention obtains multi-frame continuous RGB and depth images, solves the transformation matrix of the depth camera coordinate system and the world coordinate system of each frame, fuses the three-dimensional information of each frame and obtains complete and high-precision face dense point cloud. And (5) carrying out surface reconstruction by using a Poisson reconstruction algorithm to obtain a face grid. The general flow chart of face surface reconstruction is shown in fig. 1. And detecting two-dimensional face key points of each frame of RGB image, acquiring three-dimensional face key points under a multi-frame world coordinate system, removing abnormal key points, calculating the center of each key point, and selecting a point closest to the center from a face grid as a final three-dimensional key point. The flowchart for obtaining the final three-dimensional key points is shown in fig. 2. The face mesh and some of the key points are shown in fig. 3.

1. Point cloud fusion and TSDF model

In order to improve the efficiency and the precision, the TSDF model is adopted to represent the three-dimensional space in the fusion process. The first frame depth camera coordinate system is taken as the world coordinate system. And acquiring point cloud according to the depth camera parameters and the depth map. Starting from the second frame, the transformation matrix of the current frame camera coordinate system and the last frame camera coordinate system is calculated by using an ICP algorithm, and the transformation matrix from the current camera coordinate system to the world coordinate system is obtained. And updating the TSDF model according to the transformation matrix. The TSDF model is a three-dimensional spatial representation. TSDF needs to preset cuboid bounding boxes and divides each dimension of the cuboid into n equal parts, namelyThe rectangular parallelepiped is divided into n ³ Small cubes, called voxels. Transforming voxels from the world coordinate system to the camera coordinate system:

in the formula (I), the compound is shown in the specification,

representing the coordinates of the jth voxel in the ith camera coordinate system. />

Calculating the SDF value:

in the formula, sdf _j The SDF value representing the jth voxel,

represents->

Is greater than or equal to>

Representing the projection of a voxel onto the image coordinate system>

Indicating the depth at the image coordinates.

TSDF value calculation:

tsdf _j ′＝max(-1,min(sdf _j |t,1))

in the formula, tsdf _j ' denotes the TSDF value of the j-th voxel before update, the max function is used to obtain the maximum of the two values, and the min function is used to obtain the two valuesThe minimum value among the numerical values, t, represents the truncation distance. The higher the depth camera accuracy, the smaller the truncation distance.

Updating the TSDF value and its weight:

w _j ＝w _j ′+w _j-1

in the formula, w _j ' denotes the weight before the jth voxel update, and the default is 1,w _j Indicating the updated weights.

When enough images are acquired, searching a 0-equivalent surface in the TSDF model by using a Marching Cubes algorithm to serve as a reconstructed point cloud. And calculating a normal vector of each point of the point cloud, and acquiring a grid from the point cloud and the normal vector thereof by using Poisson reconstruction.

2. Face detection

Because the target to be detected by the invention is only the face, all the models with light weight can meet the requirements. The trunk of the replacement Yolov5-s model is ShuffleNet v2, and the input resolution of the model is set to be 320X320. The modified Yolov5-s can accurately detect the face region and consumes very little time.

3. Face keypoint detection

The PFLD model has a very good detection effect on faces with different angles and different expressions. The coordinates of the key points were obtained by regression method using MobileNetv2 as the backbone of the FPLD. The loss function employed by the FPLD model is as follows:

wherein M represents the number of samples, N represents the number of key points, and γ _n It is shown that the different weights are different,

the distance between the regression value and the true value of the nth keypoint in the mth sample is shown. Weight gamma _n Is calculated as follows:

where C represents different number of classes, the face may be divided into multiple classes, for example: face up, side up, head down, etc.,

the weights are corresponding to different types of faces. K equals 3->

Shows yaw, pitch, roll of the nth key point, respectively.

4. Key point center and kd-Tree

The face reconstruction needs a plurality of frames of RGB images and depth images, so that a plurality of groups of three-dimensional key points exist, and the center of each group of three-dimensional key points is calculated as follows:

representing the ith keypoint in the jth frame image.

The reconstructed face mesh contains more vertexes, and traversing all vertexes to find the closest point of each central point consumes more time, so that the kd-tree is adopted for acceleration. The Kd-tree is a binary tree, and the vertexes of the face mesh are stored into a tree structure, so that the searching times can be effectively reduced. Let the number of face key points be N and the number of face mesh vertices be M. When the traversal search method is used, the number of searches is N · M. When the method of kd-tree lookup is used, the number of lookups is N · logM.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications can be made without departing from the scope of the invention. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the present invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A face three-dimensional key point detection method based on an RGBD camera is characterized by comprising the following steps:

2. The method for detecting the three-dimensional key points of the human face based on the RGBD camera as claimed in claim 1, wherein the TSDF model is a three-dimensional space representation, wherein the TSDF model is preset with a rectangular bounding box and divides each dimension of the rectangular into n equal parts, that is, divides the rectangular into n equal parts ³ A small cube, defined as a voxel, which is then transformed from the world coordinate system to the camera coordinate system:

in the formula (I), the compound is shown in the specification,

3. The method for detecting the three-dimensional key points of the human face based on the RGBD camera as claimed in claim 2, wherein the TSDF value in the TSDF model is calculated according to the following formula:

tsdf _j ′＝max(-1,min(sdf _j /t,1))

4. The method as claimed in claim 3, wherein the updating the TSDF model according to the transformation matrix further includes updating the TSDF value and its weight according to the following formula:

w _j ＝w _j ′+w _j-1

in the formula, w _j ' denotes the weight before the jth voxel update, and the default is 1,w _j And (3) representing the updated weight, when the collected images are enough, searching a 0 equivalent surface in the TSDF model by using a Marching Cubes algorithm to serve as the reconstructed point cloud, calculating a normal vector of each point of the point cloud, and acquiring a grid from the point cloud and the normal vector thereof by using Poisson reconstruction.

5. The method for detecting the three-dimensional key points of the human face based on the RGBD camera as claimed in claim 1, wherein the performing the human face detection through the lightweight human face recognition model further comprises: the light-weight face recognition model replaces a backbone of a Yolov5-s model and is ShuffleNetv2, and the input resolution of the model is set to be 320X320.

6. The method of claim 1, wherein the extracting the face key points through the PFLD model further comprises: the PFLD model uses MobileNetv2 as a trunk of the FPLD, and coordinates of key points are obtained by a regression method, wherein a loss function adopted by the FPLD model is as follows:

wherein M represents the number of samples, N represents the number of key points, and γ _n It is indicated that the different weights are different,

the distance between the nth keypoint regression value and the true value in the mth sample is shown.

7. The method for detecting the three-dimensional key points of the human face based on the RGBD camera as claimed in claim 6, wherein the weight γ is _n Is calculated as follows:

in the formula, C represents different types of numbers,

weights corresponding to different types of faces; k equals 3->

Shows yaw, pitch, roll of the nth key point, respectively.

8. The method for detecting the three-dimensional key points of the human face based on the RGBD camera as claimed in claim 1, wherein the reconstruction of the human face requires a plurality of frames of RGB images and depth images, so that there are a plurality of groups of three-dimensional key points, and the centers of the plurality of groups of three-dimensional key points are calculated, wherein the center calculation formula is as follows:

representing the ith keypoint in the jth frame image.

9. The method for detecting the three-dimensional key points of the human face based on the RGBD camera as claimed in claim 1, wherein the step 5 further comprises: the reconstructed face mesh contains more vertexes, and the time for traversing all vertexes to find the closest point of each central point is accelerated by adopting a kd-tree.

10. The method as claimed in claim 9, wherein the Kd-tree is a binary tree, and vertices of the face mesh are stored in a tree structure to reduce the number of search times, wherein the number of the face key points is N, the number of the face mesh vertices is M, when the traversal search method is used, the number of search times is N · M, and when the Kd-tree search method is used, the number of search times is N · log M.