CN111523398A

CN111523398A - Method and device for fusing 2D face detection and 3D face recognition

Info

Publication number: CN111523398A
Application number: CN202010241057.6A
Authority: CN
Inventors: 葛晨阳; 邓鹏超; 卢泳冲; 屈渝立; 乔欣
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-08-11

Abstract

A method for fusing 2D face detection and 3D face recognition comprises the following steps: collecting an RGB image and a depth image or an IR image and a depth image by a 3D depth camera; preprocessing an RGB image or an IR image; detecting a human face according to a retinaface2D detection method for the preprocessed RGB image or IR image, and framing 5 human face key points of a human face boundary frame, a nose tip, left and right eyes and left and right mouth corners; mapping 5 face key points to corresponding point clouds of 5 XYZ key points, and standardizing three-dimensional face data point clouds detected by an XYZ point cloud data set; orthogonally projecting the standardized three-dimensional face data point cloud to a specified plane according to the standardized three-dimensional face data point cloud, carrying out grid fitting, and converting the three-dimensional face data point cloud into a two-dimensional face depth map after the grid fitting; and identifying the two-dimensional face depth image by adopting a face identification method to obtain a face identification result.

Description

Method and device for fusing 2D face detection and 3D face recognition

Technical Field

The disclosure belongs to the technical field of computer vision and image recognition, and particularly relates to a method and a device for fusing 2D face detection and 3D face recognition.

Background

Face recognition, as an effective non-invasive biometric identification technique, has rapidly become the tool of choice in the fields of surveillance, security and entertainment since the development of computer vision. Compared with other biological identification technologies, such as iris identification, fingerprint and palm print identification and the like, the face identification has the advantages of non-contact type, strong interactivity, easy acquisition and the like, becomes the basis of technologies such as city security, equipment unlocking, man-machine interaction, virtual reality, digital media and the like, and plays more and more roles.

2D and 3D face recognition belong to the field of computer vision. Because the 2D image is the projection of a three-dimensional object on a plane, the performance of 2D face recognition can be interfered by external factors such as illumination, posture, expression and the like, and therefore the accuracy is influenced. Therefore, how to fully utilize two-dimensional and three-dimensional information of a human face and how to improve the recognition rate of human face recognition, especially in scenes with human face shielding, poor lighting conditions, poor shooting angles and the like, is one of the key problems to be solved at present.

Disclosure of Invention

In order to solve the above problem, the present disclosure provides a method for fusing 2D face detection and 3D face recognition, the method comprising the steps of:

s100: collecting an RGB image and a depth image or an IR image and a depth image by a 3D depth camera;

s200: preprocessing an RGB image or an IR image;

s300: detecting a human face according to a retinaface2D detection method for the preprocessed RGB image or IR image, and framing a human face boundary frame and 5 human face key points of a nose tip, left and right eyes, left and right mouth corners;

s400: mapping the obtained 5 face key points to corresponding point clouds of 5 XYZ key points, and standardizing the three-dimensional face data point cloud detected by the XYZ point cloud data set;

s500: orthogonally projecting the standardized three-dimensional face data point cloud to a specified plane according to the standardized three-dimensional face data point cloud, carrying out grid fitting, and converting the three-dimensional face data point cloud into a two-dimensional face depth map after the grid fitting;

s600: and identifying the two-dimensional face depth image by adopting a face identification method to obtain a face identification result.

The present disclosure also provides a device for fusing 2D face detection and 3D face recognition, including:

means for capturing RGB images and depth images, or IR images and depth images, by a 3D depth camera;

means for pre-processing the RGB image or the IR image;

a device for detecting a human face according to a retinaface2D detection method for the preprocessed RGB image or IR image, and framing a human face boundary frame and 5 human face key points of a nose tip, left and right eyes, left and right mouth corners;

a device for mapping the obtained 5 face key points to corresponding point clouds of 5 XYZ key points, and standardizing the three-dimensional face data point cloud detected by the XYZ point cloud data set;

the device is used for orthogonally projecting the standardized three-dimensional face data point cloud to a specified plane according to the standardized three-dimensional face data point cloud, carrying out grid fitting and converting the three-dimensional face into a two-dimensional face depth map after the grid fitting;

and the device is used for identifying the two-dimensional face depth image by adopting a face identification method to obtain a face identification result.

According to the technical scheme, the rapid 2D face detection is carried out through RGB or IR images, then the face key points are mapped to the 3D face point cloud for standardization processing, the 3D face point cloud data subjected to standardization processing are mapped to the 2D face depth image, and a face recognition result can be rapidly obtained by combining a face recognition method. The method has the characteristics of rapidness, high efficiency, safety and reliability in the face recognition process, and is suitable for the application field of embedded 3D face recognition.

Drawings

FIG. 1 is a flow chart of a method of fusing 2D face detection and 3D face recognition provided in an embodiment of the present disclosure;

FIG. 2 is a schematic representation of a grid in one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of bilinear interpolation in one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of weight calculation in one embodiment of the present disclosure;

fig. 5 is a schematic flow chart of a face recognition method in an embodiment of the present disclosure.

Detailed Description

In one embodiment, as shown in fig. 1, a method for fusing 2D face detection and 3D face recognition is disclosed, the method comprising the steps of:

s200: preprocessing an RGB image or an IR image;

For the embodiment, the method takes a 2D data (IR image or RGB) image as an auxiliary to carry out face detection, extracts face boundaries and key points, corresponds the information to depth data, and carries out face recognition on the depth data. In the method, the 2D data is used as assistance and is identified by depending on the depth data. This increases complexity if one device takes both 2D and 3D recognition.

The three-dimensional face feature vector obtained by the method can accurately identify the face, and the identification method can be transplanted to an embedded system platform and has good reliability and convenience. The method has the characteristics of high speed, high efficiency, safety and reliability in the face recognition process, the speed is high because a lightweight network is used, and the accuracy is high because not only global features are extracted, but also local feature extraction is added.

In another embodiment, the 3D depth camera in step S100 is a structured light depth camera or a ToF depth camera.

For this embodiment, the 3D depth camera may be a structured light depth camera, or may be a ToF depth camera, mainly including: the device comprises a projector, an RGB camera, an IR camera and a depth calculation module.

And the projector is an infrared laser speckle projector. The working process comprises the following steps: firstly, an infrared laser speckle projector projects dense infrared laser beams outwards, and the laser beams form a coding pattern in a specific mode after coherent interference and diffuse reflection on the surface of an object. The encoding pattern is a speckle pattern consisting of randomly distributed speckle points, and the speckle pattern is fixed and has a certain offset in the horizontal or vertical direction along with the distance. The IR camera is responsible for collecting and receiving the speckle pattern, the Depth calculation module carries out block matching parallax estimation on the collected speckle pattern and an internally stored reference speckle pattern with a known distance as a left binocular parallax image and a right binocular parallax image to generate a parallax vector image, and a Depth image of a projection space or a target object is finally obtained according to a Depth calculation method.

And the ToF depth camera, the projector is a floodlight projector or a regular speckle projector, and the IR camera is a ToF receiving camera. Firstly, a floodlight projector emits a light source for uniform irradiation, or a regular speckle projector emits a laser speckle pattern, a ToF receiving camera synchronously receives a phase shift image reflected by the projector after irradiating the surface of an object, and a depth calculation module performs depth decoding on RAW data of the phase shift image by using a phase shift method.

In another embodiment, step S200 further comprises: and converting the Depth image into an XYZ point cloud data set according to internal and external parameters of the RGB camera and the IR Depth camera, and obtaining a pixel point mapping relation between the RGB image and the IR image.

In this embodiment, preprocessing is performed from three aspects of image contrast, image gradation, and image detail.

Image contrast: the dynamic range of the image is enlarged by increasing the contrast of the image, and for the image of each pixel represented by 8 bits, the dynamic range occupies the whole 0-255 gray levels, and the contrast is obviously increased compared with the image only expressed by local gray levels. The dynamic range stretching method comprises (1) linear mapping, wherein when the dynamic range is stretched in equal proportion, the loss of gray level caused by saturated truncation can be caused by parameter setting; (2) the nonlinear Gamma transformation mapping can determine a mapping curve according to the requirement, enlarge the dynamic range of high gray level and reduce the dynamic range of low gray level; (3) with improved Gamma variation, the stretching from the middle gray level of the dynamic range to both ends is achieved.

Hierarchy of the image: by adopting a self-adaptive histogram equalization algorithm for limiting contrast, the image is partitioned, and the histograms are respectively counted for mapping, so that the effect of enhancing the local layering sense is improved.

Details of the image: the high-frequency information in the image is firstly separated by adopting an image sharpening algorithm and multiplied by an enhancement coefficient, and then the high-frequency information is superposed into the original image. The sharpening may be 4 neighbors or 8 neighbors. The sharpening algorithm idea is expanded, namely the size and shape of a window with low-pass or high-pass filtering for high-low frequency separation are changed, so that different detail enhancement effects can be obtained.

In another embodiment, the method for detecting retinaface2D in step S300 specifically includes: the preprocessed IR or RGB image is input into a backbone network, in the face detection multitask learning, 5 pieces of key point information of the face are additionally marked except for a traditional face classification loss function and a face box loss function, an additional supervision information loss function for face alignment is introduced according to the key point information, and a self-superimposed decoding branch is introduced to predict a 3D face information branch.

In another embodiment, the three-dimensional face data point cloud normalization in step S400 specifically comprises the following steps: taking the nose tip as the origin of a new coordinate system, taking the direction from the left eye to the right eye as the new x-axis direction, taking the direction from the midpoint of the two eyes to the midpoint of the two corners of the mouth as the new y-axis direction, and determining the new z-axis by the cross multiplication of the x-axis and the y-axis; and coordinate values of each three-dimensional face under a new coordinate system are obtained through coordinate transformation, and all three-dimensional face data are the same in orientation and posture after transformation.

For this embodiment, for two sets of points, and the points between the sets are mapped one-to-one:

A＝(X_1i，Y_1i，Z_1i，...，X_ni，Y_ni，Z_ni)^T

B＝(X_1i，Y_1i，Z_1i，...，X_ni，Y_ni，Z_ni)^T

a and B respectively correspond to a human face point cloud data point set, and the central point of the point set is calculated:

translating the center of the point set to u respectively_AAnd u_B：

A′＝{A_i-u_A}

B′＝{B_i-u_B}

And (3) calculating a covariance matrix H between the point sets, and performing singular value decomposition:

SVD(H)＝USV^T

wherein U and V^TIs a unitary matrix, S is a rectangular diagonal matrix, and the rotation matrix R ═ VU can be obtained^TThe translation matrix t ═ Ru_A+u_BSo that the transformation of the depth point cloud of any position to the standardized position can be obtainedAnd (4) matrix.

In a specific embodiment, the conversion process of the depth image into the XYZ point cloud dataset is obtained by internal reference of the IR camera, and the specific formula is as follows:

X(i，j)＝depth(i，j)*(j-c_x)/f_x

Y(i，j)＝depth(i，j)*(i-c_y)/f_y

wherein (i, j) corresponds to the pixel position in the Depth map, Depth (i, j) is the Depth information value of the corresponding pixel point, and fx, fy, cx, and cy are internal references of the IR camera.

In another embodiment, the mesh fitting in step S500 refers to converting from a three-dimensional format to a two-dimensional format, the pixel values of the two-dimensional format representing depth values.

For the purposes of this embodiment, a depth map is a lossy conversion of three-dimensional data to two-dimensional data, with the horizontal and vertical coordinates replacing physical coordinates with pixel coordinates, and only depth values remaining, and is therefore generally referred to as 2.5D data. And a mesh surface fitting algorithm is adopted to convert the input disordered three-dimensional point cloud into a mesh surface in a lossless manner, and the physical meanings of x, y and z are kept. For a given point cloud set { x1, y1, z1, x2, y2, z2... times.xn, yn, zn }, the range of the point cloud set data values is (— infinity, + ∞), the corresponding grid curved surface z ═ f (x, y) needs to be found, i.e. the grid curved surface z ═ f (x, y) is fitted into the grids with the width and length of nx × ny and the distance of d respectively. The value on each grid represents the depth value interpolated for that point. The grid schematic is shown in fig. 2 below, where the numbers shown next to the grid points along the x-axis and y-axis are grid numbers.

As shown in fig. 3, for an input point in a grid formed by J, K, L, M with four grid numbers, the depth value of the point obtained by bilinear interpolation should be:

the four proportional products in the x-axis and y-axis directions can be regarded as weights, Q and P respectively represent input point position information, Depth represents a Depth value, and len represents a position distance value.

All points in the point cloud are mapped into a grid as input points, and as shown in fig. 4, a bilinear interpolation weight of each grid point is calculated. Fitting 7 grids with the width and the length of 5 multiplied by 5 and the distance of 1, wherein the left table represents the space coordinates of the input 7 points, the middle table represents the bilinear interpolation operation process, and the rightmost table represents the interpolation weight corresponding to each point. For example, the point (0, 4, 1.5) is located at the vertex of mesh No. 21, and the weight is (0, 0, 1, 0) according to the formula; for a point (3.9, 0.1, 0.2) located between

vertices

4, 5, 9, 10, the interpolation weight is calculated as (0.09, 0.81, 0.01, 0.09).

The sparse vector for each input point weight is nx × ny dimension, so all weight vectors form a sparse weight matrix A_FThe true depth value of the input point is b_FThe depth value of the fitted surface on each grid is z; according to the equation:

A_F·z＝b_F

the fitted mesh surface can be solved. Because the formula is an underdetermined equation set, the formula has infinite solution; meanwhile, the fitted curved surface is not smooth and has severe fluctuation. The first derivative needs to be added as a penalty term on the basis that the first derivative is equal to 0, and (I is the corresponding point depth value) is satisfied for each point:

I(x，y)-I(x+1，y)＝0

I(x，y)-I(x，y+1)＝0

I(x，y)-I(x+1，y+1)＝0

I(x+1，y)-I(x，y+1)＝0

after fitting the mesh, the point cloud may be converted from a three-dimensional format to a two-dimensional format.

In another embodiment, the face recognition method in step S600 mainly comprises feature point detection, a global feature extraction network GFE, a local feature extraction network LFE, and a full connection layer.

For the embodiment, the face recognition method has the following main characteristics: the method is based on the Mobilefacenet and has the advantages of high speed, deep network and good performance; secondly, extracting the relationship characteristics between the local characteristic pairs by adopting a local characteristic extraction network.

The flow of the face recognition method is shown in fig. 5, and the face recognition method mainly comprises feature point detection, a global feature extraction network (GFE), a local feature extraction network (LFE), and a full connection layer. The input to the network is a gridded fitted surface. Inputting a two-dimensional face depth map detected by face key points, and entering a global feature extraction process. The global feature extraction network has two roles: the method is used as a main network framework for algorithm feature extraction, and is used for providing a multi-channel feature map for local feature extraction; the method takes the modified MobileFaceNet as a basic network structure (Base CNN) for global feature extraction. In the local feature extraction, Region Of Interest (ROI) (region Of interest) selection is carried out according to the mapping Of the face key points on the feature map, so that the features Of the corresponding key point positions are extracted. And finally, fusing the global features and the local features, and outputting a feature vector representing the three-dimensional face through a full connection layer.

On three data sets of IAS-Lab RGBD, RGBD and ND-2006, the identification accuracy of global feature extraction is 99.0%, 98.9% and 96.0% on the three data sets respectively, and the identification accuracy of the fusion global feature and the local feature is improved to 99.4% 99.3% and 96.8%.

In another embodiment, the internal and external parameters of the RGB camera and the IR depth camera in step S200 are obtained by 3D depth camera calibration.

In this embodiment, due to the position difference between the IR camera and the RGB camera, the final output RGB image and the Depth image cannot be aligned in a one-to-one correspondence relationship between pixels, and the method needs to use a color image as an auxiliary tool to perform face detection, so internal and external reference coefficients of the RGB camera and the IR camera need to be calibrated, and a Depth map and a color map acquired at the same time are calibrated to the same point in a space corresponding to the pixels at the same position. The pixel coordinates, the image coordinates, the camera coordinates and the world coordinates have a corresponding conversion relationship through the camera internal reference and the camera external reference, so that the pixel coordinates of the depth image and the color image can be unified. The specific conversion formula is as follows:

wherein, [ X ]_rgbY_rgbZ_rgb]^TIs the three-dimensional coordinate, [ X ], of a point next to the RGB camera coordinate system_dY_dZ_d]^TThe three-dimensional coordinates of the next point of the camera coordinate system of the IR camera are R and t respectively represent the rotation and translation relations from the IR camera to the camera coordinate system and are calibrated according to the Zhang Zhengyou calibration method. The RGB camera and the IR camera are relatively fixed in position, so that calibration is only needed once.

After the coordinate system conversion relation between the RGB camera and the IR camera is obtained, the corresponding relation between the IR image and the RGB image pixels can be obtained according to the internal parameters, and the formula is as follows:

wherein [ u ]_rgbv_rgb]^TAnd fx, fy, cx and cy are internal parameters of the RGB camera. Because the Depth image and the IR image are in one-to-one correspondence, a certain pixel in the Depth image is given, the three-dimensional coordinate of the point in the camera coordinate system of the IR camera can be obtained through the internal parameters of the IR camera, the coordinate of the point in the camera coordinate system of the corresponding RGB camera is obtained, and the pixel coordinate of the corresponding RGB image can be obtained according to the internal parameters of the RGB camera, so that the pixel correspondence between the RGB image and the Depth image (IR image) is realized, and the RGB and Depth images under the uniform visual angle are obtained.

In another embodiment, the RGB image pixel point mapping relationship obtained in step S200 is obtained by using the following method: and aligning the pixels of the RGB image and the Depth image by internal and external parameters obtained by calibration.

For the embodiment, the bounding box detected by the IR image and the key point of the face may directly correspond to the Depth image without performing a pixel alignment operation, and the result of detecting the RGB image may correspond the pixel point on the RGB image to the pixel point on the Depth image through the RGB camera and the internal and external parameters of the IR camera.

In the present embodiment, although there is inevitably a certain deviation in the pose of the face data obtained by actual scanning when the data is acquired. At the same time, the spatial position between different data sets differs more. The inconsistency of the positions greatly interferes with related visual tasks, and the spatial positions of the point clouds of the human faces need to be transformed, so that the positions of different human faces in a world coordinate system are relatively consistent.

In another embodiment, an apparatus for fusing 2D face detection and 3D face recognition includes:

means for capturing an RGB image and a Depth image, or an IR image and a Depth image, by a 3D Depth camera;

the device is used for preprocessing the RGB image or the IR image, converting the Depth image into an XYZ point cloud data set according to internal and external parameters of the RGB camera and the IR Depth camera, and obtaining a pixel point mapping relation between the RGB image and the IR image;

a device for detecting a human face according to a retinafece 2D detection method for the preprocessed RGB image or IR image, and framing 5 key points of the human face, namely a human face bounding box and a nose tip, left and right eyes, left and right mouth corners;

For the embodiment, the retinaface2D detection method is improved on the basis of One-stage target detection network, and the class probability and the position coordinate value of the object are directly regressed. The method has the characteristics of high speed, high precision and the like.

In summary, the above embodiments are only used for illustrating the technical solutions of the present disclosure, and not for limiting the same; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims

1. A method for fusing 2D face detection and 3D face recognition, the method comprising the steps of:

s200: preprocessing an RGB image or an IR image;

2. The method according to claim 1, preferably, the 3D depth camera in step S100 is a structured light depth camera or a ToF depth camera.

3. The method of claim 1, step S200 further comprising: and converting the depth image into an XYZ point cloud data set according to internal and external parameters of the RGB camera and the IR depth camera, and obtaining a pixel point mapping relation between the RGB image and the IR image.

4. The method according to claim 1, wherein the detecting method of retinaface2D in step S300 specifically comprises: the preprocessed IR or RGB image is input into a backbone network, in the face detection multitask learning, 5 pieces of key point information of the face are additionally marked except for a traditional face classification loss function and a face box loss function, an additional supervision information loss function for face alignment is introduced according to the key point information, and a self-superimposed decoding branch is introduced to predict a 3D face information branch.

5. The method of claim 1, wherein the step S400 of normalizing the point cloud of the three-dimensional face data comprises the following steps: taking the nose tip as the origin of a new coordinate system, taking the direction from the left eye to the right eye as the new x-axis direction, taking the direction from the midpoint of the two eyes to the midpoint of the two corners of the mouth as the new y-axis direction, and determining the new z-axis by the cross multiplication of the x-axis and the y-axis; and coordinate values of each three-dimensional face under a new coordinate system are obtained through coordinate transformation, and all three-dimensional face data are the same in orientation and posture after transformation.

6. The method of claim 1, wherein the mesh fitting in step S500 refers to converting from a three-dimensional format to a two-dimensional format, the pixel values of the two-dimensional format representing depth values.

7. The method according to claim 1, wherein the face recognition method in step S600 mainly comprises feature point detection, global feature extraction network GFE, local feature extraction network LFE and full connection layer.

8. The method of claim 3, wherein the RGB camera and IR depth camera internal and external parameters are obtained by 3D depth camera calibration.

9. The method as claimed in claim 3, wherein the obtained RGB image pixel mapping relationship is obtained by adopting the following method: the RGB image and the depth image are aligned by the internal and external parameters obtained by calibration.

10. An apparatus for fusing 2D face detection and 3D face recognition, comprising:

means for pre-processing the RGB image or the IR image;

the device is used for orthogonally projecting the standardized three-dimensional face data point cloud to a specified plane according to the standardized three-dimensional face data point cloud, carrying out grid fitting and converting the three-dimensional face data point cloud into a two-dimensional face depth map after the grid fitting;