CN112365586B

CN112365586B - 3D face modeling and stereo judging method and binocular 3D face modeling and stereo judging method of embedded platform

Info

Publication number: CN112365586B
Application number: CN202011334611.1A
Authority: CN
Inventors: 袁嘉言; 陈明木; 徐绍凯; 王汉超; 贾宝芝
Original assignee: Xiamen Ruiwei Information Technology Co ltd
Current assignee: Xiamen Ruiwei Information Technology Co ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2023-07-18
Anticipated expiration: 2040-11-25
Also published as: CN112365586A

Abstract

The invention discloses a 3D face modeling and stereo judging method and a binocular 3D face modeling and stereo judging method of an embedded platform, which relate to the technical field of computer vision and intelligent security terminals, and the algorithm is used for quickly judging whether a face has stereo information or not by designing a novel binocular 3D face modeling and stereo matching deep learning architecture, wherein the algorithm is more dense and has higher precision than a traditional image stereo matching algorithm; the algorithm is transplanted to the embedded platform, can run at a speed close to real time, and has no sense of user experience; the 3D face modeling accuracy of the invention is close to that of a structured light or TOF device, but the cost price is lower. The intelligent security door lock can be used in the fields of intelligent security door lock machines, automatic attendance machines, 3D face door locks and the like, so that huge economic benefits are generated.

Description

3D face modeling and stereo judging method and binocular 3D face modeling and stereo judging method of embedded platform

Technical Field

The invention relates to the technical field of computer vision, in particular to a 3D face modeling and stereo judging method and a binocular 3D face modeling and stereo judging method of an embedded platform.

Background

Face recognition technology wide application in security protection field includes control snapshot, wisdom entrance guard and wisdom lock etc.. With the rapid development of computer vision technology and huge data accumulation in the Internet age, the face recognition technology has made obvious progress.

The face recognition technology can bring convenience to life in many places, but is not absolutely safe, and the attack of abnormal living faces needs to be prevented, especially in the fields of unattended access control, payment and the like, so that strict face living detection is very important. Currently, a large number of single-eye silence living bodies or matched living bodies are used for judging true and false persons in the market, and the judgment of 2D face living bodies is achieved. However, the monocular silence living body scheme can only learn the face texture and the edge background information, and the matched living body needs to be matched with a verifier, so that experience is poor, and the monocular silence living body scheme cannot model the information of the depth direction of the face. Therefore, the face of the high-definition paper is most likely to attack the face recognition device successfully because the face living body information is insufficient. Through the analysis, the 2D face living body lacks depth information judgment, and the safety is not good enough, so that the 3D face living body technology is very necessary in an unattended access area. Several typical methods for judging living bodies using face depth information are described below:

(1) Monocular shooting of two images and pseudo 3D face information judgment living body: the German biological feature recognition company BioID issues a patent Method for discriminating between a real face and a two-dimensional image of the face in a biometric detection process for identifying real face and two-dimensional facial images in the biological feature detection process, and the main idea of the patent is to snap two moving faces at a certain time by a monocular camera, and then judge whether the current face has a real person three-dimensional or paper plane by the position change of key points of the two faces. The patent shows that the key characteristic point position relation between two real human faces with mobile snapshot and two paper plane human faces with mobile snapshot is distinguished, and whether the human faces have three-dimensional information or not is distinguished by utilizing the position deformation difference between the actual three-dimensional object and the actual plane object after the projection imaging of a camera. The advantages of this patent are: the face can be judged whether to have three-dimensional information or not through shooting twice by a monocular lens; the disadvantages of this patent are: whether the three-dimensional object is three-dimensional or not is judged only through the deformation relation of the two-dimensional feature points of the imaging system of the three-dimensional object and the planar object, and the three-dimensional modeling of the human face is not really realized, and the three-dimensional modeling belongs to the judgment of pseudo 3D information.

(2) Structured light or TOF (Time of flight) device depth imaging techniques: according to the scheme, the shot object depth is directly imaged through structured light or TOF equipment, so that a face area can be selected on a depth map and then corresponding stereo judgment can be performed. The scheme has the advantages that: the three-dimensional judgment adopts a real human face depth map, and the imaging precision of the human face depth map is high; the scheme has the defects that: structured light or TOF equipment is more costly than conventional lenses.

(3) A binocular head three-dimensional modeling technology: according to the scheme, certain parallax exists in the same object imaged by the left eye lens and the right eye lens through the binocular imaging principle, and the parallax can be converted into distance information according to the triangular relationship of the lens aperture imaging principle, so that a depth map of space is constructed. When the binocular method is applied to face recognition, a depth map of the face can be constructed, and whether the face has three-dimensional information or not can be judged through the depth map of the face. The scheme has the advantages that: the binocular modules are mature and produced in large scale and have low price; the scheme has the defects that if parallax Matching is carried out through a traditional image BM (Block Matching) or SGM (Semi-global Matching) method, a cavity with easy Matching failure of a human face weak texture area is obvious, the precision is not high, and if an end-to-end deep learning method is used, the precision is high, but the running speed on an embedded platform is very slow.

The technology has the advantages and the disadvantages, the precision of the face three-dimensional modeling is required to be high based on the safety consideration, the recognition speed of the embedded platform is high in consideration of the experience of a user, and the cost performance of the selection module is high in consideration of the overall cost of product production. Therefore, the invention provides a binocular 3D face modeling and stereo judging method of an embedded platform, which can improve the defects of the prior art.

Disclosure of Invention

The invention aims to provide the binocular 3D face stereo judging method which is good in safety, good in user experience and low in cost.

In order to achieve the above object, the solution of the present invention is:

a binocular 3D face modeling and stereo judging method comprises the following steps:

step (1), inputting left and right view data to construct a cost volume: setting a network input size of parallax prediction, normalizing an acquired ROI image left_face_img of a left view and an acquired ROI image right_face_img of a right view to the network input size before inputting the network, wherein the ROI is a region of interest, namely, a cost volume operation is moved forward to an original data layer, constructing a tensor of the network input size in an input space 2*D, wherein D is the maximum parallax number which can be actually matched, and the value of D is 2n (n is more than or equal to 4);

step (2), extracting parallax characteristics of CNN 1: the CNN1 network structure comprises two parts, wherein the front part is used for downsampling by 8 times to extract high-dimensional semantic information, the rear part is used for upsampling by 8 times to fuse the high-level semantic information with the bottom-level texture information, the CNN1 parallax characteristic extraction learning process is used for Softmax WithLoss cross entropy loss learning, and the cost volume operation is moved forward to the original data layer in the step (1), so that only one section of CNN1 network structure can decode and output the parallax characteristic of a D network input size;

step (3), argmax parallax analysis: the argmax operation is to analyze the parallax value of the tensor of the input size of the D-network, and the argmax operation is to find the index with the maximum value of a position point of the input size of the D-network on the D channel, wherein the channel index with the maximum value of the position point is the parallax value predicted by the network, and the shape of the parallax prediction tensor is changed into the vector of the input size of the network after the argmax;

step (4) modeling a face 3D diagram: the network parallax prediction result is based on the parallax of the network input size image, the real parallax of the original resolution needs to be multiplied by a proportion scale_x= (roi.x2-roi.x1)/the width of the network input size, roi.x2 is the X-axis maximum value of the region of interest, and roi.x1 is the X-axis minimum value of the region of interest, namely the parallax D of the original resolution _truth ＝D _predict *scale_x，D _predict Is the predicted disparity value, D _truth Is a real parallax value, and the Depth Depth=B×F/(D) of the final output of the 3D face modeling _truth ) B is binocular cameraThe baseline distance between heads, F is the coplanar pixel focal length after binocular head baseline correction, depth is the Depth distance of the face;

step (5), CNN2 predicts the 3D stereoscopic of human face: according to the 3D face image, a CNN2 deep learning classification network is designed, the 3D face image on the network input size is directly subjected to feature extraction, and then whether the three-dimensional information exists is classified through a full-connection layer.

Further, the CNN1 in the step (2) and the CNN2 in the step (5) are model parameters stored after training actual data of the device, the training data are real left and right views acquired by the actual binocular device, and the training label comprises two parts: one part is an accurate parallax image aligned with the face of the left view, the other part is a classifying label of whether the input left and right face views have depth information, the network training structure uses caffe, the training mode is CNN1 and CNN2 integrated training, and the learning parallax prediction and stereo judgment loss functions are all cross entropy loss selection.

The invention further aims to provide a binocular 3D face modeling and stereo judging method of the embedded platform, which comprehensively considers the safety, user experience and cost of products, and has the advantages of high face stereo modeling precision, high embedded platform recognition speed and high selective module cost performance.

In order to achieve the above object, the solution of the present invention is:

a binocular 3D face modeling and stereo judging method of an embedded platform comprises the following steps:

calibrating internal parameters and external parameters of the binocular camera;

step two, positioning the positions of the face in the binocular left and right views;

step three, selecting a matching area of interest of the binocular view, and aligning distortion correction with a base line; the treatment mode of the region of interest is as follows: the face position of the second positioning step is the position on the view before correction, the face position of the left view is ori_left_face, the face position of the right view is ori_right_face, the position information of the face frame comprises (x 1, y1, x2, y 2), parallax stereo matching is required to match corrected images, the ori_left_face and ori_right_face of the original view need to be mapped into an image space aligned with a base line through the calibrated binocular camera parameters of the first positioning step, the face frames of the left view and the right view are changed into correct_left_face and correct_right_face, and the union of the two corrected faces is selected as a region of interest ROI through the operation of the following formulas:

ROI.x1＝min(correct_left_face.x1,correct_right_face.x1)；

ROI.y1＝min(correct_left_face.y1,correct_right_face.y1)；

ROI.x2＝max(correct_left_face.x2,correct_right_face.x2)；

ROI.y2＝max(correct_left_face.y2,correct_right_face.y2)；

after the ROI of the corrected face position is selected, the ROIs of the left view and the right view of the binocular lens are subjected to baseline alignment according to a baseline correction matrix, and an ROI image left_face_img of the left view and an ROI image right_face_img of the right view are obtained, wherein the same point of world coordinates on the left view and the right view are respectively on the same horizontal line on the images after baseline alignment;

step four, binocular 3D face modeling and face stereo judgment based on a deep learning method are carried out on the region of interest, and the method comprises the following steps:

step (1) inputting left and right view data to construct a cost volume: setting a network input size of parallax prediction, normalizing the left_face_img and the right_face_img obtained in the step three to the network input size before inputting the network, and constructing 2*D tensors of the network input size in an input space, wherein D is the maximum number of the parallax which can be matched actually, and the value of D is 2n (n is more than or equal to 4);

step (2) CNN1 parallax feature extraction: the CNN1 network structure comprises two parts, wherein the front part is used for downsampling by 8 times to extract high-dimensional semantic information, the rear part is used for upsampling by 8 times to fuse the high-level semantic information with the bottom-level texture information, the CNN1 parallax characteristic extraction learning process is used for softmaxWithLoss cross entropy loss learning, and the cost volume operation is moved forward to the original data layer in the step (1), so that only one section of CNN1 network structure can decode and output the parallax characteristic of a D network input size;

step (3) argmax parallax analysis: the argmax operation is to analyze the parallax value of the D-network input size tensor, and because the CNN1 parallax characteristic extraction learning process uses SoftmaxWithLoss cross entropy loss learning, the argmax operation is to find the index of the D-network input size, where the value of a position point is the largest, on a D channel, the channel index where the position point is the maximum is the parallax value predicted by the network, and the parallax prediction tensor shape becomes the network input size vector after the argmax;

step (4) modeling a face 3D diagram: the network parallax prediction result is based on the parallax of the network input size image, and the real parallax of the original resolution is multiplied by the width of a proportion scale_x= (roi.x2-roi.x1)/the network input size, namely the parallax D of the original resolution _truth ＝D _predict *scale_x，D _predict Is the predicted disparity value, D _truth Is a real parallax value, and the Depth Depth=B×F/(D) of the final output of the 3D face modeling _truth ) B is the baseline distance between binocular cameras, F is the focal length of coplanar pixels after binocular head baseline correction, and Depth is the Depth distance of the face;

step (5) CNN2 predicts the 3D stereo of the human face: according to the 3D face image, a CNN2 deep learning classification network is designed, the 3D face image with the network input size is directly subjected to feature extraction, and then whether the three-dimensional information exists is classified through a full-connection layer;

step five, transplanting the deep learning model of binocular 3D face modeling and face three-dimensional judgment in the step four onto an embedded platform comprises the following steps:

step (1), quantitatively transplanting the SFD face detection model to an embedded platform for detecting the faces of the original left and right views, namely realizing a step two on the embedded platform; step (2), calculating a corrected face interesting image by using an opencv image processing library according to the face detected in the step (1) and the binocular camera calibration parameters, and normalizing the size to a network input size, namely realizing the step three on an embedded platform;

transplanting the 3D face modeling CNN1, arranging the training data quantization CNN1 model in the fourth step, coding the training data quantization CNN1 model to a npu implementation on an embedded platform, and coding parallax analysis argmax operation and 3D depth modeling to an arm processor to output 3D face information;

and (4) transplanting the three-dimensional judgment CNN2, finishing the training data quantization CNN2 model in the fourth step, and coding the training data to the npu implementation of the embedded platform to output the three-dimensional judgment result.

Further, the calibration process of the first step is as follows: and placing the calibration plate at a proper position of the binocular head, ensuring that the whole calibration plate is in the range of the binocular left and right views, enabling the pixel area of the calibration plate to be more than 1/4 of the pixel area of an image, collecting more than 10 groups of picture pairs, namely calibration pairs, on the calibration plate, on the lower part, on the left part, on the right part and rotating the calibration plate, and then carrying out the internal and external parameter calibration of the binocular camera on the collected calibration pairs and storing lower parameter data.

Further, a step two is to detect a human face by using a simplified classical SFD detection algorithm, and to respectively position the human face to the binocular left and right views by using the SFD algorithm, wherein the human face is an interested region in human face three-dimensional modeling.

In step (4) of step four, CNN1 and CNN2 are model parameters stored after training the actual data of the device, the training data is a real left-right view acquired by the actual binocular device, and the training label comprises two parts: one part is an accurate parallax image aligned with the face of the left view, the other part is a classifying label of whether the input left and right face views have depth information, the network training structure uses caffe, the training mode is CNN1 and CNN2 integrated training, and the learning parallax prediction and stereo judgment loss functions are all cross entropy loss selection.

Further, the embedded platform is a haisi 3516 platform.

By adopting the structure, the novel binocular 3D face modeling and stereo matching deep learning architecture is designed, and compared with the traditional image stereo matching algorithm, the algorithm has a 3D face map which is more dense and has higher precision, and whether the face has stereo information or not is judged rapidly; the algorithm is transplanted to the embedded platform, can run at a speed close to real time, and has no sense of user experience; in terms of precision, the face three-dimensional judgment algorithm has very high binocular 3D face three-dimensional modeling precision and very good generalization capability under various light rays; (2) At speed, the algorithm can run at near real-time speed on an embedded platform. The 3D face modeling accuracy of the invention is close to that of a structured light or TOF device, but the cost price is lower. In conclusion, the 3D provided by the invention is a 3D living body judgment scheme with higher face modeling accuracy, high running speed on a Hai Si embedded hardware platform and low cost, and can be used in the fields of intelligent security gate inhibition, automatic attendance machine, 3D face door lock and the like, so that huge economic benefits are generated.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of a binocular 3D face modeling and face stereo judgment algorithm for a region of interest according to the present invention.

Fig. 3 is a 3D modeling effect diagram of a binocular real human face of the present invention.

Fig. 4 is a 3D modeling effect diagram of the binocular paper face of the present invention.

Detailed Description

In order to further explain the technical scheme of the invention, the invention is explained in detail by specific examples.

As shown in fig. 1, the invention relates to a binocular 3D face modeling and stereo judging method of an embedded platform, which comprises the following steps:

step one, calibrating internal parameters and external parameters of the binocular camera. The invention uses opencv Zhang Zhengyou calibration method to calibrate, and the calibration tool uses a checkerboard calibration plate with the size of 12 x 9 square blocks of 30 mm. The calibration process is as follows: the calibration plate is placed at a proper position of the binocular head, the whole calibration plate is ensured to be in the range of binocular left and right views, the pixel area of the calibration plate is more than 1/4 of the pixel area of an image, more than 10 groups of picture pairs, namely calibration pairs (5 pairs of pictures of the calibration plate, namely right, oblique upward, oblique downward, left and right, are shot on a proper position, the calibration plate is moved to another proper distance, 5 pairs of pictures are shot again according to the mode, and the acquired calibration pairs are obtained), and then the acquired calibration pairs are subjected to the internal and external parameter calibration of the binocular camera and the lower parameter data are stored.

And step two, positioning the face positions of the binocular left and right views. The present embodiment uses a reduced classical SFD (Single Shot Scale-Invariant Face Detector) detection algorithm to detect faces. And respectively positioning the positions of the human faces, which are the regions of interest in the human face three-dimensional modeling, for the binocular left and right views by using an SFD algorithm.

And thirdly, selecting a matching area of interest of the binocular view, and aligning distortion correction with a base line. The binocular lens has errors due to the installation process, the optical axis direction and the optical center position of different lenses are not fixed, and the parallax matching result of the left view and the right view is incorrect, so that the distortion correction of the left view and the right view of the binocular lens is required to be aligned with a base line. Taking a binocular lens placed left and right as an example, after distortion correction is aligned with a base line, the same object point on the world coordinate is on the same horizontal line of left and right imaging views, and parallax only needs to be searched on the same horizontal line. This can narrow the range of parallax search, and the calculation amount of the parallax matching process can be reduced. The invention focuses on the 3D reconstruction and the stereo judgment of the human face, so the algorithm only carries out distortion correction and baseline alignment on the region of interest of the positioned human face, thus the image of an irrelevant human face region does not need to be corrected, and a great deal of waste of computing resources can be avoided. The following details the manner in which the present invention processes the region of interest: the face position located in the second step is the position on the view before correction, and the left view face position is assumed to be ori_left_face, the right view face position is assumed to be ori_right_face, and the position information of the face frame comprises (x 1, y1, x2, y 2). The parallax stereo matching needs to match corrected images, the ori_left_face and ori_right_face of the original view are required to be mapped into an image space aligned with a base line by using an undistigtPoint function of opencv through the calibrated binocular camera parameters of the step one, and the face frames of the left view and the right view are changed into correct_left_face and correct_right_face. To ensure that the complete face can be modeled stereoscopically, the union of two corrected faces needs to be selected as the region of interest ROI through the operation of the following formula.

ROI.x1＝min(correct_left_face.x1,correct_right_face.x1)；

ROI.y1＝min(correct_left_face.y1,correct_right_face.y1)；

ROI.x2＝max(correct_left_face.x2,correct_right_face.x2)；

ROI.y2＝max(correct_left_face.y2,correct_right_face.y2)；

After the ROI of the corrected face position is selected, the ROIs of the left view and the right view of the binocular lens are subjected to baseline alignment according to a baseline correction matrix, and an ROI image left_face_img of the left view and an ROI image right_face_img of the right view are obtained, wherein the same points of world coordinates on the left view and the right view are respectively on the same horizontal line on the images after baseline alignment.

Step four, performing binocular 3D face modeling and face stereo judgment based on a deep learning method on the region of interest, wherein the step is a key step of the invention, and the workflow is as shown in fig. 2, and comprises the following steps:

(1) Inputting left and right view data to construct a cost volume: the network input size of the parallax prediction is set to 256×256, and the left_face_img and right_face_img obtained in the step three need to be normalized to the size of 256 before inputting into the network. The deep learning disparity matching is provided with an important concept called cost volume, which can be called matching cost, and the disparity with the minimum matching cost is the disparity predicted value of the current point. The conventional deep learning parallax matching method PSMNet (PSMNet:. Pyramid Stereo Matching Network) will place the cost volume operation on the feature map obtained by downsampling the input picture by the feature extraction CNN (convolutional neural network), but the present invention directly performs the cost volume operation on the original picture data, and constructs a tensor of 2×256×256 in the input space, where D is the maximum parallax number that can be actually matched, and D may be set to 16, 32, 64, 128, etc.

Step (2) CNN1 parallax feature extraction: the PSMNet network prediction is divided into two parts, the first CNN network is high-dimensional feature extraction, the middle operation is to construct a cost volume by using high-dimensional features, and the second CNN is a decoding parallax prediction result, so that the PSMNet prediction parallax requires two CNN networks. The CNN1 network structure comprises two parts, wherein the front part is used for extracting high-dimensional semantic information by 8 times of downsampling, the rear part is used for fusing the high-level semantic information and bottom-level texture information by 8 times of upsampling, and the cost volume operation is moved forward to the original data layer, so that only one CNN1 network structure can decode and output a parallax characteristic of D multiplied by 256.

Step (3) argmax parallax analysis: the argmax operation is to analyze the parallax value of the d×256×256 tensor, because the CNN1 parallax feature extraction learning process uses softmaxwitbloss cross entropy loss learning, the argmax operation refers to finding the index of the d×256×256 position point with the largest value on the D channel, the channel index of the position point maximum value is the parallax value predicted by the network, the parallax prediction tensor shape becomes 256×256 vectors after argmax, such as the binocular real face view parallax prediction result and the binocular paper face view parallax prediction result in fig. 3 and 4, respectively, and the predicted parallax map in fig. 3 and 4 is more accurate than the predicted parallax map of opencv traditional matching algorithm BM (Block Matching) or sgparallax bm (semi-global block matching).

(4) Modeling a face 3D diagram: the network parallax prediction result is based on the parallax of 256×256 size images, and the real parallax of the original resolution needs to be multiplied by a proportion scale_x= (roi.x2-roi.x1)/256.0, namely the parallax D of the original resolution _truth ＝D _predict *scale_x，D _predict Is the predicted disparity value, D _truth Is a real parallax value, and the Depth Depth=B×F/(D) of the final output of the 3D face modeling _truth ) B is the baseline distance between binocular cameras, F is the coplanar pixel focal length after binocular head baseline correction, depth is the Depth distance of the face.

(5) CNN2 predicts 3D stereo of human face: according to the 3D face image, a CNN2 deep learning classification network is designed, the feature extraction is directly carried out on the 256-by-256 3D face images, and then whether the three-dimensional information exists or not is classified through a full-connection layer. CNN1 and CNN2 are model parameters stored after training actual data of the device, training data are real left and right views acquired by the actual binocular device, and training labels comprise two parts: one part is an accurate parallax image aligned with the face of the left view, the other part is a classifying label of whether the input left and right face views have depth information, the network training structure uses caffe, the training mode is CNN1 and CNN2 integrated training, and the learning parallax prediction and stereo judgment loss functions are all cross entropy loss selection.

And fifthly, transplanting a novel deep learning model for binocular 3D face modeling and face stereo judgment to the embedded Hai Si platform. The four steps introduce the complete binocular 3D face modeling and three-dimensional judgment algorithm design flow, and the research and development of the algorithm finally generates practical social value for application of the floor. The product of this embodiment is a floor-mounted hardware platform selected from the list 3516. The transplanting step comprises the following steps:

step (1), quantitatively transplanting an SFD face detection model to a Hai Si 3516 platform for detecting faces of original left and right views;

step (2), calculating a corrected face interesting image by using an opencv image processing library according to the face detected in the step (1) and the binocular camera calibration parameters, and normalizing the size to 256 x 256;

transplanting the 3D face modeling CNN1, finishing the training data quantization CNN1 model in the fourth step, encoding the training data quantization CNN1 model onto a Hai Si 3516 (embedded neural network processor) for realizing parallax analysis argmax operation and 3D depth modeling encoding onto the Hai Si arm processor, and outputting 3D face information;

and (4) transplanting the three-dimensional judgment CNN2, finishing the training data quantization CNN2 model in the fourth step, and coding the training data to npu (an embedded neural network processor) on Hai Si 3516 to output a three-dimensional judgment result. The following is a specific optimization process, the first segment of the original PSMAT extracts feature networks, the calculated amount is 27.8G FLPs (float point operation, floating point number operation, which is used for evaluating the calculation complexity of a CNN network), and the second segment constructs a cost volume according to the extracted features to analyze the parallax calculated amount to be 54.99G FLPs; according to the invention, the network input size is reduced in the fourth step, the cost volume operation is moved forward to the data layer, so that the two-section network of PSMAT is converted into one-section network, the calculated amount is reduced, the CNN1 and CNN2 network structures are designed by using an upper model pruning and distilling method, and finally the total calculated amount of the CNN1 and CNN2 networks is 2.27G FLOPS which is 2.7% of the calculated amount of PSMAT. During caffe (convolutional neural network framework) training, a binocular 3D face modeling parallax matching CNN1 network and a stereo judgment CNN2 network are trained integrally, and because Hai Si NNIE does not support argmax operation of parallax analysis, CNN1 and CNN2 need to be separately quantized and transplanted in the algorithm transplanting Hai Si process, and argmax operation is designed on a CPU for processing. Finally, the effect of the migration is that the whole algorithm runs for 100ms in the case of high-precision model quantization, for 60ms in the case of low-precision model quantization, near real-time speed on the embedded hardware platform, while the PSMNet predicts that parallax time is 410ms at the training server Titan X according to the PSMNet paper. The embodiment can be used for the face living detection of front-end equipment in scenes such as an access gate, an automatic attendance machine, a face door lock and the like.

According to the invention, a novel binocular 3D face modeling and stereo matching deep learning architecture is designed, and compared with a traditional image stereo matching algorithm, the algorithm has a 3D face map with higher density and higher precision and can be used for rapidly judging whether a face has stereo information or not; the algorithm is transplanted to a Hai Si 3516 embedded platform, can run at a speed close to real time, and has no sense of user experience; the 3D face modeling accuracy of the invention is close to that of a structured light or TOF device, but the cost price is lower. In conclusion, the 3D provided by the invention is a 3D living body judgment scheme with higher face modeling accuracy, high running speed on a Hai Si embedded hardware platform and low cost, and can be used in the fields of intelligent security gate inhibition, automatic attendance machine, 3D face door lock and the like, so that huge economic benefits are generated.

The above examples and drawings are not intended to limit the form or form of the present invention, and any suitable variations or modifications thereof by those skilled in the art should be construed as not departing from the scope of the present invention.

Claims

1. The binocular 3D face modeling and stereo judging method is characterized by comprising the following steps of:

step (1), inputting left and right view data to construct a cost volume: setting a network input size of parallax prediction, normalizing an acquired ROI image left_face_img of a left view and an acquired ROI image right_face_img of a right view to the network input size before inputting the network, wherein the ROI is a region of interest, namely, a cost volume operation is moved forward to an original data layer, constructing a tensor of the network input size in an input space 2*D, wherein D is a maximum parallax number which can be matched actually, and the value of D is 2 ⁿ (n≥4)；

Step (2), extracting parallax characteristics of CNN 1: the CNN1 network structure comprises two parts, wherein the front part is used for downsampling by 8 times to extract high-dimensional semantic information, the rear part is used for upsampling by 8 times to fuse the high-level semantic information with the bottom-level texture information, the CNN1 parallax characteristic extraction learning process is used for softmaxWithLoss cross entropy loss learning, and the cost volume operation is moved forward to the original data layer in the step (1), so that only one section of CNN1 network structure can decode and output the parallax characteristic of a D network input size;

step (4) modeling a face 3D diagram: the network parallax prediction result is based on the parallax of the network input size image, the real parallax of the original resolution needs to be multiplied by a proportion scale_x= (roi.x2-roi.x1)/the width of the network input size, roi.x2 is the X-axis maximum value of the region of interest, and roi.x1 is the X-axis minimum value of the region of interest, namely the parallax D of the original resolution _truth ＝D _predict *scale_x，D _predict Is the predicted disparity value, D _truth Is a real parallax value, and the Depth Depth=B×F/(D) of the final output of the 3D face modeling _truth ) B is a binocular cameraThe baseline distance between the two is F, the focal length of the coplanar pixels after the baseline correction of the binocular head, and Depth is the Depth distance of the face;

2. The 3D face modeling and stereo judging method as defined in claim 1, wherein: the CNN1 in the step (2) and the CNN2 in the step (5) are model parameters stored after the training of the actual data of the equipment, the training data are real left and right views acquired by the actual binocular equipment, and the training label comprises two parts: one part is an accurate parallax image aligned with the face of the left view, the other part is a classifying label of whether the input left and right face views have depth information, the network training structure uses caffe, the training mode is CNN1 and CNN2 integrated training, and the learning parallax prediction and stereo judgment loss functions are all cross entropy loss selection.

3. The binocular 3D face modeling and stereo judging method of the embedded platform is characterized by comprising the following steps of:

ROI.x1＝min(correct_left_face.x1,correct_right_face.x1)；

ROI.y1＝min(correct_left_face.y1,correct_right_face.y1)；

ROI.x2＝max(correct_left_face.x2,correct_right_face.x2)；

ROI.y2＝max(correct_left_face.y2,correct_right_face.y2)；

step (1) inputting left and right view data to construct a cost volume: setting the network input size of parallax prediction, normalizing the left_face_img and right_face_img obtained in the step three to the network input size before inputting the network, and constructing 2*D tensors of the network input size in an input space, wherein D is the maximum number of the parallax which can be matched actually, and the value of D is 2 ⁿ (n≥4)；

step (3) argmax parallax analysis: the argmax operation is to analyze the parallax value of the network input size tensor, and because the CNN1 parallax characteristic extraction learning process uses softmaxWithLoss cross entropy loss learning, the argmax operation is to find an index with the maximum value of a position point of the D-type network input size on a D channel, the channel index with the maximum value of the position point is the parallax value predicted by the network, and the parallax prediction tensor shape is changed into the network input size vector after the argmax;

step five, transplanting the deep learning model of binocular 3D face modeling and face three-dimensional judgment in the step four onto an embedded platform, wherein the transplanting step comprises the following steps:

4. The method for modeling and stereo judging a binocular 3D face of an embedded platform according to claim 3, wherein the method comprises the following steps: the calibration process of the first step is as follows: and placing the calibration plate at a proper position of the binocular head, ensuring that the whole calibration plate is in the range of the binocular left and right views, enabling the pixel area of the calibration plate to be more than 1/4 of the pixel area of an image, collecting more than 10 groups of picture pairs, namely calibration pairs, on the calibration plate, on the lower part, on the left part, on the right part and rotating the calibration plate, and then carrying out the internal and external parameter calibration of the binocular camera on the collected calibration pairs and storing lower parameter data.

5. The method for modeling and stereo judging a binocular 3D face of an embedded platform according to claim 3, wherein the method comprises the following steps: and secondly, detecting a human face by using a simplified classical SFD detection algorithm, and respectively positioning the human face positions of the binocular left and right views by using the SFD algorithm, wherein the human face is an interested region in human face three-dimensional modeling.

6. The method for modeling and stereo judging a binocular 3D face of an embedded platform according to claim 3, wherein the method comprises the following steps: in the step (4) of the fourth step, CNN1 and CNN2 are model parameters stored after training of actual data of the device, the training data are real left and right views acquired by the actual binocular device, and the training label comprises two parts: one part is an accurate parallax image aligned with the face of the left view, the other part is a classifying label of whether the input left and right face views have depth information, the network training structure uses caffe, the training mode is CNN1 and CNN2 integrated training, and the learning parallax prediction and stereo judgment loss functions are all cross entropy loss selection.

7. The method for modeling and stereo judging a binocular 3D face of an embedded platform according to claim 3, wherein the method comprises the following steps: the embedded platform is a Hai Si 3516 platform.