CN111160291B

CN111160291B - Human eye detection method based on depth information and CNN

Info

Publication number: CN111160291B
Application number: CN201911416013.6A
Authority: CN
Inventors: 朱志林; 张伟香; 王禹衡; 方勇
Original assignee: Shanghai Evis Technology Co ltd
Current assignee: Shanghai Evis Technology Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-10-31
Anticipated expiration: 2039-12-31
Also published as: CN111160291A

Abstract

The invention discloses a human eye detection method based on depth information and CNN, which comprises the following steps: s1, inputting an image and a depth image thereof; s2, preprocessing the depth image according to a detection distance range, and removing a background in a non-detection distance range; s3, performing depth histogram segmentation on the preprocessed depth map to obtain a target candidate region; s4, performing matching verification on the candidate areas by using head and shoulder templates, and determining face candidate areas; s5, comparing the overlapping areas of the face candidate frames, and merging the candidate frames meeting the set threshold; and S6, performing face frame regression and key point regression calculation in the trained CNN model in the image area corresponding to the face candidate area to obtain the position of the human eyes. The human eye detection method based on the depth information and the CNN can improve the detection accuracy, reduce the calculation complexity, improve the detection efficiency, ensure the detection instantaneity and accuracy and meet the requirements of naked eye 3D displays.

Description

Human eye detection method based on depth information and CNN

Technical Field

The invention belongs to the technical field of human eye detection, relates to a human eye detection method, and particularly relates to a human eye detection method based on depth information and CNN.

Background

With the growing maturity of naked eye 3D display technology, how to accurately detect the position of the eyes of the viewer in real time and output the best 3D viewing effect according to the eye position of the viewer becomes an important development direction of the naked eye 3D display. At present, human eye detection is realized by mainly training a CNN model, and carrying out regression on human face frames and eye positions on the basis of human face region classification, so that human eye detection is accurately realized.

Meanwhile, with the development of depth information acquisition technology of a depth camera, scene depth information is extracted through the depth camera, and depth features which are not in an image can be obtained. And the depth features can accurately segment objects in the scene and acquire the corresponding depth range, so that the computational complexity of CNN extraction of the face candidate region is greatly reduced.

In view of this, a method combining depth information and CNN is designed to realize human eye detection, so as to meet the requirements of real-time performance and accuracy of naked eye 3D displays.

Disclosure of Invention

The invention provides a human eye detection method based on depth information and CNN, which can improve detection accuracy, reduce calculation complexity and improve detection efficiency.

In order to solve the technical problems, according to one aspect of the present invention, the following technical scheme is adopted:

a human eye detection method based on depth information and CNN, the human eye detection method comprising:

s1, inputting an image and a depth image thereof;

s2, preprocessing the depth image according to a detection distance range, and removing a background in a non-detection distance range;

s3, performing depth histogram segmentation on the preprocessed depth map to obtain a target candidate region;

s4, performing matching verification on the candidate areas by using head and shoulder templates, and determining face candidate areas;

s5, comparing the overlapping areas of the face candidate frames, and merging the candidate frames meeting the set threshold;

and S6, performing face frame regression and key point regression calculation in the trained CNN model in the image area corresponding to the face candidate area to obtain the position of the human eyes.

In step S2, the pixel points of the detection range are extracted from the depth values in the depth map, converted into a mask with the pixel values set to 255 and the rest of 0, and the mask is multiplied by the depth map to remove the background pixels outside the range.

In the step S3, the depth value of the depth map after mask is converted into 0-255, the depth map is mapped from the xy axis of the screen coordinate system to the xz axis plane, the mapped images are projected to the xz axis respectively, and the object range is segmented; the divided ranges correspond to the x-axis region range and the z-axis region range, and template images of corresponding scales of region template matching are obtained according to the intermediate values of the corresponding depth ranges.

In the step S3, the difference of depth values of different object positions is fully utilized, firstly, a screen coordinate system of an xy axis is converted to an xz axis, an object in a scene is subjected to xz two-dimensional vertical projection, then, according to the vertical projection of the x axis on an object projection diagram, the projected wave crests and wave troughs are found out, each wave crest position is considered to represent the existence of the object, the wave trough before and after each wave crest is taken as a segmentation threshold, and the object on the scene is segmented in the x axis;

and performing vertical projection on the z axis of each divided x axis region, finding out the wave crest and wave trough of the projection, taking each wave crest as the depth of an object, taking the wave trough before and after the wave crest as a division threshold value, dividing the wave trough with the x axis to form a division range of each object, and dividing each object in a scene.

As an implementation mode of the invention, in the step S3, a method for matching the self-adaptive depth value with the template scale is adopted, so that the complexity of the traditional multi-scale simultaneous detection is avoided; according to depth range values obtained by different objects, taking the intermediate value of the range as the depth of the object, selecting a head-shoulder template image matched with the depth value, realizing the optimal scale matching accuracy of the head-shoulder template, and simultaneously avoiding the complexity of simultaneous detection of the multi-scale templates.

In step S4, the depth map is divided into a plurality of detection portions, the plurality of depth regions are detected in parallel, a template matching method is adopted, similarity detection is performed on the head-shoulder template image corresponding to the current depth value and the input depth map through a sliding window with the step length of 1, and the obtained value is stored in the result map.

In step S5, a non-maximum suppression method is adopted to merge the candidate frames in the result map, the obtained candidate frames are segmented into heads according to the head-shoulder ratio, and the segmented positions in the depth map are mapped to the images as input images of the CNN model.

In one embodiment of the present invention, in the step S6, the face region resize obtained by dividing the image is set to the same training scale size as the training image, and is used as the input image of the CNN model.

In the step S6, a four-layer convolution layer and a two-layer full-connection layer model are adopted to conduct face two-classification and face frame and key point regression, the face two-classification adopts a softmax loss function, and the face frame and face key point regression adopts a minimum mean square error function to train out the model; and importing the trained model into a network structure, and inputting the face image separated by combining the depth information to obtain a new face frame position and a new face key point.

The invention has the beneficial effects that: the human eye detection method based on the depth information and the CNN can improve the detection accuracy, reduce the calculation complexity, improve the detection efficiency, ensure the detection instantaneity and accuracy and meet the requirements of naked eye 3D displays.

The human eye detection method provided by the invention firstly utilizes depth information to coarsely extract the human face region, and then refines the human face position and the human eye position through a CNN method. According to the method, the objects in the scene are segmented through the depth information according to the characteristics of the horizontal intervals and the depth intervals of the objects in the scene, and then head-shoulder template matching is carried out on each object, so that the rapid positioning of the human face is realized, the complexity of human face detection is greatly reduced on the premise of ensuring the accuracy of human face detection, and the time consumption of the algorithm is less. On the basis of the partitioned face area, the invention adopts a network structure of 4 convolution layers and 2 full connection layers, performs face frame edge regression and position regression of key point (landmark) points on the face, and realizes accurate positioning of eyes through pre-trained model parameters.

Drawings

Fig. 1 is a flowchart of a method for identifying eyes according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

For a further understanding of the present invention, preferred embodiments of the invention are described below in conjunction with the examples, but it should be understood that these descriptions are merely intended to illustrate further features and advantages of the invention, and are not limiting of the claims of the invention.

The description of this section is intended to be illustrative of only a few exemplary embodiments and the invention is not to be limited in scope by the description of the embodiments. It is also within the scope of the description and claims of the invention to interchange some of the technical features of the embodiments with other technical features of the same or similar prior art.

The invention discloses a human eye identification method based on depth information and CNN, and FIG. 1 is a flow chart of the human eye identification method in an embodiment of the invention; referring to fig. 1, in an embodiment of the present invention, the method includes:

s1, obtaining an image and a corresponding depth map through a depth camera;

s2, extracting pixel points of a detection range from the depth values in the depth map, converting the pixel points into a mask with the pixel values set to 255 and the rest of 0, multiplying the mask with the depth map points, and removing pixels outside the range.

S3, converting the depth value of the depth map after mask into 0-255, mapping the depth map from the xy axis of a screen coordinate system to an xz axis plane, performing x-axis projection on the mapped image, finding out the positions of wave crests and wave troughs, and taking the front and rear wave trough values of the wave crest positions as a dividing region of the x axis. And then, respectively carrying out z-axis projection on the segmented areas, finding out the projected wave crests and wave troughs, and taking the wave troughs before and after the wave crests as an area of the z-axis. The divided ranges correspond to the x-axis region range and the z-axis region range, and template images of corresponding scales of region template matching are obtained according to the intermediate values of the corresponding depth ranges.

S4, dividing the depth map into a plurality of detection parts, detecting the plurality of depth areas in parallel, adopting a template matching method, detecting the similarity between a head-shoulder template image corresponding to the current depth value and the input depth map through a sliding window with the step length of 1, and storing the obtained value in a result map.

S5, comparing the overlapping areas of the face candidate frames, and merging the candidate frames meeting the set threshold.

In an embodiment of the present invention, a threshold is set by traversing each pixel value of the result graph, a candidate frame meeting a threshold condition is used as a candidate frame, a non-maximum suppression method is adopted for the candidate frame, and the candidate frames meeting a certain IOU threshold (cross-over threshold) are combined.

S6, mapping the combined candidate frames to an image, dividing the head area to serve as an input image, and performing face frame regression and key point regression calculation in the trained CNN model to obtain the position of the human eyes.

In one embodiment of the present invention, in step S6, the face frame edge and face key point regression method includes the following steps:

step S61: and (3) sending the input face region resize to a training scale, sending the training scale into 4 layers of convolution layers, wherein the convolution kernel is 3x3, adding a pre layer after each convolution layer, importing trained model weight parameters, and extracting the feature map of 3x3x 128.

Step S62, outputting a vector of 2+4+2xPOINtNum from the feature map through 2 full-connection layers, wherein 2 represents the classification of whether the face is classified, 4 represents the position of the face frame, and PointNum represents the number of key points of the face.

Step S63: and extracting the positions of the eyes from the obtained key point positions of the face, and mapping the positions of the eyes on the face back to the source diagram to realize the detection of the positions of the eyes.

In summary, the human eye detection method based on the depth information and the CNN provided by the invention can ensure the real-time detection and accuracy and meet the requirements of naked eye 3D displays.

The human eye detection method provided by the invention firstly utilizes depth information to coarsely extract the human face region, and then refines the human face position and the human eye position through a CNN method. According to the method, the objects in the scene are segmented through the depth information according to the characteristics of the horizontal intervals and the depth intervals of the objects in the scene, and then head-shoulder template matching is carried out on each object, so that the rapid positioning of the human face is realized, the complexity of human face detection is greatly reduced on the premise of ensuring the accuracy of human face detection, and the time consumption of the algorithm is less. On the basis of the partitioned face area, the invention adopts a network structure of 4 convolution layers and 2 full connection layers, performs face frame edge regression and key point position regression on the face, and realizes accurate positioning of eyes through pre-trained model parameters.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The description and applications of the present invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. Variations and modifications of the embodiments disclosed herein are possible, and alternatives and equivalents of the various components of the embodiments are known to those of ordinary skill in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other assemblies, materials, and components, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims

1. The human eye detection method based on the depth information and the CNN is characterized by comprising the following steps of:

s1, inputting an image and a depth image thereof;

s6, performing face frame regression and key point regression calculation in the trained CNN model in the image area corresponding to the face candidate area to obtain the position of the human eyes;

in the step S2, the pixel points of the detection range are extracted from the depth values in the depth map, the pixel points are converted into masks with the pixel values set to 255 and the rest are 0, the masks are multiplied by the depth map points, and the background pixels outside the range are removed;

in the step S3, the depth value of the depth map after mask is converted into 0-255, the depth map is mapped from the xy axis of the screen coordinate system to the xz axis plane, the mapped images are respectively projected to the xz axis, and the object range is segmented; the divided ranges correspond to an x-axis region range and a z-axis region range, and template images of corresponding scales of region template matching are obtained according to the intermediate value of the corresponding depth range;

in the step S3, the difference of depth values of different object positions is fully utilized, firstly, a screen coordinate system of an xy axis is converted to an xz axis, an object in a scene is subjected to xz two-dimensional vertical projection, then, according to the vertical projection of the x axis on an object projection diagram, the projected wave crests and wave troughs are found out, each wave crest position is considered to represent the existence of the object, the wave trough in front of and behind each wave crest is taken as a segmentation threshold value, and the object on the scene is subjected to x-axis segmentation;

2. The human eye detection method based on depth information and CNN according to claim 1, wherein:

in the step S4, a method of matching the template scale by adopting the self-adaptive depth value is adopted; and according to the depth range values obtained by different objects, taking the intermediate value of the range as the depth of the object, and selecting a head-shoulder template image matched with the depth value.

3. The human eye detection method based on depth information and CNN according to claim 1, wherein:

in the step S4, the depth map is divided into a plurality of detection portions, the plurality of depth areas are detected in parallel, a template matching method is adopted, similarity detection is performed on the head-shoulder template image corresponding to the current depth value and the input depth map through a sliding window with the step length of 1, and the obtained value is stored in the result map.

4. The human eye detection method based on depth information and CNN according to claim 1, wherein:

in the step S5, a non-maximum suppression method is adopted, the candidate frames in the result diagram are combined, the obtained candidate frames are subjected to head segmentation according to the head-shoulder ratio, and the segmented positions in the depth diagram are mapped to the image to be used as the input image of the CNN model.

5. The human eye detection method based on depth information and CNN according to claim 1, wherein:

in the step S6, the size of the face region resize obtained by dividing the image to the same training scale as the training image is used as the input image of the CNN model.

6. The human eye detection method based on depth information and CNN according to claim 1, wherein:

in the step S6, a four-layer convolution layer and a two-layer full-connection layer model are adopted to perform face two classification and face frame and key point regression, the face two classification adopts a softmaxloss function, and the face frame and face key point regression adopts a minimum mean square error function to train out the model; and importing the trained model into a network structure, and inputting the face image separated by combining the depth information to obtain a new face frame position and a new face key point.