CN113095274A

CN113095274A - Sight estimation method, system, device and storage medium

Info

Publication number: CN113095274A
Application number: CN202110450755.1A
Authority: CN
Inventors: 梁姗姗; 张航
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-07-09
Anticipated expiration: 2041-04-26
Also published as: CN113095274B

Abstract

The invention discloses a sight line estimation method, a system, a device and a storage medium, wherein the method comprises the following steps: obtaining a human face image, and performing key point detection and 3D model fitting processing to obtain a human eye image and a 3D head rotation vector; carrying out data regularization on the human eye image and the 3D head rotation vector to obtain a regularized human eye image and a head posture estimation vector; the regularized eye images and head pose estimation vectors are input to a pre-trained CNN network, and the network output is converted to a 3D gaze direction vector. The system comprises: the device comprises an image preprocessing module, a data regularization module and a result output module. The apparatus includes a memory and a processor for performing the gaze estimation method described above. By using the invention, a high-precision sight line estimation result can be obtained. The sight line estimation method, the sight line estimation system, the sight line estimation device and the storage medium can be widely applied to the field of sight line estimation.

Description

Sight estimation method, system, device and storage medium

Technical Field

The present invention relates to the field of gaze estimation, and in particular, to a gaze estimation method, system, apparatus, and storage medium.

Background

The sight line estimation technology is a technology for researching how to accurately track the visual direction and the visual attention of human beings, has wide application scenes and great application value in actual life, can be applied to the fields of cognitive science, psychology, medical research, automobile driving, entertainment, advertising, marketing research and the like, brings convenience to the life of people, comprehensively improves the social technological level, and is accompanied with the continuous improvement of an optical imaging technology and image processing capacity, particularly the development of computer vision, the sight line estimation method based on images starts to become dominant.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a gaze estimation method, system, device and storage medium, which have high accuracy, no calibration, and simple operation.

The first technical scheme adopted by the invention is as follows: a gaze estimation method, comprising the steps of:

obtaining a human face image, and performing key point detection and 3D model fitting processing to obtain a human eye image and a 3D head rotation vector;

carrying out data regularization on the human eye image and the 3D head rotation vector to obtain a regularized human eye image and a head posture estimation vector;

the regularized eye images and head pose estimation vectors are input to a pre-trained CNN network, and the network output is converted to a 3D gaze direction vector.

Further, the step of obtaining a human face image and performing key point detection and 3D model fitting processing to obtain a human eye image and a 3D head rotation vector specifically includes:

acquiring a complete face image;

2D face alignment is carried out based on dlib face detection and 68 face key point detection, and a two-dimensional coordinate of a face key point corresponding to an image is obtained;

acquiring an eye image according to the eye key point position in the two-dimensional coordinates of the face key point;

acquiring a 3D face key point model;

and fitting the two-dimensional coordinates of the key points of the human face with the 3D key point model based on an EPnP algorithm to obtain a 3D head rotation vector.

Further, before regularizing the image of the eye, the method further comprises the step of detecting the eye blink and screening the image of the eye, and specifically comprises the following steps:

obtaining a horizontal line and a vertical line which pass through eyes according to the left eye key point information and the right eye key point information in the human eye image;

calculating the ratio of the horizontal line to the corresponding vertical line;

judging that the ratio is greater than a preset threshold value, determining that the human eye image is in an eye opening state, and performing sight line estimation;

and determining that the ratio is smaller than a preset threshold value, determining that the human eye image is in an eye closing state, and not performing sight line estimation.

Further, the formula for data regularization is as follows:

M＝S*R

in the above equation, R denotes an inverse matrix of the camera rotation matrix, and S denotes a scaling matrix.

Further, the step of performing data regularization on the human eye image and the 3D head rotation vector to obtain a regularized human eye image and a head pose estimation vector specifically includes:

processing the image of the person eye and the 3D head rotation vector based on the transformation matrix;

rotating the camera coordinate system by an R rotation matrix;

then scaling the camera coordinate system by an S scaling matrix;

and finally obtaining the regularized human eye image and the head posture estimation vector through perspective transformation.

Further, the training step of the pre-trained CNN network specifically includes:

acquiring a human eye image with a real sight angle label and a head posture estimation vector, and inputting the human eye image and the head posture estimation vector into a CNN network to obtain network output;

calculating the error between the network output and the real sight angle label based on the loss function of the mean square error to obtain an error result;

and adjusting network parameters according to the error result to obtain a trained sight estimation model.

Further, the step of inputting the regularized eye images and the head pose estimation vector into a pre-trained CNN network and converting the network output into a 3D gaze direction vector specifically includes:

inputting the regularized human eye image and the head pose estimation vector into a pre-trained CNN network;

obtaining eye characteristics through convolution of the convolution layer and compression of the pooling layer;

splicing the head posture estimation vector with the extracted eye features through a full connection layer, and outputting a 2D sight angle;

and geometrically converting the 2D sight angle to obtain a 3D sight direction vector.

The second technical scheme adopted by the invention is as follows: a gaze estimation system, comprising:

the image preprocessing module is used for acquiring a face image and performing key point detection and 3D model fitting processing to obtain a human eye image and a 3D head rotation vector;

the data regularization module is used for carrying out data regularization on the human eye image and the 3D head rotation vector to obtain a regularized human eye image and a head posture estimation vector;

and the result output module is used for inputting the regularized human eye image and the head posture estimation vector into the pre-trained CNN network and converting the network output into a 3D sight line direction vector.

The third technical scheme adopted by the invention is as follows: a gaze estimation device, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, the at least one program causes the at least one processor to implement a gaze estimation method as described above.

The fourth technical scheme adopted by the invention is as follows: a storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a gaze estimation method as described above.

The method, the system, the device and the storage medium have the advantages that: the invention firstly judges whether a human face exists, if so, the positions of a plurality of key points of the eyes are determined to carry out human eye detection, and finally the obtained eye image is input into a CNN network through clipping so as to realize the sight estimation.

Drawings

FIG. 1 is a flow chart of the steps of a gaze estimation method of the present invention;

FIG. 2 is a schematic diagram of a gaze estimation method in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of data regularization of a human eye image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a regularizing human eye (left and right eye) image according to an embodiment of the present invention;

FIG. 5 is a diagram of 68 key points of a face according to an embodiment of the present invention;

fig. 6 is a block diagram of a sight line estimation system according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Referring to fig. 1 and 2, the present invention provides a gaze estimation method, including the steps of:

s1, obtaining a human face image, and performing key point detection and 3D model fitting processing to obtain a human eye image and a 3D head rotation vector;

s2, carrying out data regularization on the human eye image and the 3D head rotation vector to obtain a regularized human eye image and a head posture estimation vector;

s3, inputting the regularized human eye image and the head pose estimation vector into a pre-trained CNN network, and converting the network output into a 3D sight line direction vector.

Further, as a preferred embodiment of the method, the step of obtaining the face image and performing the key point detection and the 3D model fitting process to obtain the eye image and the 3D head rotation vector specifically includes:

acquiring a complete face image;

acquiring a 3D face key point model;

specifically, a 3D-FAN network is adopted to perform fine adjustment on data sets such as 300W, 300W-LP-3D and the like, and 68 personal face key point models (namely average face models) required by the text are obtained.

The EPnP algorithm is a weighted sum of n three-dimensional spatial points represented as 4 virtual control points. Then, the coordinates of the 4 control points in the camera coordinate system need to be estimated, and the coordinates of the control points in the camera coordinate system can be obtained by weighting and adding eigenvectors representing a 12-by-12 matrix of the coordinates and solving a small constant quadratic equation to select correct weights. And finally, according to Euclidean motion of the camera coordinate system and the world coordinate system, a translation vector and a rotation matrix of the coordinate system can be solved.

Referring to fig. 3, a head coordinate system (X) of the average face model_h，Y_h，Z_h) The definition mode is as follows: origin at the tip of the nose, Z_hThe axis being perpendicular to the plane formed by the three midpoints of the eyes and mouth, X_hThe axis being parallel to a line passing through the midpoints of the eyes, Y_hAxis perpendicular to Z_hAxis and X_hAxis, coordinate system unit is meter. And the external eye angle distance of the model is set to be 90 mm. Wherein the triangleThe shape area is a plane formed by three middle points of the eyes and the mouth; the dots are sequentially from top to bottom and from left to right: the external canthus of the left and right eyes, the tip of the nose and two key points of the mouth.

As a further preferred embodiment of the method, before regularizing the image of the human eye, the method further includes a step of performing blink detection and screening on the image of the human eye, which specifically includes:

Specifically, referring to FIG. 5, based on face keypoint detection, we can determine 68 specific face keypoints, each with a specific index. Therefore, we can get several key point indexes of the left and right eyes as (36,37,38,39,40,41) and (42,43,44,45,46,47), respectively, and when the eyes are opened and closed, the length of the horizontal line is almost constant, while the vertical line is different. When the eyes are open, the vertical line length is much longer than when closed. The eyes are closed and the vertical line length is almost zero.

Further as a preferred embodiment of the method, the formula of data regularization is as follows:

M＝S*R

in the above equation, R represents the inverse of the camera rotation matrix, which makes the x-axis of the head coordinate system perpendicular to the y-axis of the camera coordinate system, the camera z-axis towards the eye position, and S represents the scaling matrix, so that the distance of the eye to the camera coordinate system remains fixed.

Further, as a preferred embodiment of the method, the step of performing data regularization on the human eye image and the 3D head rotation vector to obtain a regularized human eye image and a head pose estimation vector specifically includes:

rotating the camera coordinate system by an R rotation matrix;

then scaling the camera coordinate system by an S scaling matrix;

Specifically, in order to achieve high-precision sight line estimation under different camera parameters, data regularization is required, that is, the distance between the camera and the position of human eyes is ensured to be fixed by regularizing the input image, the x axis of the head coordinate system is perpendicular to the y axis of the camera coordinate system, and the z axis of the camera faces the eyes.

Image regularization step schematic referring to fig. 3 and 4, (a) from the head coordinate system (top) and camera coordinate system (bottom) centered on the tip of the nose; (b) the camera coordinate system is rotated by a rotation matrix; (c) then scaling the camera coordinate system by an S scaling matrix; (d) and finally obtaining the regularized eye image through perspective transformation.

As a further preferred embodiment of the method, the training step of the pre-trained CNN network specifically includes:

As a preferred embodiment of the method, the step of inputting the regularized eye image and the head pose estimation vector into a pre-trained CNN network and converting the network output into a 3D gaze direction vector specifically includes:

specifically, the convolutional layer performs convolution operation, extracts eye features, and compresses input features and extracts main features through a pooling layer, thereby simplifying the network computation complexity.

As shown in fig. 6, a sight line estimation system includes:

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

A sight line estimation device:

at least one processor;

at least one memory for storing at least one program;

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a gaze estimation method as described above.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A gaze estimation method, characterized by comprising the steps of:

2. The gaze estimation method of claim 1, wherein the step of obtaining a human face image and performing keypoint detection and 3D model fitting to obtain a human eye image and a 3D head rotation vector comprises:

acquiring a complete face image;

acquiring a 3D face key point model;

3. The gaze estimation method of claim 2, further comprising the step of blink detection and screening of the images before regularizing the images, in particular comprising:

4. The gaze estimation method of claim 3, wherein the data regularization is formulated as follows:

M＝S*R

5. The gaze estimation method of claim 4, wherein the step of regularizing the images of the person's eye and the 3D head rotation vectors to obtain regularized images of the person's eye and head pose estimation vectors comprises:

rotating the camera coordinate system by an R rotation matrix;

then scaling the camera coordinate system by an S scaling matrix;

6. The gaze estimation method of claim 5, characterized in that the training step of the pre-trained CNN network specifically comprises:

7. The method of claim 6, wherein the step of inputting the regularized eye images and the head pose estimation vectors into a pre-trained CNN network and converting the network output into a 3D gaze direction vector comprises:

8. A gaze estimation system, comprising:

9. A gaze estimation device, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a gaze estimation method as claimed in any one of claims 1-7.

10. A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a gaze estimation method as claimed in any one of claims 1-7.