CN108171218A

CN108171218A - A kind of gaze estimation method for watching network attentively based on appearance of depth

Info

Publication number: CN108171218A
Application number: CN201810081808.5A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2018-06-15

Abstract

The present invention proposes a kind of gaze estimation method for watching network attentively based on appearance of depth, and main contents include：Gaze data collection, watch network attentively, cross datasets are assessed, its process is, the great amount of images from different participants is acquired as gaze data collection, the manual annotations facial marks in data set subset, facial calibration is carried out to the input picture that monocular RGB video camera obtains, using Face datection method and facial marks detection method telltale mark, a general three dimensional face shape is fitted to estimate detected face 3 d pose, application space normalizes technology, head pose and eye image are twisted into normalized trained space, learn the mapping of three-dimensional direction of gaze in head pose and eye image to camera coordinates system using convolutional neural networks.The present invention detects facial marks using condition of continuity neural network model, and average facial contours carry out 3 d pose estimation, estimate suitable for the sight under varying environment, improve the accuracy of estimated result.

Description

A kind of gaze estimation method for watching network attentively based on appearance of depth

Technical field

The present invention relates to sights to estimate field, estimates more particularly, to a kind of sight for watching network attentively based on appearance of depth Method.

Background technology

Sight estimation is that the human eye sight direction in picture is estimated, calculates and returns to eyes eyeball center Coordinate and eyes sight direction vector.The real-time tracing to human eye sight can be realized in video, be usually used in public security, handed over The fields such as logical, medicine, military investigation.Specifically, in police field, by the estimation to human eye sight direction, it may infer that people Situations such as psychological activity of object area of interest or subject object, further research people, available in convict's Interrogation Procedure It detects a lie.In field of traffic, the direction of gaze of the automatic monitoring driver of sight estimation, once driver sees the mobile phone or dozes off, depending on When line deviates road ahead, system is issued by alarm and reminding driver, and vehicle assistant drive is carried out with this.In medical domain, On the one hand sight can be estimated to be mounted in robot, is estimated by the sight to user come control device, be disabled person Life provide convenience, on the other hand, can pass through and cognitive disorder diagnosis etc. is carried out to the estimation of the sight of patient.And in military affairs Field can realize the control to external equipment and system using the motor behavior of eyes, further improve military equipment Human-computer interaction is horizontal.Since the facial image background that is acquired in daily life is complicated, it is illuminated by the light, posture, direction of gaze and a The conditions such as people's appearance influence, and existing method is only applicable to the data set of specific background mostly, once cross datasets are estimated As a result meter is easy for generating error, therefore accurately to carry out sight and estimate that even there are certain challenges.

The present invention proposes a kind of gaze estimation method for watching network attentively based on appearance of depth, and acquisition is from different participants Great amount of images as gaze data collection, manual annotations facial marks in the subset that is provided in data set, to monocular RGB video camera The input picture of acquisition carries out facial calibration, using Face datection method and facial marks detection method telltale mark, is fitted one and leads to Three dimensional face shape estimates detected face 3 d pose, application space normalization technology, by head appearance State and eye image twist into normalized trained space, are arrived using convolutional neural networks to learn head pose and eye image The mapping of three-dimensional direction of gaze in camera coordinates system.The present invention detects facial marks using condition of continuity neural network model, adopts 3 d pose estimation is carried out with average facial contours, and assesses entire sight estimation in the actual environment, suitable for varying environment Under sight estimation, improve the accuracy of estimated result.

Invention content

Estimate for sight, the present invention proposes a kind of gaze estimation method for watching network attentively based on appearance of depth, acquires Great amount of images from different participants is as gaze data collection, manual annotations facial marks in the subset provided in data set, Facial calibration is carried out to the input picture that monocular RGB video camera obtains, is positioned using Face datection method and facial marks detection method Label is fitted a general three dimensional face shape to estimate detected face 3 d pose, application space normalizing Head pose and eye image are twisted into normalized trained space, learn head using convolutional neural networks by change technology The mapping of three-dimensional direction of gaze in posture and eye image to camera coordinates system.

To solve the above problems, a kind of gaze estimation method for watching network attentively based on appearance of depth is proposed, it is main interior Appearance includes：

(1) gaze data collection；

(2) watch network attentively；

(3) cross datasets are assessed.

Wherein, the gaze data collection, in order to assess without constraint gaze estimation method, data set is needed to have and not shared the same light According to the head pose, direction of gaze and area of personal appearance of condition, the great amount of images from different participants is acquired as gaze data Collection, the three-dimensional annotation that image includes fixation object and the eyes or the three-dimensional annotation of head position that detect, later in data Manual annotations facial marks in the subset provided are provided, are assessed watching estimation performance attentively.

Further, the acquisition image by the use of laptop as collecting device, is transported on the computer of participant Row acquisition software is collected calibration annotation using the point moved in screen, is collected once within every 10 minutes, acquisition software requires to join automatically Check 20 positions represented with gray circles at random with person, participant watches acquisition position attentively, pressed when circle will disappear Space bar confirms, if missing, records identical screen position again, since the Computer model of gathered data collection is different, therefore shields Position of watching attentively on curtain is converted to three-dimensional position in camera coordinates system, before gathered data, uses cameras record participant Six three dimensional face marks, for establishing three dimensional face model.

Further, the annotation facial marks, manual annotations have the image subset of facial marks, assess face accidentally Difference estimates watching attentively the influence of performance, randomly chooses 1500 left-eye images and 1500 eye images as assessment subset, leads to It crosses method of facial landmark detection and six marks is generated on each face-image, including four eyes marks and two corners of the mouth marks Will, cuts the eye image of facial marks manually, and pupil center is annotated.

Wherein, it is described to watch network attentively, facial calibration is carried out to the input picture that monocular RGB video camera obtains, using people Face detection method and facial marks detection method telltale mark, one general three dimensional face shape of fitting are detected to estimate Face 3 d pose, application space normalization technology, head pose and eye image are twisted into normalized trained space, Learn in head pose and eye image to camera coordinates system reflecting for three-dimensional direction of gaze using convolutional neural networks (CNN) It penetrates.

Further, the facial calibration detects facial marks, using average using condition of continuity neural network model Facial contours F carries out 3 d pose estimation, assesses entire sight estimation in the actual environment, and the face that F is all participants is put down Equal shape is made of the three-dimensional position of six facial marks, is sat according to the triangle Shape definition head at connection eyes and face midpoint Mark system：

(1) x-axis is given by the line of right eye annotation dot center and left eye annotation dot center；

(2) y-axis is line of the eyes to face perpendicular to x-axis；

(3) z-axis is directed toward face's opposite direction perpendicular to triangle projective planum；

The three-dimensional rotation matrix R of mask is obtained by the two-dimentional facial marks point p detected_rWith translation vector t_r, p It is a perspective point, in the case of given threedimensional model, estimates that the 3 d pose of target image and the corresponding two dimension of image are thrown Shadow estimates an initial solution using Attitude estimation algorithm, makes the facial marks that detect of F adaptations, using minimum optimization distance into Step refining posture.

Further, eyes image normalization normalizes eyes image and head pose, due to gestures of object With six-freedom degree, the eyes cosmetic variation in sextuple space must be handled, but assume eye areas by watching estimator attentively It is a plane, then the arbitrary scaling of camera and rotation can be compensated by corresponding image warpage, therefore based on appearance Estimation function need to only handle two degree of freedom of cosmetic variation.

Further, the step of rotation, rotation, is as follows：

(1) end rotation matrix R is given_r, the eye position e in camera coordinates system_r=t_r+e_h, wherein e_hIt is head coordinate The point midway at two canthus, e in system_rRepresent eye position, normalization transition matrix M=SR, M represent eye center in camera Three-dimensional scaling and rotation in coordinate system, R is the inverse of camera spin matrix, and S is scaling, and camera is enabled to be directed at e_r；

(2) scaling matrix S=diag (1,1, d_n/||e_r| |), e_rDistance away from camera coordinates system origin is d_n, from Camera parameter original camera projection matrix obtained by calibrating is expressed as C_r, C_nIt is normalization camera projection matrix；

(3) identical conversion is carried out using the perspective distortion of image transformation matrix W in original image pixels, wherein C_n=[f_x,0,c_x；0,f_y,c_y；0,0,1], f is the focal length for normalizing camera, and c is the principal point for normalizing camera；

(4) entire normalization process is enabled to be applied to right eye and left eye in an identical manner, is determined according to corresponding eye position Adopted e_r, generate one group of eye image I, end rotation matrix R_n=MR_rWith angle of direction vector g_n=Mg_rPositioned at normalization space, g_r It is that e is derived from original camera coordinate system_rThree-dimensional sight line vector, normalized end rotation matrix R_nBe converted to three-dimensional rotation Gyration h_n；

(5) it is zero due to normalizing back wall around the rotation angle of z-axis, therefore uses h_nRepresent Two Dimensional Rotating vector h, g_nRepresent false Two Dimensional Rotating vector g in order bit length, by d_nIt is defined as 600mm, normalization camera projection matrix C_nFocal length f_xAnd f_yFor Normalized eye image resolution ratio is set as 60 × 36, c in 960, I_xAnd c_y30 and 18 are respectively set to, eyes image I is returning One change after be converted to gray level image go forward side by side column hisgram equalization, make normalization eyes image compatible between different data collection, Promote to intersect data set assessment.

Further, the network structure, the task of CNN is in normalization space learning input feature vector to angle of direction The mapping of g is spent, the distance of change target plane of fixation under unconfined condition, flip horizontal eye image, and in the horizontal direction The mirror image of h and g is created, depth convolutional neural networks framework is fully connected layer and a classification including 13 convolutional layers, two Layer, wherein there are five maximum pond layer, using the gray scale single channel image that resolution ratio is 60 × 36 pixels as input, by first The stride of pond layer and the second pond layer is changed to 1 from 2, to reflect smaller input resolution ratio, by yaw angleAnd pitch angle The two-dimentional angle of direction vector of compositionAs output, head pose information h inputs first are fully connected layer, using predicted vectorThe L2 norms of distance lose summation as loss function between true vector g.

Wherein, cross datasets assessment, using the weight of 16 layer depth convolutional neural networks in ImageNet data It is assessed on collection, in the entire network 15000 iteration of progress, the size of each batch makes to be set as 256 on training set Two momentum values are set as β with solver₁=0.9 and β₂=0.9, initial learning rate is 0.00001, after iteration 5000 times It is multiplied by 0.1.

Description of the drawings

Fig. 1 is a kind of frame diagram for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.

Fig. 2 is a kind of network structure for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.

Fig. 3 is a kind of eye sample graph for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.

Fig. 4 is a kind of head coordinate system figure for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.

Specific embodiment

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the application can phase It mutually combines, the present invention is described in further detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of frame diagram for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.It is main to include note Depending on data set, watch network, cross datasets assessment attentively.

(2) y-axis is line of the eyes to face perpendicular to x-axis；

Further, the step of rotation, rotation, is as follows：

Further, the network structure, the task of CNN is in normalization space learning input feature vector to angle of direction The mapping of g is spent, the distance of change target plane of fixation under unconfined condition, flip horizontal eye image, and in the horizontal direction The mirror image of h and g is created, depth convolutional neural networks framework is fully connected layer and a classification including 13 convolutional layers, two Layer, wherein there are five maximum pond layer, using the gray scale single channel image that resolution ratio is 60 × 36 pixels as input, by first The stride of pond layer and the second pond layer is changed to 1 from 2, to reflect smaller input resolution ratio, by yaw angleAnd pitch angle The two-dimentional angle of direction vector of compositionAs output, head pose information h inputs first are fully connected layer, using predicted vector The L2 norms of distance lose summation as loss function between true vector g.

Fig. 2 is a kind of network structure for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.To monocular The input picture that RGB video camera obtains carries out facial calibration, using Face datection method and facial marks detection method telltale mark, intends Unify a general three dimensional face shape to estimate detected face 3 d pose, application space normalizes technology, Head pose and eye image are twisted into normalized trained space, learn head appearance using convolutional neural networks (CNN) The mapping of three-dimensional direction of gaze in state and eye image to camera coordinates system.

Fig. 3 is a kind of eye sample graph for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.Expression is returned Eye sample image from each data set after one change, figure (a), figure (b) they are the images from gaze data collection, figure (c), Figure (d) is the image from other data sets, and every group of image randomly chooses a roughly the same direction of gaze.With (c), (d) It compares, figure (b) is it can be seen that gaze data collection includes larger cosmetic variation in ocular, wherein can be seen that by scheming (a) Wear glasses participant image change it is the most apparent.

Fig. 4 is a kind of head coordinate system figure for the gaze estimation method for watching network attentively based on appearance of depth of the present invention.According to The triangle Shape definition head coordinate system at three midpoints of eyes and face annotation is connected, x-axis is by two midpoints, and y-axis is hung down Directly in the x-axis in triangle projective planum, z-axis is perpendicular to triangle projective planum.

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. a kind of gaze estimation method for watching network attentively based on appearance of depth, which is characterized in that mainly include gaze data collection (1), watch network (two) attentively；Cross datasets assess (three).

2. based on the gaze data collection (one) described in claims 1, which is characterized in that in order to assess without constraint sight estimation side Method, data set need have the head pose of different illumination conditions, direction of gaze and area of personal appearance, and acquisition is from different participants Great amount of images as gaze data collection, image includes the three-dimensional annotation of fixation object and the eyes detected or head position The three-dimensional annotation put, the manual annotations facial marks in the subset of data set offer, are assessed watching estimation performance attentively later.

3. based on the acquisition image described in claims 2, which is characterized in that by the use of laptop as collecting device, Acquisition software is run on the computer of participant, calibration annotation is collected using the point moved in screen, collects once within every 10 minutes, adopt Collection software requires participant to check 20 positions represented with gray circles at random automatically, and participant watches acquisition position attentively, in circle Circle presses space bar confirmation when will disappear, if missing, identical screen position is recorded again, due to the electricity of gathered data collection Brain model is different, therefore the position of watching attentively on screen is converted to three-dimensional position in camera coordinates system, before gathered data, uses Six three dimensional face marks of cameras record participant, for establishing three dimensional face model.

4. based on the annotation facial marks described in claim 2, which is characterized in that manual annotations have image of facial marks Collection assesses the influence that facial error estimates watching attentively performance, randomly chooses 1500 left-eye images and 1500 eye images are made To assess subset, six marks are generated on each face-image by method of facial landmark detection, including four eyes marks Will and two corners of the mouth marks, cut the eye image of facial marks manually, and pupil center is annotated.

5. based on network (two) is watched attentively described in claims 1, which is characterized in that the input obtained to monocular RGB video camera Image carries out facial calibration, using Face datection method and facial marks detection method telltale mark, is fitted a general three-dimensional surface Portion's shape estimates detected face 3 d pose, application space normalization technology, by head pose and eyes figure As twisting into normalized trained space, learn head pose and eye image to camera using convolutional neural networks (CNN) The mapping of three-dimensional direction of gaze in coordinate system.

6. based on the facial calibration described in claims 5, which is characterized in that use condition of continuity neural network model detection faces Facial marker carries out 3 d pose estimation using average facial contours F, assesses entire sight estimation in the actual environment, F is all The facial average shape of participant, is made of the three-dimensional position of six facial marks, according to connection eyes and face midpoint three It is angular to define head coordinate system：

(2) y-axis is line of the eyes to face perpendicular to x-axis；

The three-dimensional rotation matrix R of mask is obtained by the two-dimentional facial marks point p detected_rWith translation vector t_r, p is one A perspective point in the case of given threedimensional model, is estimated the 3 d pose of target image and the corresponding two-dimensional projection of image, is adopted Estimate an initial solution with Attitude estimation algorithm, the facial marks for detecting F adaptations are further thin using minimum optimization distance Change posture.

7. based on the eyes image normalization described in claims 5, which is characterized in that by eyes image and head pose normalizing Change, since gestures of object has six-freedom degree, the eyes cosmetic variation in sextuple space must be handled by watching estimator attentively, But assume that eye areas is a plane, then the arbitrary scaling of camera and rotation can be mended by corresponding image warpage It repays, therefore the estimation function based on appearance need to only handle two degree of freedom of cosmetic variation.

8. based on the rotation described in claims 7, which is characterized in that the step of rotation is as follows：

(1) end rotation matrix R is given_r, the eye position e in camera coordinates system_r=t_r+e_h, wherein e_hIt is in the coordinate system of head The point midway at two canthus, e_rRepresent eye position, normalization transition matrix M=SR, M represent eye center in camera coordinates Three-dimensional scaling and rotation in system, R is the inverse of camera spin matrix, and S is scaling, and camera is enabled to be directed at e_r；

(2) scaling matrix S=diag (1,1, d_n/||e_r| |), e_rDistance away from camera coordinates system origin is d_n, join from camera Number original camera projection matrix obtained by calibrating is expressed as C_r, C_nIt is normalization camera projection matrix；

(4) entire normalization process is enabled to be applied to right eye and left eye in an identical manner, e is defined according to corresponding eye position_r, Generate one group of eye image I, end rotation matrix R_n=MR_rWith angle of direction vector g_n=Mg_rPositioned at normalization space, g_rIt is original E is derived from camera coordinate system_rThree-dimensional sight line vector, normalized end rotation matrix R_nBe converted to three-dimensional rotation angle h_n；

(5) it is zero due to normalizing back wall around the rotation angle of z-axis, therefore uses h_nRepresent Two Dimensional Rotating vector h, g_nIt represents to assume single Two Dimensional Rotating vector g in bit length, by d_nIt is defined as 600mm, normalization camera projection matrix C_nFocal length f_xAnd f_yIt is 960, Normalized eye image resolution ratio is set as 60 × 36, c in I_xAnd c_y30 and 18 are respectively set to, eyes image I is being normalized After be converted to gray level image go forward side by side column hisgram equalization, make normalization eyes image compatible between different data collection, promote Intersect data set assessment.

9. based on the network structure described in claims 5, which is characterized in that the task of CNN is in normalization space learning Input feature vector changes the distance of target plane of fixation, flip horizontal eyes figure to the mapping of gaze angle g under unconfined condition Picture, and the mirror image of h and g is created in the horizontal direction, depth convolutional neural networks framework is fully connected including 13 convolutional layers, two Layer and a classification layer, wherein there are five maximum pond layer, use the gray scale single channel image that resolution ratio is 60 × 36 pixels As input, the stride of the first pond layer and the second pond layer is changed to 1 from 2, to reflect smaller input resolution ratio, by yawing AngleAnd pitch angleThe two-dimentional angle of direction vector of compositionAs output, head pose information h inputs first are fully connected Layer, using predicted vectorThe L2 norms of distance lose summation as loss function between true vector g.

10. assess (three) based on the cross datasets described in claims 1, which is characterized in that use 16 layer depth convolutional Neurals The weight of network is assessed on ImageNet data sets, in the entire network 15000 iteration of progress, each on training set Two momentum values are set as β by the size of batch to be set as 256 using solver₁=0.9 and β₂=0.9, initial learning rate Be 0.00001, per iteration 5000 times after be multiplied by 0.1.