CN110458001A

CN110458001A - A kind of convolutional neural networks gaze estimation method and system based on attention mechanism

Info

Publication number: CN110458001A
Application number: CN201910578161.1A
Authority: CN
Inventors: 李菁; 钟艺豪; 陈则金
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-11-15

Abstract

The invention discloses a kind of convolutional neural networks gaze estimation methods based on attention mechanism, comprising the following steps: step 1: being positioned using local restriction neuron domain to face key point；Step 2: eyes image being intercepted using the coordinate points that step 1 detects；Step 3: the image being truncated to is standardized；Step 4: the convolutional neural networks that the image after standardization is sent into attention mechanism being returned, the sight angle coordinate estimated.Present invention design makes the position in the high-rise feature for extracting feature substantially from pupil reduce error to preferably improve accuracy rate using attention mechanism network；And by critical point detection so that cut-out photo resolution is smaller, so that rapidity be made to be improved.

Description

A kind of convolutional neural networks gaze estimation method and system based on attention mechanism

Technical field

The present invention relates to image procossings and area of pattern recognition, and in particular to a kind of convolutional Neural based on attention mechanism Network gaze estimation method and system.

Background technique

Sight estimation is a classical problem in computer vision research, existing to be estimated based on eye image progress sight The method of meter has: (1) pupil corneal reflection method；(2) iris-corneoscleral limbus method；(3) appearance based on convolutional neural networks Method.

The main problem of present method has: (1) head movement bring sight estimation inaccuracy；(2) calibration for cameras is needed, Need to measure environment distance；(3) profession, expensive hardware device are needed；(3) precision is not high enough.

Summary of the invention

The purpose of the present invention is to provide a kind of convolutional neural networks gaze estimation methods based on attention mechanism, thus It can succinctly, conveniently, accurately realize the sight estimation of people.

To achieve the above object, the invention provides the following technical scheme: a kind of convolutional Neural net based on attention mechanism Network gaze estimation method, comprising the following steps:

Step 1: face key point being positioned using local restriction neuron domain；

Step 2: eyes image being intercepted using the coordinate points that step 1 detects；

Step 3: the image being truncated to is standardized；

Step 4: the convolutional neural networks that the image after standardization is sent into attention mechanism being returned, are estimated The sight angle coordinate of meter.

Preferably, described image standardization is the affine transformation by image, and image is transformed into a standardization Camera space, in this standardization camera space, as the head of the people in all images with the distance of camera is, and Head pose is also the same.

Preferably, described image standardization includes that there are three steps:

Step 1: using camera coordinates system as world coordinate system, it is known that eyes centre coordinate e_cWith head pose spin matrix R, The z-axis that first camera is rotated to camera is directed at two centers；This step need to only allow camera z-axis to be aligned eyes centre coordinate e_c, can Obtaining postrotational camera z-axis is r_z=e_c/||e_c||；

Step 2: it is in the same plane that camera around z-axis rotates the x-axis to the x-axis of camera and head pose；Due to head The x-axis of posture is known quantity, is the first row R of head pose spin matrix R_x, to allow postrotational camera x-axis r_xAnd R_xIt is located at Same plane then needs to meet postrotational camera y-axis r_yPerpendicular to this plane；R again_yPerpendicular to postrotational camera z-axis r_z, because This, r_yIt can be by R_xAnd r_zCross product acquire: r_y=R_x×r_z；r_xIt can be by r_yAnd r_zCross product acquire: r_x=r_y×r_z；Then, it obtains The spin matrix R of camera_c=[r_x,r_y,r_z]；

Step 3: the distance at standardization eyes center to image center；This step can be by the z-axis realization of scaling camera, i.e., Define a scaling matrix S=diag (1,1, d/ | | e_c| |), wherein d be eyes center to image center standardization away from From.

Preferably, the attention power module of the convolutional neural networks of the attention mechanism is made of binary channels；

Upper layer is known as main channel, by CNN module composition；

Lower layer is known as mask channel, is bottom-up-top-down hourglass network.

Preferably, for an input picture I, remember that the output of main channel is F (I), the output in mask channel is A (I), then Notice that the output M (I) of power module can be obtained according to the dot product of F (I) and A (I): M_c(I)=F_c(I)+F_c(I)·A_c(I)；

In formula: F_c(I) c-th of channel of F (I), A are indicated_c(I) c-th of channel of A (I), symbol representing matrix are indicated Dot product.

The utility model has the advantages that

(1) a kind of convolutional neural networks gaze estimation method and system based on attention mechanism of the invention, design are adopted Make the position in the high-rise feature for extracting feature substantially from pupil with attention mechanism network, to preferably improve quasi- True rate reduces error；And by critical point detection so that cut-out photo resolution is smaller, to make rapidity It is improved.

(2) a kind of convolutional neural networks gaze estimation method and system based on attention mechanism of the invention has standard Really (accuracy is improved, and can achieve error and only has 4.8 °), objective, convenient (without harsh laboratory environment, without spy Different equipment, only need common a camera or smart phone), quick advantage.

Detailed description of the invention

Fig. 1 is the method for the present invention flow diagram.

Fig. 2 is the convolutional neural networks structure chart of attention mechanism in the present invention.

Fig. 3 is attention function structure chart in the present invention.

Specific embodiment

Embodiments of the present invention will be further described below with reference to the accompanying drawings.

As shown in Figure 1-3, firstly, using local restriction neuron domain (Constrained Local Neural Fields, CLNF) face key point positioned.

However, different head poses is different when we intercept eyes image using the coordinate points that CLNF is detected Shooting distance, can all cause different image sizes, convolutional neural networks (Convolutional Neural Networks, CNN) desired input picture size is often consistent, and usual way is zoomed image to fixed size, in this way meeting Picture is caused to be distorted, especially when handling sight estimation task, this pantography can seriously affect the performance of network, bring partially Difference.In order to solve this problem, we introduce image standardization technology: i.e. by the affine transformation of image, image being transformed into One standardized camera space, in this standardization camera space, the head of the people in all images and the distance of camera It is the same, and head pose is also the same.Specifically, can be divided into the following three steps:

1) using camera coordinates system as world coordinate system, it is known that eyes centre coordinate e_cWith head pose spin matrix R, first will Camera, which is rotated to the z-axis of camera, is directed at two centers.This step need to only allow camera z-axis to be aligned eyes centre coordinate e_c, can must revolve Camera z-axis after turning is r_z=e_c/||e_c||。

2) then around z-axis to rotate the x-axis to the x-axis of camera and head pose in the same plane for camera.Due to head The x-axis of posture is known quantity, is the first row R of head pose spin matrix R_x, to allow postrotational camera x-axis r_xAnd R_xIt is located at Same plane then needs to meet postrotational camera y-axis r_yPerpendicular to this plane.R again_yPerpendicular to postrotational camera z-axis r_z, because This, r_yIt can be by R_xAnd r_zCross product acquire:

r_y=R_x×r_z (1)

r_xIt can be by r_yAnd r_zCross product acquire:

r_x=r_y×r_z (2)

Then, the spin matrix R of camera is obtained_c=[r_x,r_y,r_z]。

3) distance of the standardization eyes center to image center.This step can be realized by scaling the z-axis of camera, that is, be defined One scaling matrix S=diag (1,1, d/ | | e_c| |), wherein d is standardization distance of the eyes center to image center, D=600mm is taken in the application.

By above three step, we are available camera transition matrix M=SR_c.In actual operation, it to be marked The image of standardization is needed by an affine transformation matrixWherein C_rFor the true internal reference matrix of camera, and C_s For the internal reference matrix of virtual camera in standardised space.After standardization, head pose spin matrix is become by original RWatch vector attentively is become from original gIn addition, watching vector attentively can further be turned by three-dimensional cartesian coordinate system Change spheroidal coordinate system intoWhereinTo predict three variables Problem is changed into prediction two.

Finally, the picture after standardization to be sent into the convolutional neural networks of attention mechanism.

The attention power module of the network is made of binary channels: upper layer is known as main channel, by the residual error module etc. of ResNet Popular CNN module composition；Lower layer is known as mask channel, is bottom-up-top-down hourglass network.For one Input picture I remembers that the output of main channel is F (I), and the output in mask channel is A (I), then notices that the output M (I) of power module can To be obtained according to the dot product of F (I) and A (I):

M_c(I)=F_c(I)+F_c(I)·A_c(I) (3)

By stacking such attention power module, the attention mechanism CNN of depth is formed.In this way, estimating task in sight In, notice that power module can begin look for the position of eye pupil in the picture from bottom, and be constantly increasing the weight of the position And reduce the weight of other irrelevant positions, to it is high-rise when, the position of the feature of extraction substantially from eye pupil.By these Feature, which is sent into classifier, classifies, and can obtain high-accuracy.

In an experiment, we test ResNet-50 we by the size of convolution kernel in first convolutional layer by original 7 × 7 are revised as 5 × 5, to adapt to our small-sized image input (36 × 224), and the softmax of the last layer layer are changed to entirely Articulamentum, for returning two gaze angles；Since sight estimation is the depending on eye locations of the task, it is believed that net The position insensitivity of network will cause the decline of performance.Attention network based on ResNet-50 is referred to as AttentionGazeNet-Res.Loss function:

By the convolutional neural networks of this attention mechanism, we only only have 4.8 ° at sight evaluated error.

A kind of convolutional neural networks gaze estimation method and system based on attention mechanism of the invention, design is using note Meaning power mechanism network makes the position in the high-rise feature for extracting feature substantially from pupil, to preferably improve accurately Rate reduces error.And by critical point detection so that cut-out photo resolution is smaller, so that rapidity be made to obtain To raising.

A kind of convolutional neural networks gaze estimation method and system based on attention mechanism of the invention, it is accurate to have It is (accuracy is improved, and can achieve error and only has 4.8 °), objective, convenient (without harsh laboratory environment, without special Equipment, only need common a camera or smart phone), quick advantage.

Specific embodiments of the present invention are described in detail above, but it is merely an example, the present invention is simultaneously unlimited It is formed on above description specific embodiment.To those skilled in the art, the equivalent modifications and replace that any couple of present invention carries out In generation, is also all among scope of the invention.Therefore, without departing from the spirit and scope of the invention made by equal transformation and repair Change, all covers within the scope of the present invention.

Claims

1. a kind of convolutional neural networks gaze estimation method based on attention mechanism, which comprises the following steps:

Step 3: the image being truncated to is standardized；

Step 4: the convolutional neural networks that the image after standardization is sent into attention mechanism being returned, are estimated Sight angle coordinate.

2. a kind of convolutional neural networks gaze estimation method based on attention mechanism according to claim 1, feature It is:

Described image standardization is the affine transformation by image, and image is transformed into a standardized camera space, In this standardization camera space, as the head of the people in all images with the distance of camera is, and head pose It is the same.

3. a kind of convolutional neural networks gaze estimation method based on attention mechanism according to claim 2, feature It is:

Described image standardization includes that there are three steps:

Step 1: using camera coordinates system as world coordinate system, it is known that eyes centre coordinate e_cWith head pose spin matrix R, first will Camera, which is rotated to the z-axis of camera, is directed at two centers；This step need to only allow camera z-axis to be aligned eyes centre coordinate e_c, can must revolve Camera z-axis after turning is r_z=e_c/||e_c||；

Step 2: it is in the same plane that camera around z-axis rotates the x-axis to the x-axis of camera and head pose；Due to head pose X-axis be known quantity, be head pose spin matrix R first row R_x, to allow postrotational camera x-axis r_xAnd R_xPositioned at same Plane then needs to meet postrotational camera y-axis r_yPerpendicular to this plane；R again_yPerpendicular to postrotational camera z-axis r_z, therefore, r_y It can be by R_xAnd r_zCross product acquire: r_y=R_x×r_z；r_xIt can be by r_yAnd r_zCross product acquire: r_x=r_y×r_z；Then, camera is obtained Spin matrix R_c=[r_x,r_y,r_z]；

Step 3: the distance at standardization eyes center to image center；This step can be realized by scaling the z-axis of camera, that is, be defined One scaling matrix S=diag (1,1, d/ | | e_c| |), wherein d is standardization distance of the eyes center to image center.

4. a kind of convolutional neural networks gaze estimation method based on attention mechanism according to claim 1, feature It is:

The attention power module of the convolutional neural networks of the attention mechanism is made of binary channels；

Upper layer is known as main channel, by CNN module composition；

Lower layer is known as mask channel, is bottom-up-top-down hourglass network.

5. a kind of convolutional neural networks gaze estimation method based on attention mechanism according to claim 4, feature It is:

For an input picture I, remember that the output of main channel is F (I), the output in mask channel is A (I), then pays attention to power module Output M (I) can be obtained according to the dot product of F (I) and A (I): M_c(I)=F_c(I)+F_c(I)·A_c(I)；

In formula: F_c(I) c-th of channel of F (I), A are indicated_c(I) c-th of channel of A (I), the point of symbol representing matrix are indicated Multiply.