CN114724211A

CN114724211A - Sight estimation method integrating non-attention mechanism

Info

Publication number: CN114724211A
Application number: CN202210300517.7A
Authority: CN
Inventors: 王鹏; 陶文杰; 王世龙; 苑硕; 杨东磊; 储智超; 陈玥
Original assignee: Changzhou Institute of Technology
Current assignee: Changzhou Institute of Technology
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-07-08

Abstract

The invention relates to the improvement of a sight estimation method, in particular to a sight estimation method fusing a non-attention-machine mechanism; the method comprises the following steps: step one, detecting a human face and human face characteristic points by using a method of combining a gradient direction histogram with a restricted local model; estimating the head pose based on a PnP method, establishing a normalized coordinate system, and converting the face image into a normalized space; in the face detection step, a method based on a direction gradient histogram and a method based on a limited local model are adopted to detect the face and the face characteristic points, so that the aim of quickly and accurately detecting the face and the face characteristic points is fulfilled, the accurate face characteristic points are provided for head posture estimation, meanwhile, the calculation resources are saved, and the running speed of a neural network is ensured; after the face image is detected, the head pose of the user is estimated through a PnP algorithm, a normalized coordinate system is established, and the data set is subjected to normalization processing, so that the line-of-sight estimation can be carried out in a normalized space.

Description

Sight estimation method integrating non-attention mechanism

Technical Field

The invention relates to improvement of a sight line estimation method, in particular to a sight line estimation method fusing a non-attention-machine mechanism.

Background

The direction of a person's gaze often indicates what the person is looking at, and is interested in, reflecting the attention behavior of the agent. Gaze estimation is the process of determining gaze direction by computer software or hardware devices. It is widely applied to the fields of computer vision, pattern recognition, automatic driving, virtual reality, optics, neurocognitive science and psychology, advertising, marketing and the like.

The sight line estimation method based on image processing is mainly divided into two types, namely a model-based sight line estimation method and an appearance-based sight line estimation method, and in recent years, with the development of computer hardware and deep learning technology, the appearance-based sight line estimation method gradually becomes the mainstream sight line estimation method. The sight line estimation method based on the appearance does not need special hardware equipment and auxiliary light sources, and is a method based on a plane image and two-dimensional data. In order to acquire high-dimensional features hidden in an image, a complex and changeable image is used for training a neural network model for sight line estimation. Since the human full-face picture contains many redundant features irrelevant to the sight line estimation, which may interfere with the accuracy of the sight line estimation, in order to obtain a better sight line estimation result, the weight of the redundant features should be reduced, and the weight of the features having stable contribution to the sight line estimation result should be increased.

The existing method for suppressing the weight of the irrelevant area is to use a channel attention mechanism, a space attention mechanism or a combination of the two mechanisms, and the two attention mechanisms can increase the parameter quantity of the neural network to a certain extent and reduce the calculation speed of the network. In addition, the way of combining the channel attention mechanism and the spatial attention mechanism is a three-dimensional weight derived from a one-dimensional weight obtained by the channel attention mechanism and a two-dimensional weight obtained by the spatial attention mechanism. Compared with the method using a single attention mechanism, the performance of the network is improved to a certain extent by combining the two methods, but the real three-dimensional weight of the characteristic diagram is not directly obtained, so that the performance of the sight line estimation network is not greatly improved; meanwhile, another problem is that the network model becomes more complex, the parameter amount is larger, more computing resources are occupied, and the real-time performance of the sight line estimation is poor. In addition, in the aspect of face detection, the existing method mostly adopts a method of cascade combination of AdaBoost weak classifiers, and the problems that the number of cascade weak classifiers is difficult to set, training consumes time and the like exist. In the existing active shape model or active appearance model in the human face feature point detection, when the initial human face shape is far away from the real human face shape, the iteration times are greatly increased, and the detection speed is slow. Although the calculated attitude is more accurate, the speed of the mainstream random sampling consistent algorithm in the head attitude estimation is low, so that the real-time sight line estimation cannot be realized.

Disclosure of Invention

The invention aims to provide a sight line estimation method fusing a non-attention mechanism.

In order to achieve the purpose, the invention provides the following technical scheme: a sight line estimation method fusing a non-attention-focusing mechanism comprises the following steps:

step one, detecting a human face and human face characteristic points by using a method of combining a gradient direction histogram with a restricted local model;

acquiring an original picture of a whole face of a person through a camera, detecting the face and face characteristic points by adopting a method based on a gradient direction histogram and combining a limited local model, and providing accurate face characteristic points for head posture estimation;

step210, image normalization: firstly, standardizing the color space of an input image by adopting a Gamma correction method, wherein the Gamma correction formula is as follows:

I'(x,y)＝I(x,y)^γ (1)

wherein, I (x, y) is the pixel value of the original image, I' (x, y) is the corrected image pixel value, and Gamma is the Gamma correction coefficient;

step220, edge direction calculation: calculating the gradient of pixel points of the image, wherein the gradient of the pixel points describes the speed of the change of the pixel values at the point, and the calculation formula is as follows:

g in formula (2)_x(x,y)、G_y(x, y) respectively represents the gradients of the input image in the horizontal and vertical directions at the pixel point (x, y), the gradient of the pixel value of the point is obtained according to the change of the gradients of the pixel point (x, y) in the horizontal and vertical directions, and the calculation formula is as follows:

g (x, y) in the formula (3) represents the amplitude of the gradient at the pixel point (x, y), and phi represents the direction of the gradient at the pixel point (x, y);

step230, histogram calculation: dividing the image into cell units with the size of 8 multiplied by 8, then counting the gradient histogram of each cell unit to obtain a descriptor of one cell unit, and then connecting the descriptors of each cell unit in series to obtain an HOG feature descriptor of one block;

step240, normalizing block: the L2 norm normalization algorithm is adopted, and the formula is as follows:

v in formula (4)_i' is normalized element, v_iN is the element in the vector to be normalized, i ═ 1,2.. n;

step250, synthesis window HOG feature vector: cascading the feature vectors of all blocks of the detection window to form an HOG feature vector of the detection window, sending the feature vector into a support vector machine which is trained in advance, and judging whether the picture contains a human face or not;

step260, human face characteristic point detection: the method for selecting the limited local model specifically comprises the following steps: firstly, initializing the position of a face in a direct-view camera state, and then searching an actual position meeting constraint conditions in a neighborhood range of each feature point corresponding to the face to form complete face 68 feature point distribution;

the method based on the directional gradient histogram and the limited local model is used for detecting the human face and the human face characteristic points, so that the aim of quickly and accurately detecting the human face and the human face characteristic points is fulfilled;

estimating the head pose based on a PnP method, establishing a normalized coordinate system, and converting the face image into a normalized space;

the head pose estimation comprises the steps of projecting human face feature points in a picture onto a three-dimensional human face model, solving Euler angles according to a two-dimensional and three-dimensional coordinate transformation relation matrix to obtain head poses, wherein the specifically selected feature points are 7, namely the positions of the external angles in the eyes, the positions of the mouth angles and the positions of the nose tips;

the specific method comprises the following steps: calculating to obtain rotation and translation vectors by using the position of the input world coordinate system point, the position of the human face characteristic point of the picture acquired by Step260 and camera parameters, and obtaining the Euler angle of the head posture by using a PnP method;

translation matrix: a spatial position relation matrix of the object relative to the camera, denoted by T;

rotating the matrix: a spatial attitude relationship matrix of the object relative to the camera, denoted by R;

(1) the conversion formula from the world coordinate system to the camera coordinate system is as follows:

the conversion formula from the camera coordinate system to the pixel coordinate system is as follows:

the conversion formula from the pixel coordinate system to the world coordinate system is as follows:

the conversion formula from the image center coordinate system to the pixel coordinate system is as follows:

the euler angle of the rotation matrix can be obtained by using the PnP method, and the formula is as follows:

wherein α, β and θ are the angles of rotation of the head in three axes x, y and z;

normalizing the data set, and establishing a normalized coordinate system S through the head posture information_nTaking the middle point of two feature points of the right eye as the right eye center, taking the middle point of two feature points of the left eye as the left eye center, taking the middle point of two feature points of the mouth as the mouth center, taking the direction from the right eye center to the left eye center as the X axis of a head posture coordinate system, and taking the direction vertical to an XOY plane formed by the eyes and the mouth center as the Y axis, carrying out affine transformation on an original input image by data normalization processing, converting the image into a normalized coordinate system so as to be capable of carrying out sight line estimation in a normalized space with fixed camera parameters and reference point positions, giving an input image I, calculating a conversion matrix M to S R, wherein R represents the inverse matrix of a camera rotation matrixS denotes a scaling matrix defined as the distance d from the center of the eye to the origin of the normalized coordinate system_sA is d_sIs 600 mm;

normalized coordinate system S_nHas three coordinate axes of X_n、Y_nAnd Z_nThe origin is O_nHead pose coordinate system S_hHas three coordinate axes of X_h、Y_hAnd Z_hThe origin is O_hThe camera coordinate system S_cHas three coordinate axes of X_c、Y_cAnd Z_cThe origin is O_c，O_nAnd O_cSame, Z_nAnd Z_cThe same;

X_hto be perpendicular to Z_nIs projected as

Y_hTo be perpendicular to Z_nIs projected as

The coordinates of the center of the eye in the normalized coordinate system are (e)_x,e_y,e_z) The distance from the center of the eye to the origin of the normalized coordinate system is in Z_nThe upper projection is d ═ e_zThen there is

S＝diag(1,1,λ) (11)

The transformation matrix M rotates and scales points input into the original camera coordinate system into the normalized coordinate system, and the input image I uses the image transformation matrix

Affine transformation into a normalized coordinate system, C_rIs corresponding to a camera calibrationProjection matrix of the input image to be acquired, C_sIs a predefined parameter for defining a camera projection matrix in a normalized space;

in the formula (12) < f >_xAnd f_yIs the focal length value of the camera, c_xAnd c_yRespectively the length and width of the image;

in the training phase, all training images I with true annotation vectors g are normalized or directly represented by d_sAnd C_sSynthesized in the training space, the true gaze vector g is also normalized to

The test images are normalized in the same way during the test phase and their corresponding gaze vectors in the normalized space are estimated by a regression function trained in the normalized space, and then the inverse normalization is passed

Converting to obtain an actual sight line direction;

step three, building a shallow residual error neural network fused with a non-attention mechanism

In the visual processing task, neurons exhibiting significant spatial suppression should be given higher weight, and the objective of the attention-free mechanism is to find linear separability between the target neuron exhibiting significant spatial suppression and other neurons, based on which an energy function is defined for each neuron, as shown in equation (13):

w in the formula (13)_tIs a weight, b_tIs an offset transformation, t and x_iIs a target neuron and other neurons in a single channel of an input feature, M is a nerve in that channelNumber of elements, i is an index of the spatial dimension, for y_tAnd y_oAdopting binary labels to respectively take values of 1 and-1, and adding regularization in a formula (13) to obtain a final energy function formula as follows:

w in formula (14)_tAnd b_tThe fast closed-form solution is obtained by the following formula:

and

mean and variance of all neurons in the channel except the target neuron, respectively; preferable range of λ is 10^-1To 10^-6(ii) a While all pixels in a single channel follow the same distribution, so to avoid calculating the mean u and variance σ for each location, the minimum energy is calculated using the following equation:

in formula (16)

(Energy)

The lower, the greater the neuron t differs from the surrounding neurons, the more important it is for the visual processing, and thus the importance of each neuron can be determined by

And obtaining the final attention-free mechanism module formula as follows:

wherein E in the formula (17) groups all energy functions in the channel and space dimensions, and the added sigmoid function does not influence the relative importance of each neuron because the sigmoid function is a single increasing function;

the sight line estimation network receives an input normalized face image by calling a convolutional neural network, and the face image is fused with the convolutional neural network without a attention mechanism by the convolutional neural network, and the sight line estimation network specifically comprises the following steps: the normalized face image is translated into 2 through a skip lattice of a first convolution module in sequence and contains 64 convolution kernels with the size of 7 × 7, the size of an obtained feature map is 112 × 112, the feature map is scaled through a skip lattice translation-2 maximum pooling operation with the advantages of translation invariance, rotation invariance and scale invariance after the step of convolution operation to obtain a feature map with the size of 56 × 56, the feature map output in the previous step is sent to layer1, two residual blocks are contained in layer1, the number and the size of 4 convolution layer convolution kernels contained in the two residual blocks are consistent and are 64 convolution kernels with 3 × 3, and the step skip lattice is translated into 1, so that the size of the feature map is unchanged after the operation of layer 1; sending the feature map obtained in the previous step into layer2 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer2 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 128 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation of the step is 28 × 28; sending the feature map obtained in the previous step into layer3 containing 2 residual error blocks, wherein the skip translation of the first convolution of the first residual error block in layer3 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 256 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation in the step is 14 × 14; then, the characteristic diagram is sent into a convolution layer containing 1 multiplied by 1 to carry out convolution operation; finishing extracting facial feature information by the residual blockThen, the extracted feature information is adjusted into a vector form and then spliced and fused with the head posture information, the head posture information returns to the sight line direction in a normalized space after passing through two full-connection layers, and then the head posture information is inversely normalized and passed

The actual gaze direction is obtained by the transformation.

Preferably, when the histogram of the gradient of one cell unit is counted, the histogram of 9 bins is used to count the gradient information of the 8 × 8 pixels, that is, the gradient direction of the cell unit is divided into 9 direction blocks from 0 ° to 360 °:

the first directional block range is: 0-20 degrees and 180-200 degrees;

the second directional block range is: 20-40 degrees and 200-220 degrees;

the third directional block range is: 40-60 degrees and 220-240 degrees;

the fourth directional block range is: 60-80 degrees and 240-260 degrees;

the fifth directional block range is: 80-100 degrees and 260-280 degrees;

the sixth directional block range is: 100-120 degrees and 280-300 degrees;

the seventh directional block range is: 120-140 degrees and 300-320 degrees;

the eighth directional block range is: 140-160 degrees and 320-340 degrees;

the ninth directional block range is: 160-180 degrees and 340-360 degrees;

if the gradient direction of a certain pixel in the cell unit is 20-40 degrees or 200-220 degrees, 1 is added to the count of the 2 nd bin of the histogram, so that the gradient direction of each pixel in the cell unit is weighted and projected in the histogram by the gradient direction, and the weighted and projected pixel is mapped into a corresponding angle range block, so that the gradient direction histogram of the cell unit can be obtained, namely the 9-dimensional feature vector corresponding to the cell unit.

Preferably, λ values in the formula (15) are all 10^-4。

Compared with the prior art, the sight line estimation method fusing the non-attention mechanism has the advantages that: in the face detection step, a method based on a direction gradient histogram and a method based on a limited local model are adopted to detect the face and the face characteristic points, so that the aim of quickly and accurately detecting the face and the face characteristic points is fulfilled, the accurate face characteristic points are provided for head posture estimation, meanwhile, the calculation resources are saved, and the running speed of a neural network is ensured; after the face image is detected, estimating the head posture of the user through a PnP algorithm, establishing a normalized coordinate system, and carrying out normalization processing on the data set so that line-of-sight estimation can be carried out in a normalized space to obtain a more accurate line-of-sight estimation result; meanwhile, a shallow residual error network without a reference mechanism is fused, the weight of a region which has stable contribution to the line estimation in the image is increased under the condition that the number of parameters of a neural network is not additionally increased, and the region weight which does not contribute to the line estimation or contributes to the line estimation in the image is reduced; the real-time performance and the accuracy of sight line estimation are improved, and the advantages of robustness and strong self-adaptive capacity are achieved.

Drawings

FIG. 1 is a flow chart of gaze estimation;

FIG. 2 is a flow chart of face detection;

FIG. 3 is a block diagram showing gradient directions of cell units;

FIG. 4 is a schematic diagram of coordinate system transformation;

FIG. 5 is a construction diagram of a normalized coordinate system;

FIG. 6 is a schematic diagram of the SimAM mechanism;

FIG. 7 is a schematic diagram of a residual network incorporating the non-attention mechanism.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the method is specifically implemented, the main process comprises three parts of face detection, head posture estimation and sight line estimation; the method comprises the steps of performing face detection on a user in an original picture shot by a PC (personal computer) terminal by using a gradient direction histogram-based method, and performing face feature point detection on a detected face image by using a limited local model; estimating the head posture of the user through a PnP algorithm and establishing a normalized coordinate system; a non-reference mechanism SimAM is fused in a shallow residual network to achieve the purpose of improving the sight estimation accuracy without additionally increasing the number of parameters of a neural network, and a specific flow is shown in fig. 1.

1. Method for detecting human face and human face characteristic point by using gradient direction histogram combined with restricted local model

First, an original picture of a person's full face is captured by a camera. When the complete face or eyes of the user cannot be extracted from the original picture, the direction of the user's sight line cannot be predicted in the picture from the viewpoint of geometrical relationship. This step is therefore the basis for the subsequent steps. In this step, a method based on a gradient direction histogram combined with a restricted local model is used to perform face and face feature point detection, so that the face and face feature points can be quickly and accurately detected under the condition of using a small amount of computing resources, and accurate face feature points are provided for head pose estimation, so that a more accurate head pose can be obtained, and a face detection flow chart is shown in fig. 2.

Step210, image normalization: in order to reduce the influence of illumination factors and reduce the influence caused by local shadow and illumination change of an image, the invention firstly adopts a Gamma correction method to standardize the color space of an input image, and the Gamma correction formula is as follows:

I'(x,y)＝I(x,y)^γ (1)

where I (x, y) is the original image pixel value, I' (x, y) is the corrected image pixel value, and γ is the Gamma correction coefficient.

in the formula (3), G (x, y) represents the magnitude of the gradient at the pixel point (x, y), and Φ represents the direction of the gradient at the pixel point (x, y).

Step230, histogram calculation: dividing the image into cell units, for example, 8 × 8, then counting the gradient histogram of each cell unit to obtain a descriptor of a cell unit, for example, combining 2 × 2 cell units into a block, and then concatenating the descriptors of each cell unit to obtain a HOG feature descriptor of the block.

When the gradient histogram of a cell unit is counted, the histogram of 9 bins is used to count the gradient information of the 8 × 8 pixels, i.e. the gradient direction of the cell unit is divided into 9 direction blocks from 0 ° to 360 °, as shown in fig. 3:

the first directional block range is: 0-20 degrees and 180-200 degrees;

the second directional block range is: 20-40 degrees and 200-220 degrees;

the third directional block range is: 40-60 degrees and 220-240 degrees;

the fourth directional block range is: 60-80 degrees and 240-260 degrees;

the fifth directional block range is: 80-100 degrees and 260-280 degrees;

the sixth directional block range is: 100-120 degrees and 280-300 degrees;

the seventh directional block range is: 120-140 degrees and 300-320 degrees;

the eighth directional block range is: 140-160 degrees and 320-340 degrees;

the ninth directional block range is: 160-180 degrees and 340-360 degrees.

Step240, normalizing block: the invention adopts an L2 norm normalization algorithm, and the formula is as follows:

v in formula (4)_i' is normalized element, v_iI is 1,2.. n, which is an element in the vector to be normalized.

Step250, synthesis window HOG feature vector: and cascading the feature vectors of all blocks of the detection window to form the HOG feature vector of the detection window, sending the feature vector into a support vector machine which is trained in advance, and judging whether the picture contains the face.

Step260, human face characteristic point detection: the method for selecting the limited local model combines the advantages of two algorithms of the active shape model and the active appearance model, and has high accuracy and robustness in detecting the human face characteristic points. The specific method comprises the following steps: the method comprises the steps of initializing the position of a face in a direct-view camera state, searching an actual position meeting a constraint condition in a neighborhood range of each feature point corresponding to the face, and forming complete face 68 feature point distribution.

The method based on the histogram of the directional gradient and the limited local model is used for detecting the human face and the characteristic points of the human face, so that the aim of quickly and accurately detecting the human face and the characteristic points of the human face is fulfilled, a good foundation is laid for head posture estimation, meanwhile, computing resources are saved, and the real-time performance of a sight line estimation network is ensured.

2. Head pose is estimated based on a PnP method, a normalized coordinate system is established, and a face image is converted into a normalized space

The pose of an object relative to the camera can be represented by using a rotation matrix and a translation matrix, and as shown in fig. 4, the head pose estimation of the invention obtains the head pose by projecting the human face feature points in the picture onto a three-dimensional human face model and solving Euler angles according to a two-dimensional coordinate transformation relation matrix and a three-dimensional coordinate transformation relation matrix. The invention selects 7 characteristic points which are respectively the positions of the external angle of the eyes, the angle of the mouth and the tip of the nose.

The specific method comprises the following steps: and calculating to obtain rotation and translation vectors by using the position of the input world coordinate system point, the position of the human face characteristic point of the picture acquired by Step260 and camera parameters, and obtaining the Euler angle of the head posture by using a PnP method.

(2) the conversion formula from the camera coordinate system to the pixel coordinate system is as follows:

(3) the conversion formula from the pixel coordinate system to the world coordinate system is as follows:

(4) the conversion formula from the image center coordinate system to the pixel coordinate system is as follows:

where α, β and θ are the angles of rotation of the head in the three axes x, y and z.

Normalizing the data set, and establishing a normalized coordinate system S through the head posture information_nTaking the midpoint of two feature points of the right eye as the right eye center, taking the midpoint of two feature points of the left eye as the left eye center, taking the midpoint of two feature points of the mouth as the mouth center, the direction from the right eye center to the left eye center being the X axis of the head pose coordinate system, the direction perpendicular to the XOY plane formed by the eyes and the mouth center being the Y axis, data normalization processing for affine transforming the original input image, transforming the image into the normalized coordinate system so that sight estimation can be performed in the normalized space with fixed camera parameters and reference point positions, given the input image I, calculating the transformation matrix M-S R, R representing the inverse of the camera rotation matrix, S representing the scaling matrix defined as the distance d from the eye center to the origin of the normalized coordinate system_sA is d_sIs 600 mm.

Normalized coordinate system S_nHas three coordinate axes of X_n、Y_nAnd Z_nThe origin is O_nHead pose coordinate system S_hHas three coordinate axes of X_h、Y_hAnd Z_hThe origin is O_hThe camera coordinate system S_cHas three coordinate axes of X_c、Y_cAnd Z_cThe origin is O_c，O_nAnd O_cSame, Z_nAnd Z_cSimilarly, the construction of the normalized coordinate system is shown in FIG. 5.

X_hTo be perpendicular to Z_nIs projected as

Y_hTo be perpendicular to Z_nIs projected as

S＝diag(1,1,λ) (11)

Affine transformation into a normalized coordinate system, C_rIs a projection matrix, C, corresponding to the input image obtained by camera calibration_sIs a predefined parameter for defining a camera projection matrix in the normalized space.

In the formula (12) < f >_xAnd f_yIs the focal length value of the camera, c_xAnd c_yRespectively the length and width of the image.

The actual gaze direction is obtained by the transformation.

3. Constructing a shallow residual error neural network fused with a non-attention mechanism

The invention builds a shallow layer residual error network fused with a non-attention mechanism on the basis of the residual error network.

w in the formula (13)_tIs a weight, b_tIs an offset transformation, t and x_iIs the target neuron and other neurons in a single channel of the input feature, M is the number of neurons in that channel, i is the index of the spatial dimension, and y is_tAnd y_oAdopting binary labels to respectively take values of 1 and-1, and adding regularization in a formula (13) to obtain a final energy function formula as follows:

and

mean and variance of all neurons in the channel except the target neuron, respectively; preferable range of λ is 10^-1To 10^-6In the present invention, the values of lambda are all 10^-4(ii) a While all pixels in a single channel follow the same distribution, so to avoid calculating the mean u and variance σ for each location, the minimum energy is calculated using the following equation:

in formula (16)

(Energy)

The lower, the more the neuron t differs from the surrounding neurons, the more important it is for the visual processing, and thus the importance of each neuron can be determined by

And obtaining a final non-attention mechanism module formula as follows:

where E in equation (17) groups all energy functions in the channel and spatial dimensions, the simultaneous addition of the sigmoid function, since it is a simple increasing function, does not affect the relative importance of each neuron.

The non-reference mechanism module of the present invention is added after the second convolution layer in each residual block in the residual network as shown in fig. 6.

The sight line estimation network receives an input normalized face image by calling a convolutional neural network, and the face image is fused with the convolutional neural network without a attention mechanism by the input normalized face image, and the sight line estimation network specifically comprises the following steps: the normalized face image is translated into 2 through a skip lattice of a first convolution module in sequence and contains 64 convolution kernels with the size of 7 × 7, the size of an obtained feature map is 112 × 112, the feature map is scaled through a skip lattice translation-2 maximum pooling operation with the advantages of translation invariance, rotation invariance and scale invariance after the step of convolution operation to obtain a feature map with the size of 56 × 56, the feature map output in the previous step is sent to layer1, two residual blocks are contained in layer1, the number and the size of 4 convolution layer convolution kernels contained in the two residual blocks are consistent and are 64 convolution kernels with 3 × 3, and the step skip lattice is translated into 1, so that the size of the feature map is unchanged after the operation of layer 1; sending the feature map obtained in the previous step into layer2 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer2 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 128 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation of the step is 28 × 28; sending the feature map obtained in the previous step into layer3 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer3 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 256 convolution kernels with the size of 3 × 3, and the size of the feature map obtained after the operation of the step is 14 × 14; then, the characteristic diagram is sent into a convolution layer containing 1 multiplied by 1 to carry out convolution operation; finishing extracting face feature information by the residual block, adjusting the extracted feature information into a vector form, splicing and fusing the vector form with head posture information, returning to a sight line direction in a normalized space after passing through two full-connection layers, and performing inverse normalization

Converting to obtain actual sight direction, and fusing the residue without attention mechanismThe difference network is shown in fig. 7.

A method based on a gradient direction histogram combined with a limited local model is used in face detection and face feature point detection, a PnP method is used in head posture estimation, meanwhile, a non-attention mechanism is fused in a sight line estimation network, the three-dimensional weight of a region which has stable contribution to sight line estimation in an image is increased under the condition that the number of neural network parameters is not additionally increased and less computing resources are occupied, and the weight of a region which does not contribute to the sight line estimation or contributes to the sight line estimation in the image is reduced. The sight line estimation method fused with the non-attention mechanism improves the accuracy of sight line estimation and has the advantages of being strong in robustness and self-adaptive capacity.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A sight line estimation method fusing a non-attention mechanism is characterized by comprising the following steps:

I'(x,y)＝I(x,y)^γ (1)

step230, histogram calculation: dividing the image into cell units with the size of 8 multiplied by 8, then counting a gradient histogram of each cell unit to obtain a descriptor of one cell unit, and then connecting the descriptors of each cell unit in series to obtain a HOG characteristic descriptor of one block;

v 'in formula (4)'_iIs a normalized element, v_iN is the element in the vector to be normalized, i ═ 1,2.. n;

normalizing the data set, and establishing a normalized coordinate system S through the head posture information_nTaking the midpoint of two feature points of the right eye as the right eye center, taking the midpoint of two feature points of the left eye as the left eye center, taking the midpoint of two feature points of the mouth as the mouth center, the direction from the right eye center to the left eye center being the X axis of the head pose coordinate system, the direction perpendicular to the XOY plane formed by the eyes and the mouth center being the Y axis, data normalization processing for affine transforming the original input image, transforming the image into the normalized coordinate system so that sight estimation can be performed in the normalized space with fixed camera parameters and reference point positions, given the input image I, calculating the transformation matrix M-S R, R representing the inverse of the camera rotation matrix, S representing the scaling matrix defined as the distance d from the eye center to the origin of the normalized coordinate system_sA is d_sIs 600 mm;

normalized coordinate system S_nHas three coordinate axes of X_n、Y_nAnd Z_nOrigin is O_nHead pose coordinate system S_hHas three coordinate axes of X_h、Y_hAnd Z_hOrigin is O_hThe camera coordinate system S_cHas three coordinate axes of X_c、Y_cAnd Z_cThe origin is O_c，O_nAnd O_cSame, Z_nAnd Z_cThe same;

X_hto perpendicular to Z_nIs projected as

Y_hTo be perpendicular to Z_nIs projected as

S＝diag(1,1,λ) (11)

Affine transformation into a normalized coordinate system, C_rIs a projection matrix, C, corresponding to the input image obtained by camera calibration_sIs a predefined parameter for defining a camera projection matrix in the normalized space;

formula (II)(12) In f_xAnd f_yIs the focal length value of the camera, c_xAnd c_yRespectively the length and width of the image;

The test images are normalized in the same way during the test phase and their corresponding gaze vectors in the normalized space are estimated by a regression function trained in the normalized space, then the inverse normalization is passed

Converting to obtain an actual sight line direction;

and

in formula (16)

(Energy)

And obtaining a final non-attention mechanism module formula as follows:

the sight line estimation network receives an input normalized face image by calling a convolutional neural network, and the face image is fused with the convolutional neural network without a attention mechanism by the input normalized face image, and the sight line estimation network specifically comprises the following steps: the normalized face image sequentially passes through a first convolution module, grid jumping translation is 2, 64 convolution kernels with the size of 7 x 7 are contained in the normalized face image, the size of the obtained feature map is 112 x 112, the feature map is zoomed by using grid jumping translation with the advantages of translation invariance, rotation invariance and scale invariance to 2-maximum pooling operation after the step of convolution operation to obtain the feature map with the size of 56 x 56, the feature map output in the previous step is sent to layer1, two residual blocks are contained in layer1, the number and the size of 4 convolution layer convolution kernels contained in the two residual blocks are 64 convolution kernels with the size of 3 x 3, and the grid jumping translation is 1, so that the size of the feature map is unchanged after the operation of layer 1; sending the feature map obtained in the previous step into layer2 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer2 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 128 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation of the step is 28 × 28; sending the feature map obtained in the previous step into layer3 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer3 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 256 convolution kernels with the size of 3 × 3, and the size of the feature map obtained after the operation of the step is 14 × 14; then, the characteristic diagram is sent into a convolution layer containing 1 multiplied by 1 to carry out convolution operation; finishing extracting face feature information by the residual block, adjusting the extracted feature information into a vector form, splicing and fusing the vector form with head posture information, returning to a sight line direction in a normalized space after passing through two full-connection layers, and performing inverse normalizationBy passing

The actual gaze direction is obtained by the transformation.

2. The method for estimating a gaze direction fused with a non-attention mechanism according to claim 1, wherein when a histogram of gradients of a cell unit is counted, a histogram of 9 bins is used to count gradient information of 8 × 8 pixels, i.e. a gradient direction of a cell unit is divided into 0-360 ° into 9 direction blocks:

the first directional block range is: 0-20 degrees and 180-200 degrees;

the second directional block range is: 20-40 degrees and 200-220 degrees;

the third directional block range is: 40-60 degrees and 220-240 degrees;

the fourth directional block range is: 60-80 degrees and 240-260 degrees;

the fifth directional block range is: 80-100 degrees and 260-280 degrees;

the sixth directional block range is: 100-120 degrees and 280-300 degrees;

the seventh directional block range is: 120-140 degrees and 300-320 degrees;

the eighth directional block range is: 140-160 degrees and 320-340 degrees;

the ninth directional block range is: 160-180 degrees and 340-360 degrees;

3. The gaze estimation method fusing non-attention mechanism according to claim 1, wherein λ values in the formula (15) are all 10^-4。