CN114724211A - Sight estimation method integrating non-attention mechanism - Google Patents

Sight estimation method integrating non-attention mechanism Download PDF

Info

Publication number
CN114724211A
CN114724211A CN202210300517.7A CN202210300517A CN114724211A CN 114724211 A CN114724211 A CN 114724211A CN 202210300517 A CN202210300517 A CN 202210300517A CN 114724211 A CN114724211 A CN 114724211A
Authority
CN
China
Prior art keywords
normalized
coordinate system
degrees
formula
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210300517.7A
Other languages
Chinese (zh)
Inventor
王鹏
陶文杰
王世龙
苑硕
杨东磊
储智超
陈玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Institute of Technology
Original Assignee
Changzhou Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Institute of Technology filed Critical Changzhou Institute of Technology
Priority to CN202210300517.7A priority Critical patent/CN114724211A/en
Publication of CN114724211A publication Critical patent/CN114724211A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to the improvement of a sight estimation method, in particular to a sight estimation method fusing a non-attention-machine mechanism; the method comprises the following steps: step one, detecting a human face and human face characteristic points by using a method of combining a gradient direction histogram with a restricted local model; estimating the head pose based on a PnP method, establishing a normalized coordinate system, and converting the face image into a normalized space; in the face detection step, a method based on a direction gradient histogram and a method based on a limited local model are adopted to detect the face and the face characteristic points, so that the aim of quickly and accurately detecting the face and the face characteristic points is fulfilled, the accurate face characteristic points are provided for head posture estimation, meanwhile, the calculation resources are saved, and the running speed of a neural network is ensured; after the face image is detected, the head pose of the user is estimated through a PnP algorithm, a normalized coordinate system is established, and the data set is subjected to normalization processing, so that the line-of-sight estimation can be carried out in a normalized space.

Description

Sight estimation method integrating non-attention mechanism
Technical Field
The invention relates to improvement of a sight line estimation method, in particular to a sight line estimation method fusing a non-attention-machine mechanism.
Background
The direction of a person's gaze often indicates what the person is looking at, and is interested in, reflecting the attention behavior of the agent. Gaze estimation is the process of determining gaze direction by computer software or hardware devices. It is widely applied to the fields of computer vision, pattern recognition, automatic driving, virtual reality, optics, neurocognitive science and psychology, advertising, marketing and the like.
The sight line estimation method based on image processing is mainly divided into two types, namely a model-based sight line estimation method and an appearance-based sight line estimation method, and in recent years, with the development of computer hardware and deep learning technology, the appearance-based sight line estimation method gradually becomes the mainstream sight line estimation method. The sight line estimation method based on the appearance does not need special hardware equipment and auxiliary light sources, and is a method based on a plane image and two-dimensional data. In order to acquire high-dimensional features hidden in an image, a complex and changeable image is used for training a neural network model for sight line estimation. Since the human full-face picture contains many redundant features irrelevant to the sight line estimation, which may interfere with the accuracy of the sight line estimation, in order to obtain a better sight line estimation result, the weight of the redundant features should be reduced, and the weight of the features having stable contribution to the sight line estimation result should be increased.
The existing method for suppressing the weight of the irrelevant area is to use a channel attention mechanism, a space attention mechanism or a combination of the two mechanisms, and the two attention mechanisms can increase the parameter quantity of the neural network to a certain extent and reduce the calculation speed of the network. In addition, the way of combining the channel attention mechanism and the spatial attention mechanism is a three-dimensional weight derived from a one-dimensional weight obtained by the channel attention mechanism and a two-dimensional weight obtained by the spatial attention mechanism. Compared with the method using a single attention mechanism, the performance of the network is improved to a certain extent by combining the two methods, but the real three-dimensional weight of the characteristic diagram is not directly obtained, so that the performance of the sight line estimation network is not greatly improved; meanwhile, another problem is that the network model becomes more complex, the parameter amount is larger, more computing resources are occupied, and the real-time performance of the sight line estimation is poor. In addition, in the aspect of face detection, the existing method mostly adopts a method of cascade combination of AdaBoost weak classifiers, and the problems that the number of cascade weak classifiers is difficult to set, training consumes time and the like exist. In the existing active shape model or active appearance model in the human face feature point detection, when the initial human face shape is far away from the real human face shape, the iteration times are greatly increased, and the detection speed is slow. Although the calculated attitude is more accurate, the speed of the mainstream random sampling consistent algorithm in the head attitude estimation is low, so that the real-time sight line estimation cannot be realized.
Disclosure of Invention
The invention aims to provide a sight line estimation method fusing a non-attention mechanism.
In order to achieve the purpose, the invention provides the following technical scheme: a sight line estimation method fusing a non-attention-focusing mechanism comprises the following steps:
step one, detecting a human face and human face characteristic points by using a method of combining a gradient direction histogram with a restricted local model;
acquiring an original picture of a whole face of a person through a camera, detecting the face and face characteristic points by adopting a method based on a gradient direction histogram and combining a limited local model, and providing accurate face characteristic points for head posture estimation;
step210, image normalization: firstly, standardizing the color space of an input image by adopting a Gamma correction method, wherein the Gamma correction formula is as follows:
I'(x,y)=I(x,y)γ (1)
wherein, I (x, y) is the pixel value of the original image, I' (x, y) is the corrected image pixel value, and Gamma is the Gamma correction coefficient;
step220, edge direction calculation: calculating the gradient of pixel points of the image, wherein the gradient of the pixel points describes the speed of the change of the pixel values at the point, and the calculation formula is as follows:
Figure BDA0003565303540000021
g in formula (2)x(x,y)、Gy(x, y) respectively represents the gradients of the input image in the horizontal and vertical directions at the pixel point (x, y), the gradient of the pixel value of the point is obtained according to the change of the gradients of the pixel point (x, y) in the horizontal and vertical directions, and the calculation formula is as follows:
Figure BDA0003565303540000031
g (x, y) in the formula (3) represents the amplitude of the gradient at the pixel point (x, y), and phi represents the direction of the gradient at the pixel point (x, y);
step230, histogram calculation: dividing the image into cell units with the size of 8 multiplied by 8, then counting the gradient histogram of each cell unit to obtain a descriptor of one cell unit, and then connecting the descriptors of each cell unit in series to obtain an HOG feature descriptor of one block;
step240, normalizing block: the L2 norm normalization algorithm is adopted, and the formula is as follows:
Figure BDA0003565303540000032
v in formula (4)i' is normalized element, viN is the element in the vector to be normalized, i ═ 1,2.. n;
step250, synthesis window HOG feature vector: cascading the feature vectors of all blocks of the detection window to form an HOG feature vector of the detection window, sending the feature vector into a support vector machine which is trained in advance, and judging whether the picture contains a human face or not;
step260, human face characteristic point detection: the method for selecting the limited local model specifically comprises the following steps: firstly, initializing the position of a face in a direct-view camera state, and then searching an actual position meeting constraint conditions in a neighborhood range of each feature point corresponding to the face to form complete face 68 feature point distribution;
the method based on the directional gradient histogram and the limited local model is used for detecting the human face and the human face characteristic points, so that the aim of quickly and accurately detecting the human face and the human face characteristic points is fulfilled;
estimating the head pose based on a PnP method, establishing a normalized coordinate system, and converting the face image into a normalized space;
the head pose estimation comprises the steps of projecting human face feature points in a picture onto a three-dimensional human face model, solving Euler angles according to a two-dimensional and three-dimensional coordinate transformation relation matrix to obtain head poses, wherein the specifically selected feature points are 7, namely the positions of the external angles in the eyes, the positions of the mouth angles and the positions of the nose tips;
the specific method comprises the following steps: calculating to obtain rotation and translation vectors by using the position of the input world coordinate system point, the position of the human face characteristic point of the picture acquired by Step260 and camera parameters, and obtaining the Euler angle of the head posture by using a PnP method;
translation matrix: a spatial position relation matrix of the object relative to the camera, denoted by T;
rotating the matrix: a spatial attitude relationship matrix of the object relative to the camera, denoted by R;
(1) the conversion formula from the world coordinate system to the camera coordinate system is as follows:
Figure BDA0003565303540000041
the conversion formula from the camera coordinate system to the pixel coordinate system is as follows:
Figure BDA0003565303540000042
the conversion formula from the pixel coordinate system to the world coordinate system is as follows:
Figure BDA0003565303540000043
the conversion formula from the image center coordinate system to the pixel coordinate system is as follows:
Figure BDA0003565303540000044
the euler angle of the rotation matrix can be obtained by using the PnP method, and the formula is as follows:
Figure BDA0003565303540000051
wherein α, β and θ are the angles of rotation of the head in three axes x, y and z;
normalizing the data set, and establishing a normalized coordinate system S through the head posture informationnTaking the middle point of two feature points of the right eye as the right eye center, taking the middle point of two feature points of the left eye as the left eye center, taking the middle point of two feature points of the mouth as the mouth center, taking the direction from the right eye center to the left eye center as the X axis of a head posture coordinate system, and taking the direction vertical to an XOY plane formed by the eyes and the mouth center as the Y axis, carrying out affine transformation on an original input image by data normalization processing, converting the image into a normalized coordinate system so as to be capable of carrying out sight line estimation in a normalized space with fixed camera parameters and reference point positions, giving an input image I, calculating a conversion matrix M to S R, wherein R represents the inverse matrix of a camera rotation matrixS denotes a scaling matrix defined as the distance d from the center of the eye to the origin of the normalized coordinate systemsA is dsIs 600 mm;
normalized coordinate system SnHas three coordinate axes of Xn、YnAnd ZnThe origin is OnHead pose coordinate system ShHas three coordinate axes of Xh、YhAnd ZhThe origin is OhThe camera coordinate system ScHas three coordinate axes of Xc、YcAnd ZcThe origin is Oc,OnAnd OcSame, ZnAnd ZcThe same;
Xhto be perpendicular to ZnIs projected as
Figure BDA0003565303540000052
YhTo be perpendicular to ZnIs projected as
Figure BDA0003565303540000053
Figure BDA0003565303540000061
The coordinates of the center of the eye in the normalized coordinate system are (e)x,ey,ez) The distance from the center of the eye to the origin of the normalized coordinate system is in ZnThe upper projection is d ═ ezThen there is
Figure BDA0003565303540000062
S=diag(1,1,λ) (11)
The transformation matrix M rotates and scales points input into the original camera coordinate system into the normalized coordinate system, and the input image I uses the image transformation matrix
Figure BDA0003565303540000063
Affine transformation into a normalized coordinate system, CrIs corresponding to a camera calibrationProjection matrix of the input image to be acquired, CsIs a predefined parameter for defining a camera projection matrix in a normalized space;
Figure BDA0003565303540000064
in the formula (12) < f >xAnd fyIs the focal length value of the camera, cxAnd cyRespectively the length and width of the image;
in the training phase, all training images I with true annotation vectors g are normalized or directly represented by dsAnd CsSynthesized in the training space, the true gaze vector g is also normalized to
Figure BDA0003565303540000065
The test images are normalized in the same way during the test phase and their corresponding gaze vectors in the normalized space are estimated by a regression function trained in the normalized space, and then the inverse normalization is passed
Figure BDA0003565303540000066
Converting to obtain an actual sight line direction;
step three, building a shallow residual error neural network fused with a non-attention mechanism
In the visual processing task, neurons exhibiting significant spatial suppression should be given higher weight, and the objective of the attention-free mechanism is to find linear separability between the target neuron exhibiting significant spatial suppression and other neurons, based on which an energy function is defined for each neuron, as shown in equation (13):
Figure BDA0003565303540000071
w in the formula (13)tIs a weight, btIs an offset transformation, t and xiIs a target neuron and other neurons in a single channel of an input feature, M is a nerve in that channelNumber of elements, i is an index of the spatial dimension, for ytAnd yoAdopting binary labels to respectively take values of 1 and-1, and adding regularization in a formula (13) to obtain a final energy function formula as follows:
Figure BDA0003565303540000072
w in formula (14)tAnd btThe fast closed-form solution is obtained by the following formula:
Figure BDA0003565303540000073
Figure BDA0003565303540000074
and
Figure BDA0003565303540000075
mean and variance of all neurons in the channel except the target neuron, respectively; preferable range of λ is 10-1To 10-6(ii) a While all pixels in a single channel follow the same distribution, so to avoid calculating the mean u and variance σ for each location, the minimum energy is calculated using the following equation:
Figure BDA0003565303540000076
in formula (16)
Figure BDA0003565303540000077
(Energy)
Figure BDA0003565303540000078
The lower, the greater the neuron t differs from the surrounding neurons, the more important it is for the visual processing, and thus the importance of each neuron can be determined by
Figure BDA0003565303540000079
And obtaining the final attention-free mechanism module formula as follows:
Figure BDA00035653035400000710
wherein E in the formula (17) groups all energy functions in the channel and space dimensions, and the added sigmoid function does not influence the relative importance of each neuron because the sigmoid function is a single increasing function;
the sight line estimation network receives an input normalized face image by calling a convolutional neural network, and the face image is fused with the convolutional neural network without a attention mechanism by the convolutional neural network, and the sight line estimation network specifically comprises the following steps: the normalized face image is translated into 2 through a skip lattice of a first convolution module in sequence and contains 64 convolution kernels with the size of 7 × 7, the size of an obtained feature map is 112 × 112, the feature map is scaled through a skip lattice translation-2 maximum pooling operation with the advantages of translation invariance, rotation invariance and scale invariance after the step of convolution operation to obtain a feature map with the size of 56 × 56, the feature map output in the previous step is sent to layer1, two residual blocks are contained in layer1, the number and the size of 4 convolution layer convolution kernels contained in the two residual blocks are consistent and are 64 convolution kernels with 3 × 3, and the step skip lattice is translated into 1, so that the size of the feature map is unchanged after the operation of layer 1; sending the feature map obtained in the previous step into layer2 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer2 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 128 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation of the step is 28 × 28; sending the feature map obtained in the previous step into layer3 containing 2 residual error blocks, wherein the skip translation of the first convolution of the first residual error block in layer3 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 256 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation in the step is 14 × 14; then, the characteristic diagram is sent into a convolution layer containing 1 multiplied by 1 to carry out convolution operation; finishing extracting facial feature information by the residual blockThen, the extracted feature information is adjusted into a vector form and then spliced and fused with the head posture information, the head posture information returns to the sight line direction in a normalized space after passing through two full-connection layers, and then the head posture information is inversely normalized and passed
Figure BDA0003565303540000081
The actual gaze direction is obtained by the transformation.
Preferably, when the histogram of the gradient of one cell unit is counted, the histogram of 9 bins is used to count the gradient information of the 8 × 8 pixels, that is, the gradient direction of the cell unit is divided into 9 direction blocks from 0 ° to 360 °:
the first directional block range is: 0-20 degrees and 180-200 degrees;
the second directional block range is: 20-40 degrees and 200-220 degrees;
the third directional block range is: 40-60 degrees and 220-240 degrees;
the fourth directional block range is: 60-80 degrees and 240-260 degrees;
the fifth directional block range is: 80-100 degrees and 260-280 degrees;
the sixth directional block range is: 100-120 degrees and 280-300 degrees;
the seventh directional block range is: 120-140 degrees and 300-320 degrees;
the eighth directional block range is: 140-160 degrees and 320-340 degrees;
the ninth directional block range is: 160-180 degrees and 340-360 degrees;
if the gradient direction of a certain pixel in the cell unit is 20-40 degrees or 200-220 degrees, 1 is added to the count of the 2 nd bin of the histogram, so that the gradient direction of each pixel in the cell unit is weighted and projected in the histogram by the gradient direction, and the weighted and projected pixel is mapped into a corresponding angle range block, so that the gradient direction histogram of the cell unit can be obtained, namely the 9-dimensional feature vector corresponding to the cell unit.
Preferably, λ values in the formula (15) are all 10-4
Compared with the prior art, the sight line estimation method fusing the non-attention mechanism has the advantages that: in the face detection step, a method based on a direction gradient histogram and a method based on a limited local model are adopted to detect the face and the face characteristic points, so that the aim of quickly and accurately detecting the face and the face characteristic points is fulfilled, the accurate face characteristic points are provided for head posture estimation, meanwhile, the calculation resources are saved, and the running speed of a neural network is ensured; after the face image is detected, estimating the head posture of the user through a PnP algorithm, establishing a normalized coordinate system, and carrying out normalization processing on the data set so that line-of-sight estimation can be carried out in a normalized space to obtain a more accurate line-of-sight estimation result; meanwhile, a shallow residual error network without a reference mechanism is fused, the weight of a region which has stable contribution to the line estimation in the image is increased under the condition that the number of parameters of a neural network is not additionally increased, and the region weight which does not contribute to the line estimation or contributes to the line estimation in the image is reduced; the real-time performance and the accuracy of sight line estimation are improved, and the advantages of robustness and strong self-adaptive capacity are achieved.
Drawings
FIG. 1 is a flow chart of gaze estimation;
FIG. 2 is a flow chart of face detection;
FIG. 3 is a block diagram showing gradient directions of cell units;
FIG. 4 is a schematic diagram of coordinate system transformation;
FIG. 5 is a construction diagram of a normalized coordinate system;
FIG. 6 is a schematic diagram of the SimAM mechanism;
FIG. 7 is a schematic diagram of a residual network incorporating the non-attention mechanism.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the method is specifically implemented, the main process comprises three parts of face detection, head posture estimation and sight line estimation; the method comprises the steps of performing face detection on a user in an original picture shot by a PC (personal computer) terminal by using a gradient direction histogram-based method, and performing face feature point detection on a detected face image by using a limited local model; estimating the head posture of the user through a PnP algorithm and establishing a normalized coordinate system; a non-reference mechanism SimAM is fused in a shallow residual network to achieve the purpose of improving the sight estimation accuracy without additionally increasing the number of parameters of a neural network, and a specific flow is shown in fig. 1.
1. Method for detecting human face and human face characteristic point by using gradient direction histogram combined with restricted local model
First, an original picture of a person's full face is captured by a camera. When the complete face or eyes of the user cannot be extracted from the original picture, the direction of the user's sight line cannot be predicted in the picture from the viewpoint of geometrical relationship. This step is therefore the basis for the subsequent steps. In this step, a method based on a gradient direction histogram combined with a restricted local model is used to perform face and face feature point detection, so that the face and face feature points can be quickly and accurately detected under the condition of using a small amount of computing resources, and accurate face feature points are provided for head pose estimation, so that a more accurate head pose can be obtained, and a face detection flow chart is shown in fig. 2.
Step210, image normalization: in order to reduce the influence of illumination factors and reduce the influence caused by local shadow and illumination change of an image, the invention firstly adopts a Gamma correction method to standardize the color space of an input image, and the Gamma correction formula is as follows:
I'(x,y)=I(x,y)γ (1)
where I (x, y) is the original image pixel value, I' (x, y) is the corrected image pixel value, and γ is the Gamma correction coefficient.
Step220, edge direction calculation: calculating the gradient of pixel points of the image, wherein the gradient of the pixel points describes the speed of the change of the pixel values at the point, and the calculation formula is as follows:
Figure BDA0003565303540000111
g in formula (2)x(x,y)、Gy(x, y) respectively represents the gradients of the input image in the horizontal and vertical directions at the pixel point (x, y), the gradient of the pixel value of the point is obtained according to the change of the gradients of the pixel point (x, y) in the horizontal and vertical directions, and the calculation formula is as follows:
Figure BDA0003565303540000112
in the formula (3), G (x, y) represents the magnitude of the gradient at the pixel point (x, y), and Φ represents the direction of the gradient at the pixel point (x, y).
Step230, histogram calculation: dividing the image into cell units, for example, 8 × 8, then counting the gradient histogram of each cell unit to obtain a descriptor of a cell unit, for example, combining 2 × 2 cell units into a block, and then concatenating the descriptors of each cell unit to obtain a HOG feature descriptor of the block.
When the gradient histogram of a cell unit is counted, the histogram of 9 bins is used to count the gradient information of the 8 × 8 pixels, i.e. the gradient direction of the cell unit is divided into 9 direction blocks from 0 ° to 360 °, as shown in fig. 3:
the first directional block range is: 0-20 degrees and 180-200 degrees;
the second directional block range is: 20-40 degrees and 200-220 degrees;
the third directional block range is: 40-60 degrees and 220-240 degrees;
the fourth directional block range is: 60-80 degrees and 240-260 degrees;
the fifth directional block range is: 80-100 degrees and 260-280 degrees;
the sixth directional block range is: 100-120 degrees and 280-300 degrees;
the seventh directional block range is: 120-140 degrees and 300-320 degrees;
the eighth directional block range is: 140-160 degrees and 320-340 degrees;
the ninth directional block range is: 160-180 degrees and 340-360 degrees.
If the gradient direction of a certain pixel in the cell unit is 20-40 degrees or 200-220 degrees, 1 is added to the count of the 2 nd bin of the histogram, so that the gradient direction of each pixel in the cell unit is weighted and projected in the histogram by the gradient direction, and the weighted and projected pixel is mapped into a corresponding angle range block, so that the gradient direction histogram of the cell unit can be obtained, namely the 9-dimensional feature vector corresponding to the cell unit.
Step240, normalizing block: the invention adopts an L2 norm normalization algorithm, and the formula is as follows:
Figure BDA0003565303540000121
v in formula (4)i' is normalized element, viI is 1,2.. n, which is an element in the vector to be normalized.
Step250, synthesis window HOG feature vector: and cascading the feature vectors of all blocks of the detection window to form the HOG feature vector of the detection window, sending the feature vector into a support vector machine which is trained in advance, and judging whether the picture contains the face.
Step260, human face characteristic point detection: the method for selecting the limited local model combines the advantages of two algorithms of the active shape model and the active appearance model, and has high accuracy and robustness in detecting the human face characteristic points. The specific method comprises the following steps: the method comprises the steps of initializing the position of a face in a direct-view camera state, searching an actual position meeting a constraint condition in a neighborhood range of each feature point corresponding to the face, and forming complete face 68 feature point distribution.
The method based on the histogram of the directional gradient and the limited local model is used for detecting the human face and the characteristic points of the human face, so that the aim of quickly and accurately detecting the human face and the characteristic points of the human face is fulfilled, a good foundation is laid for head posture estimation, meanwhile, computing resources are saved, and the real-time performance of a sight line estimation network is ensured.
2. Head pose is estimated based on a PnP method, a normalized coordinate system is established, and a face image is converted into a normalized space
The pose of an object relative to the camera can be represented by using a rotation matrix and a translation matrix, and as shown in fig. 4, the head pose estimation of the invention obtains the head pose by projecting the human face feature points in the picture onto a three-dimensional human face model and solving Euler angles according to a two-dimensional coordinate transformation relation matrix and a three-dimensional coordinate transformation relation matrix. The invention selects 7 characteristic points which are respectively the positions of the external angle of the eyes, the angle of the mouth and the tip of the nose.
The specific method comprises the following steps: and calculating to obtain rotation and translation vectors by using the position of the input world coordinate system point, the position of the human face characteristic point of the picture acquired by Step260 and camera parameters, and obtaining the Euler angle of the head posture by using a PnP method.
Translation matrix: a spatial position relation matrix of the object relative to the camera, denoted by T;
rotating the matrix: a spatial attitude relationship matrix of the object relative to the camera, denoted by R;
(1) the conversion formula from the world coordinate system to the camera coordinate system is as follows:
Figure BDA0003565303540000131
(2) the conversion formula from the camera coordinate system to the pixel coordinate system is as follows:
Figure BDA0003565303540000132
(3) the conversion formula from the pixel coordinate system to the world coordinate system is as follows:
Figure BDA0003565303540000133
(4) the conversion formula from the image center coordinate system to the pixel coordinate system is as follows:
Figure BDA0003565303540000134
the euler angle of the rotation matrix can be obtained by using the PnP method, and the formula is as follows:
Figure BDA0003565303540000141
where α, β and θ are the angles of rotation of the head in the three axes x, y and z.
Normalizing the data set, and establishing a normalized coordinate system S through the head posture informationnTaking the midpoint of two feature points of the right eye as the right eye center, taking the midpoint of two feature points of the left eye as the left eye center, taking the midpoint of two feature points of the mouth as the mouth center, the direction from the right eye center to the left eye center being the X axis of the head pose coordinate system, the direction perpendicular to the XOY plane formed by the eyes and the mouth center being the Y axis, data normalization processing for affine transforming the original input image, transforming the image into the normalized coordinate system so that sight estimation can be performed in the normalized space with fixed camera parameters and reference point positions, given the input image I, calculating the transformation matrix M-S R, R representing the inverse of the camera rotation matrix, S representing the scaling matrix defined as the distance d from the eye center to the origin of the normalized coordinate systemsA is dsIs 600 mm.
Normalized coordinate system SnHas three coordinate axes of Xn、YnAnd ZnThe origin is OnHead pose coordinate system ShHas three coordinate axes of Xh、YhAnd ZhThe origin is OhThe camera coordinate system ScHas three coordinate axes of Xc、YcAnd ZcThe origin is Oc,OnAnd OcSame, ZnAnd ZcSimilarly, the construction of the normalized coordinate system is shown in FIG. 5.
XhTo be perpendicular to ZnIs projected as
Figure BDA0003565303540000142
YhTo be perpendicular to ZnIs projected as
Figure BDA0003565303540000143
Figure BDA0003565303540000144
The coordinates of the center of the eye in the normalized coordinate system are (e)x,ey,ez) The distance from the center of the eye to the origin of the normalized coordinate system is in ZnThe upper projection is d ═ ezThen there is
Figure BDA0003565303540000145
S=diag(1,1,λ) (11)
The transformation matrix M rotates and scales points input into the original camera coordinate system into the normalized coordinate system, and the input image I uses the image transformation matrix
Figure BDA0003565303540000151
Affine transformation into a normalized coordinate system, CrIs a projection matrix, C, corresponding to the input image obtained by camera calibrationsIs a predefined parameter for defining a camera projection matrix in the normalized space.
Figure BDA0003565303540000152
In the formula (12) < f >xAnd fyIs the focal length value of the camera, cxAnd cyRespectively the length and width of the image.
In the training phase, all training images I with true annotation vectors g are normalized or directly represented by dsAnd CsSynthesized in the training space, the true gaze vector g is also normalized to
Figure BDA0003565303540000153
The test images are normalized in the same way during the test phase and their corresponding gaze vectors in the normalized space are estimated by a regression function trained in the normalized space, and then the inverse normalization is passed
Figure BDA0003565303540000154
The actual gaze direction is obtained by the transformation.
3. Constructing a shallow residual error neural network fused with a non-attention mechanism
The invention builds a shallow layer residual error network fused with a non-attention mechanism on the basis of the residual error network.
In the visual processing task, neurons exhibiting significant spatial suppression should be given higher weight, and the objective of the attention-free mechanism is to find linear separability between the target neuron exhibiting significant spatial suppression and other neurons, based on which an energy function is defined for each neuron, as shown in equation (13):
Figure BDA0003565303540000155
w in the formula (13)tIs a weight, btIs an offset transformation, t and xiIs the target neuron and other neurons in a single channel of the input feature, M is the number of neurons in that channel, i is the index of the spatial dimension, and y istAnd yoAdopting binary labels to respectively take values of 1 and-1, and adding regularization in a formula (13) to obtain a final energy function formula as follows:
Figure BDA0003565303540000156
w in formula (14)tAnd btThe fast closed-form solution is obtained by the following formula:
Figure BDA0003565303540000161
Figure BDA0003565303540000162
and
Figure BDA0003565303540000163
mean and variance of all neurons in the channel except the target neuron, respectively; preferable range of λ is 10-1To 10-6In the present invention, the values of lambda are all 10-4(ii) a While all pixels in a single channel follow the same distribution, so to avoid calculating the mean u and variance σ for each location, the minimum energy is calculated using the following equation:
Figure BDA0003565303540000164
in formula (16)
Figure BDA0003565303540000165
(Energy)
Figure BDA0003565303540000166
The lower, the more the neuron t differs from the surrounding neurons, the more important it is for the visual processing, and thus the importance of each neuron can be determined by
Figure BDA0003565303540000167
And obtaining a final non-attention mechanism module formula as follows:
Figure BDA0003565303540000168
where E in equation (17) groups all energy functions in the channel and spatial dimensions, the simultaneous addition of the sigmoid function, since it is a simple increasing function, does not affect the relative importance of each neuron.
The non-reference mechanism module of the present invention is added after the second convolution layer in each residual block in the residual network as shown in fig. 6.
The sight line estimation network receives an input normalized face image by calling a convolutional neural network, and the face image is fused with the convolutional neural network without a attention mechanism by the input normalized face image, and the sight line estimation network specifically comprises the following steps: the normalized face image is translated into 2 through a skip lattice of a first convolution module in sequence and contains 64 convolution kernels with the size of 7 × 7, the size of an obtained feature map is 112 × 112, the feature map is scaled through a skip lattice translation-2 maximum pooling operation with the advantages of translation invariance, rotation invariance and scale invariance after the step of convolution operation to obtain a feature map with the size of 56 × 56, the feature map output in the previous step is sent to layer1, two residual blocks are contained in layer1, the number and the size of 4 convolution layer convolution kernels contained in the two residual blocks are consistent and are 64 convolution kernels with 3 × 3, and the step skip lattice is translated into 1, so that the size of the feature map is unchanged after the operation of layer 1; sending the feature map obtained in the previous step into layer2 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer2 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 128 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation of the step is 28 × 28; sending the feature map obtained in the previous step into layer3 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer3 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 256 convolution kernels with the size of 3 × 3, and the size of the feature map obtained after the operation of the step is 14 × 14; then, the characteristic diagram is sent into a convolution layer containing 1 multiplied by 1 to carry out convolution operation; finishing extracting face feature information by the residual block, adjusting the extracted feature information into a vector form, splicing and fusing the vector form with head posture information, returning to a sight line direction in a normalized space after passing through two full-connection layers, and performing inverse normalization
Figure BDA0003565303540000171
Converting to obtain actual sight direction, and fusing the residue without attention mechanismThe difference network is shown in fig. 7.
A method based on a gradient direction histogram combined with a limited local model is used in face detection and face feature point detection, a PnP method is used in head posture estimation, meanwhile, a non-attention mechanism is fused in a sight line estimation network, the three-dimensional weight of a region which has stable contribution to sight line estimation in an image is increased under the condition that the number of neural network parameters is not additionally increased and less computing resources are occupied, and the weight of a region which does not contribute to the sight line estimation or contributes to the sight line estimation in the image is reduced. The sight line estimation method fused with the non-attention mechanism improves the accuracy of sight line estimation and has the advantages of being strong in robustness and self-adaptive capacity.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (3)

1. A sight line estimation method fusing a non-attention mechanism is characterized by comprising the following steps:
step one, detecting a human face and human face characteristic points by using a method of combining a gradient direction histogram with a restricted local model;
acquiring an original picture of a whole face of a person through a camera, detecting the face and face characteristic points by adopting a method based on a gradient direction histogram and combining a limited local model, and providing accurate face characteristic points for head posture estimation;
step210, image normalization: firstly, standardizing the color space of an input image by adopting a Gamma correction method, wherein the Gamma correction formula is as follows:
I'(x,y)=I(x,y)γ (1)
wherein, I (x, y) is the pixel value of the original image, I' (x, y) is the corrected image pixel value, and Gamma is the Gamma correction coefficient;
step220, edge direction calculation: calculating the gradient of pixel points of the image, wherein the gradient of the pixel points describes the speed of the change of the pixel values at the point, and the calculation formula is as follows:
Figure FDA0003565303530000011
g in formula (2)x(x,y)、Gy(x, y) respectively represents the gradients of the input image in the horizontal and vertical directions at the pixel point (x, y), the gradient of the pixel value of the point is obtained according to the change of the gradients of the pixel point (x, y) in the horizontal and vertical directions, and the calculation formula is as follows:
Figure FDA0003565303530000012
g (x, y) in the formula (3) represents the amplitude of the gradient at the pixel point (x, y), and phi represents the direction of the gradient at the pixel point (x, y);
step230, histogram calculation: dividing the image into cell units with the size of 8 multiplied by 8, then counting a gradient histogram of each cell unit to obtain a descriptor of one cell unit, and then connecting the descriptors of each cell unit in series to obtain a HOG characteristic descriptor of one block;
step240, normalizing block: the L2 norm normalization algorithm is adopted, and the formula is as follows:
Figure FDA0003565303530000013
v 'in formula (4)'iIs a normalized element, viN is the element in the vector to be normalized, i ═ 1,2.. n;
step250, synthesis window HOG feature vector: cascading the feature vectors of all blocks of the detection window to form an HOG feature vector of the detection window, sending the feature vector into a support vector machine which is trained in advance, and judging whether the picture contains a human face or not;
step260, human face characteristic point detection: the method for selecting the limited local model specifically comprises the following steps: firstly, initializing the position of a face in a direct-view camera state, and then searching an actual position meeting constraint conditions in a neighborhood range of each feature point corresponding to the face to form complete face 68 feature point distribution;
the method based on the directional gradient histogram and the limited local model is used for detecting the human face and the human face characteristic points, so that the aim of quickly and accurately detecting the human face and the human face characteristic points is fulfilled;
estimating the head pose based on a PnP method, establishing a normalized coordinate system, and converting the face image into a normalized space;
the head pose estimation comprises the steps of projecting human face feature points in a picture onto a three-dimensional human face model, solving Euler angles according to a two-dimensional and three-dimensional coordinate transformation relation matrix to obtain head poses, wherein the specifically selected feature points are 7, namely the positions of the external angles in the eyes, the positions of the mouth angles and the positions of the nose tips;
the specific method comprises the following steps: calculating to obtain rotation and translation vectors by using the position of the input world coordinate system point, the position of the human face characteristic point of the picture acquired by Step260 and camera parameters, and obtaining the Euler angle of the head posture by using a PnP method;
translation matrix: a spatial position relation matrix of the object relative to the camera, denoted by T;
rotating the matrix: a spatial attitude relationship matrix of the object relative to the camera, denoted by R;
(1) the conversion formula from the world coordinate system to the camera coordinate system is as follows:
Figure FDA0003565303530000021
the conversion formula from the camera coordinate system to the pixel coordinate system is as follows:
Figure FDA0003565303530000031
the conversion formula from the pixel coordinate system to the world coordinate system is as follows:
Figure FDA0003565303530000032
the conversion formula from the image center coordinate system to the pixel coordinate system is as follows:
Figure FDA0003565303530000033
the euler angle of the rotation matrix can be obtained by using the PnP method, and the formula is as follows:
Figure FDA0003565303530000034
wherein α, β and θ are the angles of rotation of the head in three axes x, y and z;
normalizing the data set, and establishing a normalized coordinate system S through the head posture informationnTaking the midpoint of two feature points of the right eye as the right eye center, taking the midpoint of two feature points of the left eye as the left eye center, taking the midpoint of two feature points of the mouth as the mouth center, the direction from the right eye center to the left eye center being the X axis of the head pose coordinate system, the direction perpendicular to the XOY plane formed by the eyes and the mouth center being the Y axis, data normalization processing for affine transforming the original input image, transforming the image into the normalized coordinate system so that sight estimation can be performed in the normalized space with fixed camera parameters and reference point positions, given the input image I, calculating the transformation matrix M-S R, R representing the inverse of the camera rotation matrix, S representing the scaling matrix defined as the distance d from the eye center to the origin of the normalized coordinate systemsA is dsIs 600 mm;
normalized coordinate system SnHas three coordinate axes of Xn、YnAnd ZnOrigin is OnHead pose coordinate system ShHas three coordinate axes of Xh、YhAnd ZhOrigin is OhThe camera coordinate system ScHas three coordinate axes of Xc、YcAnd ZcThe origin is Oc,OnAnd OcSame, ZnAnd ZcThe same;
Xhto perpendicular to ZnIs projected as
Figure FDA0003565303530000041
YhTo be perpendicular to ZnIs projected as
Figure FDA0003565303530000042
Figure FDA0003565303530000043
The coordinates of the center of the eye in the normalized coordinate system are (e)x,ey,ez) The distance from the center of the eye to the origin of the normalized coordinate system is in ZnThe upper projection is d ═ ezThen there is
Figure FDA0003565303530000044
S=diag(1,1,λ) (11)
The transformation matrix M rotates and scales points input into the original camera coordinate system into the normalized coordinate system, and the input image I uses the image transformation matrix
Figure FDA0003565303530000048
Affine transformation into a normalized coordinate system, CrIs a projection matrix, C, corresponding to the input image obtained by camera calibrationsIs a predefined parameter for defining a camera projection matrix in the normalized space;
Figure FDA0003565303530000045
formula (II)(12) In fxAnd fyIs the focal length value of the camera, cxAnd cyRespectively the length and width of the image;
in the training phase, all training images I with true annotation vectors g are normalized or directly represented by dsAnd CsSynthesized in the training space, the true gaze vector g is also normalized to
Figure FDA0003565303530000046
The test images are normalized in the same way during the test phase and their corresponding gaze vectors in the normalized space are estimated by a regression function trained in the normalized space, then the inverse normalization is passed
Figure FDA0003565303530000047
Converting to obtain an actual sight line direction;
step three, building a shallow residual error neural network fused with a non-attention mechanism
In the visual processing task, neurons exhibiting significant spatial suppression should be given higher weight, and the objective of the attention-free mechanism is to find linear separability between the target neuron exhibiting significant spatial suppression and other neurons, based on which an energy function is defined for each neuron, as shown in equation (13):
Figure FDA0003565303530000051
w in the formula (13)tIs a weight, btIs an offset transformation, t and xiIs the target neuron and other neurons in a single channel of the input feature, M is the number of neurons in that channel, i is the index of the spatial dimension, and y istAnd yoAdopting binary labels to respectively take values of 1 and-1, and adding regularization in a formula (13) to obtain a final energy function formula as follows:
Figure FDA0003565303530000052
w in formula (14)tAnd btThe fast closed-form solution is obtained by the following formula:
Figure FDA0003565303530000053
Figure FDA0003565303530000054
and
Figure FDA0003565303530000055
mean and variance of all neurons in the channel except the target neuron, respectively; preferable range of λ is 10-1To 10-6(ii) a While all pixels in a single channel follow the same distribution, so to avoid calculating the mean u and variance σ for each location, the minimum energy is calculated using the following equation:
Figure FDA0003565303530000056
in formula (16)
Figure FDA0003565303530000057
(Energy)
Figure FDA0003565303530000058
The lower, the greater the neuron t differs from the surrounding neurons, the more important it is for the visual processing, and thus the importance of each neuron can be determined by
Figure FDA0003565303530000059
And obtaining a final non-attention mechanism module formula as follows:
Figure FDA00035653035300000510
wherein E in the formula (17) groups all energy functions in the channel and space dimensions, and the added sigmoid function does not influence the relative importance of each neuron because the sigmoid function is a single increasing function;
the sight line estimation network receives an input normalized face image by calling a convolutional neural network, and the face image is fused with the convolutional neural network without a attention mechanism by the input normalized face image, and the sight line estimation network specifically comprises the following steps: the normalized face image sequentially passes through a first convolution module, grid jumping translation is 2, 64 convolution kernels with the size of 7 x 7 are contained in the normalized face image, the size of the obtained feature map is 112 x 112, the feature map is zoomed by using grid jumping translation with the advantages of translation invariance, rotation invariance and scale invariance to 2-maximum pooling operation after the step of convolution operation to obtain the feature map with the size of 56 x 56, the feature map output in the previous step is sent to layer1, two residual blocks are contained in layer1, the number and the size of 4 convolution layer convolution kernels contained in the two residual blocks are 64 convolution kernels with the size of 3 x 3, and the grid jumping translation is 1, so that the size of the feature map is unchanged after the operation of layer 1; sending the feature map obtained in the previous step into layer2 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer2 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 128 convolution kernels of 3 × 3, and the size of the feature map obtained after the operation of the step is 28 × 28; sending the feature map obtained in the previous step into layer3 containing 2 residual blocks, wherein the jump translation of the first convolution of the first residual block in layer3 is 2, the number and the size of the rest parameter information and the convolution kernels of the rest 3 convolution layers are 256 convolution kernels with the size of 3 × 3, and the size of the feature map obtained after the operation of the step is 14 × 14; then, the characteristic diagram is sent into a convolution layer containing 1 multiplied by 1 to carry out convolution operation; finishing extracting face feature information by the residual block, adjusting the extracted feature information into a vector form, splicing and fusing the vector form with head posture information, returning to a sight line direction in a normalized space after passing through two full-connection layers, and performing inverse normalizationBy passing
Figure FDA0003565303530000061
The actual gaze direction is obtained by the transformation.
2. The method for estimating a gaze direction fused with a non-attention mechanism according to claim 1, wherein when a histogram of gradients of a cell unit is counted, a histogram of 9 bins is used to count gradient information of 8 × 8 pixels, i.e. a gradient direction of a cell unit is divided into 0-360 ° into 9 direction blocks:
the first directional block range is: 0-20 degrees and 180-200 degrees;
the second directional block range is: 20-40 degrees and 200-220 degrees;
the third directional block range is: 40-60 degrees and 220-240 degrees;
the fourth directional block range is: 60-80 degrees and 240-260 degrees;
the fifth directional block range is: 80-100 degrees and 260-280 degrees;
the sixth directional block range is: 100-120 degrees and 280-300 degrees;
the seventh directional block range is: 120-140 degrees and 300-320 degrees;
the eighth directional block range is: 140-160 degrees and 320-340 degrees;
the ninth directional block range is: 160-180 degrees and 340-360 degrees;
if the gradient direction of a certain pixel in the cell unit is 20-40 degrees or 200-220 degrees, 1 is added to the count of the 2 nd bin of the histogram, so that the gradient direction of each pixel in the cell unit is weighted and projected in the histogram by the gradient direction, and the weighted and projected pixel is mapped into a corresponding angle range block, so that the gradient direction histogram of the cell unit can be obtained, namely the 9-dimensional feature vector corresponding to the cell unit.
3. The gaze estimation method fusing non-attention mechanism according to claim 1, wherein λ values in the formula (15) are all 10-4
CN202210300517.7A 2022-03-25 2022-03-25 Sight estimation method integrating non-attention mechanism Withdrawn CN114724211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210300517.7A CN114724211A (en) 2022-03-25 2022-03-25 Sight estimation method integrating non-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210300517.7A CN114724211A (en) 2022-03-25 2022-03-25 Sight estimation method integrating non-attention mechanism

Publications (1)

Publication Number Publication Date
CN114724211A true CN114724211A (en) 2022-07-08

Family

ID=82240844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210300517.7A Withdrawn CN114724211A (en) 2022-03-25 2022-03-25 Sight estimation method integrating non-attention mechanism

Country Status (1)

Country Link
CN (1) CN114724211A (en)

Similar Documents

Publication Publication Date Title
Holte et al. View-invariant gesture recognition using 3D optical flow and harmonic motion context
US10949649B2 (en) Real-time tracking of facial features in unconstrained video
Dornaika et al. On appearance based face and facial action tracking
WO2022156640A1 (en) Gaze correction method and apparatus for image, electronic device, computer-readable storage medium, and computer program product
US20150035825A1 (en) Method for real-time face animation based on single video camera
US20140043329A1 (en) Method of augmented makeover with 3d face modeling and landmark alignment
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN109598196B (en) Multi-form multi-pose face sequence feature point positioning method
CN108446672B (en) Face alignment method based on shape estimation of coarse face to fine face
US9158963B2 (en) Fitting contours to features
CN112419170A (en) Method for training occlusion detection model and method for beautifying face image
US9202138B2 (en) Adjusting a contour by a shape model
CN110751097B (en) Semi-supervised three-dimensional point cloud gesture key point detection method
Wu et al. Eyenet: A multi-task deep network for off-axis eye gaze estimation
CN111815768B (en) Three-dimensional face reconstruction method and device
CN115661246A (en) Attitude estimation method based on self-supervision learning
CN112446322A (en) Eyeball feature detection method, device, equipment and computer-readable storage medium
Chang et al. 2d–3d pose consistency-based conditional random fields for 3d human pose estimation
CN111626152A (en) Space-time sight direction estimation prototype design based on Few-shot
Pais et al. Omnidrl: Robust pedestrian detection using deep reinforcement learning on omnidirectional cameras
Sun et al. Adaptive image dehazing and object tracking in UAV videos based on the template updating Siamese network
CN108694348B (en) Tracking registration method and device based on natural features
CN114724211A (en) Sight estimation method integrating non-attention mechanism
Li et al. Evaluating effects of focal length and viewing angle in a comparison of recent face landmark and alignment methods
CN113962846A (en) Image alignment method and device, computer readable storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220708

WW01 Invention patent application withdrawn after publication