CN114241179A

CN114241179A - Sight estimation method based on self-learning

Info

Publication number: CN114241179A
Application number: CN202111480164.5A
Authority: CN
Inventors: 孟明明; 潘力立
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-25

Abstract

The invention discloses a sight line method based on self-learning, and belongs to the field of computer vision. The method comprises the steps of firstly selecting a deep regression forest as a basic frame, simultaneously introducing two independent sub-networks for feature extraction, performing feature fusion on the extracted features through a feature fusion network, improving the capability of extracting network features, then introducing a structure of the regression forest as a regression model for estimating probability distribution of the sight direction of an input image, calculating a prediction result and the entropy of a sample based on the probability distribution, finally training the whole network model by adopting a self-learning method, and correcting the sequence of the model in self-learning sequencing by utilizing the entropy of the sample to finish the training of the whole model. By the method, the advantages of the deep regression forest and the self-learning training method are fully utilized, and the accuracy and the robustness of the model on the sight estimation task are improved.

Description

Sight estimation method based on self-learning

Technical Field

The invention belongs to the field of computer vision, and mainly relates to a sight estimation problem based on images; the method is mainly applied to the aspects of film and television entertainment industry, man-machine interaction, machine vision understanding and the like.

Background

The sight line estimation refers to inputting an image including an eye region, analyzing and processing the image by using a computer technology, and estimating the sight line direction of eyes in the input image. At present, the demand for line of sight estimation is increasing in the fields of movie and television entertainment, human-computer interaction, machine vision understanding and the like. For example, the direction of the sight line can be calculated in real time through the camera, and the efficiency of man-machine interaction is improved; in behavior analysis in public places, visual behavior and the like of a monitored object can be better analyzed in an auxiliary manner through sight line estimation. The existing sight line estimation methods are mainly divided into methods based on model estimation and methods based on appearance estimation.

The model-based gaze estimation method is an early method, the basic principle of which can be divided into three steps. The first step is to roughly extract the eye position from the graph using a classifier and locate the center of the eye using a shape-based method; the second step is to detect the eye area and model a two-dimensional elliptical contour covering the eye area on the basis of the corneal limbus; the third step is to back-project the two-dimensional elliptical contour into three-dimensional space to locate the optical axis direction of the eye, and then to estimate the gaze direction of the line of sight in combination with the intersection of the optical axis direction and the screen. The method relies on accurate modeling of the eye image, has high requirements on the quality of the input image, has poor anti-interference capability and is often difficult to meet the requirements on the estimation accuracy. Reference documents: wood E, boring A. Eyetab: Model-based simulation on unmodified tablet computers, Proceedings of the Symposium on Eye Tracking Research and applications.2014: 207-.

The sight line estimation method based on the appearance is to directly obtain the sight line direction through eye image calculation, and specifically, a model is trained through a large number of eye images with labels, so that the model learns a mapping function for directly estimating the sight line direction from the eye images. The method has the advantages that the complicated eye geometric shape modeling process can be avoided, the quality requirement on the input eye image is reduced, and the estimation precision is improved. However, the method has the disadvantages that the training relies on a large number of accurate labeled images for model training, the robustness of the model is not high, the sight line estimation precision may be significantly reduced in a task scene of cross-person estimation, and effective cross-person migration prediction cannot be performed. Reference documents: fischer T, Chang H J, Demiris Y.Rt-gene, Real-time eye size estimation in natural environment, Proceedings of the European Conference on Computer Vision (ECCV) 2018: 334-.

In recent years, the sight line estimation method based on the appearance is more mature, and higher requirements are also placed on the accuracy and robustness of the sight line estimation. The existing method has certain problems in model training, and cannot achieve sufficient precision and robustness. Aiming at the field and considering the defects, the invention provides the sight line estimation method based on self-learning, and the precision and the robustness are obviously improved.

Disclosure of Invention

The invention discloses a sight line estimation method based on self-learning, which solves the problems of low sight line estimation precision and poor robustness in the prior art.

The method begins with the selection of a depth regression forest as the basic frame, the training picture is composed of a pair of left and right eye images, and the monocular image is normalized to a size of 36 x 60 x 3. And respectively constructing a feature extraction network for the left eye and the right eye, taking the features extracted by the left eye and the right eye as the input of the feature fusion network, further obtaining a fusion feature vector, taking the fusion feature vector as the input feature of a regression forest, and further estimating the sight direction of the input image through the regression forest. The training of the model is finally completed by introducing a self-learning strategy in the training process of the model, correcting the sequence of the samples based on the uncertainty of the samples and gradually adding the training samples into the training process. After the model is trained, the sight direction can be estimated only by inputting the left eye image and the right eye image into the trained network model respectively. By the method, the advantages of deep regression forest and self-learning are utilized, the sight line estimation method based on self-learning is provided, and the estimation precision and robustness of the model are improved. The general structural diagram of the algorithm is shown in fig. 1.

For the convenience of describing the present disclosure, certain terms are first defined.

Definition 1: a normal distribution. Also called normal distribution, also known as gaussian distribution, is a probability distribution that is very important in the fields of mathematics, physics, engineering, etc., and has a significant influence on many aspects of statistics. If the random variable x, its probability density function satisfies

Where μ is the mathematical expectation of a normal distribution, σ²The variance of a normal distribution is said to satisfy the normal distribution, and is often referred to as

Definition 2: the Relu function. The modified linear unit is an activation function commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and a variant thereof, and the expression is f (x) max (0, x).

Definition 3: sigmoid function. Using expressions

And (4) defining.

Therefore, the technical scheme of the invention is a sight line estimation method based on self-learning, which comprises the following steps:

step 1: preprocessing the data set;

acquiring a data set, wherein the data set consists of images and corresponding labeling information thereof, extracting left and right eye areas of each image according to the labeling information, and randomly disordering the sequence of the left and right eye image pairs; finally, normalizing the pixel values of the picture to a range of [ -1,1 ];

step 2: constructing a convolutional neural network, wherein the convolutional neural network comprises a feature extraction network and a feature fusion network;

1) constructing a feature extraction network; the feature extraction network consists of two sub-networks with the same structure, and each sub-network receives the monocular image as input and outputs a feature vector; a sub-network is composed of 5 convolution blocks and 1 standard full-connection layer, wherein the 5 convolution blocks are respectively composed of 2, 3 and 3 standard convolution layers, a maximum pooling layer with the step length of 2 is added between the convolution blocks, a maximum pooling layer with the step length of 2 is also connected behind the 5 th convolution block, and finally a standard full-connection layer is connected to output a corresponding feature vector; the standard convolutional layer, the standard fully-connected layer, the sub-networks and the feature extraction network are shown in fig. 3.

2) Constructing a feature fusion network; the feature fusion network takes the feature vectors corresponding to the left eye and the right eye as input and outputs fusion feature vectors; the feature fusion network is composed of 2 standard full-connection layers and 1 inactivated full-connection layer, two input feature vectors are cascaded first, and then the fusion feature vectors are output through the feature fusion network; the feature fusion network is shown in fig. 4.

And step 3: constructing a regression forest; the regression forest is composed of 5 complete binary trees, the depth of each tree is 6, for each tree, each tree is composed of 31 internal nodes and 32 leaf nodes, each internal node has a splitting function, and each leaf node has a Gaussian distribution; calculating the probability s of the current internal node moving to the left according to the splitting function of the nth internal node_nAfter calculating the probability that all internal nodes move to the left, starting from the root node, the arrival probability w of each leaf node can be calculated according to the left movement probability of the internal nodes_lThen, according to the probability of reaching each leaf and the distribution of the leaves, calculating the prediction result of the current tree; finally, taking the average value of the 5 tree prediction results as the final sight estimation result;

and 4, step 4: an overall neural network; respectively extracting the feature vectors f of the left eye image and the right eye image by using the feature extraction network in the step 2_l，f_r(ii) a Then extracting the feature vector f_l，f_rInputting as a feature fusion network to further obtain a fusion feature vector f; finally, calculating the left shift probability of each tree internal node in the regression forest based on the fusion feature vector and the splitting function, and further calculating a final prediction result; the general neural network structure is schematically shown in fig. 1.

And 5: designing a loss function; the ith pair of left and right eye images obtained in step 1Notation x_i，y_iLabel representing the ith pair of images, v_iRepresenting the weight of the ith pair of samples, theta representing the parameters of the feature extraction network and the feature fusion network, pi representing the parameter of regression forest leaf gaussian distribution, the loss function can be represented as:

wherein

Indicating that y is taken with the current model parameters_iProbability of (H)_iExpressing the entropy of the ith pair of samples, gamma expressing the weight coefficient of the entropy, and lambda being the control parameter of the learning process, wherein the two parameters are both the hyper-parameters of the model; the goal of the entire model is to maximize the above loss function;

step 6: training a total neural network based on self-learning; completing the training of the network model according to a self-learning strategy;

and 7: and estimating the sight in the actual image by adopting the trained total neural network.

Further, the specific method of step 3 is as follows:

step 3.1: calculating the left shift probability of each internal node: splitting function s_n(x_i；θ)：x_i→[0，1]The splitting function is determined by a network parameter theta, and input samples x_iA scalar quantity which is mapped between 0 and 1 and represents the probability of dividing the sample into a left sub-tree after reaching the current node; the concrete form of the splitting function is as follows:

where σ (-) is a sigmoid function,

is that the index function represents at the nth splitting sectionPoint selection fusion feature f

The number of the elements is one,

represents to the sample x_iIn terms of the value of the nth split node;

step 3.2: calculate the probability of reaching a leaf: for each sample pair, calculating the probability of arriving at each leaf node from the root node according to the left-shift probability of the split node, wherein the calculation of the arrival probability is given by the following formula:

wherein [. ]]Is an indication function, if true returns 1, otherwise returns 0;

respectively representing node sets of subtrees taking left and right children of the split node n as root nodes;

step 3.3: calculating the prediction result of a single tree: by Gaussian distribution

Representing the distribution of leaf nodes, y_iRepresents the value of the sight angle, mu represents the mean value of the Gaussian distribution,

the variance of the gaussian distribution is expressed, considering that a tree is composed of a plurality of leaf nodes, and the final prediction result is expressed by the weighted average of all the leaves according to the arrival probability, and the form is as follows:

wherein the content of the first and second substances,

indicating arrival at a leaf

The probability of (a) of (b) being,

indicating leaves

At y_iThe probability of (a) being in (b),

representation tree

A set of leaves of (1);

step 3.4: calculating the prediction result of the regression forest: the final prediction for the sample is the average of the multiple tree predictions, given by:

wherein K represents the number of trees in the regressive forest,

is the predicted result of the kth tree, pi_kIs the leaf distribution parameter for the kth tree;

further, the method for calculating the sample entropy in step 5 is as follows:

since a single tree is obtained by weighted summation of multiple leaf distributions, the integral of such a mixture gaussian distribution is non-trivial, where the lower bound of the single tree entropy is calculated to approximate the true value of the single tree entropy, which is calculated by:

wherein

Is the predicted result of the kth tree, pi_kIs the leaf distribution parameter of the kth tree, then the entropy of the sample is obtained from the average of the entropy of the trees, calculated by:

the innovation of the invention is that:

1) the features of the left eye image and the right eye image are respectively extracted by using two independent sub-networks, and the extracted features are subjected to feature fusion. As shown in fig. 6.

2) And introducing a regression forest structure as a regression model, performing regression to estimate the probability distribution of the sight line direction of the input image, and calculating the entropy of a prediction result and a sample based on the probability distribution.

3) And (3) a learning paradigm of self-learning is introduced to train a deep regression forest model, the sequence of the samples in the self-learning is corrected by combining the uncertainty of the samples, and the prediction precision and the robustness of the model are improved.

Drawings

FIG. 1 is a diagram of the main network structure of the method of the present invention

FIG. 2 is a schematic diagram of a standard volume block and a standard full-link block of the present invention.

Fig. 3 is a schematic diagram of a feature extraction network according to the present invention.

FIG. 4 is a schematic diagram of a feature fusion network according to the present invention.

FIG. 5 is a schematic view of a regression forest structure according to the present invention.

FIG. 6 is a flow chart of the model training algorithm for the self-learning of the present invention.

Detailed Description

Step 1: preprocessing the data set;

acquiring an MPIIGaze data set, wherein the MPIIGaze data set consists of 15 images of people and corresponding annotation information, and each person has 1500 images; extracting left and right eye regions of each image according to the labeling information, enabling the size of a single-eye image to be 36 × 60 × 3, and randomly disordering the sequence of the left and right eye image pairs; finally, normalizing the pixel values of the picture to a range of [ -1,1 ];

step 2: constructing a convolutional neural network and a regression forest;

1) constructing a feature extraction network; the feature extraction network consists of two sub-networks with the same structure, and each sub-network receives the monocular image as input and outputs a feature vector; a sub-network is composed of 5 convolution blocks and 1 standard full-connection layer, wherein the 5 convolution blocks are composed of 2, 3 and 3 standard convolution layers respectively, a maximum pooling layer with the step length of 2 is added between the convolution blocks, a maximum pooling layer with the step length of 2 is also connected behind the 5 th convolution block, and finally a standard full-connection layer is connected to output a corresponding feature vector. The standard convolutional layer, the standard fully-connected layer, the sub-networks and the feature extraction network are shown in fig. 2.

2) Constructing a feature fusion network; the feature fusion network takes the feature vectors corresponding to the left eye and the right eye as input and outputs fusion feature vectors; the feature fusion network is composed of 2 standard full-connection layers and 1 inactivated full-connection layer, two input feature vectors are firstly cascaded, and then the fusion feature vectors are output through the feature fusion network. The feature fusion network is shown in fig. 2.

And step 3: constructing a regression forest; the regression forest is composed of 5 complete binary trees, and the depth of each tree is 6. For each tree, the tree is composed of 31 internal nodes and 32 leaf nodes, each internal node has a splitting function, and each leaf node has a Gaussian distribution. Calculating the probability s of the current internal node moving to the left according to the splitting function of the nth internal node_n. After calculating the probability that all internal nodes move to the left, starting from the root node, the arrival probability w of each leaf node can be calculated according to the left movement probability of the internal nodes_lAnd then according to the probability of reaching each leaf and the distribution of the leaves, calculating the prediction result of the current tree. And finally, taking the average value of the 5 tree prediction results as the result of the final sight line estimation.

And 4, step 4: an overall neural network; respectively extracting the feature vectors f of the left eye image and the right eye image by using the feature extraction network in the step 2_l，f_r(ii) a Then extracting the feature vector f_l，f_rInputting as a feature fusion network to further obtain a fusion feature vector f; and finally, calculating the left shift probability of the internal node of each tree in the regression forest based on the fusion feature vector and the splitting function, and further calculating the final prediction result. The general neural network structure is schematically shown in fig. 1.

And 5: designing a loss function; recording the i-th pair of left and right eye images obtained in step 1 as x_i，y_iLabel representing the ith pair of images, v_iRepresenting the weight of the ith pair of samples, theta representing the parameters of the feature extraction network and the feature fusion network, pi representing the parameter of regression forest leaf gaussian distribution, the loss function can be represented as:

wherein

Indicating that y is taken with the current model parameters_iProbability of (H)_iThe entropy of the ith pair of samples is represented, gamma represents a weight coefficient of the entropy, and lambda is a control parameter of the learning process, wherein the two parameters are both hyper-parameters of the model. The goal of the overall model is to maximize the above loss function.

Step 6: training a network model based on self-learning; and finishing the training of the network model according to a self-walking learning strategy, setting the total step number of the self-walking learning to be 6, and setting the number of the samples used from the step 1 to the step 6 to be 50%, 60%, 70%, 80%, 90% and 100% of the total sample number. Initializing lambda⁰，γ⁰Ensure that 50% of the data is added to the 1 st training. During each training step, the loss function in the step 5 is maximized, the network parameters and the regression forest parameters are updated, and after the training is finished, the lambda and the gamma are adjusted to ensure that samples with corresponding proportions are added to the next stepAnd (5) a one-step training process. A flow chart of a model training algorithm based on self-learning is shown in fig. 3.

And 7: and in the testing stage, an image to be tested is taken, preprocessing is carried out according to the method in the step 1, and then the preprocessed image pair is used as the input of the trained model in the step 6, so that the sight estimation result of the tested image is obtained. Experimental results the mean error on MPIIGaze data set was 4.45 °; compared with the front method, the angle is improved by 0.17 degrees.

Further, the specific method of step 3 is as follows:

step 3.1: calculating the left shift probability of each internal node: splitting function s_n(x_i；θ)：x_i→[0，1]The splitting function is determined by a network parameter theta, and input samples x_iA scalar that maps between 0 and 1, characterizes how likely the sample should be divided into the left sub-tree after reaching the current node. The concrete form of the splitting function is as follows:

where σ (-) is a sigmoid function,

is that the index function represents the selection of the fusion characteristic f at the nth splitting node

The number of the elements is one,

represents to the sample x_iIn terms of the value of the nth split node; .

wherein [. ]]Is an indication function, if true returns 1, otherwise returns 0;

respectively representing node sets of subtrees taking left and right children of the split node n as root nodes.

The distribution state of the leaf nodes is represented, and considering that a tree is composed of a plurality of leaf nodes, the final prediction result is represented by the weighted average of all the leaves according to the arrival probability, and the form of the final prediction result is as follows:

further, the specific method of step 5 is as follows:

step 5.1: calculating the prediction result of the sample: according to the method in the step 3, the prediction result of the regression forest is calculated

Step 5.2: calculating the entropy of the sample: since a single tree is obtained by weighted summation of multiple leaf distributions, the integral of such a mixture gaussian distribution is non-trivial, where the lower bound of the single tree entropy is calculated to approximate the true value of the single tree entropy, which is calculated by:

wherein

Is the predicted result of the kth tree, Π_kIs the leaf distribution parameter for the kth tree. The entropy of the sample is then derived from the average of the entropies of the trees, and is calculated by:

Claims

1. a sight line estimation method based on self-walking learning comprises the following steps:

step 1: preprocessing the data set;

1) constructing a feature extraction network; the feature extraction network consists of two sub-networks with the same structure, and each sub-network receives the monocular image as input and outputs a feature vector; a sub-network is composed of 5 convolution blocks and 1 standard full-connection layer, wherein the 5 convolution blocks are respectively composed of 2, 3 and 3 standard convolution layers, a maximum pooling layer with the step length of 2 is added between the convolution blocks, a maximum pooling layer with the step length of 2 is also connected behind the 5 th convolution block, and finally a standard full-connection layer is connected to output a corresponding feature vector;

2) constructing a feature fusion network; the feature fusion network takes the feature vectors corresponding to the left eye and the right eye as input and outputs fusion feature vectors; the feature fusion network is composed of 2 standard full-connection layers and 1 inactivated full-connection layer, two input feature vectors are cascaded first, and then the fusion feature vectors are output through the feature fusion network;

and 4, step 4: an overall neural network; respectively extracting the feature vectors f of the left eye image and the right eye image by using the feature extraction network in the step 2_l，f_r(ii) a Then extracting the feature vector f_l，f_rInputting as a feature fusion network to further obtain a fusion feature vector f; finally, calculating the left shift probability of each tree internal node in the regression forest based on the fusion feature vector and the splitting function, and further calculating a final prediction result;

and 5: designing a loss function; recording the i-th pair of left and right eye images obtained in step 1 as x_i，y_iLabel representing the ith pair of images, v_iRepresenting the weight of the ith pair of samples, theta representing the parameters of the feature extraction network and the feature fusion network, and pi representing the parameter of the regression forest leaf Gaussian distribution, the loss function can be represented as:

wherein

2. The sight line estimation method based on self-learning according to claim 1, wherein the specific method of the step 3 is as follows:

where σ (-) is a sigmoid function,

The number of the elements is one,

represents to the sample x_iIn terms of the value of the nth split node;

wherein [. ]]Is an indication function, if true returns 1, otherwise returns 0;

wherein, ω is_l(x_i| θ) represents the probability of reaching a leaf, l, p_l(y_i) Indicates that the leaf l is at y_iThe probability of (a) being in (b),

representation tree

A set of leaves of (1);

wherein K represents the number of trees in the regressive forest,

3. the sight line estimation method based on self-learning according to claim 1, wherein the calculation method of the sample entropy in the step 5 is as follows:

wherein