CN117292421B

CN117292421B - GRU-based continuous vision estimation deep learning method

Info

Publication number: CN117292421B
Application number: CN202311173058.1A
Authority: CN
Inventors: 王可; 王进; 曹硕裕
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2024-05-28
Anticipated expiration: 2043-09-12
Also published as: CN117292421A

Abstract

The invention belongs to the field of computer vision, and particularly relates to a GRU-based continuous vision estimation deep learning method, which comprises the following steps: defining an image feature space and a hidden state space dimension of the GRU; carrying out feature extraction and feature dimension reduction treatment on an input face image by utilizing a pre-trained ResNet-50 model; processing the image feature vector to obtain a model hiding state; inputting the hidden state into the GRU for time series modeling to generate an output vector; performing feature mapping on the output vector to obtain a new feature vector; mapping the new feature vector into a three-dimensional output vector; performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector; transforming a third element of the three-dimensional output vector through a sigmoid function; the error between the predicted result and the actual value is measured using PinBall loss functions. The invention uses ResNet-50 model and GRU model at the same time, which has high accuracy and effectiveness in the task of estimating continuous sight.

Description

GRU-based continuous vision estimation deep learning method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a GRU-based continuous vision estimation deep learning method.

Background

The goal of line-of-sight estimation is to determine the gaze direction and point of a person in an image or video. Its importance stems from the fact that people can infer their potential behavior and intent by looking at the line of sight of an individual. For example, a person looking down at a watch at a bus stop may indicate that he has an emergency to deal with. Since the direction of a person's gaze implies rich information, gaze estimation may help people to better understand the person's intent, predicting what they might do next. Therefore, the vision estimation has wide application prospect in a plurality of fields.

Line-of-sight estimation methods can be generally classified into model-based and appearance-based methods, since model-based methods generally use specific devices; whereas appearance-based gaze estimation, human gaze is typically estimated using simple camera equipment and complex depth learning algorithms.

The early vision estimation method takes a monocular image as input, adopts a convolutional neural network training model, and outputs the two-dimensional coordinates of the vision. Subsequently, a binocular vision line estimation method is proposed which compensates for the deficiency of the binocular vision line estimation method by using complementary information of both eyes. However, both methods still have drawbacks, such as the need for additional modules for eye detection and head pose estimation. Therefore, a full-face sight line estimation method appears later, the method can output a final sight line estimation result only by inputting a face image, the end-to-end learning strategy can consider global characteristics of the full face, and many modern sight line estimation methods are based on the method.

In chinese patent application CN114387679a, a line-of-sight estimating method based on a recurrent convolutional neural network is proposed, a convolutional neural network based on DenseNet network mechanism is designed in the feature extraction part of the method, and the line-of-sight regression part further performs joint coding on dynamic line-of-sight features through an LSTM network, so as to regress the line-of-sight angle. The full connection structure DenseNet is superior to ResNet in parameter efficiency, but is large in calculation amount and high in memory consumption when processing a large data set and a complex task, so that a large expenditure is generated. But is more computationally efficient when processing large data sets and complex tasks for line-of-sight estimation than DenseNet, resNet-50. Although LSTM is suitable for handling the problem of long-term dependency, LSTM has more parameters and more calculation amount than GRU, so when processing a large number of continuous video frames, it may increase calculation complexity and affect real-time performance. Compared with LSTM, GRU has simpler model, fewer parameters and smaller calculation amount, and meanwhile GRU has better performance than LSTM in the task of line-of-sight regression.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a GRU-based continuous vision estimation deep learning method which combines ResNet-50 with a GRU model and has high accuracy in a continuous vision estimation task, and the method adopts the following technical scheme:

A GRU-based continuous vision estimation deep learning method comprises the following steps:

step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;

S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';

S3, processing the image feature vector F' through the full connection layer F _C1 to generate a hidden state H of the model;

S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;

S5, performing feature mapping on the output vector G of the GRU through the full connection layer F _C2 to obtain a new feature vector G';

step S6, mapping the new feature vector G' into a three-dimensional output vector O through a full connection layer F _C3, wherein the three-dimensional output vector O represents a predicted sight line direction and uncertainty of sight line prediction, the sight line direction comprises a horizontal angle and a vertical angle of the sight line, and the uncertainty of the sight line prediction comprises an angle error;

S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;

s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;

and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.

Further, in step S1, the image feature space dimension is set to d, and the hidden-state space dimension of the GRU is set to h, where d=h=256.

Further, step S2 includes:

Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the formula for obtaining the image feature vector F' through the feature dimension reduction processing of the linear transformation layer is as follows:

F′＝W_L1·F+B_L1

wherein W _L1 and B _L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.

Further, in step S3, it includes:

the formula for obtaining the hidden state H is as follows:

Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F _C1, respectively.

Further, the step S4 includes the steps of:

Step S401, selecting an all-zero vector as an initial state;

Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:

Z＝sigmoid(W_Z·[H,F′]+B_Z)

R＝sigmoid(W_R·[H,F′]+B_R)

Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W _Z,W_R,B_Z,B_R is a weight and bias parameter learned during training;

Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:

H'＝tanh(W_H'·[R⊙H,F']+B_H')

Wherein W _H' and B _H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;

Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:

H＝(1-Z)⊙H+Z⊙H'

the hidden state H of the last time step serves as the output vector G of the current sequence.

Further, in step S5, it includes:

the formula for obtaining the new feature vector G' is as follows:

Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F _C2 are respectively; the new feature vector G' contains depth features and time series information of the original input face image.

Further, in step S6, it includes:

The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F _C3 is as follows:

Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F _C3, respectively.

Further, in step S7, it includes:

The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O _h and a vertical angle O _v of the sight, and the calculation formula is as follows:

O_h＝π·tanh(O[0])

O_v＝π/2·tanh(O[1])

Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.

Further, in step S8, it includes:

The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:

σ＝π·sigmoid(O[2])

where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.

Further, in step S9, it includes:

Calculating the loss of each sample on the two quantiles by taking the difference between the PinBall loss function target value and the predicted value of 10% and 90% quantiles as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, the loss is back propagated to the network for updating network parameters to improve accuracy of line-of-sight estimation.

The loss function is shown as follows:

L₁＝1/N∑(q₁*max(t-(o-σ),0)+(1-q₁)*max((o-σ)-t,0))

L₂＝1/N∑(q₉*max(t-(o+σ),0)+(1-q₉)*max((o+σ)-t,0))

L＝L₁+L₂

Where L ₁ represents the average loss to calculate 10% quantiles, L ₂ represents the average loss to calculate 90% quantiles, L represents the final loss, L is the amount that is attempted to be minimized during training, and N represents the total number of samples; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q ₁ and q ₉ are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method, the ResNet-50 model is used for extracting depth features, so that more abundant sight features can be deeply excavated and obtained;

2. In the method, the two full connection layers F _C1 and F _C2 are used for performing dimension reduction treatment on the video characteristics, so that the treatment efficiency and accuracy are improved;

3. According to the method, GRU is used for line of sight estimation, time sequence modeling is carried out, dynamic change information of a human face is captured, continuous line of sight estimation is realized, long-term dependence in time sequence data is more effectively captured and utilized, and the problems of gradient disappearance or gradient explosion can be avoided;

4. According to the method, the nonlinear activation function and the PinBall loss function are introduced after the full connection layer F _C3, the three-dimensional characteristics are output, the accuracy and the stability of sight estimation are further improved, and the high efficiency and the reliability in practical application are ensured.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

FIG. 1 is a network architecture diagram of a GRU-based continuous line-of-sight estimation deep learning method according to the present invention;

FIG. 2 is a flow chart of a method used in providing an embodiment of the present invention;

FIG. 3 is a schematic diagram of the hierarchy and dimensional changes of a network model used in an embodiment of the present invention;

Fig. 4 is a diagram of a single-eye, double-eye, full-face gaze estimation network in an embodiment of the present invention.

Detailed Description

The invention is further explained in the following detailed description with reference to the drawings so that those skilled in the art can more fully understand the invention and can practice it, but the invention is explained below by way of example only and not by way of limitation.

The early vision estimation method takes a monocular image as input, adopts a convolutional neural network training model, and outputs the two-dimensional coordinates of the vision. Subsequently, a binocular vision line estimation method is proposed which compensates for the deficiency of the binocular vision line estimation method by using complementary information of both eyes. However, both methods still have drawbacks, such as the need for additional modules for eye detection and head pose estimation. The full-face vision estimation method can output a final vision estimation result only by inputting a face image, and the end-to-end learning strategy can consider global characteristics of the full face, so that many modern vision estimation methods are based on the full-face vision estimation method. Fig. 4 is a diagram of a single-eye, double-eye, full-face line-of-sight estimation network for comparison.

As shown in FIG. 1, the invention adopts ResNet-50 model to extract the sight line characteristics, then reduces the 1000-dimensional characteristics extracted by ResNet-50 network to 256 dimensions through two full connection layers, then uses GRU as sight line estimation module, and finally introduces nonlinear activation function and loss function to output three-dimensional characteristics after passing through the full connection layers. FIG. 2 is a flow chart of a method used in the present embodiment; fig. 3 is a schematic diagram of a hierarchical structure and dimensional change of the network model according to the present embodiment.

In step S1, an image feature space dimension is set to d, and a hidden state space dimension of the GRU is set to h, where d=h=256; these two parameters are used as the basis for model training to form the feature transformation space of the input to the output.

The step S2 includes:

F′＝W_L1·F+B_L1

In step S3, it includes:

The hidden state H represents a result of nonlinear transformation of the feature vector F', and the formula for obtaining the hidden state H is as follows:

Wherein the method comprises the steps of And/>The weight matrix and bias vector of the full connection layer F _C1, respectively, are optimally updated during the training process, with the goal of minimizing the loss function of the network.

In step S4, the hidden state H is input to a GRU having H hidden states to perform time series modeling to obtain an output vector G, which includes the following steps:

Step S401, selecting an all-zero vector as an initial state;

Z＝sigmoid(W_Z·[H,F′]+B_Z)

R＝sigmoid(W_R·[H,F′]+B_R)

H'＝tanh(W_H'·[R⊙H,F']+B_H')

H＝(1-Z)⊙H+Z⊙H'

The hidden state H will be used for the calculation of the next moment or as the final output G vector of the current sequence.

In step S5, an output sequence is obtained after the GRU performs the operation, and in this embodiment, the output sequence is converted into a feature vector with a fixed size, and the hidden state of the last time step is used as the output vector to obtain a new feature vector G' with the following formula:

Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F _C2 are respectively; the new feature vector G' contains depth features and time sequence information of the original input face image; the output of GRU is non-linearly transformed to raise the expression capacity of the model.

The step S6 includes:

The step S7 includes:

O_h＝π·tanh(O[0])

O_v＝π/2·tanh(O[1])

The step S8 includes:

σ＝π·sigmoid(O[2])

In step S9, a PinBall loss function is used, and the PinBall loss function calculates a gap between the prediction result and the upper and lower boundaries of the prediction uncertainty region, and if the target value exceeds the prediction region, the loss increases, and if the target value exceeds the prediction region, the loss decreases. This design enables the predictive model to dynamically adjust the penalty based on the uncertainty of the predicted value. The step S9 includes:

The loss function is shown as follows:

L₁＝1/N∑(q₁*max(t-(o-σ),0)+(1-q₁)*max((o-σ)-t,0))

L₂＝1/N∑(q₉*max(t-(o+σ),0)+(1-q₉)*max((o+σ)-t,0))

L＝L₁+L₂

Where L ₁ represents the average loss of calculating 10% quantiles, L ₂ represents the average loss of calculating 90% quantiles, L represents the final loss, L is the amount that is sought to be minimized during training, N represents the total number of samples, and the average of the losses for each sample is calculated in the above formula; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q ₁ and q ₉ are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.

The effectiveness of the present invention is verified by simulation experiments as follows.

The classical Gaze360 dataset and MPIIFaceGaze dataset of video estimation are subjected to data rectification, and the aim is to eliminate factors such as environment and the like by a data preprocessing method and simplify the fixation regression problem.

The Gaze360 dataset is video data collected from 238 subjects in the real world, and the dataset is large in size and contains a large number of video frames, so that the time series information of the dataset can be fully utilized in the present embodiment. In this embodiment, 84902 pictures of the train group of the dataset are used as the test set, and 11318 pictures of the val group of the dataset are used as the test set.

MPIIFaceGaze data set, comprising a total of 45000 images of 15 subjects, the present example uses 3000 images of the experimenter P00 as the test set and the remaining 42000 images as the training set.

The specific steps for processing MPIIFaceGaze datasets are as follows:

Step 1, defining and acquiring necessary file paths, including an input data set path, a sample list path and an output path;

step 2, acquiring all people in a sample list, and processing each person, wherein in the processing of each person, firstly, reading a camera matrix and annotation information of the person, creating an output file for storing tag information, and simultaneously creating folders for storing images of faces, left eyes and right eyes;

and 3, traversing all the images of the person, and processing each image as follows:

Step 3-1, reading annotation information of the image and the image, and normalizing the image by using a face center, a fixation target, a head rotation vector, an image size and camera parameters in the annotation information to obtain a normalized face image;

And 3-2, respectively cutting left eyes and right eyes, and performing histogram equalization processing to obtain normalized 3D fixation points, 3D head orientations, face center points, rotation matrixes and scale matrixes. If the face image is the right eye image, the face image, the left eye image and the right eye image are turned over, the 3D fixation point and the 3D head orientation are turned over, the x coordinate of the center point of the face is inverted, and the 3D fixation point and the 3D head orientation are converted into 2D;

And 3-3, storing the processed face image, the left eye image, the right eye image and all annotation information into a designated file, and closing the output file of the tag information after all image processing is completed.

After the data set is processed, the network model of the present embodiment is trained using the pre-processed MPIIFaceGaze and size 360 data sets, configuring training parameters, base size set to 20, epoch set to 60, learning rate set to 0.0001, decay set to 1, decay step set to 5000, and further PinBall is used as a loss function. Then training by using the configured parameters and data set, and the initialized model and the loss function, wherein the specific steps are as follows:

S1: performing forward propagation to obtain the output of the model, and calculating a loss function by using the output of the model and the actual label;

S2: performing back propagation, calculating gradients, updating parameters of the model using an optimizer and adjusting the learning rate;

s3: at the end of each epoch, it is checked whether the conditions for model preservation are met. If so, saving the parameters of the current model into a specified file.

And finally, verifying on the test set by using the trained model.

The evaluation index of the current main stream of the sight line estimation is mostly an angle error, namely the deviation angle of the predicted value and the true value of the sight line estimation, and the smaller the index is, the better the effect is. The comparative model uses the advanced line-of-sight estimation methods Dilated-Net, RT-Gene, gaze360. Wherein Dilated-Net sets the batch size as 64, epoch as 100 and learning rate as 0.001; RT-Gene set batch size to 64, epoch to 40, learning rate to 0.0001; gaze360 sets a batch size of 80, epoch of 100, and learning rate of 0.0001. The experimental results are shown in table 1:

table 1 experimental results of the network and other advanced networks proposed by the present invention

Method of	MPIIFaceGaze	Gaze360
			RT-Gene	3.24°	12.16°
Dilated-Net	2.65°	/
			Gaze360	2.57°	10.58°
The invention is that	2.24°	10.30°

As shown in the experimental data of the table 1, the method of the invention can effectively improve the precision of continuous sight estimation and has stronger practical value through experimental verification.

The following is an applicable scenario of the embodiment of the present invention:

The sight line estimation has wide application scenes, wherein one application scene is driver fatigue detection. The driver may have an influence on the concentration of his eyes during driving if he/she remains highly concentrated or in a tired state for a long period of time. For example, during fatigue driving, the driver's vision may not be concentrated or eyes may be frequently closed, which are important indicators of driver fatigue.

An important indicator of driver fatigue is the driver's gaze status, and the method of the present invention is used to predict gaze. The network in the method has memory, so that the time dependence of the gaze state of the driver can be captured, namely, the gaze state of the previous period has an influence on the current gaze state, and the gaze state of the driver can be detected and predicted in real time, so that the fatigue of the driver can be early warned in advance, and traffic accidents can be avoided.

Firstly, capturing face images of a driver in real time through a camera when the driver drives a vehicle;

then, inputting the predicted gaze state into a network model provided by the invention;

Finally, when the model predicts that the driver is likely to be in a tired state, the system audibly or otherwise alerts the driver to rest, or automatically switches to an automatic driving mode.

While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. The GRU-based continuous vision estimation deep learning method is characterized by comprising the following steps of:

Step S6, mapping the new feature vector G' into a three-dimensional output vector O through the full connection layer F _C3, wherein the three-dimensional output vector O represents the predicted sight direction and the predicted uncertainty;

2. The GRU-based continuous line-of-sight estimation deep learning method according to claim 1, wherein in the step S1, the image feature space dimension is set to d, and the hidden state space dimension of the GRU is set to h, where d=h=256.

3. The GRU-based continuous line-of-sight estimation deep learning method according to claim 2, wherein the step S2 includes:

Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the linear transformation layer is used for carrying out feature dimension reduction processing to obtain an image feature vector F' with the following formula:

F′＝W_L1·F+B_L1

4. The GRU-based continuous line-of-sight estimation deep learning method according to claim 3, wherein the step S3 includes:

the formula for obtaining the hidden state H is as follows:

5. The GRU-based continuous line-of-sight estimation deep learning method of claim 4, wherein said step S4 comprises the steps of:

Step S401, selecting an all-zero vector as an initial state;

Z＝sigmoid(W_Z·[H,F′]+B_Z)

R＝sigmoid(W_R·[H,F′]+B_R)

H'＝tanh(W_H'·[R⊙H,F']+B_H')

H＝(1-Z)⊙H+Z⊙H'

6. The GRU-based continuous line-of-sight estimation deep learning method of claim 5, wherein the step S5 includes:

the formula for obtaining the new feature vector G' is as follows:

7. The GRU-based continuous line-of-sight estimation deep learning method of claim 6, wherein the step S6 includes:

8. The GRU-based continuous line-of-sight estimation deep learning method of claim 7, wherein the step S7 includes:

O_h＝π·tanh(O[0])

O_v＝π/2·tanh(O[1])

9. The GRU-based continuous line-of-sight estimation deep learning method of claim 8, wherein the step S8 includes:

σ＝π·sigmoid(O[2])

10. The GRU-based continuous line-of-sight estimation deep learning method of claim 9, wherein the step S9 includes:

Calculating the loss of each sample on the two quantiles by taking the difference value of the PinBall loss function target value and the 10% and 90% quantiles of the predicted value as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, this final loss is back propagated to the network for updating the network parameters;

The loss function is shown as follows:

L₁＝1/N∑(q₁*max(t-(o-σ),0)+(1-q₁)*max((o-σ)-t,0))

L₂＝1/N∑(q₉*max(t-(o+σ),0)+(1-q₉)*max((o+σ)-t,0))

L＝L₁+L₂