CN117292421B - GRU-based continuous vision estimation deep learning method - Google Patents

GRU-based continuous vision estimation deep learning method Download PDF

Info

Publication number
CN117292421B
CN117292421B CN202311173058.1A CN202311173058A CN117292421B CN 117292421 B CN117292421 B CN 117292421B CN 202311173058 A CN202311173058 A CN 202311173058A CN 117292421 B CN117292421 B CN 117292421B
Authority
CN
China
Prior art keywords
gru
vector
sight
output vector
hidden state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311173058.1A
Other languages
Chinese (zh)
Other versions
CN117292421A (en
Inventor
王可
王进
曹硕裕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202311173058.1A priority Critical patent/CN117292421B/en
Publication of CN117292421A publication Critical patent/CN117292421A/en
Application granted granted Critical
Publication of CN117292421B publication Critical patent/CN117292421B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Ophthalmology & Optometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and particularly relates to a GRU-based continuous vision estimation deep learning method, which comprises the following steps: defining an image feature space and a hidden state space dimension of the GRU; carrying out feature extraction and feature dimension reduction treatment on an input face image by utilizing a pre-trained ResNet-50 model; processing the image feature vector to obtain a model hiding state; inputting the hidden state into the GRU for time series modeling to generate an output vector; performing feature mapping on the output vector to obtain a new feature vector; mapping the new feature vector into a three-dimensional output vector; performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector; transforming a third element of the three-dimensional output vector through a sigmoid function; the error between the predicted result and the actual value is measured using PinBall loss functions. The invention uses ResNet-50 model and GRU model at the same time, which has high accuracy and effectiveness in the task of estimating continuous sight.

Description

GRU-based continuous vision estimation deep learning method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a GRU-based continuous vision estimation deep learning method.
Background
The goal of line-of-sight estimation is to determine the gaze direction and point of a person in an image or video. Its importance stems from the fact that people can infer their potential behavior and intent by looking at the line of sight of an individual. For example, a person looking down at a watch at a bus stop may indicate that he has an emergency to deal with. Since the direction of a person's gaze implies rich information, gaze estimation may help people to better understand the person's intent, predicting what they might do next. Therefore, the vision estimation has wide application prospect in a plurality of fields.
Line-of-sight estimation methods can be generally classified into model-based and appearance-based methods, since model-based methods generally use specific devices; whereas appearance-based gaze estimation, human gaze is typically estimated using simple camera equipment and complex depth learning algorithms.
The early vision estimation method takes a monocular image as input, adopts a convolutional neural network training model, and outputs the two-dimensional coordinates of the vision. Subsequently, a binocular vision line estimation method is proposed which compensates for the deficiency of the binocular vision line estimation method by using complementary information of both eyes. However, both methods still have drawbacks, such as the need for additional modules for eye detection and head pose estimation. Therefore, a full-face sight line estimation method appears later, the method can output a final sight line estimation result only by inputting a face image, the end-to-end learning strategy can consider global characteristics of the full face, and many modern sight line estimation methods are based on the method.
In chinese patent application CN114387679a, a line-of-sight estimating method based on a recurrent convolutional neural network is proposed, a convolutional neural network based on DenseNet network mechanism is designed in the feature extraction part of the method, and the line-of-sight regression part further performs joint coding on dynamic line-of-sight features through an LSTM network, so as to regress the line-of-sight angle. The full connection structure DenseNet is superior to ResNet in parameter efficiency, but is large in calculation amount and high in memory consumption when processing a large data set and a complex task, so that a large expenditure is generated. But is more computationally efficient when processing large data sets and complex tasks for line-of-sight estimation than DenseNet, resNet-50. Although LSTM is suitable for handling the problem of long-term dependency, LSTM has more parameters and more calculation amount than GRU, so when processing a large number of continuous video frames, it may increase calculation complexity and affect real-time performance. Compared with LSTM, GRU has simpler model, fewer parameters and smaller calculation amount, and meanwhile GRU has better performance than LSTM in the task of line-of-sight regression.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a GRU-based continuous vision estimation deep learning method which combines ResNet-50 with a GRU model and has high accuracy in a continuous vision estimation task, and the method adopts the following technical scheme:
A GRU-based continuous vision estimation deep learning method comprises the following steps:
step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;
S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';
S3, processing the image feature vector F' through the full connection layer F C1 to generate a hidden state H of the model;
S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;
S5, performing feature mapping on the output vector G of the GRU through the full connection layer F C2 to obtain a new feature vector G';
step S6, mapping the new feature vector G' into a three-dimensional output vector O through a full connection layer F C3, wherein the three-dimensional output vector O represents a predicted sight line direction and uncertainty of sight line prediction, the sight line direction comprises a horizontal angle and a vertical angle of the sight line, and the uncertainty of the sight line prediction comprises an angle error;
S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;
s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;
and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.
Further, in step S1, the image feature space dimension is set to d, and the hidden-state space dimension of the GRU is set to h, where d=h=256.
Further, step S2 includes:
Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the formula for obtaining the image feature vector F' through the feature dimension reduction processing of the linear transformation layer is as follows:
F′=WL1·F+BL1
wherein W L1 and B L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.
Further, in step S3, it includes:
the formula for obtaining the hidden state H is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C1, respectively.
Further, the step S4 includes the steps of:
Step S401, selecting an all-zero vector as an initial state;
Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:
Z=sigmoid(WZ·[H,F′]+BZ)
R=sigmoid(WR·[H,F′]+BR)
Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W Z,WR,BZ,BR is a weight and bias parameter learned during training;
Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:
H'=tanh(WH'·[R⊙H,F']+BH')
Wherein W H' and B H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;
Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:
H=(1-Z)⊙H+Z⊙H'
the hidden state H of the last time step serves as the output vector G of the current sequence.
Further, in step S5, it includes:
the formula for obtaining the new feature vector G' is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C2 are respectively; the new feature vector G' contains depth features and time series information of the original input face image.
Further, in step S6, it includes:
The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F C3 is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C3, respectively.
Further, in step S7, it includes:
The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O h and a vertical angle O v of the sight, and the calculation formula is as follows:
Oh=π·tanh(O[0])
Ov=π/2·tanh(O[1])
Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.
Further, in step S8, it includes:
The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:
σ=π·sigmoid(O[2])
where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.
Further, in step S9, it includes:
Calculating the loss of each sample on the two quantiles by taking the difference between the PinBall loss function target value and the predicted value of 10% and 90% quantiles as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, the loss is back propagated to the network for updating network parameters to improve accuracy of line-of-sight estimation.
The loss function is shown as follows:
L1=1/N∑(q1*max(t-(o-σ),0)+(1-q1)*max((o-σ)-t,0))
L2=1/N∑(q9*max(t-(o+σ),0)+(1-q9)*max((o+σ)-t,0))
L=L1+L2
Where L 1 represents the average loss to calculate 10% quantiles, L 2 represents the average loss to calculate 90% quantiles, L represents the final loss, L is the amount that is attempted to be minimized during training, and N represents the total number of samples; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q 1 and q 9 are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the ResNet-50 model is used for extracting depth features, so that more abundant sight features can be deeply excavated and obtained;
2. In the method, the two full connection layers F C1 and F C2 are used for performing dimension reduction treatment on the video characteristics, so that the treatment efficiency and accuracy are improved;
3. According to the method, GRU is used for line of sight estimation, time sequence modeling is carried out, dynamic change information of a human face is captured, continuous line of sight estimation is realized, long-term dependence in time sequence data is more effectively captured and utilized, and the problems of gradient disappearance or gradient explosion can be avoided;
4. According to the method, the nonlinear activation function and the PinBall loss function are introduced after the full connection layer F C3, the three-dimensional characteristics are output, the accuracy and the stability of sight estimation are further improved, and the high efficiency and the reliability in practical application are ensured.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
FIG. 1 is a network architecture diagram of a GRU-based continuous line-of-sight estimation deep learning method according to the present invention;
FIG. 2 is a flow chart of a method used in providing an embodiment of the present invention;
FIG. 3 is a schematic diagram of the hierarchy and dimensional changes of a network model used in an embodiment of the present invention;
Fig. 4 is a diagram of a single-eye, double-eye, full-face gaze estimation network in an embodiment of the present invention.
Detailed Description
The invention is further explained in the following detailed description with reference to the drawings so that those skilled in the art can more fully understand the invention and can practice it, but the invention is explained below by way of example only and not by way of limitation.
The early vision estimation method takes a monocular image as input, adopts a convolutional neural network training model, and outputs the two-dimensional coordinates of the vision. Subsequently, a binocular vision line estimation method is proposed which compensates for the deficiency of the binocular vision line estimation method by using complementary information of both eyes. However, both methods still have drawbacks, such as the need for additional modules for eye detection and head pose estimation. The full-face vision estimation method can output a final vision estimation result only by inputting a face image, and the end-to-end learning strategy can consider global characteristics of the full face, so that many modern vision estimation methods are based on the full-face vision estimation method. Fig. 4 is a diagram of a single-eye, double-eye, full-face line-of-sight estimation network for comparison.
As shown in FIG. 1, the invention adopts ResNet-50 model to extract the sight line characteristics, then reduces the 1000-dimensional characteristics extracted by ResNet-50 network to 256 dimensions through two full connection layers, then uses GRU as sight line estimation module, and finally introduces nonlinear activation function and loss function to output three-dimensional characteristics after passing through the full connection layers. FIG. 2 is a flow chart of a method used in the present embodiment; fig. 3 is a schematic diagram of a hierarchical structure and dimensional change of the network model according to the present embodiment.
A GRU-based continuous vision estimation deep learning method comprises the following steps:
step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;
S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';
S3, processing the image feature vector F' through the full connection layer F C1 to generate a hidden state H of the model;
S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;
S5, performing feature mapping on the output vector G of the GRU through the full connection layer F C2 to obtain a new feature vector G';
step S6, mapping the new feature vector G' into a three-dimensional output vector O through a full connection layer F C3, wherein the three-dimensional output vector O represents a predicted sight line direction and uncertainty of sight line prediction, the sight line direction comprises a horizontal angle and a vertical angle of the sight line, and the uncertainty of the sight line prediction comprises an angle error;
S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;
s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;
and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.
In step S1, an image feature space dimension is set to d, and a hidden state space dimension of the GRU is set to h, where d=h=256; these two parameters are used as the basis for model training to form the feature transformation space of the input to the output.
The step S2 includes:
Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the formula for obtaining the image feature vector F' through the feature dimension reduction processing of the linear transformation layer is as follows:
F′=WL1·F+BL1
wherein W L1 and B L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.
In step S3, it includes:
The hidden state H represents a result of nonlinear transformation of the feature vector F', and the formula for obtaining the hidden state H is as follows:
Wherein the method comprises the steps of And/>The weight matrix and bias vector of the full connection layer F C1, respectively, are optimally updated during the training process, with the goal of minimizing the loss function of the network.
In step S4, the hidden state H is input to a GRU having H hidden states to perform time series modeling to obtain an output vector G, which includes the following steps:
Step S401, selecting an all-zero vector as an initial state;
Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:
Z=sigmoid(WZ·[H,F′]+BZ)
R=sigmoid(WR·[H,F′]+BR)
Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W Z,WR,BZ,BR is a weight and bias parameter learned during training;
Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:
H'=tanh(WH'·[R⊙H,F']+BH')
Wherein W H' and B H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;
Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:
H=(1-Z)⊙H+Z⊙H'
The hidden state H will be used for the calculation of the next moment or as the final output G vector of the current sequence.
In step S5, an output sequence is obtained after the GRU performs the operation, and in this embodiment, the output sequence is converted into a feature vector with a fixed size, and the hidden state of the last time step is used as the output vector to obtain a new feature vector G' with the following formula:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C2 are respectively; the new feature vector G' contains depth features and time sequence information of the original input face image; the output of GRU is non-linearly transformed to raise the expression capacity of the model.
The step S6 includes:
The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F C3 is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C3, respectively.
The step S7 includes:
The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O h and a vertical angle O v of the sight, and the calculation formula is as follows:
Oh=π·tanh(O[0])
Ov=π/2·tanh(O[1])
Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.
The step S8 includes:
The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:
σ=π·sigmoid(O[2])
where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.
In step S9, a PinBall loss function is used, and the PinBall loss function calculates a gap between the prediction result and the upper and lower boundaries of the prediction uncertainty region, and if the target value exceeds the prediction region, the loss increases, and if the target value exceeds the prediction region, the loss decreases. This design enables the predictive model to dynamically adjust the penalty based on the uncertainty of the predicted value. The step S9 includes:
Calculating the loss of each sample on the two quantiles by taking the difference between the PinBall loss function target value and the predicted value of 10% and 90% quantiles as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, the loss is back propagated to the network for updating network parameters to improve accuracy of line-of-sight estimation.
The loss function is shown as follows:
L1=1/N∑(q1*max(t-(o-σ),0)+(1-q1)*max((o-σ)-t,0))
L2=1/N∑(q9*max(t-(o+σ),0)+(1-q9)*max((o+σ)-t,0))
L=L1+L2
Where L 1 represents the average loss of calculating 10% quantiles, L 2 represents the average loss of calculating 90% quantiles, L represents the final loss, L is the amount that is sought to be minimized during training, N represents the total number of samples, and the average of the losses for each sample is calculated in the above formula; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q 1 and q 9 are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.
The effectiveness of the present invention is verified by simulation experiments as follows.
The classical Gaze360 dataset and MPIIFaceGaze dataset of video estimation are subjected to data rectification, and the aim is to eliminate factors such as environment and the like by a data preprocessing method and simplify the fixation regression problem.
The Gaze360 dataset is video data collected from 238 subjects in the real world, and the dataset is large in size and contains a large number of video frames, so that the time series information of the dataset can be fully utilized in the present embodiment. In this embodiment, 84902 pictures of the train group of the dataset are used as the test set, and 11318 pictures of the val group of the dataset are used as the test set.
MPIIFaceGaze data set, comprising a total of 45000 images of 15 subjects, the present example uses 3000 images of the experimenter P00 as the test set and the remaining 42000 images as the training set.
The specific steps for processing MPIIFaceGaze datasets are as follows:
Step 1, defining and acquiring necessary file paths, including an input data set path, a sample list path and an output path;
step 2, acquiring all people in a sample list, and processing each person, wherein in the processing of each person, firstly, reading a camera matrix and annotation information of the person, creating an output file for storing tag information, and simultaneously creating folders for storing images of faces, left eyes and right eyes;
and 3, traversing all the images of the person, and processing each image as follows:
Step 3-1, reading annotation information of the image and the image, and normalizing the image by using a face center, a fixation target, a head rotation vector, an image size and camera parameters in the annotation information to obtain a normalized face image;
And 3-2, respectively cutting left eyes and right eyes, and performing histogram equalization processing to obtain normalized 3D fixation points, 3D head orientations, face center points, rotation matrixes and scale matrixes. If the face image is the right eye image, the face image, the left eye image and the right eye image are turned over, the 3D fixation point and the 3D head orientation are turned over, the x coordinate of the center point of the face is inverted, and the 3D fixation point and the 3D head orientation are converted into 2D;
And 3-3, storing the processed face image, the left eye image, the right eye image and all annotation information into a designated file, and closing the output file of the tag information after all image processing is completed.
After the data set is processed, the network model of the present embodiment is trained using the pre-processed MPIIFaceGaze and size 360 data sets, configuring training parameters, base size set to 20, epoch set to 60, learning rate set to 0.0001, decay set to 1, decay step set to 5000, and further PinBall is used as a loss function. Then training by using the configured parameters and data set, and the initialized model and the loss function, wherein the specific steps are as follows:
S1: performing forward propagation to obtain the output of the model, and calculating a loss function by using the output of the model and the actual label;
S2: performing back propagation, calculating gradients, updating parameters of the model using an optimizer and adjusting the learning rate;
s3: at the end of each epoch, it is checked whether the conditions for model preservation are met. If so, saving the parameters of the current model into a specified file.
And finally, verifying on the test set by using the trained model.
The evaluation index of the current main stream of the sight line estimation is mostly an angle error, namely the deviation angle of the predicted value and the true value of the sight line estimation, and the smaller the index is, the better the effect is. The comparative model uses the advanced line-of-sight estimation methods Dilated-Net, RT-Gene, gaze360. Wherein Dilated-Net sets the batch size as 64, epoch as 100 and learning rate as 0.001; RT-Gene set batch size to 64, epoch to 40, learning rate to 0.0001; gaze360 sets a batch size of 80, epoch of 100, and learning rate of 0.0001. The experimental results are shown in table 1:
table 1 experimental results of the network and other advanced networks proposed by the present invention
Method of MPIIFaceGaze Gaze360
RT-Gene 3.24° 12.16°
Dilated-Net 2.65° /
Gaze360 2.57° 10.58°
The invention is that 2.24° 10.30°
As shown in the experimental data of the table 1, the method of the invention can effectively improve the precision of continuous sight estimation and has stronger practical value through experimental verification.
The following is an applicable scenario of the embodiment of the present invention:
The sight line estimation has wide application scenes, wherein one application scene is driver fatigue detection. The driver may have an influence on the concentration of his eyes during driving if he/she remains highly concentrated or in a tired state for a long period of time. For example, during fatigue driving, the driver's vision may not be concentrated or eyes may be frequently closed, which are important indicators of driver fatigue.
An important indicator of driver fatigue is the driver's gaze status, and the method of the present invention is used to predict gaze. The network in the method has memory, so that the time dependence of the gaze state of the driver can be captured, namely, the gaze state of the previous period has an influence on the current gaze state, and the gaze state of the driver can be detected and predicted in real time, so that the fatigue of the driver can be early warned in advance, and traffic accidents can be avoided.
Firstly, capturing face images of a driver in real time through a camera when the driver drives a vehicle;
then, inputting the predicted gaze state into a network model provided by the invention;
Finally, when the model predicts that the driver is likely to be in a tired state, the system audibly or otherwise alerts the driver to rest, or automatically switches to an automatic driving mode.
While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (10)

1. The GRU-based continuous vision estimation deep learning method is characterized by comprising the following steps of:
step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;
S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';
S3, processing the image feature vector F' through the full connection layer F C1 to generate a hidden state H of the model;
S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;
S5, performing feature mapping on the output vector G of the GRU through the full connection layer F C2 to obtain a new feature vector G';
Step S6, mapping the new feature vector G' into a three-dimensional output vector O through the full connection layer F C3, wherein the three-dimensional output vector O represents the predicted sight direction and the predicted uncertainty;
S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;
s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;
and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.
2. The GRU-based continuous line-of-sight estimation deep learning method according to claim 1, wherein in the step S1, the image feature space dimension is set to d, and the hidden state space dimension of the GRU is set to h, where d=h=256.
3. The GRU-based continuous line-of-sight estimation deep learning method according to claim 2, wherein the step S2 includes:
Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the linear transformation layer is used for carrying out feature dimension reduction processing to obtain an image feature vector F' with the following formula:
F′=WL1·F+BL1
wherein W L1 and B L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.
4. The GRU-based continuous line-of-sight estimation deep learning method according to claim 3, wherein the step S3 includes:
the formula for obtaining the hidden state H is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C1, respectively.
5. The GRU-based continuous line-of-sight estimation deep learning method of claim 4, wherein said step S4 comprises the steps of:
Step S401, selecting an all-zero vector as an initial state;
Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:
Z=sigmoid(WZ·[H,F′]+BZ)
R=sigmoid(WR·[H,F′]+BR)
Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W Z,WR,BZ,BR is a weight and bias parameter learned during training;
Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:
H'=tanh(WH'·[R⊙H,F']+BH')
Wherein W H' and B H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;
Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:
H=(1-Z)⊙H+Z⊙H'
the hidden state H of the last time step serves as the output vector G of the current sequence.
6. The GRU-based continuous line-of-sight estimation deep learning method of claim 5, wherein the step S5 includes:
the formula for obtaining the new feature vector G' is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C2 are respectively; the new feature vector G' contains depth features and time series information of the original input face image.
7. The GRU-based continuous line-of-sight estimation deep learning method of claim 6, wherein the step S6 includes:
The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F C3 is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C3, respectively.
8. The GRU-based continuous line-of-sight estimation deep learning method of claim 7, wherein the step S7 includes:
The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O h and a vertical angle O v of the sight, and the calculation formula is as follows:
Oh=π·tanh(O[0])
Ov=π/2·tanh(O[1])
Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.
9. The GRU-based continuous line-of-sight estimation deep learning method of claim 8, wherein the step S8 includes:
The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:
σ=π·sigmoid(O[2])
where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.
10. The GRU-based continuous line-of-sight estimation deep learning method of claim 9, wherein the step S9 includes:
Calculating the loss of each sample on the two quantiles by taking the difference value of the PinBall loss function target value and the 10% and 90% quantiles of the predicted value as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, this final loss is back propagated to the network for updating the network parameters;
The loss function is shown as follows:
L1=1/N∑(q1*max(t-(o-σ),0)+(1-q1)*max((o-σ)-t,0))
L2=1/N∑(q9*max(t-(o+σ),0)+(1-q9)*max((o+σ)-t,0))
L=L1+L2
Where L 1 represents the average loss to calculate 10% quantiles, L 2 represents the average loss to calculate 90% quantiles, L represents the final loss, L is the amount that is attempted to be minimized during training, and N represents the total number of samples; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q 1 and q 9 are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.
CN202311173058.1A 2023-09-12 2023-09-12 GRU-based continuous vision estimation deep learning method Active CN117292421B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311173058.1A CN117292421B (en) 2023-09-12 2023-09-12 GRU-based continuous vision estimation deep learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311173058.1A CN117292421B (en) 2023-09-12 2023-09-12 GRU-based continuous vision estimation deep learning method

Publications (2)

Publication Number Publication Date
CN117292421A CN117292421A (en) 2023-12-26
CN117292421B true CN117292421B (en) 2024-05-28

Family

ID=89238146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311173058.1A Active CN117292421B (en) 2023-09-12 2023-09-12 GRU-based continuous vision estimation deep learning method

Country Status (1)

Country Link
CN (1) CN117292421B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108621159A (en) * 2018-04-28 2018-10-09 首都师范大学 A kind of Dynamic Modeling in Robotics method based on deep learning
CN114444813A (en) * 2022-02-18 2022-05-06 中南大学 Traffic flow prediction method based on deep learning
WO2023159336A1 (en) * 2022-02-22 2023-08-31 大连理工大学 Deep autoregressive network based prediction method for stalling and surging of axial-flow compressor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108621159A (en) * 2018-04-28 2018-10-09 首都师范大学 A kind of Dynamic Modeling in Robotics method based on deep learning
CN114444813A (en) * 2022-02-18 2022-05-06 中南大学 Traffic flow prediction method based on deep learning
WO2023159336A1 (en) * 2022-02-22 2023-08-31 大连理工大学 Deep autoregressive network based prediction method for stalling and surging of axial-flow compressor

Also Published As

Publication number Publication date
CN117292421A (en) 2023-12-26

Similar Documents

Publication Publication Date Title
CN112052886B (en) Intelligent human body action posture estimation method and device based on convolutional neural network
CN109829436B (en) Multi-face tracking method based on depth appearance characteristics and self-adaptive aggregation network
CN110399808A (en) A kind of Human bodys' response method and system based on multiple target tracking
CN109902546A (en) Face identification method, device and computer-readable medium
CN108427921A (en) A kind of face identification method based on convolutional neural networks
CN111462191B (en) Non-local filter unsupervised optical flow estimation method based on deep learning
CN108921019A (en) A kind of gait recognition method based on GEI and TripletLoss-DenseNet
CN111160294B (en) Gait recognition method based on graph convolution network
CN114565655B (en) Depth estimation method and device based on pyramid segmentation attention
CN110097029B (en) Identity authentication method based on high way network multi-view gait recognition
CN108364305B (en) Vehicle-mounted camera video target tracking method based on improved DSST
CN113610046B (en) Behavior recognition method based on depth video linkage characteristics
CN106529441B (en) Depth motion figure Human bodys' response method based on smeared out boundary fragment
CN110569706A (en) Deep integration target tracking algorithm based on time and space network
CN105976397A (en) Target tracking method based on half nonnegative optimization integration learning
CN111507184B (en) Human body posture detection method based on parallel cavity convolution and body structure constraint
CN110335299A (en) A kind of monocular depth estimating system implementation method based on confrontation network
CN113378649A (en) Identity, position and action recognition method, system, electronic equipment and storage medium
CN103839280B (en) A kind of human body attitude tracking of view-based access control model information
Feng Mask RCNN-based single shot multibox detector for gesture recognition in physical education
CN114492634A (en) Fine-grained equipment image classification and identification method and system
Firouznia et al. Adaptive chaotic sampling particle filter to handle occlusion and fast motion in visual object tracking
CN117576149A (en) Single-target tracking method based on attention mechanism
CN117292421B (en) GRU-based continuous vision estimation deep learning method
CN115546491A (en) Fall alarm method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant