CN117292421B - GRU-based continuous vision estimation deep learning method - Google Patents
GRU-based continuous vision estimation deep learning method Download PDFInfo
- Publication number
- CN117292421B CN117292421B CN202311173058.1A CN202311173058A CN117292421B CN 117292421 B CN117292421 B CN 117292421B CN 202311173058 A CN202311173058 A CN 202311173058A CN 117292421 B CN117292421 B CN 117292421B
- Authority
- CN
- China
- Prior art keywords
- gru
- vector
- sight
- output vector
- hidden state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000013135 deep learning Methods 0.000 title claims abstract description 18
- 230000006870 function Effects 0.000 claims abstract description 36
- 230000009466 transformation Effects 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 19
- 238000013507 mapping Methods 0.000 claims abstract description 11
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 238000000605 extraction Methods 0.000 claims abstract description 5
- 230000001131 transforming effect Effects 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 230000000644 propagated effect Effects 0.000 claims description 3
- 210000003128 head Anatomy 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 101150104269 RT gene Proteins 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 210000000887 face Anatomy 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/18—Eye characteristics, e.g. of the iris
- G06V40/193—Preprocessing; Feature extraction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Ophthalmology & Optometry (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision, and particularly relates to a GRU-based continuous vision estimation deep learning method, which comprises the following steps: defining an image feature space and a hidden state space dimension of the GRU; carrying out feature extraction and feature dimension reduction treatment on an input face image by utilizing a pre-trained ResNet-50 model; processing the image feature vector to obtain a model hiding state; inputting the hidden state into the GRU for time series modeling to generate an output vector; performing feature mapping on the output vector to obtain a new feature vector; mapping the new feature vector into a three-dimensional output vector; performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector; transforming a third element of the three-dimensional output vector through a sigmoid function; the error between the predicted result and the actual value is measured using PinBall loss functions. The invention uses ResNet-50 model and GRU model at the same time, which has high accuracy and effectiveness in the task of estimating continuous sight.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a GRU-based continuous vision estimation deep learning method.
Background
The goal of line-of-sight estimation is to determine the gaze direction and point of a person in an image or video. Its importance stems from the fact that people can infer their potential behavior and intent by looking at the line of sight of an individual. For example, a person looking down at a watch at a bus stop may indicate that he has an emergency to deal with. Since the direction of a person's gaze implies rich information, gaze estimation may help people to better understand the person's intent, predicting what they might do next. Therefore, the vision estimation has wide application prospect in a plurality of fields.
Line-of-sight estimation methods can be generally classified into model-based and appearance-based methods, since model-based methods generally use specific devices; whereas appearance-based gaze estimation, human gaze is typically estimated using simple camera equipment and complex depth learning algorithms.
The early vision estimation method takes a monocular image as input, adopts a convolutional neural network training model, and outputs the two-dimensional coordinates of the vision. Subsequently, a binocular vision line estimation method is proposed which compensates for the deficiency of the binocular vision line estimation method by using complementary information of both eyes. However, both methods still have drawbacks, such as the need for additional modules for eye detection and head pose estimation. Therefore, a full-face sight line estimation method appears later, the method can output a final sight line estimation result only by inputting a face image, the end-to-end learning strategy can consider global characteristics of the full face, and many modern sight line estimation methods are based on the method.
In chinese patent application CN114387679a, a line-of-sight estimating method based on a recurrent convolutional neural network is proposed, a convolutional neural network based on DenseNet network mechanism is designed in the feature extraction part of the method, and the line-of-sight regression part further performs joint coding on dynamic line-of-sight features through an LSTM network, so as to regress the line-of-sight angle. The full connection structure DenseNet is superior to ResNet in parameter efficiency, but is large in calculation amount and high in memory consumption when processing a large data set and a complex task, so that a large expenditure is generated. But is more computationally efficient when processing large data sets and complex tasks for line-of-sight estimation than DenseNet, resNet-50. Although LSTM is suitable for handling the problem of long-term dependency, LSTM has more parameters and more calculation amount than GRU, so when processing a large number of continuous video frames, it may increase calculation complexity and affect real-time performance. Compared with LSTM, GRU has simpler model, fewer parameters and smaller calculation amount, and meanwhile GRU has better performance than LSTM in the task of line-of-sight regression.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a GRU-based continuous vision estimation deep learning method which combines ResNet-50 with a GRU model and has high accuracy in a continuous vision estimation task, and the method adopts the following technical scheme:
A GRU-based continuous vision estimation deep learning method comprises the following steps:
step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;
S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';
S3, processing the image feature vector F' through the full connection layer F C1 to generate a hidden state H of the model;
S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;
S5, performing feature mapping on the output vector G of the GRU through the full connection layer F C2 to obtain a new feature vector G';
step S6, mapping the new feature vector G' into a three-dimensional output vector O through a full connection layer F C3, wherein the three-dimensional output vector O represents a predicted sight line direction and uncertainty of sight line prediction, the sight line direction comprises a horizontal angle and a vertical angle of the sight line, and the uncertainty of the sight line prediction comprises an angle error;
S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;
s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;
and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.
Further, in step S1, the image feature space dimension is set to d, and the hidden-state space dimension of the GRU is set to h, where d=h=256.
Further, step S2 includes:
Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the formula for obtaining the image feature vector F' through the feature dimension reduction processing of the linear transformation layer is as follows:
F′=WL1·F+BL1
wherein W L1 and B L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.
Further, in step S3, it includes:
the formula for obtaining the hidden state H is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C1, respectively.
Further, the step S4 includes the steps of:
Step S401, selecting an all-zero vector as an initial state;
Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:
Z=sigmoid(WZ·[H,F′]+BZ)
R=sigmoid(WR·[H,F′]+BR)
Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W Z,WR,BZ,BR is a weight and bias parameter learned during training;
Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:
H'=tanh(WH'·[R⊙H,F']+BH')
Wherein W H' and B H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;
Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:
H=(1-Z)⊙H+Z⊙H'
the hidden state H of the last time step serves as the output vector G of the current sequence.
Further, in step S5, it includes:
the formula for obtaining the new feature vector G' is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C2 are respectively; the new feature vector G' contains depth features and time series information of the original input face image.
Further, in step S6, it includes:
The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F C3 is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C3, respectively.
Further, in step S7, it includes:
The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O h and a vertical angle O v of the sight, and the calculation formula is as follows:
Oh=π·tanh(O[0])
Ov=π/2·tanh(O[1])
Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.
Further, in step S8, it includes:
The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:
σ=π·sigmoid(O[2])
where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.
Further, in step S9, it includes:
Calculating the loss of each sample on the two quantiles by taking the difference between the PinBall loss function target value and the predicted value of 10% and 90% quantiles as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, the loss is back propagated to the network for updating network parameters to improve accuracy of line-of-sight estimation.
The loss function is shown as follows:
L1=1/N∑(q1*max(t-(o-σ),0)+(1-q1)*max((o-σ)-t,0))
L2=1/N∑(q9*max(t-(o+σ),0)+(1-q9)*max((o+σ)-t,0))
L=L1+L2
Where L 1 represents the average loss to calculate 10% quantiles, L 2 represents the average loss to calculate 90% quantiles, L represents the final loss, L is the amount that is attempted to be minimized during training, and N represents the total number of samples; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q 1 and q 9 are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the ResNet-50 model is used for extracting depth features, so that more abundant sight features can be deeply excavated and obtained;
2. In the method, the two full connection layers F C1 and F C2 are used for performing dimension reduction treatment on the video characteristics, so that the treatment efficiency and accuracy are improved;
3. According to the method, GRU is used for line of sight estimation, time sequence modeling is carried out, dynamic change information of a human face is captured, continuous line of sight estimation is realized, long-term dependence in time sequence data is more effectively captured and utilized, and the problems of gradient disappearance or gradient explosion can be avoided;
4. According to the method, the nonlinear activation function and the PinBall loss function are introduced after the full connection layer F C3, the three-dimensional characteristics are output, the accuracy and the stability of sight estimation are further improved, and the high efficiency and the reliability in practical application are ensured.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
FIG. 1 is a network architecture diagram of a GRU-based continuous line-of-sight estimation deep learning method according to the present invention;
FIG. 2 is a flow chart of a method used in providing an embodiment of the present invention;
FIG. 3 is a schematic diagram of the hierarchy and dimensional changes of a network model used in an embodiment of the present invention;
Fig. 4 is a diagram of a single-eye, double-eye, full-face gaze estimation network in an embodiment of the present invention.
Detailed Description
The invention is further explained in the following detailed description with reference to the drawings so that those skilled in the art can more fully understand the invention and can practice it, but the invention is explained below by way of example only and not by way of limitation.
The early vision estimation method takes a monocular image as input, adopts a convolutional neural network training model, and outputs the two-dimensional coordinates of the vision. Subsequently, a binocular vision line estimation method is proposed which compensates for the deficiency of the binocular vision line estimation method by using complementary information of both eyes. However, both methods still have drawbacks, such as the need for additional modules for eye detection and head pose estimation. The full-face vision estimation method can output a final vision estimation result only by inputting a face image, and the end-to-end learning strategy can consider global characteristics of the full face, so that many modern vision estimation methods are based on the full-face vision estimation method. Fig. 4 is a diagram of a single-eye, double-eye, full-face line-of-sight estimation network for comparison.
As shown in FIG. 1, the invention adopts ResNet-50 model to extract the sight line characteristics, then reduces the 1000-dimensional characteristics extracted by ResNet-50 network to 256 dimensions through two full connection layers, then uses GRU as sight line estimation module, and finally introduces nonlinear activation function and loss function to output three-dimensional characteristics after passing through the full connection layers. FIG. 2 is a flow chart of a method used in the present embodiment; fig. 3 is a schematic diagram of a hierarchical structure and dimensional change of the network model according to the present embodiment.
A GRU-based continuous vision estimation deep learning method comprises the following steps:
step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;
S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';
S3, processing the image feature vector F' through the full connection layer F C1 to generate a hidden state H of the model;
S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;
S5, performing feature mapping on the output vector G of the GRU through the full connection layer F C2 to obtain a new feature vector G';
step S6, mapping the new feature vector G' into a three-dimensional output vector O through a full connection layer F C3, wherein the three-dimensional output vector O represents a predicted sight line direction and uncertainty of sight line prediction, the sight line direction comprises a horizontal angle and a vertical angle of the sight line, and the uncertainty of the sight line prediction comprises an angle error;
S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;
s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;
and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.
In step S1, an image feature space dimension is set to d, and a hidden state space dimension of the GRU is set to h, where d=h=256; these two parameters are used as the basis for model training to form the feature transformation space of the input to the output.
The step S2 includes:
Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the formula for obtaining the image feature vector F' through the feature dimension reduction processing of the linear transformation layer is as follows:
F′=WL1·F+BL1
wherein W L1 and B L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.
In step S3, it includes:
The hidden state H represents a result of nonlinear transformation of the feature vector F', and the formula for obtaining the hidden state H is as follows:
Wherein the method comprises the steps of And/>The weight matrix and bias vector of the full connection layer F C1, respectively, are optimally updated during the training process, with the goal of minimizing the loss function of the network.
In step S4, the hidden state H is input to a GRU having H hidden states to perform time series modeling to obtain an output vector G, which includes the following steps:
Step S401, selecting an all-zero vector as an initial state;
Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:
Z=sigmoid(WZ·[H,F′]+BZ)
R=sigmoid(WR·[H,F′]+BR)
Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W Z,WR,BZ,BR is a weight and bias parameter learned during training;
Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:
H'=tanh(WH'·[R⊙H,F']+BH')
Wherein W H' and B H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;
Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:
H=(1-Z)⊙H+Z⊙H'
The hidden state H will be used for the calculation of the next moment or as the final output G vector of the current sequence.
In step S5, an output sequence is obtained after the GRU performs the operation, and in this embodiment, the output sequence is converted into a feature vector with a fixed size, and the hidden state of the last time step is used as the output vector to obtain a new feature vector G' with the following formula:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C2 are respectively; the new feature vector G' contains depth features and time sequence information of the original input face image; the output of GRU is non-linearly transformed to raise the expression capacity of the model.
The step S6 includes:
The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F C3 is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C3, respectively.
The step S7 includes:
The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O h and a vertical angle O v of the sight, and the calculation formula is as follows:
Oh=π·tanh(O[0])
Ov=π/2·tanh(O[1])
Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.
The step S8 includes:
The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:
σ=π·sigmoid(O[2])
where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.
In step S9, a PinBall loss function is used, and the PinBall loss function calculates a gap between the prediction result and the upper and lower boundaries of the prediction uncertainty region, and if the target value exceeds the prediction region, the loss increases, and if the target value exceeds the prediction region, the loss decreases. This design enables the predictive model to dynamically adjust the penalty based on the uncertainty of the predicted value. The step S9 includes:
Calculating the loss of each sample on the two quantiles by taking the difference between the PinBall loss function target value and the predicted value of 10% and 90% quantiles as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, the loss is back propagated to the network for updating network parameters to improve accuracy of line-of-sight estimation.
The loss function is shown as follows:
L1=1/N∑(q1*max(t-(o-σ),0)+(1-q1)*max((o-σ)-t,0))
L2=1/N∑(q9*max(t-(o+σ),0)+(1-q9)*max((o+σ)-t,0))
L=L1+L2
Where L 1 represents the average loss of calculating 10% quantiles, L 2 represents the average loss of calculating 90% quantiles, L represents the final loss, L is the amount that is sought to be minimized during training, N represents the total number of samples, and the average of the losses for each sample is calculated in the above formula; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q 1 and q 9 are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.
The effectiveness of the present invention is verified by simulation experiments as follows.
The classical Gaze360 dataset and MPIIFaceGaze dataset of video estimation are subjected to data rectification, and the aim is to eliminate factors such as environment and the like by a data preprocessing method and simplify the fixation regression problem.
The Gaze360 dataset is video data collected from 238 subjects in the real world, and the dataset is large in size and contains a large number of video frames, so that the time series information of the dataset can be fully utilized in the present embodiment. In this embodiment, 84902 pictures of the train group of the dataset are used as the test set, and 11318 pictures of the val group of the dataset are used as the test set.
MPIIFaceGaze data set, comprising a total of 45000 images of 15 subjects, the present example uses 3000 images of the experimenter P00 as the test set and the remaining 42000 images as the training set.
The specific steps for processing MPIIFaceGaze datasets are as follows:
Step 1, defining and acquiring necessary file paths, including an input data set path, a sample list path and an output path;
step 2, acquiring all people in a sample list, and processing each person, wherein in the processing of each person, firstly, reading a camera matrix and annotation information of the person, creating an output file for storing tag information, and simultaneously creating folders for storing images of faces, left eyes and right eyes;
and 3, traversing all the images of the person, and processing each image as follows:
Step 3-1, reading annotation information of the image and the image, and normalizing the image by using a face center, a fixation target, a head rotation vector, an image size and camera parameters in the annotation information to obtain a normalized face image;
And 3-2, respectively cutting left eyes and right eyes, and performing histogram equalization processing to obtain normalized 3D fixation points, 3D head orientations, face center points, rotation matrixes and scale matrixes. If the face image is the right eye image, the face image, the left eye image and the right eye image are turned over, the 3D fixation point and the 3D head orientation are turned over, the x coordinate of the center point of the face is inverted, and the 3D fixation point and the 3D head orientation are converted into 2D;
And 3-3, storing the processed face image, the left eye image, the right eye image and all annotation information into a designated file, and closing the output file of the tag information after all image processing is completed.
After the data set is processed, the network model of the present embodiment is trained using the pre-processed MPIIFaceGaze and size 360 data sets, configuring training parameters, base size set to 20, epoch set to 60, learning rate set to 0.0001, decay set to 1, decay step set to 5000, and further PinBall is used as a loss function. Then training by using the configured parameters and data set, and the initialized model and the loss function, wherein the specific steps are as follows:
S1: performing forward propagation to obtain the output of the model, and calculating a loss function by using the output of the model and the actual label;
S2: performing back propagation, calculating gradients, updating parameters of the model using an optimizer and adjusting the learning rate;
s3: at the end of each epoch, it is checked whether the conditions for model preservation are met. If so, saving the parameters of the current model into a specified file.
And finally, verifying on the test set by using the trained model.
The evaluation index of the current main stream of the sight line estimation is mostly an angle error, namely the deviation angle of the predicted value and the true value of the sight line estimation, and the smaller the index is, the better the effect is. The comparative model uses the advanced line-of-sight estimation methods Dilated-Net, RT-Gene, gaze360. Wherein Dilated-Net sets the batch size as 64, epoch as 100 and learning rate as 0.001; RT-Gene set batch size to 64, epoch to 40, learning rate to 0.0001; gaze360 sets a batch size of 80, epoch of 100, and learning rate of 0.0001. The experimental results are shown in table 1:
table 1 experimental results of the network and other advanced networks proposed by the present invention
Method of | MPIIFaceGaze | Gaze360 |
RT-Gene | 3.24° | 12.16° |
Dilated-Net | 2.65° | / |
Gaze360 | 2.57° | 10.58° |
The invention is that | 2.24° | 10.30° |
As shown in the experimental data of the table 1, the method of the invention can effectively improve the precision of continuous sight estimation and has stronger practical value through experimental verification.
The following is an applicable scenario of the embodiment of the present invention:
The sight line estimation has wide application scenes, wherein one application scene is driver fatigue detection. The driver may have an influence on the concentration of his eyes during driving if he/she remains highly concentrated or in a tired state for a long period of time. For example, during fatigue driving, the driver's vision may not be concentrated or eyes may be frequently closed, which are important indicators of driver fatigue.
An important indicator of driver fatigue is the driver's gaze status, and the method of the present invention is used to predict gaze. The network in the method has memory, so that the time dependence of the gaze state of the driver can be captured, namely, the gaze state of the previous period has an influence on the current gaze state, and the gaze state of the driver can be detected and predicted in real time, so that the fatigue of the driver can be early warned in advance, and traffic accidents can be avoided.
Firstly, capturing face images of a driver in real time through a camera when the driver drives a vehicle;
then, inputting the predicted gaze state into a network model provided by the invention;
Finally, when the model predicts that the driver is likely to be in a tired state, the system audibly or otherwise alerts the driver to rest, or automatically switches to an automatic driving mode.
While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims (10)
1. The GRU-based continuous vision estimation deep learning method is characterized by comprising the following steps of:
step S1, defining the image feature space and hidden state space dimension of GRU, for setting basic parameters of model training;
S2, performing feature extraction on an input face image I by using a pre-trained ResNet-50 model to obtain a feature vector F, and performing feature dimension reduction processing through a linear transformation layer to obtain an image feature vector F';
S3, processing the image feature vector F' through the full connection layer F C1 to generate a hidden state H of the model;
S4, inputting the hidden state H into the GRU for time series modeling, and generating an output vector G of the GRU;
S5, performing feature mapping on the output vector G of the GRU through the full connection layer F C2 to obtain a new feature vector G';
Step S6, mapping the new feature vector G' into a three-dimensional output vector O through the full connection layer F C3, wherein the three-dimensional output vector O represents the predicted sight direction and the predicted uncertainty;
S7, performing hyperbolic tangent transformation on the first two elements of the three-dimensional output vector O to obtain a predicted line-of-sight direction;
s8, transforming a third element of the three-dimensional output vector through a sigmoid function to obtain uncertainty of sight prediction;
and S9, measuring the error between the predicted result and the true value by utilizing PinBall loss function, and then back-propagating the error to update the network parameters.
2. The GRU-based continuous line-of-sight estimation deep learning method according to claim 1, wherein in the step S1, the image feature space dimension is set to d, and the hidden state space dimension of the GRU is set to h, where d=h=256.
3. The GRU-based continuous line-of-sight estimation deep learning method according to claim 2, wherein the step S2 includes:
Using a pre-trained ResNet-50 model as a basic model of a convolutional neural network, and extracting depth features of an input face image I to obtain a feature vector F; then, the linear transformation layer is used for carrying out feature dimension reduction processing to obtain an image feature vector F' with the following formula:
F′=WL1·F+BL1
wherein W L1 and B L1 are the weight matrix and bias vector, respectively, of the linear transformation layer.
4. The GRU-based continuous line-of-sight estimation deep learning method according to claim 3, wherein the step S3 includes:
the formula for obtaining the hidden state H is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C1, respectively.
5. The GRU-based continuous line-of-sight estimation deep learning method of claim 4, wherein said step S4 comprises the steps of:
Step S401, selecting an all-zero vector as an initial state;
Step S402, obtaining an update gate and a reset gate through a sigmoid function and linear transformation of a hidden state H and an input feature vector F', wherein a calculation formula is as follows:
Z=sigmoid(WZ·[H,F′]+BZ)
R=sigmoid(WR·[H,F′]+BR)
Wherein Z represents an update gate of the GRU, R represents a reset gate of the GRU, and W Z,WR,BZ,BR is a weight and bias parameter learned during training;
Step S403, obtaining a candidate hidden state H' by using the information of the reset gate R, where the calculation formula is as follows:
H'=tanh(WH'·[R⊙H,F']+BH')
Wherein W H' and B H' are the weights and bias parameters learned during training, as indicated by the product of the corresponding elements, and tan h is a hyperbolic tangent function;
Step S404, calculating the hidden state H at the current time by updating the door, the candidate hidden state H', and the hidden state at the previous time, where the calculation formula is as follows:
H=(1-Z)⊙H+Z⊙H'
the hidden state H of the last time step serves as the output vector G of the current sequence.
6. The GRU-based continuous line-of-sight estimation deep learning method of claim 5, wherein the step S5 includes:
the formula for obtaining the new feature vector G' is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C2 are respectively; the new feature vector G' contains depth features and time series information of the original input face image.
7. The GRU-based continuous line-of-sight estimation deep learning method of claim 6, wherein the step S6 includes:
The formula for mapping the new feature vector G' to the three-dimensional output vector O through the full connection layer F C3 is as follows:
Wherein the method comprises the steps of And/>The weight matrix and the bias vector of the full connection layer F C3, respectively.
8. The GRU-based continuous line-of-sight estimation deep learning method of claim 7, wherein the step S7 includes:
The first two elements of the output vector O are subjected to hyperbolic tangent transformation to obtain a horizontal angle O h and a vertical angle O v of the sight, and the calculation formula is as follows:
Oh=π·tanh(O[0])
Ov=π/2·tanh(O[1])
Wherein tan h is a hyperbolic tangent function, O0, O1 respectively represent first and second elements of the output vector O; after hyperbolic tangent transformation, the range of the predicted value of the sight angle is limited to be within the range of [ -pi, pi ] and And the angle corresponds to the actual angle range of the sight line.
9. The GRU-based continuous line-of-sight estimation deep learning method of claim 8, wherein the step S8 includes:
The third element of the output vector O is transformed by a sigmoid function and multiplied by pi to obtain the uncertainty sigma of the sight prediction, and the formula is as follows:
σ=π·sigmoid(O[2])
where sigma is in the range of 0, pi, O2 represents the third element of the output vector O.
10. The GRU-based continuous line-of-sight estimation deep learning method of claim 9, wherein the step S9 includes:
Calculating the loss of each sample on the two quantiles by taking the difference value of the PinBall loss function target value and the 10% and 90% quantiles of the predicted value as a basis, then calculating the average loss of the two quantiles, and adding the average loss to obtain the final loss; finally, this final loss is back propagated to the network for updating the network parameters;
The loss function is shown as follows:
L1=1/N∑(q1*max(t-(o-σ),0)+(1-q1)*max((o-σ)-t,0))
L2=1/N∑(q9*max(t-(o+σ),0)+(1-q9)*max((o+σ)-t,0))
L=L1+L2
Where L 1 represents the average loss to calculate 10% quantiles, L 2 represents the average loss to calculate 90% quantiles, L represents the final loss, L is the amount that is attempted to be minimized during training, and N represents the total number of samples; o represents a predicted value, which is the result of model prediction; t represents a true value, which is a predicted actual target value; sigma represents a predicted uncertainty interval, and a possible offset range is predicted on the basis of a given predicted value; q 1 and q 9 are defined as two quantiles, 0.1 and 0.9 respectively, which define the interval of prediction error.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311173058.1A CN117292421B (en) | 2023-09-12 | 2023-09-12 | GRU-based continuous vision estimation deep learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311173058.1A CN117292421B (en) | 2023-09-12 | 2023-09-12 | GRU-based continuous vision estimation deep learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117292421A CN117292421A (en) | 2023-12-26 |
CN117292421B true CN117292421B (en) | 2024-05-28 |
Family
ID=89238146
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311173058.1A Active CN117292421B (en) | 2023-09-12 | 2023-09-12 | GRU-based continuous vision estimation deep learning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117292421B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108621159A (en) * | 2018-04-28 | 2018-10-09 | 首都师范大学 | A kind of Dynamic Modeling in Robotics method based on deep learning |
CN114444813A (en) * | 2022-02-18 | 2022-05-06 | 中南大学 | Traffic flow prediction method based on deep learning |
WO2023159336A1 (en) * | 2022-02-22 | 2023-08-31 | 大连理工大学 | Deep autoregressive network based prediction method for stalling and surging of axial-flow compressor |
-
2023
- 2023-09-12 CN CN202311173058.1A patent/CN117292421B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108621159A (en) * | 2018-04-28 | 2018-10-09 | 首都师范大学 | A kind of Dynamic Modeling in Robotics method based on deep learning |
CN114444813A (en) * | 2022-02-18 | 2022-05-06 | 中南大学 | Traffic flow prediction method based on deep learning |
WO2023159336A1 (en) * | 2022-02-22 | 2023-08-31 | 大连理工大学 | Deep autoregressive network based prediction method for stalling and surging of axial-flow compressor |
Also Published As
Publication number | Publication date |
---|---|
CN117292421A (en) | 2023-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112052886B (en) | Intelligent human body action posture estimation method and device based on convolutional neural network | |
CN109829436B (en) | Multi-face tracking method based on depth appearance characteristics and self-adaptive aggregation network | |
CN110399808A (en) | A kind of Human bodys' response method and system based on multiple target tracking | |
CN109902546A (en) | Face identification method, device and computer-readable medium | |
CN108427921A (en) | A kind of face identification method based on convolutional neural networks | |
CN111462191B (en) | Non-local filter unsupervised optical flow estimation method based on deep learning | |
CN108921019A (en) | A kind of gait recognition method based on GEI and TripletLoss-DenseNet | |
CN111160294B (en) | Gait recognition method based on graph convolution network | |
CN114565655B (en) | Depth estimation method and device based on pyramid segmentation attention | |
CN110097029B (en) | Identity authentication method based on high way network multi-view gait recognition | |
CN108364305B (en) | Vehicle-mounted camera video target tracking method based on improved DSST | |
CN113610046B (en) | Behavior recognition method based on depth video linkage characteristics | |
CN106529441B (en) | Depth motion figure Human bodys' response method based on smeared out boundary fragment | |
CN110569706A (en) | Deep integration target tracking algorithm based on time and space network | |
CN105976397A (en) | Target tracking method based on half nonnegative optimization integration learning | |
CN111507184B (en) | Human body posture detection method based on parallel cavity convolution and body structure constraint | |
CN110335299A (en) | A kind of monocular depth estimating system implementation method based on confrontation network | |
CN113378649A (en) | Identity, position and action recognition method, system, electronic equipment and storage medium | |
CN103839280B (en) | A kind of human body attitude tracking of view-based access control model information | |
Feng | Mask RCNN-based single shot multibox detector for gesture recognition in physical education | |
CN114492634A (en) | Fine-grained equipment image classification and identification method and system | |
Firouznia et al. | Adaptive chaotic sampling particle filter to handle occlusion and fast motion in visual object tracking | |
CN117576149A (en) | Single-target tracking method based on attention mechanism | |
CN117292421B (en) | GRU-based continuous vision estimation deep learning method | |
CN115546491A (en) | Fall alarm method, system, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |