CN116664677A - Sight estimation method based on super-resolution reconstruction - Google Patents
Sight estimation method based on super-resolution reconstruction Download PDFInfo
- Publication number
- CN116664677A CN116664677A CN202310599847.5A CN202310599847A CN116664677A CN 116664677 A CN116664677 A CN 116664677A CN 202310599847 A CN202310599847 A CN 202310599847A CN 116664677 A CN116664677 A CN 116664677A
- Authority
- CN
- China
- Prior art keywords
- resolution
- super
- face image
- sight
- resolution reconstruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000012549 training Methods 0.000 claims abstract description 25
- 238000006243 chemical reaction Methods 0.000 claims description 28
- 230000004913 activation Effects 0.000 claims description 24
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 23
- 239000011159 matrix material Substances 0.000 claims description 16
- 230000008447 perception Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 3
- 230000002401 inhibitory effect Effects 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 9
- 238000012360 testing method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 210000003128 head Anatomy 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 101150104269 RT gene Proteins 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000000750 progressive effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000209202 Bromus secalinus Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000011273 social behavior Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4053—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
- G06T3/4076—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution using the original low-resolution images to iteratively correct the high-resolution images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a sight line estimation method based on super-resolution reconstruction, which comprises the following steps: acquiring a face image by using a camera; constructing a super-resolution reconstruction module and a sight line estimation module, firstly pre-training the super-resolution reconstruction module, then training the whole network, inputting a face image, recovering details and definition of a low-resolution face image through the super-resolution reconstruction module so as to improve the sight line estimation precision, and extracting global features through the sight line estimation module, improving the feature expression capability, and increasing the weight of a sight line related area through a space weight mechanism so as to perform accurate sight line estimation; the method designed by the invention has better learning ability, performance and generalization ability. Experiments prove that the method can effectively improve the accuracy of sight estimation in a low-resolution scene.
Description
Technical Field
The invention relates to the field of deep learning and computer vision, in particular to a sight line estimation method based on super-resolution reconstruction.
Background
The gaze estimation aims at determining the direction and point of gaze of a person in an image or video. Since gaze behavior is a fundamental aspect of human social behavior, potential information can be inferred from the objects of gaze estimation.
Early vision estimation methods used monocular images as input, convolutional neural network training models, and output two-dimensional coordinate points of vision. Next, a binocular vision line estimation method is proposed, which complements this disadvantage because the monocular method cannot fully utilize the complementary information of both eyes. But the single-eye and double-eye gaze estimation methods have some drawbacks, such as the need for additional modules to detect the eyes, and the need for additional modules to estimate head pose. Then, a full-face sight line estimation method is proposed, the method can obtain the output of a final sight line estimation result only by inputting a face image, the method is an end-to-end learning strategy, global characteristics of the full face can be considered, and a plurality of mainstream sight line estimation methods are based on the full-face sight line estimation method. However, the shallow residual error network learning capability adopted by the method is limited, the lifting effect is limited, and the problem that the sight line estimation accuracy is greatly reduced in a low-resolution occasion is still not solved.
Disclosure of Invention
The purpose of the invention is that: the sight line estimation method based on super-resolution reconstruction solves the problem that sight line estimation accuracy is obviously reduced in low-resolution occasions.
In order to achieve the above functions, the present invention designs a sight line estimation method based on super-resolution reconstruction, and performs the following steps S1 to S5 to complete the face sight line estimation of a target object:
step S1: acquiring a preset number of face images by using a camera, and constructing a face image training set;
step S2: the super-resolution reconstruction module is constructed and comprises a preset number of residual blocks and format conversion blocks corresponding to the residual blocks, takes a low-resolution face image as input, and upsamples the features in the face image by adopting a step-by-step upsampling mode based on each residual block to generate a high-resolution face image with a preset size;
step S3: pre-training the super-resolution reconstruction module to obtain a pre-trained super-resolution reconstruction module;
step S4: constructing a sight estimating module, taking a high-resolution face image output by the super-resolution reconstructing module as input, adopting ResNet50 to extract characteristics in the face image, giving weight to each region in the face image based on a space weight mechanism, and inhibiting the weight of other regions by increasing the weight of the relevant region of the sight line in the face image to obtain a sight estimating result aiming at the face image;
step S5: and (3) carrying out overall training on the super-resolution reconstruction module and the sight estimation module by adopting the face image training set constructed in the step (S1) so as to finish the face sight estimation of the target object.
As a preferred technical scheme of the invention: in step S3, the super-resolution reconstruction module is pre-trained by using the face data set FFHQ.
As a preferred technical scheme of the invention: the super-resolution reconstruction module in step S2 has 6 residual blocks sequentially connected in series, and performs step-by-step upsampling on the low-resolution face image to extract the features thereof, wherein the first residual block is input as a learning constant F with a size of c×16×16 0 Wherein C is the channel size; the input of the ith residual block is feature F i-1 The output is characteristic F i Last residual block output feature F 6 Converting into RGB image by ToRGB convolution layer, and outputting high resolution face imageThe specific formula is as follows:
wherein the method comprises the steps ofRepresenting residual convolution block, ">Representing up-sampled residual convolution block,>representing a style conversion block;
style conversion blockIs input as a face image I with low resolution L And corresponding analytic graph I P The input pair is expressed as +.> And->Respectively inputting a low-resolution face image and an analytic graph corresponding to the low-resolution face image for the ith style conversion block;
style conversion blockFrom input pair->Scale-in-scale learning F i Style conversion parameter y of (2) i =(y s,i ,y b,i ) Expressed by the following formula:
wherein γ represents a lightweight network, where μ and σ are the mean and standard deviation of the features, y s,i Is thatCorresponding style conversion parametersNumber, y b,i Is->Corresponding style conversion parameters.
As a preferred technical scheme of the invention: the super-resolution reconstruction module introduces semantic perception style lossCalculating semantic perception style penalty +.>The specific formula is as follows:
in phi i Representing the characteristics of the ith layer in VGG19, M j Representing the parse mask with tag j,representing a high-resolution face image output by the super-resolution reconstruction module, I H The real value of the high-resolution face image output by the super-resolution reconstruction module is represented, and g represents the resolution mask M j Calculating the characteristic phi i The formula is as follows:
where, as indicated by the element product, ε=1e-8 was used to avoid the divisor being zero.
As a preferred technical scheme of the invention: the super-resolution reconstruction module introduces reconstruction lossHigh-resolution face image for outputting super-resolution reconstruction module>Constrained to its true value I H Position of reconstruction loss->Is calculated as follows:
where the second term on the right side of the equation is the multi-scale feature matching loss for matchingAnd I H Is characterized by the fact that,is a downsampling factor, D s (. Cndot.) represents the discriminator corresponding to the downsampling factor,>representation D s Is a k-th layer feature of (c).
As a preferred technical scheme of the invention: super resolution reconstruction module introduces contrast lossThe specific formula is as follows:
construction of objective function based on multi-scale discriminator and hinge lossThe specific formula is as follows:
based on semantic perception style lossReconstruction loss->Resistance loss->Constructing a loss function of the super-resolution reconstruction module>The formula is as follows:
wherein lambda is SS 、λ rec 、λ adv Respectively corresponding to semantic perception style lossReconstruction loss->Resistance loss->Is a weight of (2).
As a preferred technical scheme of the invention: the specific method of step S4 is as follows:
step S4.1: the method comprises the steps of adopting a pre-trained ResNet50 as a feature extractor to extract features from a high-resolution face image with a preset size output by a super-resolution reconstruction module and outputting a feature map;
step S4.2: a space weight mechanism is adopted, and the weight of each position of a face region in the face image is learned through one branch, so that the weight of a view line related region in the face image is increased, and the weight of other regions is restrained;
step S4.3: the features are classified using the full connection layer, and coordinates (x, y) representing the line of sight are output for representing the line of sight estimation result.
As a preferred technical scheme of the invention: the spatial weighting mechanism in step S4.2 comprises three convolution layers, the filter size of which is 1×1, and is a modified linear unit layer, and for each convolution layer, an activation tensor U with a size of nxh×w is input from the convolution layer, where N is the number of channels of the feature map, H and W are the height and width of the feature map, the spatial weighting mechanism generates an axw spatial weighting matrix W, and the spatial weighting matrix W is multiplied by each channel of the activation tensor U element by element to obtain a weighted activation map on the channel, and the formula is as follows:
V C =W⊙U C
wherein W is a space weight matrix, U C The C-th channel, V, representing the activation tensor U C For the weighted activation graph of the C-th channel, the weighted activation graphs of the channels are stacked to form a weighted activation tensor V and fed into the next convolutional layer.
As a preferred technical scheme of the invention: in the training of the sight estimating module, the filter weights of the first two layers of convolution layers of the space weight mechanism are randomly initialized from Gaussian distribution with the mean value of 0 and the deviation of 0.1, the filter weights of the last convolution layer are randomly initialized from Gaussian distribution with the mean value of 0 and the variance of 0.001, and the filter weights have a constant deviation term of 1; wherein the gradient of the activation tensor U and the spatial weight matrix W is expressed as:
where N is the number of channels of the feature map.
As a preferred technique of the present inventionThe operation scheme is as follows: the line-of-sight estimation module introduces a loss functionThe formula is as follows:
in xi gt Representing the true value of the line of sight estimate, ζ pred A predicted value representing the gaze estimate.
The beneficial effects are that: the advantages of the present invention over the prior art include:
the sight line estimation method based on super-resolution reconstruction can increase the precision of sight line estimation in low-resolution occasions. At present, most of evaluation indexes of the main stream of the sight line estimation are angle errors, namely the deviation angle of a predicted value and a true value of the sight line estimation, and the smaller the indexes are, the better the effect is. Experimental training adopts a classical data set MPIIFaceGaze of sight estimation, and LQ processing is carried out on a test set to test the effect of the method in a low-resolution scene. The same experimental conditions and other advanced methods are adopted for comparison, and the average error of the Dilated-Net method on the data set is 4.86 degrees, the average error of the Gaze360 method on the data set is 5.02 degrees, and the average error of the Rt-Gene method on the data set is 6.43 degrees. The average error of the PGGA-Net method of the present invention on the data set was 3.96 degrees, which is superior to other methods. The method of the invention can be proved to increase the accuracy of the sight line estimation in a low resolution environment.
Drawings
FIG. 1 (a) is a diagram of a conventional monocular gaze estimation network;
FIG. 1 (b) is a diagram of a conventional binocular vision estimation network;
FIG. 1 (c) is a diagram of a conventional full-face gaze estimation network;
FIG. 2 is a diagram of a conventional full-face gaze estimation network;
FIG. 3 (a) is a flow chart of a prior art method;
fig. 3 (b) is a flowchart of a line-of-sight estimation method based on super-resolution reconstruction according to an embodiment of the present invention;
FIG. 4 is a diagram of a PGGA-Net network skeleton provided in accordance with an embodiment of the present invention;
fig. 5 (a) is a residual block structure diagram provided according to an embodiment of the present invention;
fig. 5 (b) is a block structure diagram of style conversion provided according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
The existing monocular, binocular and full-face vision estimation network structure diagrams refer to fig. 1 (a) -1 (c), a monocular image is taken as input, a convolutional neural network training model is used, and two-dimensional coordinate points of vision are output. The binocular vision line estimation method complements the defect that the monocular method cannot fully utilize the complementary information of the eyes. But the single-eye and double-eye gaze estimation methods have some drawbacks, such as the need for additional modules to detect the eyes, and the need for additional modules to estimate head pose. The full-face sight estimation method solves the defects, and can obtain the output of the final sight estimation result only by inputting the face image, thereby being an end-to-end learning strategy and considering the global characteristics of the full face.
Referring to fig. 2, based on a convolutional neural network fused with a non-attention mechanism, a normalized face image passes through a first convolutional module, the size of a convolutional kernel is 7×7, then the face image is sent into a three-layer network, each layer has 2 residual blocks, then the residual blocks are sent into a convolutional layer containing 1×1 for convolutional operation, facial feature extraction is completed, the extracted features are adjusted to be in a vector form and head posture information to be spliced and fused, and then a vision estimation result is obtained through a full connection layer.
Referring to fig. 3 (a), the existing full-face line-of-sight estimation flow chart uses a shallow residual neural network fused with a non-attention mechanism to perform full-face line-of-sight estimation, and the method can improve the performance of the network without increasing the number of network parameters, but the shallow residual network adopted by the method has limited learning ability and limited improving effect, and still does not solve the problem that the line-of-sight estimation accuracy is greatly reduced in low-resolution occasions.
Referring to fig. 3 (b), fig. 4, the line of sight estimation method based on super resolution reconstruction provided by the embodiment of the invention is based on a PGGA-Net network framework, wherein the PGGA-Net network framework mainly comprises two modules, namely a super resolution reconstruction module and a line of sight estimation module, the super resolution reconstruction module is a progressive semantic perception style conversion framework, and details and definition are recovered for a low resolution face image so as to improve the line of sight estimation precision. The following steps S1-S5 are executed to finish the face sight estimation of the target object:
step S1: acquiring a preset number of face images by using a camera, and constructing a face image training set;
step S2: the method comprises the steps of constructing a super-resolution reconstruction module, recovering details and definition of a low-resolution face image by using a progressive semantic perception style conversion frame to improve the sight estimation precision, wherein the super-resolution reconstruction module takes the low-resolution face image as input, and upsamples the features in the face image by adopting a progressive upsampling mode based on each residual block to generate a high-resolution face image with a preset size;
the super-resolution reconstruction module is provided with 6 residual blocks which are sequentially connected in series, and gradually upsamples the face image with low resolution to extract the characteristics thereof, wherein the input of the first residual block is a learning constant F with the size of C multiplied by 16 0 Wherein C is the channel size; the input of the ith residual block is feature F i-1 The output is characteristic F i Last residual block output feature F 6 Converting into RGB image by ToRGB convolution layer, and outputting high resolution face imageIn particular asThe following formula:
wherein the method comprises the steps ofRepresenting residual convolution block, ">Representing up-sampled residual convolution block,>representing a style conversion block; the residual block and the grid conversion block structure diagram refer to fig. 5 (a) -5 (b);
style conversion blockIs input as a face image I with low resolution L And corresponding analytic graph I P The input pair is expressed as +.> And->Respectively inputting a low-resolution face image and an analytic graph corresponding to the low-resolution face image for the ith style conversion block;
style conversion blockFrom input pair->Scale-in-scale learning F i Style conversion parameter y of (2) i =(y s,i ,y b,i ) Expressed by the following formula:
wherein γ represents a lightweight network, where μ and σ are the mean and standard deviation of the features, y s,i Is thatCorresponding style conversion parameters, y b,i Is->Corresponding style conversion parameters. The method can fully utilize the input low-resolution face image I L Spatial color and texture information from the parse map I P Computing and F based on shape and semantic guidance information i Style conversion parameter y of the same size i 。
The super-resolution reconstruction module introduces semantic perception style lossGram matrix loss for style transfer is good in texture recovery, and semantic perception style loss +_is introduced for better face detail recovery>It calculates the Gram matrix penalty for each semantic region separately. Semantic perception style loss->Calculating semantic perception style penalty +.>The specific formula is as follows:
in phi i Representing the characteristics of the ith layer in VGG19, M j Representing the parse mask with tag j,representing a high-resolution face image output by the super-resolution reconstruction module, I H The real value of the high-resolution face image output by the super-resolution reconstruction module is represented, and g represents the resolution mask M j Calculating the characteristic phi i The formula is as follows:
where, as indicated by the element product, ε=1e-8 was used to avoid the divisor being zero.
The super-resolution reconstruction module introduces reconstruction lossIt is a combination of pixel and feature space mean square error for outputting the high resolution face image from super resolution reconstruction module +.>Constrained to its true value I H Position of (2) reconstruction lossIs calculated as follows:
where the second term on the right side of the equation is the multi-scale feature matching loss for matchingAnd I H Is characterized by the fact that,is a downsampling factor, D s (. Cndot.) represents the discriminator corresponding to the downsampling factor,>representation D s Is a k-th layer feature of (c).
Super resolution reconstruction module introduces contrast lossIt has been demonstrated that the resistance to loss is effective and critical in generating real textures in image restoration tasks. The specific formula is as follows:
construction of objective function based on multi-scale discriminator and hinge lossThe specific formula is as follows:
based on semantic perception style lossReconstruction loss->Resistance loss->Constructing a loss function of the super-resolution reconstruction module>The formula is as follows:
wherein lambda is SS 、λ rec 、λ adv Respectively corresponding to semantic perception style lossReconstruction loss->Resistance loss->Is a weight of (2).
Step S3: pre-training the super-resolution reconstruction module by adopting a face data set FFHQ to obtain a pre-trained super-resolution reconstruction module; the pre-training aims to initialize model parameters, accelerate the convergence rate of the model and improve the generalization capability of the model.
Step S4: constructing a sight estimating module, taking a high-resolution face image output by the super-resolution reconstructing module as input, adopting ResNet50 to extract characteristics in the face image, giving weight to each region in the face image based on a space weight mechanism, and inhibiting the weight of other regions by increasing the weight of the relevant region of the sight line in the face image to obtain a sight estimating result aiming at the face image;
the specific method of step S4 is as follows:
step S4.1: the method comprises the steps of adopting a pre-trained ResNet50 as a feature extractor to extract features from a high-resolution face image with a preset size output by a super-resolution reconstruction module and outputting a feature map;
compared with a shallow neural network, the deep residual neural network has the advantages of stronger expression capability, better generalization performance, higher accuracy, self-adaptive feature learning capability and the like, the ResNet50 utilizes residual connection, and cross-layer connection is added in a model, so that the problems of gradient disappearance, gradient explosion and the like in the neural network can be solved, compared with a traditional convolutional neural network, the ResNet50 has higher accuracy, and meanwhile, model training is easier to converge due to the introduction of residual connection, so that the ResNet50 becomes a model widely applied.
The ResNet50 is adopted as a feature extractor to improve the performance and generalization capability of the sight line estimation model, the size of an input high-resolution face image is 224 multiplied by 224, and after features are extracted by the ResNet50, the size of an output feature map is 2048 multiplied by 14.
Step S4.2: a space weight mechanism is adopted, and the weight of each position of a face region in the face image is learned through one branch, so that the weight of a view line related region in the face image is increased, and the weight of other regions is restrained;
the spatial weight mechanism comprises three convolution layers, the filter size of the three convolution layers is 1 multiplied by 1, the three convolution layers are modified linear unit layers, an activation tensor U with the size of NxHxW is input from the convolution layers for each convolution layer, N is the number of channels of the feature map, H and W are the height and the width of the feature map, the spatial weight mechanism generates an H multiplied by W spatial weight matrix W, and the spatial weight matrix W is multiplied by each channel of the activation tensor U element by element to obtain a weighted activation map on the channel, wherein the formula is as follows:
V C =W⊙U C
wherein W is a space weight matrix, U C The C-th channel, V, representing the activation tensor U C For the weighted activation graph of the C-th channel, the weighted activation graphs of the channels are stacked to form a weighted activation tensor V and fed into the next convolutional layer. Since the ResNet50 feature extractor was previously used, the first layer of convolution input channels was 2048, the output channels were 256, the convolution kernel size was 1, the second layer of convolution input channels was 256, the output channels were 256, the convolution kernel size was 1, the third layer of convolution input channels was 256, the output channels were 1, and the convolution kernel size was 1. The spatial weighting mechanism can retain information from different regions for all feature channelsThe same weights are applied so that the weights of the line of sight estimation directly correspond to the face regions in the input image.
Step S4.3: the features are classified using the full connection layer, and coordinates (x, y) representing the line of sight are output for representing the line of sight estimation result.
Step S5: and (3) carrying out overall training on the super-resolution reconstruction module and the sight estimation module by adopting the face image training set constructed in the step (S1) so as to finish the face sight estimation of the target object.
In the training of the sight estimating module, the filter weights of the first two layers of convolution layers of the space weight mechanism are randomly initialized from Gaussian distribution with the mean value of 0 and the deviation of 0.1, the filter weights of the last convolution layer are randomly initialized from Gaussian distribution with the mean value of 0 and the variance of 0.001, and the filter weights have a constant deviation term of 1; wherein the gradient of the activation tensor U and the spatial weight matrix W is expressed as:
where N is the number of channels of the feature map.
The error of the sight line estimation module adopts L1los, also called average absolute error, which represents the average value of the error between the model estimated predicted value and the true value, and the sight line estimation module introduces a Loss functionThe formula is as follows:
in xi gt Representing the true value of the line of sight estimate, ζ pred A predicted value representing the gaze estimate.
The following is one embodiment of the method contemplated by the present invention:
in this embodiment, the super-resolution reconstruction module needs to be pre-trained first, the FFHQ face data set is adopted as the training data set, the Adam optimizer is used to pre-train the model, and β is selected 1 =0.5,β 2 =0.999, and the learning rate of the generator is set to 0.0001, the learning rate is set to 0.0004, and the parameters of different losses are set to: lambda (lambda) SS =100,λ rec =10,λ adv =1. Training batch size is set to 4.
The gaze estimation dataset used was the classical gaze estimation dataset mpiiifacesize, comprising a total of 45000 images of 15 subjects, using 3000 images of the experimenter No. P00 as the test set and the remaining 42000 images as the training set.
The data preprocessing is carried out on the vision estimation data set, so as to eliminate environmental factors and simplify the fixation regression problem, and the specific steps are as follows:
s1: for preprocessing the entire gaze estimation dataset. In the function, a personal folder list in the MPIIFaceGaze dataset is firstly obtained, and is ordered according to file names, then each personal folder is traversed, annotation information and image information of the person are obtained, and the processed images and information are stored in a specified path.
S2: reading the camera matrix and annotation information of the person, traversing all images of the person, acquiring important annotation information such as a face center point, left and right eye corner points and the like, normalizing and cutting the images through the annotation information, acquiring the images of the face and the left and right eyes in the images, finally acquiring important information such as a 3D gaze point and a 3D head orientation, and storing the processed images and information in a specified path.
S3: for each image, firstly carrying out normalization processing on the image through annotation information to obtain the distance between the center point of the human face and the point of regard, then carrying out scaling on the image according to a certain proportion, ensuring that the distance between the point of regard and the center point of the human face is a fixed value, and the image size of the scaled data set is 224 multiplied by 224.
S4: and acquiring important information such as a 3D gaze point and a 3D head orientation according to the normalized annotation information, and storing the processed image and information in a specified path.
S5: to test the results of the method on low resolution images, the test set was downsampled using the resize () function in python, resized down to 112 x 112 resolution, and restored up to 224 x 224 resolution, turning into low resolution images.
S6: the whole PGGA-Net network was trained using the preprocessed MPIIFaceGaze with the base size set to 128, epoch set to 20, and learning rate set to 0.00001.
S7: verification is performed on the test set using the trained model.
Evaluation index: the evaluation index of the current main stream of the sight line estimation is mostly an angle error, namely the deviation angle of the predicted value and the true value of the sight line estimation, and the smaller the index is, the better the effect is.
The contrast model adopts the advanced visual line estimation methods of Dilated-Net, RT-Gene and Gaze360. Wherein, the Diated-Net sets the batch size to 128, the epoch to 20 and the learning rate to 0.001; RT-Gene set batch size of 128, epoch of 20, learning rate of 0.0001; gaze360 sets a batch size of 128, epoch of 20, and learning rate of 0.0001. The experimental results are shown in table 1:
table 1 experimental results of the network and other advanced networks proposed by the present invention
From the experimental data in table 1, the method of the present invention can effectively increase the accuracy of the line-of-sight estimation in a low resolution environment. Experiments prove that the method provided by the invention is superior to other methods, and the method can effectively increase the accuracy of sight estimation in a low-resolution environment.
The following is an applicable scenario of the embodiment of the present invention:
the vision estimation has wide application scenes, one application scene is detection of cheating in examination, and the vision estimation is carried out on the examinee through the camera of the computer to monitor whether the vision of the examinee looks at the computer, so that whether the examinee cheats is judged. Because of the old computers or notebooks used in many school computer rooms, the definition of the pictures captured by the cameras is lower, and the accuracy of the traditional sight line estimation under such scenes is lower, and the method provided by the invention can solve the problem.
S1: the front camera of the old computer is adopted to collect pictures on the face of the examinee at equal intervals, 5s is an interval, and the resolution ratio of the pictures is low.
S2: the collected pictures are input into the PGGA-Net network provided by the invention.
S3: the PGGA-Net network provided by the invention can calculate and obtain the sight line estimation result of the examinee, then compares the result with the sight line threshold value, and considers that the examinee is highly likely to have cheating behaviors if the sight line angle of the examinee exceeds the threshold value continuously for a plurality of times.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.
Claims (10)
1. A sight line estimation method based on super-resolution reconstruction is characterized in that the following steps S1-S5 are executed to finish the face sight line estimation of a target object:
step S1: acquiring a preset number of face images by using a camera, and constructing a face image training set;
step S2: the super-resolution reconstruction module is constructed and comprises a preset number of residual blocks and format conversion blocks corresponding to the residual blocks, takes a low-resolution face image as input, and upsamples the features in the face image by adopting a step-by-step upsampling mode based on each residual block to generate a high-resolution face image with a preset size;
step S3: pre-training the super-resolution reconstruction module to obtain a pre-trained super-resolution reconstruction module;
step S4: constructing a sight estimating module, taking a high-resolution face image output by the super-resolution reconstructing module as input, adopting ResNet50 to extract characteristics in the face image, giving weight to each region in the face image based on a space weight mechanism, and inhibiting the weight of other regions by increasing the weight of the relevant region of the sight line in the face image to obtain a sight estimating result aiming at the face image;
step S5: and (3) carrying out overall training on the super-resolution reconstruction module and the sight estimation module by adopting the face image training set constructed in the step (S1) so as to finish the face sight estimation of the target object.
2. The sight line estimation method based on super-resolution reconstruction according to claim 1, wherein the super-resolution reconstruction module is pre-trained by using a face data set FFHQ in step S3.
3. The sight line estimation method based on super resolution reconstruction according to claim 1, wherein the super resolution reconstruction module in step S2 has 6 residual blocks sequentially connected in series, and performs step-wise upsampling on the low resolution face image to extract the features thereof, wherein the first residual block is inputted with a learning constant F having a size of c×16×16 0 Wherein C is the channel size; the input of the ith residual block is feature F i-1 The output is characteristic F i Last residual block output feature F 6 Converting into RGB image by ToRGB convolution layer, and outputting high resolution face imageThe specific formula is as follows:
wherein the method comprises the steps ofRepresenting residual convolution block, ">Representing up-sampled residual convolution block,>representing a style conversion block;
style conversion blockIs input as a face image I with low resolution L And corresponding analytic graph I P The input pair is expressed as +.> And->Respectively inputting a low-resolution face image and an analytic graph corresponding to the low-resolution face image for the ith style conversion block;
style conversion blockFrom input pair->Scale-in-scale learning F i Style conversion parameter y of (2) i =(y s,i ,y b,i ) Expressed by the following formula:
wherein γ represents a lightweight network, where μ and σ are the mean and standard deviation of the features, y s,i Is thatCorresponding style conversion parameters, y b,i Is->Corresponding style conversion parameters.
4. A method for estimating a line of sight based on super resolution reconstruction as claimed in claim 3, wherein the super resolution reconstruction module introduces semantic perceptual style lossCalculating semantic perception style penalty +.>The specific formula is as follows:
in phi i Representing the characteristics of the ith layer in VGG19, M j Representing the parse mask with tag j,representing a high-resolution face image output by the super-resolution reconstruction module, I H The real value of the high-resolution face image output by the super-resolution reconstruction module is represented, and g represents the resolution mask M j Calculating the characteristic phi i Gram matrix of (A), formula such asThe following steps:
where, as indicated by the element product, ε=1e-8 was used to avoid the divisor being zero.
5. The method of claim 4, wherein the super-resolution reconstruction module introduces reconstruction lossHigh-resolution face image for outputting super-resolution reconstruction module>Constrained to its true value I H Position of reconstruction loss->Is calculated as follows:
where the second term on the right side of the equation is the multi-scale feature matching loss for matchingAnd I H Is characterized by the fact that,is a downsampling factor, D s (. Cndot.) represents the discriminator corresponding to the downsampling factor,>representation D s The first of (3)k-layer features.
6. The method of claim 5, wherein the super-resolution reconstruction module introduces a loss of contrastThe specific formula is as follows:
construction of objective function based on multi-scale discriminator and hinge lossThe specific formula is as follows:
based on semantic perception style lossReconstruction loss->Resistance loss->Constructing a loss function of the super-resolution reconstruction module>The formula is as follows:
wherein lambda is SS 、λ rec 、λ adv Respectively corresponding to semantic perception style lossReconstruction loss->Loss of resistanceIs a weight of (2).
7. The sight line estimation method based on super-resolution reconstruction according to claim 1, wherein the specific method of step S4 is as follows:
step S4.1: the method comprises the steps of adopting a pre-trained ResNet50 as a feature extractor to extract features from a high-resolution face image with a preset size output by a super-resolution reconstruction module and outputting a feature map;
step S4.2: a space weight mechanism is adopted, and the weight of each position of a face region in the face image is learned through one branch, so that the weight of a view line related region in the face image is increased, and the weight of other regions is restrained;
step S4.3: the features are classified using the full connection layer, and coordinates (x, y) representing the line of sight are output for representing the line of sight estimation result.
8. The line-of-sight estimation method according to claim 7, wherein the spatial weighting mechanism of step S4.2 comprises three convolution layers, the filter size of which is 1×1, and is a modified linear unit layer, and for each convolution layer, an activation tensor U of size nxhxw is input from the convolution layer, wherein N is the number of channels of the feature map, H and W are the height and width of the feature map, the spatial weighting mechanism generates an axw spatial weighting matrix W, and the spatial weighting matrix W is multiplied by each channel of the activation tensor U element by element to obtain a weighted activation map on the channel, and the formula is as follows:
V C =W⊙U C
wherein W is a space weight matrix, U C The C-th channel, V, representing the activation tensor U C For the weighted activation graph of the C-th channel, the weighted activation graphs of the channels are stacked to form a weighted activation tensor V and fed into the next convolutional layer.
9. The line-of-sight estimation method based on super-resolution reconstruction according to claim 8, wherein in the training of the line-of-sight estimation module, the filter weights of the two convolution layers before the spatial weight mechanism are randomly initialized by gaussian distribution with mean value of 0 and deviation of 0.1, the filter weights of the last convolution layer are randomly initialized by gaussian distribution with mean value of 0 and variance of 0.001, and have a constant deviation term of 1; wherein the gradient of the activation tensor U and the spatial weight matrix W is expressed as:
where N is the number of channels of the feature map.
10. The line-of-sight estimation method based on super-resolution reconstruction of claim 9, wherein the line-of-sight estimation module introduces a loss functionThe formula is as follows:
in xi gt Representing gaze estimatesTrue value, ζ pred A predicted value representing the gaze estimate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310599847.5A CN116664677B (en) | 2023-05-24 | 2023-05-24 | Sight estimation method based on super-resolution reconstruction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310599847.5A CN116664677B (en) | 2023-05-24 | 2023-05-24 | Sight estimation method based on super-resolution reconstruction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116664677A true CN116664677A (en) | 2023-08-29 |
CN116664677B CN116664677B (en) | 2024-06-14 |
Family
ID=87719969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310599847.5A Active CN116664677B (en) | 2023-05-24 | 2023-05-24 | Sight estimation method based on super-resolution reconstruction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116664677B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830783A (en) * | 2024-01-03 | 2024-04-05 | 南通大学 | Sight estimation method based on local super-resolution fusion attention mechanism |
CN118506430A (en) * | 2024-07-17 | 2024-08-16 | 江苏富翰医疗产业发展有限公司 | Sight line estimation method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105469399A (en) * | 2015-11-20 | 2016-04-06 | 中国地质大学(武汉) | Face super-resolution reconstruction method facing mixed noises and apparatus thereof |
KR20170087734A (en) * | 2016-01-21 | 2017-07-31 | 한국전자통신연구원 | Apparatus and method for high resolution image generation using gradient information |
CN111754403A (en) * | 2020-06-15 | 2020-10-09 | 南京邮电大学 | Image super-resolution reconstruction method based on residual learning |
CN113298717A (en) * | 2021-06-08 | 2021-08-24 | 浙江工业大学 | Medical image super-resolution reconstruction method based on multi-attention residual error feature fusion |
CN113362223A (en) * | 2021-05-25 | 2021-09-07 | 重庆邮电大学 | Image super-resolution reconstruction method based on attention mechanism and two-channel network |
CN116091315A (en) * | 2023-01-05 | 2023-05-09 | 南昌大学 | Face super-resolution reconstruction method based on progressive training and face semantic segmentation |
-
2023
- 2023-05-24 CN CN202310599847.5A patent/CN116664677B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105469399A (en) * | 2015-11-20 | 2016-04-06 | 中国地质大学(武汉) | Face super-resolution reconstruction method facing mixed noises and apparatus thereof |
KR20170087734A (en) * | 2016-01-21 | 2017-07-31 | 한국전자통신연구원 | Apparatus and method for high resolution image generation using gradient information |
CN111754403A (en) * | 2020-06-15 | 2020-10-09 | 南京邮电大学 | Image super-resolution reconstruction method based on residual learning |
CN113362223A (en) * | 2021-05-25 | 2021-09-07 | 重庆邮电大学 | Image super-resolution reconstruction method based on attention mechanism and two-channel network |
CN113298717A (en) * | 2021-06-08 | 2021-08-24 | 浙江工业大学 | Medical image super-resolution reconstruction method based on multi-attention residual error feature fusion |
CN116091315A (en) * | 2023-01-05 | 2023-05-09 | 南昌大学 | Face super-resolution reconstruction method based on progressive training and face semantic segmentation |
Non-Patent Citations (2)
Title |
---|
XINTAO WANG 等: "RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization", PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 10 October 2022 (2022-10-10), pages 2556, XP059131453, DOI: 10.1145/3503161.3547915 * |
徐石: "面向单图像超分辨率的多尺度卷积神经网络模型研究", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 1, 15 January 2023 (2023-01-15), pages 138 - 1546 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830783A (en) * | 2024-01-03 | 2024-04-05 | 南通大学 | Sight estimation method based on local super-resolution fusion attention mechanism |
CN118506430A (en) * | 2024-07-17 | 2024-08-16 | 江苏富翰医疗产业发展有限公司 | Sight line estimation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN116664677B (en) | 2024-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022111236A1 (en) | Facial expression recognition method and system combined with attention mechanism | |
US11644898B2 (en) | Eye tracking method and system | |
CN112766160A (en) | Face replacement method based on multi-stage attribute encoder and attention mechanism | |
CN112530019B (en) | Three-dimensional human body reconstruction method and device, computer equipment and storage medium | |
JP6207210B2 (en) | Information processing apparatus and method | |
CN111723707B (en) | Gaze point estimation method and device based on visual saliency | |
JP2008152789A (en) | Method and device for calculating similarity of face video, method and device for retrieving face video using this, and face composing method | |
CN111046734B (en) | Multi-modal fusion sight line estimation method based on expansion convolution | |
CN109583338A (en) | Driver Vision decentralized detection method based on depth integration neural network | |
CN112232134B (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
WO2021218238A1 (en) | Image processing method and image processing apparatus | |
CN111783748A (en) | Face recognition method and device, electronic equipment and storage medium | |
JP6822482B2 (en) | Line-of-sight estimation device, line-of-sight estimation method, and program recording medium | |
CN113610046B (en) | Behavior recognition method based on depth video linkage characteristics | |
CN114120432A (en) | Online learning attention tracking method based on sight estimation and application thereof | |
CN114943924B (en) | Pain assessment method, system, equipment and medium based on facial expression video | |
CN116664677B (en) | Sight estimation method based on super-resolution reconstruction | |
CN113850231A (en) | Infrared image conversion training method, device, equipment and storage medium | |
Guo et al. | Remote sensing image super-resolution using cascade generative adversarial nets | |
CN113642393A (en) | Attention mechanism-based multi-feature fusion sight line estimation method | |
CN114170537A (en) | Multi-mode three-dimensional visual attention prediction method and application thereof | |
Dutta | Facial Pain Expression Recognition in Real‐Time Videos | |
CN114220138A (en) | Face alignment method, training method, device and storage medium | |
CN114492634A (en) | Fine-grained equipment image classification and identification method and system | |
CN117037244A (en) | Face security detection method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |