CN116664677A

CN116664677A - A Line of Sight Estimation Method Based on Super-resolution Reconstruction

Info

Publication number: CN116664677A
Application number: CN202310599847.5A
Authority: CN
Inventors: 曹硕裕; 王进; 王可
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-29
Anticipated expiration: 2043-05-24
Also published as: CN116664677B

Abstract

The invention discloses a sight line estimation method based on super-resolution reconstruction, which comprises the following steps: acquiring a face image by using a camera; constructing a super-resolution reconstruction module and a sight line estimation module, firstly pre-training the super-resolution reconstruction module, then training the whole network, inputting a face image, recovering details and definition of a low-resolution face image through the super-resolution reconstruction module so as to improve the sight line estimation precision, and extracting global features through the sight line estimation module, improving the feature expression capability, and increasing the weight of a sight line related area through a space weight mechanism so as to perform accurate sight line estimation; the method designed by the invention has better learning ability, performance and generalization ability. Experiments prove that the method can effectively improve the accuracy of sight estimation in a low-resolution scene.

Description

A Line of Sight Estimation Method Based on Super-resolution Reconstruction

技术领域technical field

本发明涉及深度学习和计算机视觉领域，具体涉及一种基于超分辨率重建的视线估计方法。The invention relates to the fields of deep learning and computer vision, in particular to a line of sight estimation method based on super-resolution reconstruction.

背景技术Background technique

视线估计旨在确定图像或视频中一个人注视的方向和点。由于视线行为是人类社会行为的一个基本方面，因此可以根据视线估计的对象来推断潜在的信息。Gaze estimation aims to determine the direction and point of a person's gaze in an image or video. Since gaze behavior is a fundamental aspect of human social behavior, latent information can be inferred from the objects estimated by gaze.

早期的视线估计方法采用单眼图像作为输入，使用卷积神经网络训练模型，输出视线的二维坐标点。接着，双眼视线估计方法被提出，由于单眼的方法没法充分利用双眼的互补信息，双眼视线估计方法补足了这一缺点。但是单眼和双眼的视线估计方法还有一些缺陷，比如需要额外的模块来检测眼睛，需要额外的模块估计头部姿态。之后，全脸视线估计方法被提出，该方法只需要输入人脸图像，就能得到最终视线估计结果的输出，是一种端到端的学习策略，并且能考虑全脸的全局特征，现在很多主流的视线估计方法都以全脸的视线估计方法为基础。但是该方法采用的浅层残差网络学习能力受限，提升效果有限，并且仍然没有解决在低分辨率场合下视线估计精度大幅度下降的问题。Early gaze estimation methods used a monocular image as input, trained a model using a convolutional neural network, and output the two-dimensional coordinate points of the gaze. Then, the binocular line of sight estimation method was proposed. Since the monocular method cannot make full use of the complementary information of both eyes, the binocular line of sight estimation method makes up for this shortcoming. However, the monocular and binocular line-of-sight estimation methods still have some shortcomings, such as the need for additional modules to detect the eyes, and the need for additional modules to estimate the head pose. Afterwards, the full-face line-of-sight estimation method was proposed. This method only needs to input the face image to get the output of the final line-of-sight estimation result. It is an end-to-end learning strategy and can consider the global features of the whole face. Now many mainstream The line of sight estimation methods are all based on the full face line of sight estimation method. However, the learning ability of the shallow residual network used in this method is limited, and the improvement effect is limited, and it still does not solve the problem that the line-of-sight estimation accuracy drops significantly in low-resolution situations.

发明内容Contents of the invention

本发明目的：在于提供一种基于超分辨率重建的视线估计方法，解决低分辨率场合下视线估计精度显著下降的问题。The object of the present invention is to provide a line of sight estimation method based on super-resolution reconstruction, which solves the problem that the accuracy of line of sight estimation drops significantly in low-resolution situations.

为实现以上功能，本发明设计一种基于超分辨率重建的视线估计方法，执行如下步骤S1-步骤S5，完成目标对象的人脸视线估计：In order to realize the above functions, the present invention designs a line of sight estimation method based on super-resolution reconstruction, and performs the following steps S1-Step S5 to complete the face line of sight estimation of the target object:

步骤S1：使用摄像头采集预设数量的人脸图像，构建人脸图像训练集；Step S1: use the camera to collect a preset number of face images, and construct a face image training set;

步骤S2：构建超分辨率重建模块，包括预设数量的残差块和与其相对应的格式转换块，超分辨率重建模块以低分辨率的人脸图像为输入，基于各残差块采用逐步上采样方式对人脸图像中的特征进行上采样，生成预设大小的高分辨率的人脸图像；Step S2: Construct a super-resolution reconstruction module, including a preset number of residual blocks and corresponding format conversion blocks. The super-resolution reconstruction module takes low-resolution face images as input, and adopts a step-by-step method based on each residual block. The upsampling method upsamples the features in the face image to generate a high-resolution face image with a preset size;

步骤S3：对超分辨率重建模块进行预训练，获得预训练好的超分辨率重建模块；Step S3: Pre-training the super-resolution reconstruction module to obtain a pre-trained super-resolution reconstruction module;

步骤S4：构建视线估计模块，以超分辨率重建模块输出的高分辨率的人脸图像为输入，采用ResNet50提取人脸图像中的特征，并基于空间权重机制，赋予人脸图像中各区域的权重，通过增加人脸图像中视线相关区域的权重，抑制其他区域的权重，获得针对人脸图像的视线估计结果；Step S4: Construct the line of sight estimation module, take the high-resolution face image output by the super-resolution reconstruction module as input, use ResNet50 to extract the features in the face image, and assign the weight of each area in the face image based on the spatial weight mechanism Weight, by increasing the weight of the line-of-sight-related area in the face image and suppressing the weight of other areas, the line-of-sight estimation result for the face image is obtained;

步骤S5：采用步骤S1所构建的人脸图像训练集对超分辨率重建模块、视线估计模块进行整体训练，以完成对目标对象的人脸视线估计。Step S5: Use the face image training set constructed in step S1 to conduct overall training on the super-resolution reconstruction module and the line-of-sight estimation module, so as to complete the face line-of-sight estimation of the target object.

作为本发明的一种优选技术方案：步骤S3中采用人脸数据集FFHQ对超分辨率重建模块进行预训练。As a preferred technical solution of the present invention: in step S3, the face data set FFHQ is used to pre-train the super-resolution reconstruction module.

作为本发明的一种优选技术方案：步骤S2所述的超分辨率重建模块具有依次串联的6个残差块，对低分辨率的人脸图像进行逐步上采样，以提取其中特征，第一个残差块的输入为大小为C×16×16的学习常数F₀，其中C是通道大小；第i个残差块的输入为特征F_i-1，输出为特征F_i，最后一个残差块输出特征F₆通过ToRGB卷积层转换为RGB图像，并输出高分辨率的人脸图像具体如下式：As a preferred technical solution of the present invention: the super-resolution reconstruction module described in step S2 has 6 residual blocks sequentially connected in series, and gradually upsamples the low-resolution face image to extract the features, the first The input of the first residual block is the learning constant F ₀ of size C×16×16, where C is the channel size; the input of the ith residual block is the feature F _i-1 , the output is the feature F _i , and the last residual The difference block output feature _F6 is converted to an RGB image through the ToRGB convolutional layer, and outputs a high-resolution face image The specific formula is as follows:

其中表示残差卷积块，/>表示上采样残差卷积块，/>表示风格转换块；in Represents a residual convolution block, /> Represents an upsampled residual convolution block, /> Represents a style conversion block;

风格转换块的输入为由低分辨率的人脸图像I_L和与其对应的解析图I_P所构成的输入对，表示为/> 和/>分别为第i个风格转换块输入的低分辨率的人脸图像和与其对应的解析图；style transfer block The input of is an input pair consisting of a low-resolution face image I _L and its corresponding parsing image I _P , denoted as /> and /> The low-resolution face image and the corresponding analysis image respectively input to the i-th style conversion block;

风格转换块从输入对/>的尺度中学习F_i的风格转换参数y_i＝(y_s,i,y_b,i)，表示为下式：style transfer block from the input pair /> In the scale of learning F _i style conversion parameters y _i =(y _s,i ,y _b,i ), expressed as the following formula:

式中，γ表示轻量级网络，其中μ和σ是特征的均值和标准差，y_s,i为对应的风格转换参数，y_b,i为/>对应的风格转换参数。where γ represents a lightweight network, where μ and σ are the mean and standard deviation of features, and y _s,i is Corresponding style conversion parameters, y _{b, i} is /> Corresponding style conversion parameters.

作为本发明的一种优选技术方案：超分辨率重建模块引入语义感知风格损失采用relu3_1、relu4_1、relu5_1层的VGG19特征来计算语义感知风格损失/>具体如下式：As a preferred technical solution of the present invention: the super-resolution reconstruction module introduces semantic-aware style loss Use VGG19 features of relu3_1, relu4_1, relu5_1 layers to calculate semantic-aware style loss /> The specific formula is as follows:

式中，φ_i表示VGG19中的第i层的特征，M_j表示带有标签j的解析掩码，表示超分辨率重建模块输出的高分辨率的人脸图像，I_H表示超分辨率重建模块输出的高分辨率的人脸图像的真实值，g表示用解析掩码M_j计算特征φ_i的Gram矩阵，公式如下：where _φi represents the features of the i-th layer in VGG19, M _j represents the parsing mask with label j, Represents the high-resolution face image output by the super-resolution reconstruction module, I _H represents the real value of the high-resolution face image output by the super-resolution reconstruction module, g represents the feature φ _i calculated by the analytical mask M _j Gram matrix, the formula is as follows:

式中，⊙表示元素乘积，ε＝1e-8用于避免除数为零。In the formula, ⊙ represents the product of elements, and ε=1e-8 is used to avoid division by zero.

作为本发明的一种优选技术方案：超分辨率重建模块引入重建损失用于将超分辨率重建模块输出的高分辨率的人脸图像/>约束在其真实值I_H的位置，重建损失/>的计算如下式：As a preferred technical solution of the present invention: the super-resolution reconstruction module introduces reconstruction loss A high-resolution face image for outputting the super-resolution reconstruction module /> Constrained at the location of its true value I _H , the reconstruction loss /> The calculation of is as follows:

式中，等式右侧第二项为多尺度特征匹配损失，用于匹配和I_H的鉴别器特征，是下采样因子，D_s(·)表示下采样因子所对应的鉴别器，/>表示D_s中的第k层特征。In the formula, the second item on the right side of the equation is the multi-scale feature matching loss, which is used for matching and the discriminator features of I _H , is the downsampling factor, D _s ( ) represents the discriminator corresponding to the downsampling factor, /> Denotes the k-th layer features in D _s .

作为本发明的一种优选技术方案：超分辨率重建模块引入对抗性损失具体如下式：As a preferred technical solution of the present invention: the super-resolution reconstruction module introduces an adversarial loss The specific formula is as follows:

基于多尺度鉴别器和合页损失构建目标函数具体如下式：Building an objective function based on a multi-scale discriminator and a hinge loss The specific formula is as follows:

基于语义感知风格损失重建损失/>对抗性损失/>构建超分辨率重建模块的损失函数/>如下式：Semantic-aware style loss reconstruction loss/> Adversarial loss/> Constructing the loss function of the super-resolution reconstruction module /> as follows:

式中，λ_SS、λ_rec、λ_adv分别为对应于语义感知风格损失重建损失/>对抗性损失/>的权重。where λ _SS , λ _rec , and λ _adv are the corresponding semantic-aware style loss reconstruction loss/> Adversarial loss/> the weight of.

作为本发明的一种优选技术方案：步骤S4的具体方法如下：As a preferred technical solution of the present invention: the specific method of step S4 is as follows:

步骤S4.1：采用预训练的ResNet50作为特征提取器，从超分辨率重建模块输出的预设大小的高分辨率的人脸图像中提取特征，输出特征图；Step S4.1: using pre-trained ResNet50 as a feature extractor, extracting features from the high-resolution face image of a preset size output by the super-resolution reconstruction module, and outputting a feature map;

步骤S4.2：采用空间权重机制，通过一个支路学习人脸图像中人脸区域各位置的权重，用于增加人脸图像中视线相关区域的权重，抑制其他区域的权重；Step S4.2: Using the spatial weight mechanism, learn the weight of each position of the face area in the face image through a branch, which is used to increase the weight of the sight-related area in the face image and suppress the weight of other areas;

步骤S4.3：使用全连接层对特征进行分类，并输出表示视线的坐标(x,y)，用于表示视线估计结果。Step S4.3: Use the fully connected layer to classify the features, and output the coordinates (x, y) representing the line of sight, which are used to represent the line of sight estimation result.

作为本发明的一种优选技术方案：步骤S4.2的空间权重机制包含三个卷积层，其过滤器大小为1×1，是一个修正的线性单元层，分别针对各卷积层，从卷积层输入大小为N×H×W的激活张量U，其中N是特征图的通道数量，H和W是特征图的高度和宽度，空间权重机制生成一个H×W空间权重矩阵W，空间权重矩阵W与激活张量U的各通道逐元素相乘得到该通道上的加权激活图，公式如下式：As a preferred technical solution of the present invention: the spatial weighting mechanism of step S4.2 includes three convolutional layers, the filter size of which is 1×1, which is a modified linear unit layer, respectively for each convolutional layer, from The convolutional layer inputs an activation tensor U of size N×H×W, where N is the number of channels of the feature map, H and W are the height and width of the feature map, and the spatial weight mechanism generates a H×W spatial weight matrix W, The spatial weight matrix W is multiplied element-by-element by each channel of the activation tensor U to obtain the weighted activation map on the channel, the formula is as follows:

V_C＝W⊙U_C V _C ＝ _W⊙UC

式中，W为空间权重矩阵，U_C表示激活张量U的第C个通道，V_C为第C个通道的加权激活图，将各通道的加权激活图堆叠形成加权激活张量V，并被送入下一层卷积层。In the formula, W is the spatial weight matrix, U _C represents the Cth channel of the activation tensor U, V _C is the weighted activation map of the Cth channel, and the weighted activation maps of each channel are stacked to form a weighted activation tensor V, and is fed into the next convolutional layer.

作为本发明的一种优选技术方案：在视线估计模块的训练中，空间权重机制前两层卷积层的过滤器权值由均值为0，偏差为0.1的高斯分布中随机初始化，最后一个卷积层的滤波器权重由均值为0，方差为0.001的高斯分布中随机初始化，并且具有一个恒定的偏差项为1；其中激活张量U和空间权重矩阵W的梯度表示为：As a preferred technical solution of the present invention: in the training of the line of sight estimation module, the filter weights of the first two convolutional layers of the spatial weight mechanism are randomly initialized from a Gaussian distribution with a mean value of 0 and a deviation of 0.1, and the last volume The filter weights of the product layer are randomly initialized from a Gaussian distribution with a mean of 0 and a variance of 0.001, and a constant bias term of 1; where the gradient of the activation tensor U and the spatial weight matrix W is expressed as:

式中，N为特征图的通道数量。where N is the number of channels of the feature map.

作为本发明的一种优选技术方案：视线估计模块引入损失函数如下式：As a preferred technical solution of the present invention: the line of sight estimation module introduces a loss function as follows:

式中，ξ_gt表示视线估计的真实值，ξ_pred表示视线估计的预测值。In the formula, ξ _gt represents the real value of line of sight estimation, and ξ _pred represents the predicted value of line of sight estimation.

有益效果：相对于现有技术，本发明的优点包括：Beneficial effect: compared with the prior art, the advantages of the present invention include:

本发明所设计一种基于超分辨率重建的视线估计方法，可以增加低分辨率场合下视线估计的精度。目前视线估计主流的评价指标大多都是角度误差，即视线估计预测值和真实值的偏差角度，该指标越小，效果越好。实验训练采用视线估计经典的数据集MPIIFaceGaze，对测试集进行LQ处理以测试本方法在低分辨率场景下的效果。采用相同的实验条件和其他先进的方法进行比较，经过实验可得，Dilated-Net方法在数据集上的平均误差为4.86°，Gaze360方法在数据集上的平均误差为5.02°，Rt-Gene方法在数据集上的平均误差为6.43°。而本发明的PGGA-Net方法在数据集上的平均误差为3.96°，优于其他方法。证明本发明的方法在低分辨率环境下能增加视线估计的精度。The present invention designs a line-of-sight estimation method based on super-resolution reconstruction, which can increase the accuracy of line-of-sight estimation in low-resolution situations. At present, most of the mainstream evaluation indicators for line of sight estimation are angle errors, that is, the deviation angle between the predicted value of line of sight estimation and the real value. The smaller the index, the better the effect. The experimental training uses the classic line of sight estimation dataset MPIIFaceGaze, and performs LQ processing on the test set to test the effect of this method in low-resolution scenes. Using the same experimental conditions to compare with other advanced methods, the average error of the Dilated-Net method on the data set is 4.86°, the average error of the Gaze360 method on the data set is 5.02°, and the Rt-Gene method The average error on the dataset is 6.43°. However, the PGGA-Net method of the present invention has an average error of 3.96° on the data set, which is better than other methods. It is proved that the method of the present invention can increase the accuracy of line-of-sight estimation in a low-resolution environment.

附图说明Description of drawings

图1(a)是现有的单眼视线估计网络结构图；Figure 1(a) is the structure diagram of the existing monocular line of sight estimation network;

图1(b)是现有的双眼视线估计网络结构图；Figure 1(b) is the structure diagram of the existing binocular line of sight estimation network;

图1(c)是现有的全脸视线估计网络结构图；Figure 1(c) is a network structure diagram of the existing full face line of sight estimation;

图2是现有的全脸视线估计网络结构图；Figure 2 is a network structure diagram of the existing full face line of sight estimation;

图3(a)是现有方法的流程图；Fig. 3 (a) is the flowchart of existing method;

图3(b)是本发明实施例提供的一种基于超分辨率重建的视线估计方法的流程图；FIG. 3(b) is a flow chart of a method for estimating a line of sight based on super-resolution reconstruction provided by an embodiment of the present invention;

图4是根据本发明实施例提供的PGGA-Net网络框架图；Fig. 4 is a PGGA-Net network frame diagram provided according to an embodiment of the present invention;

图5(a)是根据本发明实施例提供的残差块结构图；FIG. 5(a) is a structural diagram of a residual block provided according to an embodiment of the present invention;

图5(b)是根据本发明实施例提供的风格转换块结构图。Fig. 5(b) is a structural diagram of a style conversion block provided according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

现有的单眼、双眼、全脸视线估计网络结构图分别参照图1(a)-1(c)，单眼视线估计方法采用单眼图像作为输入，使用卷积神经网络训练模型，输出视线的二维坐标点。双眼视线估计方法补足了单眼的方法没法充分利用双眼的互补信息这一缺点。但是单眼和双眼的视线估计方法还有一些缺陷，比如需要额外的模块来检测眼睛，需要额外的模块估计头部姿态。全脸视线估计方法解决了上述缺点，只需要输入人脸图像，就能得到最终视线估计结果的输出，是一种端到端的学习策略，并且能考虑全脸的全局特征。Refer to Figure 1(a)-1(c) for the existing monocular, binocular, and full-face gaze estimation network structures. The monocular gaze estimation method uses a monocular image as input, uses a convolutional neural network to train the model, and outputs a two-dimensional view of the sight. Coordinate points. The binocular line of sight estimation method makes up for the shortcoming that the monocular method cannot make full use of the complementary information of the binoculars. However, the monocular and binocular line-of-sight estimation methods still have some shortcomings, such as the need for additional modules to detect the eyes, and the need for additional modules to estimate the head pose. The full face line of sight estimation method solves the above shortcomings. It only needs to input the face image to get the output of the final line of sight estimation result. It is an end-to-end learning strategy and can consider the global features of the whole face.

现有的全脸视线估计网络结构图参照图2，基于一种融合无参注意力机制的卷积神经网络，归一化后的人脸图像通过第一个卷积模块，卷积核的大小为7×7，接着送入三层的网络中，每一层都有2个残差块，接着送入包含1×1的卷积层中进行卷积操作，完成脸部特征提取，然后将提取到的特征调整为向量的形式和头部姿态信息拼接融合，接着经过全连接层得到视线估计结果。Refer to Figure 2 for the existing full-face line of sight estimation network structure diagram, based on a convolutional neural network fused with a non-parametric attention mechanism, the normalized face image passes through the first convolution module, the size of the convolution kernel It is 7×7, and then sent to the three-layer network, each layer has 2 residual blocks, and then sent to the convolutional layer containing 1×1 for convolution operation, complete facial feature extraction, and then The extracted features are adjusted to the form of vectors and spliced and fused with the head pose information, and then the line of sight estimation results are obtained through the fully connected layer.

现有的全脸视线估计流程图参照图3(a)，使用融合了无参注意力机制的浅层残差神经网络进行全脸视线估计方法，这种方法可以在不增加网络参数数量的情况下，提升网络的性能，但是该方法采用的浅层残差网络学习能力受限，提升效果有限，并且仍然没有解决在低分辨率场合下视线估计精度大幅度下降的问题。The existing full-face gaze estimation flow chart refers to Figure 3(a), using a shallow residual neural network that incorporates a non-parametric attention mechanism to perform a full-face gaze estimation method. This method can be used without increasing the number of network parameters. However, the learning ability of the shallow residual network used in this method is limited, and the improvement effect is limited, and it still does not solve the problem of a significant drop in line-of-sight estimation accuracy in low-resolution situations.

参照图3(b)，图4，本发明实施例提供的一种基于超分辨率重建的视线估计方法，该方法基于PGGA-Net网络框架，PGGA-Net网络框架主要由两个模块组成，分别是超分辨率重建模块和视线估计模块，超分辨率重建模块是一种渐进式语义感知风格转换框架，对低分辨率人脸图像恢复细节和清晰度，以提高视线估计精度。执行如下步骤S1-步骤S5，完成目标对象的人脸视线估计：Referring to Fig. 3 (b), Fig. 4, a kind of line-of-sight estimation method based on super-resolution reconstruction provided by the embodiment of the present invention, this method is based on PGGA-Net network frame, PGGA-Net network frame mainly is made up of two modules, respectively It is a super-resolution reconstruction module and a gaze estimation module. The super-resolution reconstruction module is a progressive semantic-aware style transfer framework that restores details and clarity to low-resolution face images to improve the accuracy of gaze estimation. Perform the following steps S1-Step S5 to complete the estimation of the target object's face line of sight:

步骤S2：构建超分辨率重建模块，一种渐进式语义感知风格转换框架，对低分辨率人脸图像恢复细节和清晰度，以提高视线估计精度，包括预设数量的残差块和与其相对应的格式转换块，超分辨率重建模块以低分辨率的人脸图像为输入，基于各残差块采用逐步上采样方式对人脸图像中的特征进行上采样，生成预设大小的高分辨率的人脸图像；Step S2: Construct a super-resolution reconstruction module, a progressive semantic-aware style transfer framework, to restore details and sharpness to low-resolution face images to improve line-of-sight estimation accuracy, including a preset number of residual blocks and their corresponding Corresponding to the format conversion block, the super-resolution reconstruction module takes the low-resolution face image as input, and uses the step-by-step up-sampling method to up-sample the features in the face image based on each residual block to generate a high-resolution image with a preset size. rate face image;

所述的超分辨率重建模块具有依次串联的6个残差块，对低分辨率的人脸图像进行逐步上采样，以提取其中特征，第一个残差块的输入为大小为C×16×16的学习常数F₀，其中C是通道大小；第i个残差块的输入为特征F_i-1，输出为特征F_i，最后一个残差块输出特征F₆通过ToRGB卷积层转换为RGB图像，并输出高分辨率的人脸图像具体如下式：The super-resolution reconstruction module has 6 residual blocks connected in series, and gradually upsamples the low-resolution face image to extract features. The input of the first residual block is C×16 in size. The learning constant F ₀ of ×16, where C is the channel size; the input of the i-th residual block is feature F _i-1 , the output is feature F _i , and the output feature F ₆ of the last residual block is converted by ToRGB convolutional layer It is an RGB image and outputs a high-resolution face image The specific formula is as follows:

其中表示残差卷积块，/>表示上采样残差卷积块，/>表示风格转换块；残差块和风格转换块结构图参照图5(a)-5(b)；in Represents a residual convolution block, /> Represents an upsampled residual convolution block, /> Indicates the style conversion block; refer to Figure 5(a)-5(b) for the structural diagram of the residual block and the style conversion block;

式中，γ表示轻量级网络，其中μ和σ是特征的均值和标准差，y_s,i为对应的风格转换参数，y_b,i为/>对应的风格转换参数。这种方法可以充分利用输入低分辨率人脸图像I_L的空间颜色和纹理信息以及来自解析图I_P的形状和语义指导信息来计算与F_i相同大小的风格转换参数y_i。where γ represents a lightweight network, where μ and σ are the mean and standard deviation of features, and y _s,i is Corresponding style conversion parameters, y _{b, i} is /> Corresponding style conversion parameters. This method can make full use of the spatial color and texture information of the input low-resolution face image I _L and the shape and semantic guidance information from the parsing map _IP to calculate the style transfer parameters y _i with the same size as F _i .

超分辨率重建模块引入语义感知风格损失通常用于风格转移的Gram矩阵损失在恢复纹理方面有较好效果，为了更好地恢复人脸细节，引入了语义感知风格损失/>它分别计算每个语义区域的Gram矩阵损失。语义感知风格损失/>采用relu3_1、relu4_1、relu5_1层的VGG19特征来计算语义感知风格损失/>具体如下式：Super-resolution reconstruction module introduces semantic-aware style loss The Gram matrix loss, which is usually used for style transfer, has a better effect in restoring texture. In order to better restore face details, a semantic-aware style loss is introduced /> It computes the Gram matrix loss for each semantic region separately. Semantic-aware style loss /> Use VGG19 features of relu3_1, relu4_1, relu5_1 layers to calculate semantic-aware style loss /> The specific formula is as follows:

超分辨率重建模块引入重建损失它是像素和特征空间均方误差的组合，用于将超分辨率重建模块输出的高分辨率的人脸图像/>约束在其真实值I_H的位置，重建损失的计算如下式：The super-resolution reconstruction module introduces a reconstruction loss It is a combination of pixel and feature space mean square error for high-resolution face images output by the super-resolution reconstruction module /> Constrained at the location of its true value I _H , the reconstruction loss The calculation of is as follows:

超分辨率重建模块引入对抗性损失已经证实了在图像修复任务中生成真实纹理方面，对抗性损失是有效且至关重要的。具体如下式：Super-resolution reconstruction module introduces an adversarial loss Adversarial losses have been demonstrated to be effective and crucial in generating realistic textures in image inpainting tasks. The specific formula is as follows:

步骤S3：采用人脸数据集FFHQ对超分辨率重建模块进行预训练，获得预训练好的超分辨率重建模块；预训练的目的是为了初始化模型参数，加快模型的收敛速度，同时提高模型的泛化能力。Step S3: Use the face dataset FFHQ to pre-train the super-resolution reconstruction module to obtain the pre-trained super-resolution reconstruction module; the purpose of pre-training is to initialize the model parameters, speed up the convergence speed of the model, and improve the model's Generalization.

步骤S4的具体方法如下：The specific method of step S4 is as follows:

和浅层神经网络相比，深层残差神经网络有更强的表达能力、更好的泛化性能、更高的准确率，以及拥有自适应特征学习的能力等优点，ResNet50利用了残差连接，在模型中添加跨层连接，可以解决神经网络中梯度消失以及梯度爆炸等问题，和传统的卷积神经网络相比，ResNet50有了更高的准确性，同时由于引入了残差连接，模型训练也更容易收敛，因此ResNet50成为了受到广泛应用的模型。Compared with the shallow neural network, the deep residual neural network has the advantages of stronger expression ability, better generalization performance, higher accuracy, and the ability to learn adaptive features. ResNet50 uses the residual connection , adding cross-layer connections to the model can solve problems such as gradient disappearance and gradient explosion in neural networks. Compared with traditional convolutional neural networks, ResNet50 has higher accuracy. At the same time, due to the introduction of residual connections, the model Training is also easier to converge, so ResNet50 has become a widely used model.

采用ResNet50作为特征提取器可以提高视线估计模型的性能和泛化能力，输入的高分辨率的人脸图像大小为224×224，经过ResNet50提取特征后，输出特征图大小为2048×14×14。Using ResNet50 as a feature extractor can improve the performance and generalization ability of the line of sight estimation model. The input high-resolution face image size is 224×224. After the features are extracted by ResNet50, the output feature map size is 2048×14×14.

空间权重机制包含三个卷积层，其过滤器大小为1×1，是一个修正的线性单元层，分别针对各卷积层，从卷积层输入大小为N×H×W的激活张量U，其中N是特征图的通道数量，H和W是特征图的高度和宽度，空间权重机制生成一个H×W空间权重矩阵W，空间权重矩阵W与激活张量U的各通道逐元素相乘得到该通道上的加权激活图，公式如下式：The spatial weight mechanism consists of three convolutional layers with a filter size of 1×1, which is a modified linear unit layer. For each convolutional layer, an activation tensor of size N×H×W is input from the convolutional layer U, where N is the number of channels of the feature map, H and W are the height and width of the feature map, the spatial weight mechanism generates an H×W spatial weight matrix W, and the spatial weight matrix W is element-wise related to each channel of the activation tensor U Multiply to get the weighted activation map on the channel, the formula is as follows:

V_C＝W⊙U_C V _C ＝ _W⊙UC

式中，W为空间权重矩阵，U_C表示激活张量U的第C个通道，V_C为第C个通道的加权激活图，将各通道的加权激活图堆叠形成加权激活张量V，并被送入下一层卷积层。由于之前使用了ResNet50特征提取器，第一层卷积输入通道为2048，输出通道为256，卷积核大小为1，第二层卷积输入通道为256，输出通道为256，卷积核大小为1，第三层卷积输入通道为256，输出通道为1，卷积核大小为1。空间权重机制可以保留来自不同区域的信息，对所有特征通道应用相同的权值，因此视线估计的权重直接对应输入图像中的人脸区域。In the formula, W is the spatial weight matrix, U _C represents the Cth channel of the activation tensor U, V _C is the weighted activation map of the Cth channel, and the weighted activation maps of each channel are stacked to form a weighted activation tensor V, and is fed into the next convolutional layer. Since the ResNet50 feature extractor was used before, the input channel of the first layer of convolution is 2048, the output channel is 256, the size of the convolution kernel is 1, the input channel of the second layer of convolution is 256, the output channel is 256, and the size of the convolution kernel is 1, the input channel of the third layer convolution is 256, the output channel is 1, and the convolution kernel size is 1. The spatial weighting mechanism can preserve information from different regions and apply the same weights to all feature channels, so the weights for gaze estimation directly correspond to face regions in the input image.

在视线估计模块的训练中，空间权重机制前两层卷积层的过滤器权值由均值为0，偏差为0.1的高斯分布中随机初始化，最后一个卷积层的滤波器权重由均值为0，方差为0.001的高斯分布中随机初始化，并且具有一个恒定的偏差项为1；其中激活张量U和空间权重矩阵W的梯度表示为：In the training of the line of sight estimation module, the filter weights of the first two convolutional layers of the spatial weight mechanism are randomly initialized from a Gaussian distribution with a mean value of 0 and a deviation of 0.1, and the filter weights of the last convolutional layer are randomly initialized from a mean value of 0. , randomly initialized in a Gaussian distribution with a variance of 0.001, and has a constant bias term of 1; where the gradient of the activation tensor U and the spatial weight matrix W is expressed as:

视线估计模块的误差采用L1Loss，又叫平均绝对误差，代表模型估计预测值和真实值之间的误差的平均值，视线估计模块引入损失函数如下式：The error of the line of sight estimation module uses L1Loss, also known as the mean absolute error, which represents the average value of the error between the model estimated predicted value and the real value, and the line of sight estimation module introduces a loss function as follows:

以下为本发明所设计方法的一个实施例：Below is an embodiment of the designed method of the present invention:

本实施例首先需要对超分辨率重建模块进行预训练，采用FFHQ人脸数据集作为训练数据集，使用Adam优化器对模型进行预训练，选择β₁＝0.5，β₂＝0.999，并且将生成器的学习率设置为0.0001，将学习率设置为0.0004，不同损失的参数设置为：λ_SS＝100，λ_rec＝10，λ_adv＝1。训练batch size设置为4。In this embodiment, the super-resolution reconstruction module first needs to be pre-trained, and the FFHQ face data set is used as the training data set, and the Adam optimizer is used to pre-train the model, and β ₁ =0.5, β ₂ =0.999 are selected, and the generated The learning rate of the detector is set to 0.0001, the learning rate is set to 0.0004, and the parameters of different losses are set as: λ _SS =100, λ _rec =10, λ _adv =1. The training batch size is set to 4.

视线估计数据集采用视线估计经典的数据集MPIIFaceGaze，包含了15名受试者的总计45000张图像，采用P00号实验者的3000张图作为测试集，其余的42000张图作为训练集。The line-of-sight estimation dataset uses the classic line-of-sight estimation dataset MPIIFaceGaze, which contains a total of 45,000 images from 15 subjects. The 3,000 images of experimenter P00 are used as the test set, and the remaining 42,000 images are used as the training set.

对视线估计数据集进行数据预处理，目的是消除环境因素，简化注视回归问题，具体步骤如下所示：Data preprocessing is performed on the gaze estimation data set to eliminate environmental factors and simplify the gaze regression problem. The specific steps are as follows:

S1：用于对整个视线估计数据集进行预处理。在该函数中，首先获取MPIIFaceGaze数据集中的个人文件夹列表，并按照文件名排序，然后遍历每个个人文件夹，获取该人的注释信息和图像信息，并将处理后的图像和信息保存到指定的路径中。S1: Used to preprocess the entire gaze estimation dataset. In this function, first obtain the list of personal folders in the MPIIFaceGaze dataset and sort them by file name, then traverse each personal folder to obtain the person’s annotation information and image information, and save the processed images and information to in the specified path.

S2：读取该人的相机矩阵和注释信息，然后遍历该人的所有图像，获取重要的注释信息，如人脸中心点、左右眼角点等，通过注释信息进行图像的归一化和剪裁，并获取图像中的人脸和左右眼的图像，最后获取重要的信息，如3D注视点和3D头部朝向，并将处理后的图像和信息保存到指定的路径中。S2: Read the camera matrix and annotation information of the person, and then traverse all the images of the person to obtain important annotation information, such as the center point of the face, the corners of the left and right eyes, etc., and normalize and crop the image through the annotation information, And get the face and left and right eye images in the image, and finally get important information, such as 3D gaze point and 3D head orientation, and save the processed image and information to the specified path.

S3：对于每张图像，首先通过注释信息进行图像的归一化处理，获取人脸中心点和注视点的距离，然后按照一定比例进行图像的缩放，保证注视点和人脸中心点之间的距离为固定值，缩放后的数据集图像大小为224×224。S3: For each image, first normalize the image through the annotation information to obtain the distance between the center point of the face and the gaze point, and then scale the image according to a certain ratio to ensure the distance between the gaze point and the center point of the face The distance is a fixed value, and the scaled dataset image size is 224×224.

S4：根据归一化后的注释信息，获取重要的信息，如3D注视点和3D头部朝向，并将处理后的图像和信息保存到指定的路径中。S4: According to the normalized annotation information, obtain important information, such as 3D gaze point and 3D head orientation, and save the processed image and information to the specified path.

S5：为了测试本方法在低分辨率图像上的结果，对测试集采用python中的resize()函数进行下采样，将尺寸向下调整为112×112分辨率，再向上恢复为224×224分辨率，变成低分辨率图像。S5: In order to test the results of this method on low-resolution images, use the resize() function in python to downsample the test set, adjust the size down to 112×112 resolution, and then restore it to 224×224 resolution rate, into a low-resolution image.

S6：对整个PGGA-Net网络使用预处理过后的MPIIFaceGaze进行训练，bath size设置为128，epoch设置为20，学习率设置为0.00001。S6: Use the preprocessed MPIIFaceGaze to train the entire PGGA-Net network, set the bath size to 128, set the epoch to 20, and set the learning rate to 0.00001.

S7：使用训练好的模型在测试集上进行验证。S7: Use the trained model to verify on the test set.

评价指标：目前视线估计主流的评价指标大多是角度误差，即视线估计预测值和真实值的偏差角度，该指标越小，效果越好。Evaluation index: At present, the mainstream evaluation index of line of sight estimation is mostly angle error, that is, the deviation angle between the predicted value of line of sight estimation and the real value. The smaller the index, the better the effect.

对比模型采用视线估计先进的方法Dilated-Net、RT-Gene、Gaze360。其中Dilated-Net设置batch size为128，epoch为20，学习率为0.001；RT-Gene设置batch size为128，epoch为20，学习率为0.0001；Gaze360设置batch size为128，epoch为20，学习率为0.0001。实验结果如表1所示：The comparison model uses the advanced methods of line of sight estimation Dilated-Net, RT-Gene, Gaze360. Among them, Dilated-Net sets the batch size to 128, epoch to 20, and the learning rate is 0.001; RT-Gene sets the batch size to 128, epoch to 20, and the learning rate to 0.0001; Gaze360 sets the batch size to 128, epoch to 20, and the learning rate is 0.0001. The experimental results are shown in Table 1:

表1本发明提出的网络和其他先进网络的实验结果Table 1 The experimental results of the network proposed by the present invention and other advanced networks

由表1的实验数据可知，本发明的方法在低分辨率环境下能有效增加视线估计的精度。实验证明本发明提出的方法优于其他方法，证明本方法在低分辨率环境下能有效增加视线估计的精度。From the experimental data in Table 1, it can be seen that the method of the present invention can effectively increase the accuracy of line-of-sight estimation in a low-resolution environment. Experiments prove that the method proposed by the present invention is superior to other methods, and prove that the method can effectively increase the accuracy of line of sight estimation in a low-resolution environment.

以下为本发明实施例的一种适用场景：The following is an applicable scenario of the embodiment of the present invention:

视线估计有广大的应用场景，其中的一个应用场景就是考试作弊的检测，通过电脑自带摄像头对考生进行视线估计监测考生的视线是否看着电脑，从而判断考生是否作弊。由于很多学校计算机机房使用的老式电脑或者笔记本，因此摄像头所捕捉的图片清晰度较低，而传统的视线估计针对此类场景下的准确度较低，而本发明提出的方法能解决该问题。Eyesight estimation has a wide range of application scenarios. One of the application scenarios is the detection of cheating in exams. The computer's built-in camera is used to estimate the line of sight of candidates to monitor whether the candidates are looking at the computer, so as to determine whether the candidates are cheating. Due to the old-fashioned computers or notebooks used in many school computer rooms, the pictures captured by the camera have low resolution, and the traditional line of sight estimation has low accuracy for such scenes, and the method proposed by the present invention can solve this problem.

S1：采用老式电脑的前置摄像头对考生的脸部等间隔采集图片，5s为一个间隔，图片的分辨率较低。S1: Use the front camera of an old-fashioned computer to collect pictures of the examinee's face at equal intervals, 5s as an interval, and the resolution of the pictures is low.

S2：对收集到的图片输入到本发明提出的PGGA-Net网络中。S2: Input the collected pictures into the PGGA-Net network proposed by the present invention.

S3：本发明所提出的PGGA-Net网络会计算得到考生的视线估计结果，然后将结果和视线阈值进行比对，如果考生的视线角度连续数张超出阈值，则认为考生有很大可能出现了作弊的行为。S3: The PGGA-Net network proposed by the present invention will calculate the candidate's line of sight estimation result, and then compare the result with the line of sight threshold. If the line of sight angle of the candidate exceeds the threshold for several consecutive sheets, it is considered that the candidate has a high possibility of The act of cheating.

上面结合附图对本发明的实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments, and can also be made without departing from the gist of the present invention within the scope of knowledge possessed by those of ordinary skill in the art. Variations.

Claims

1. A sight line estimation method based on super-resolution reconstruction is characterized in that the following steps S1-S5 are executed to finish the face sight line estimation of a target object:

step S1: acquiring a preset number of face images by using a camera, and constructing a face image training set;

step S2: the super-resolution reconstruction module is constructed and comprises a preset number of residual blocks and format conversion blocks corresponding to the residual blocks, takes a low-resolution face image as input, and upsamples the features in the face image by adopting a step-by-step upsampling mode based on each residual block to generate a high-resolution face image with a preset size;

step S3: pre-training the super-resolution reconstruction module to obtain a pre-trained super-resolution reconstruction module;

step S4: constructing a sight estimating module, taking a high-resolution face image output by the super-resolution reconstructing module as input, adopting ResNet50 to extract characteristics in the face image, giving weight to each region in the face image based on a space weight mechanism, and inhibiting the weight of other regions by increasing the weight of the relevant region of the sight line in the face image to obtain a sight estimating result aiming at the face image;

step S5: and (3) carrying out overall training on the super-resolution reconstruction module and the sight estimation module by adopting the face image training set constructed in the step (S1) so as to finish the face sight estimation of the target object.

2. The sight line estimation method based on super-resolution reconstruction according to claim 1, wherein the super-resolution reconstruction module is pre-trained by using a face data set FFHQ in step S3.

3. The sight line estimation method based on super resolution reconstruction according to claim 1, wherein the super resolution reconstruction module in step S2 has 6 residual blocks sequentially connected in series, and performs step-wise upsampling on the low resolution face image to extract the features thereof, wherein the first residual block is inputted with a learning constant F having a size of c×16×16 ₀ Wherein C is the channel size; the input of the ith residual block is feature F _i-1 The output is characteristic F _i Last residual block output feature F ₆ Converting into RGB image by ToRGB convolution layer, and outputting high resolution face imageThe specific formula is as follows:

wherein the method comprises the steps ofRepresenting residual convolution block, ">Representing up-sampled residual convolution block,>representing a style conversion block;

style conversion blockIs input as a face image I with low resolution _L And corresponding analytic graph I _P The input pair is expressed as +.> And->Respectively inputting a low-resolution face image and an analytic graph corresponding to the low-resolution face image for the ith style conversion block;

style conversion blockFrom input pair->Scale-in-scale learning F _i Style conversion parameter y of (2) _i ＝(y _s,i ,y _b,i ) Expressed by the following formula:

wherein γ represents a lightweight network, where μ and σ are the mean and standard deviation of the features, y _s,i Is thatCorresponding style conversion parameters, y _b,i Is->Corresponding style conversion parameters.

4. A method for estimating a line of sight based on super resolution reconstruction as claimed in claim 3, wherein the super resolution reconstruction module introduces semantic perceptual style lossCalculating semantic perception style penalty +.>The specific formula is as follows:

in phi _i Representing the characteristics of the ith layer in VGG19, M _j Representing the parse mask with tag j,representing a high-resolution face image output by the super-resolution reconstruction module, I _H The real value of the high-resolution face image output by the super-resolution reconstruction module is represented, and g represents the resolution mask M _j Calculating the characteristic phi _i Gram matrix of (A), formula such asThe following steps:

where, as indicated by the element product, ε=1e-8 was used to avoid the divisor being zero.

5. The method of claim 4, wherein the super-resolution reconstruction module introduces reconstruction lossHigh-resolution face image for outputting super-resolution reconstruction module>Constrained to its true value I _H Position of reconstruction loss->Is calculated as follows:

where the second term on the right side of the equation is the multi-scale feature matching loss for matchingAnd I _H Is characterized by the fact that,is a downsampling factor, D _s (. Cndot.) represents the discriminator corresponding to the downsampling factor,>representation D _s The first of (3)k-layer features.

6. The method of claim 5, wherein the super-resolution reconstruction module introduces a loss of contrastThe specific formula is as follows:

construction of objective function based on multi-scale discriminator and hinge lossThe specific formula is as follows:

based on semantic perception style lossReconstruction loss->Resistance loss->Constructing a loss function of the super-resolution reconstruction module>The formula is as follows:

wherein lambda is _SS 、λ _rec 、λ _adv Respectively corresponding to semantic perception style lossReconstruction loss->Loss of resistanceIs a weight of (2).

7. The sight line estimation method based on super-resolution reconstruction according to claim 1, wherein the specific method of step S4 is as follows:

step S4.1: the method comprises the steps of adopting a pre-trained ResNet50 as a feature extractor to extract features from a high-resolution face image with a preset size output by a super-resolution reconstruction module and outputting a feature map;

step S4.2: a space weight mechanism is adopted, and the weight of each position of a face region in the face image is learned through one branch, so that the weight of a view line related region in the face image is increased, and the weight of other regions is restrained;

step S4.3: the features are classified using the full connection layer, and coordinates (x, y) representing the line of sight are output for representing the line of sight estimation result.

8. The line-of-sight estimation method according to claim 7, wherein the spatial weighting mechanism of step S4.2 comprises three convolution layers, the filter size of which is 1×1, and is a modified linear unit layer, and for each convolution layer, an activation tensor U of size nxhxw is input from the convolution layer, wherein N is the number of channels of the feature map, H and W are the height and width of the feature map, the spatial weighting mechanism generates an axw spatial weighting matrix W, and the spatial weighting matrix W is multiplied by each channel of the activation tensor U element by element to obtain a weighted activation map on the channel, and the formula is as follows:

V _C ＝W⊙U _C

wherein W is a space weight matrix, U _C The C-th channel, V, representing the activation tensor U _C For the weighted activation graph of the C-th channel, the weighted activation graphs of the channels are stacked to form a weighted activation tensor V and fed into the next convolutional layer.

9. The line-of-sight estimation method based on super-resolution reconstruction according to claim 8, wherein in the training of the line-of-sight estimation module, the filter weights of the two convolution layers before the spatial weight mechanism are randomly initialized by gaussian distribution with mean value of 0 and deviation of 0.1, the filter weights of the last convolution layer are randomly initialized by gaussian distribution with mean value of 0 and variance of 0.001, and have a constant deviation term of 1; wherein the gradient of the activation tensor U and the spatial weight matrix W is expressed as:

where N is the number of channels of the feature map.

10. The line-of-sight estimation method based on super-resolution reconstruction of claim 9, wherein the line-of-sight estimation module introduces a loss functionThe formula is as follows:

in xi _gt Representing gaze estimatesTrue value, ζ _pred A predicted value representing the gaze estimate.