CN117275069B

CN117275069B - End-to-end head gesture estimation method based on learnable vector and attention mechanism

Info

Publication number: CN117275069B
Application number: CN202311251910.2A
Authority: CN
Inventors: 徐晶; 汪季轩; 王子行; 刘威
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2024-06-04
Anticipated expiration: 2043-09-26
Also published as: CN117275069A

Abstract

The invention discloses an end-to-end head posture estimation method based on a learnable vector and an attention mechanism, which belongs to the field of computer vision, and introduces an end-to-end integrated design idea, introduces learnable vector set to reserve priori face information when a head posture estimation model is designed, and reduces the parameter setting of a large number of related faces in the model; the method has the advantages that the attention mechanism and the dynamic convolution module are adopted to strengthen the head gesture characteristics in the image, the cascade gesture estimation module is built, and the gesture conversion module is designed to realize the conversion of local and global gestures, so that the estimation accuracy and the robustness of the model are improved. In the aspect of model training design, an estimated value output by a model is matched with a true value one to one by adopting a bipartite graph optimal matching method to calculate a loss function, redundant output can be avoided only by confidence, and therefore a related filtering module can be abandoned, and the real-time processing speed of the model is increased.

Description

End-to-end head gesture estimation method based on learnable vector and attention mechanism

Technical Field

The invention belongs to the field of computer vision, and in particular relates to an end-to-end head gesture estimation method based on a learnable vector and an attention mechanism.

Background

Image understanding is an important area of branching from artificial intelligence computer vision, and is directed to enabling computers to understand and interpret information in images. Head pose estimation is a classical research field in image understanding tasks, whose purpose is to obtain the pose of the head from the facial information of a person in an image, thereby providing a powerful clue for analyzing the intent of the person. Along with the popularization of intelligent equipment, head posture estimation is applied to various tasks such as driver monitoring, sight line estimation and the like, and has important research value due to wide application prospect.

The early head posture research work mainly collects a digital image processing method, namely, feature extraction and detection are carried out according to image features of a predefined head area, wherein the predefining of the image features depends on experience knowledge, and adaptability to changes such as illumination and posture is lacking. With the development of deep learning, data-driven neural networks are widely applied to various image analysis tasks, and researchers begin to attempt to extract features and estimate head gestures by using a deep learning model instead of a manual mode. The head pose estimation in the above manner requires two steps, namely, locating a face image by adopting a face detection model, and then estimating the pose by using a head pose estimation model. On the one hand, since the two models are trained independently, errors generated by the two models are accumulated, and further, the improvement of estimation accuracy is limited. On the other hand, the image processing time of the two models is accumulated, so that the computational complexity of the head pose estimation is improved, and the practical applicability of the head pose estimation is affected.

For this end-to-end technical ideas are beginning to be applied to head pose estimation. The end-to-end model establishes the relation between the face detection and the head posture estimation, so that the face detection can be skipped, the head posture of the user can be estimated directly from the original image, further the accumulated error is reduced, the implementation of the model is simplified, the detection speed of the image is improved, and the method is convenient for practical application in a weak computing environment.

However, the end-to-end attitude estimation of the current mainstream also has certain performance defects: on the one hand, the optimal performance of the model is difficult to ensure due to too many manually set parameters, for example, a large number of parameters related to candidate frames need to be set in a region candidate network, including the number and the size of initial anchor frames, the threshold value of the cross ratio of positive and negative samples and the proportion of the positive and negative samples in the training process; on the other hand, in the process of screening the final head posture frame, the common non-maximum suppression module does not have the capability of eliminating redundant candidate frames, so that the partial structure finally outputs redundant high-confidence head posture estimated values, and therefore filtering is needed to reduce the parameter complexity and reduce the influence of the parameter complexity on the final performance.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides an end-to-end head posture estimation method based on a learnable vector and an attention mechanism, which aims to improve the current main stream end-to-end head posture estimation method from two aspects of model structure and model design, thereby improving the accuracy of head posture estimation and the performance of the model.

To achieve the above object, according to a first aspect of the present invention, there is provided an end-to-end posture estimation method based on a learnable vector and an attention mechanism, including:

Training phase: constructing a head posture estimation model and training the head posture estimation model;

Wherein the head pose estimation model comprises:

the feature extraction module is used for extracting multi-scale features of the input image to obtain a multi-scale feature map X ^FPN;

The cascade posture estimation module is used for processing the X ^FPN to obtain a local head posture and a human face boundary frame; the cascade gesture estimation module comprises a plurality of cascade sub-modules, wherein each sub-module comprises an attention module, a region characteristic aggregation module, a head gesture dynamic convolution module and a candidate frame dynamic convolution module; for the t sub-module, the input is the local head gesture set output by the t-1 sub-module The advice characteristic set Q _t-1 and the face boundary candidate frame set B _t-1, and the attention module is used for carrying out attention calculation on the Q _t-1 to obtain/>The region feature aggregation module is used for extracting feature sets/>, of the region of interest from X ^FPN according to B _t-1 Head pose dynamic convolution module for pair/>And/>Convolution processing is carried out to obtain an enhanced local head posture feature set/> For/>Correction is carried out to obtain/>The candidate frame dynamic convolution module is used for pairing/>And/>The Q _t,Q_t after the convolution treatment is enhanced is respectively subjected to feature mapping treatment and dimension reduction and weighting treatment to obtain B _t and a confidence coefficient set Confidence _t; wherein, P ₀ ^local、Q₀ and B ₀ are both learnable vector sets;

The gesture conversion module is used for converting the local head gesture into a global head gesture according to the face boundary box;

the application stage comprises the following steps: and inputting the image to be estimated into a trained head posture estimation model to obtain the global head posture.

According to a second aspect of the present invention, there is provided an end-to-end pose estimation system based on a learnable vector and an attention mechanism, comprising: a computer readable storage medium and a processor;

The computer-readable storage medium is for storing executable instructions;

the processor is configured to read executable instructions stored in the computer readable storage medium and perform the method according to the first aspect.

According to a third aspect of the present invention there is provided a computer readable storage medium storing computer instructions for causing a processor to perform the method of the first aspect.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

1. The existing mainstream head posture estimation algorithm usually needs two steps of face detection positioning and head posture recognition, and achieves higher precision, but has the problems of high calculation complexity, accumulated errors, poor instantaneity and low recognition rate in complex and actual scenes. In order to optimize the problem, the end-to-end integrated design concept is introduced, the learned vector set is introduced to reserve the prior face information when the head posture estimation model is designed, and a large number of related face parameter settings in the model are reduced; the method comprises the steps of enhancing head gesture features in an image by adopting an attention mechanism and a dynamic convolution module, constructing a cascade gesture estimation module, and designing a gesture conversion module to realize conversion of local and global gestures. Thereby improving the estimation accuracy and the robustness of the model.

2. According to the method provided by the invention, in the aspect of model training design, the binary image optimal matching algorithm is adopted to match the estimated value output by the network with the true value one to one so as to calculate the loss function, and redundant output can be avoided only by the confidence coefficient, so that the related filtering module can be abandoned, and the real-time processing speed of the model is increased.

3. According to the method provided by the invention, the head posture information of different faces contained in different data sets is considered, a model parameter fusion method for training the data sets is provided, and the generalization capability of the model is improved.

Drawings

Fig. 1 is a flowchart of an end-to-end posture estimation method based on a learnable vector and an attention mechanism.

Fig. 2 is a diagram of a head pose estimation model structure provided by the present invention.

Fig. 3 is a flowchart of WIDER FACE dataset preprocessing provided in the present invention.

Fig. 4 is a schematic diagram of a backbone network and a feature pyramid network structure provided by the present invention.

Fig. 5 is a schematic diagram of an associated pose estimation module according to the present invention.

Fig. 6 is a structural diagram of a t-th level single-stage attitude estimation submodule provided by the invention.

Fig. 7 is a block diagram of a head gesture dynamic convolution module provided by the present invention.

Fig. 8 is a schematic diagram of a partial image and a complete image of the gesture conversion module provided by the invention.

FIG. 9 is a diagram illustrating the bipartite graph matching and loss function calculation according to the present invention.

FIG. 10 is a flow chart of model parameter fusion based on multi-data set training provided by the invention.

FIG. 11 is a schematic diagram of the head pose estimation errors on AFLW2000-3D and BIWI datasets for each model and for the present invention MHPE.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The embodiment of the invention provides an end-to-end head gesture estimation method based on a learnable vector and an attention mechanism, which comprises the following steps:

Wherein the head pose estimation model comprises:

The cascade posture estimation module is used for processing the X ^FPN to obtain a local head posture and a human face boundary frame; the cascade gesture estimation module comprises a plurality of cascade sub-modules, wherein each sub-module comprises an attention module, a region feature aggregation module, a head gesture dynamic convolution module and a candidate frame dynamic convolution module; for the t sub-module, the input is the local head gesture set output by the t-1 sub-module Suggested feature set Q _t-1, face boundary candidate frame set (hereinafter referred to as candidate frame set) B _t-1, and output as/>Q_t、B_t；

Wherein the attention module is used for carrying out attention calculation on Q _t-1 to obtainThe region feature aggregation module is used for extracting feature sets/>, of the region of interest from X ^FPN according to B _t-1 The head gesture dynamic convolution module is used for pairingAnd/>Convolution processing is carried out to obtain an enhanced local head posture feature set/> For/>Correction is carried out to obtain/>The candidate frame dynamic convolution module is used for pairing/>And/>The Q _t,Q_t after the convolution treatment is enhanced is subjected to feature mapping treatment to obtain B _t,Q_t, and the confidence coefficient set Confidence _t is obtained through the dimension reduction and weighting treatment;

Wherein, P ₀ ^local、Q₀ and B ₀ are both learnable vector sets; in the training stage, the values of P ₀ ^local、Q₀ and B ₀ are obtained by random initialization, and in the application stage, the values of P ₀ ^local、Q₀ and B ₀ are obtained by iterative training and learning of the cascade gesture estimation module in the training stage;

The application stage comprises the following steps: and inputting the image to be estimated into a trained head posture estimation model, and sequentially obtaining a global head posture including Euler angles and key points of a human face through a characteristic extraction module, a cascade posture estimation module and a posture conversion module.

Further, in the training phase, the loss function is:

N is the number of sub-modules of the cascade gesture estimation module; c ^*、b^*、p^local* are confidence, face bounding box, true values of local head pose, respectively (it will be appreciated that since b ^*、p^local* is true, the corresponding c ^* for b ^*、p^local* is 1); c _t、b_t, Confidence, face bounding box and estimation value of local head gesture output by the t sub-module are respectively according to/>Confidence coefficient set Confidence _t, face boundary candidate box set B _t, local head pose set/>, which are output from the t sub-modulesMiddle screening and matching to obtain,/>The optimal mapping relation is obtained by performing bipartite graph matching on the estimated value of the local head gesture set and the candidate frame set output by the last-stage sub-module and the true value: including c ^* and c _t、b_t and b ^*,/>Mapping to p ^local* (e.g., calculate B _t from B ^* and B _t to B ^*, and find B _t in B _t to complete the match); Focus loss, bezel position L1 loss, GIoU loss, local head pose estimation loss; lambda _cls、λ_L1、λ_giou、λ_dof are all weight coefficients.

Further, in the application stage, a confidence minimum threshold is set for the last-stage sub-module of the cascade gesture estimation module so as to screen and filter the output local head gesture set and the candidate frame set.

The confidence minimum threshold may be set according to actual requirements, for example: if set to 0.8, then in the application phase, the last level submodule outputs a series of local head poses and candidate boxes with confidence greater than 0.8.

Further, the head gesture dynamic convolution module comprises a first full connection layer and a second full connection layer;

two sets of convolution Kernel parameter sets Kernel 1 _t and Kernel 2 _t,/>, are output through the first full connection layer Convolving the convolution result obtained by convolving Kernel 1 _t with Kernel 2 _t again, and passing through the second full connection layer to obtain

The candidate frame dynamic convolution module comprises a third full connection layer and a fourth full connection layer;

Two sets of convolution Kernel parameter sets Kernel 1 _t 'and Kernel 2 _t' are output through a third full connection layer The convolution result obtained by convolving with Kernel 1 _t 'is convolved with Kernel 2 _t' again, and then passes through the second full connection layer to obtain Q _t.

Further, the method comprises the steps of,For/>Correction is carried out to obtain/>Comprising the following steps:

by pairs of fully-connected layers Scaling and offset calculations are performed and according to/> Weighting calculation is carried out to obtain/>

Wherein, W _t and S _t are respectively a scaling weight and an offset weight;

Q _t is characterized to obtain B _t, which comprises the following steps:

And performing feature mapping on the Q _t through a full-connection layer to obtain a face boundary box set B _t with enhanced features (namely enhanced image semantic information).

Q _t is weighted to yield Confidence _t, which comprises:

The Q _t is subjected to dimension reduction and weighting through the full connection layer, so that a local head posture confidence coefficient set Confidence _t is obtained.

Further, the feature extraction module comprises a ResNet-18 network and a feature pyramid network which are connected with each other;

Or, the feature extraction module comprises a ResNet-50 network and a feature pyramid network which are connected with each other.

Further, the head pose is a 6DoF or 3DoF head pose.

Further, the head posture estimation model is trained by adopting a plurality of data sets, and a result obtained by weighting model parameters obtained by training the data sets is used as the parameters of the trained model.

Specifically, taking a head pose as an example of a 6DoF head pose, as shown in fig. 1, the method provided by the invention mainly includes four parts: the first part is to preprocess the data set; the second part is the detailed design of the peer-to-peer network model; the third part is the design of the training scheme, the loss function and the training skill in the network training process; the fourth part is to train and test the network model, output the final head posture estimation result and compare the precision with the network performance.

The first part comprises a step of:

S1, processing head pose estimation common datasets WIDER FACE, 300W-LP, BIWI and AFLW2000-3D to respectively construct local 6DoF head poses for training and testing.

The training set of the model for training is WIDER FACE data sets and the artificially synthesized head posture data set is 300W-LP. WIDER FACE the dataset contains abundant face information, and the generalization capability of the model can be improved by training the model by using the dataset. However, since WIDER FACE is a true value of the head pose which is not manually marked in the dataset, in the S1, a weak supervision learning method is adopted, a RETINAFACE model is used for detecting two aspects of key points and bounding boxes of the face and the head pose of each picture in the dataset, and based on the mark of the 6DoF head pose in the dataset, the training mark format is a local 6DoF head pose vector and comprises a detection frame. However, the gesture truth value is rough when the foregoing WIDER FACE dataset is preprocessed, so in order to further improve the performance of the model, the 300W-LP containing the gesture truth value information needs to be used for training the whole model in the second stage, since the 300W-LP dataset provides the euler angle truth value and the face key point, the euler angle is firstly converted into a rotation matrix, then the rotation matrix is converted into a global rotation vector r ^global by using the rodgers formula, then the face key point truth value is used for obtaining a global translation vector t ^global by applying SolevePnP algorithm, finally a global-to-local method is applied for obtaining a local 6DoF head gesture vector p ^local*, and finally a local 6DoF head gesture is obtained, as shown in fig. 2.

The second part comprises three steps:

S2, building a feature extraction network, wherein the feature extraction network comprises a backbone network and a feature pyramid network, and multi-scale feature extraction is realized through top-down and bottom-up information transmission. The backbone network is mainly used for extracting useful characteristic information from the image, and the characteristic pyramid network is constructed on the basis of the backbone network and mainly aims at processing the picture data in the S1 and generating a multi-scale image characteristic map.

Preferably, a ResNet-18 or ResNet-50 series network based on residual blocks is adopted to extract the bottom-up feature map of the image data, and then semantic information of different layers is transferred to a low-level feature map from top to bottom based on a feature pyramid network, so that image features of different scales are processed.

As shown in fig. 4, the left backbone network is a classical ResNet series network, the bottom-up path extracts rough but high-resolution features from the bottom layer of the network, so as to gradually obtain deep semantic information, while the right feature pyramid network in fig. 4 extracts features with rich semantics but lower resolution from the top layer of the network through the top-down path, and each layer is connected through upsampling and feature fusion, so that the feature pyramid network comprises global feature information of a specific scale captured by the backbone network on one hand, and local semantic information of each detection object on the other hand, and thus basic feature information and depth semantic information of detection objects of different scales are captured. In this way the feature pyramid network is able to handle image features of different scales.

S3, constructing a cascade gesture estimation module, wherein the cascade gesture estimation module comprises cascade of a plurality of gesture estimation modules and a set of three learnable vectors. The input of the module is an output multiscale image feature map X ^FPN of the feature pyramid network in S2, and the output is a local 6DoF head gesture set detected from the multiscale feature imageThe corresponding relation exists between the detected local head gesture of each object, the face boundary box and the confidence coefficient, and the corresponding relation exists between the detected local head gesture of each object and the elements in the face boundary box set B ₆ and the confidence coefficient set Confidence ₆, specifically, the elements in the three sets are output.

Specifically, a cascade attitude estimation module is constructed with the aim of discarding the area candidate network to reduce manually set parameters, and a learnable vector set is introduced into the network: p ₀ ^local、Q₀ and B ₀ are respectively a learnable potential local 6DoF head gesture set, a learnable suggested feature set and a learnable candidate frame set, and learn face image information based on back propagation in sequence and gradually enhance the features of the face region in Q ₀ in a cascading mode in the S7 training process.

As shown in fig. 5, the cascade pose estimation module includes a plurality of cascade head pose estimation sub-modules.

Three sets of learnable vectors P ₀ ^local、Q₀ and B ₀ are first constructed to replace the tedious manual parameter setting, in particular, three sets of learnable vectors are randomly initialized during the training phase, the sets of learnable vectors can be automatically updated by back propagation during the training process, and the values thereof remain unchanged after the training is finished and are used as part of the input of the subsequent pose estimation module for testing or further fine tuning training.

The head posture estimation submodule constructed by the invention has the structure shown in fig. 6, and the input of the t-th posture estimation module is the output of the last cascade posture estimation module: potential local 6DoF head pose setThe proposed feature set Q _t-1, the candidate frame set B _t-1 and the multiscale image feature map X ^FPN output by the feature extraction module are output as corrected imagesAnd B _t, Q _t enhanced by image semantic information and confidence coefficient set/>Wherein Confidence _t represents each candidate box and/>, corresponding to B _t Confidence in each local 6DoF head pose. In the training stage, confidence coefficient sets Confidence _t output by each head posture estimation module in cascade are used for participating in the calculation of a training loss function; in the application stage, the confidence set Confidence _t output by the head pose estimation module of the middle cascade position is discarded, and only the Confidence _t output by the last pose estimation module is used to screen out a series of local 6DoF head poses and candidate boxes with high confidence. Each attitude estimation module has the same structure and consists of an attention module, a regional characteristic aggregation module, a head attitude dynamic convolution module and a candidate frame dynamic convolution module.

In the attention module, preferably 8 attention heads perform self-attention calculations. Each feature vector in the recommended feature set Q _t-1.Q_t-1 is split into 8 groups (h=8), and then sent to the full-connection layer in parallel to perform attention calculation, and the calculation results are spliced and finally output through one full-connection layer.

In the region feature aggregation module, the input of the region feature aggregation module consists of a candidate frame set B _t-1 and a multi-scale image feature map X ^FPN, the method adopts a RoIAlign method, the size of the region feature map of interest is set to be the same as that of the multi-scale image feature map X ^FPN, the position of the region of interest is determined based on the candidate frame set B _t-1, region segmentation is carried out, sampling, maximum pooling and interpolation operation are carried out on each region, and then the region feature set of interest with fixed size is outputThe method is used for acquiring the image region characteristics corresponding to the region of interest.

The head gesture dynamic convolution module is mainly used for further enhancing the head gesture characteristics of the face region in the image and inhibiting the head gesture characteristics of the non-face region. That is, the head pose dynamic convolution module enables each pose feature vector to have a specially tailored convolution kernel to convolutionally interact with. The weight of the convolution kernel in the traditional algorithm is fixed after training, and the image features cannot be extracted in a self-adaptive mode, and the convolution module can dynamically generate customized convolution kernel weights aiming at each face image feature Q, so that semantic information of each face image feature is self-adaptively enhanced, and the detection performance of the 6DoF head gesture is improved. Fig. 7 is a block diagram of a head gesture dynamic convolution module provided by the present invention. The inputs to the module are weighted and summed sets of different suggested featuresRegion of interest feature set/> Two sets of convolution Kernel parameter sets Kernel 1 _t and Kernel 2 _t were obtained over a linear layer network. Region of interest feature set/>Convoluting with Kernel 1 _t, convoluting the convolution result with Kernel 2 _t, and passing through a linear layer again to obtain the enhanced local head posture feature set/>Similarly, the candidate frame dynamic convolution module has the same network structure as the head gesture dynamic convolution module, and the output of the module is a suggested feature set Q _t.

And finally S3, the suggested feature set Q _t and the head gesture feature set output by the modules are processedThe following treatments were respectively performed: through the full connection layer pair/>And performing scaling and offset calculation, and finally cascading and outputting a final local posture estimation result at the head posture estimation module of the last layer. Specifically, in order to make the model more flexible, the result of each stage of local head gesture is subjected to weight scaling and offset, so that the head gesture is corrected according to the local characteristics of each stage, and the head gesture characteristic set/>Scaling weights/>, which regress to local 6DoF poses after passing through the full connection layerAnd offset weightAnd pass/>Weighting calculation is carried out to obtain a local 6DoF attitude vector set/>, which is output by the current attitude estimation moduleThis step specifically represents the utilization of the previous pose estimation module 6DoF head pose set/>Multiplying the scaling weight W _t by elements, and adding the scaling weight W _t by elements with the offset weight S _t to obtain the output/>, of the attitude estimation moduleAnd performing feature mapping on the Q _t with the enhanced proposal information through a full connection layer to obtain B _t, and performing dimension reduction and weighting on the Q _t through the full connection layer to Confidence _t for evaluating loss and screening matching.

The invention adopts the attention mechanism and the dynamic convolution module to enhance the image semantic characteristics and inhibit the head gesture characteristics of the non-face area.

S4, building a posture conversion module, and outputting the output local 6DoF head posture of the cascade posture estimation moduleConverting to a global 6DoF head pose P ^global.

Specifically, the cascade gesture estimation module outputsThe model is a local attitude estimation value relative to the face area, lacks global information relative to the whole picture, and the face frame coordinates can provide the position information of the face in the whole picture. The input to this module is therefore the local 6DoF head pose/>And a face bounding box B ₆, output as a global 6DoF head pose P ^global.

In the step, a gesture conversion module is built based on a camera projection matrix, and the invention provides a global head gesture and local head gesture conversion formula and performs head gesture conversion based on the formula:

R^global＝(K^global)^-1K^localR^local

t^global＝(K^global)^-1K^localt^local

Wherein K ^local、K^global is a camera matrix of the relative partial image and the whole image respectively, (x ₁,y₁) represents a center coordinate corresponding to the face frame, and w and h are width and height of the complete image, as shown in fig. 8. R ^local and t ^local represent a local rotation matrix and a local translation vector with respect to the local image.

The third part comprises two steps:

S5, designing a global loss function based on a Hungary matching algorithm in bipartite graph optimal matching. In order to achieve the aims of eliminating redundant head posture output and discarding a traditional non-maximum value suppression module, when a loss function is calculated, an estimated value is distributed to a true value by adopting a Hungary matching algorithm in bipartite graph optimal matching, so that the model has the capability of autonomously filtering redundant head posture output. In addition, the method of depth supervision is adopted to calculate the local loss value of the output of each stage of attitude estimation module during model training, so that model convergence is accelerated, and training efficiency is improved.

Specifically, the loss function design includes two parts: the corresponding loss functions are calculated by bipartite graph matching respectively. The cascade gesture estimation module finally outputs 300 groups of estimated value setsEach group contains a candidate box, a corresponding local 6DoF head pose estimation vector, and confidence in the set of estimates. And during training, a Hungary matching algorithm is adopted to forcedly allocate a group of estimated values for each group of real values, so that the real value set and the estimated value set are matched with the optimal bipartite graph, and the total matching cost of all groups is the lowest. On the other hand, in the loss function calculation process, the invention adopts a deep supervision training skill in the training stage, the estimated value matched with the true value of each stage of attitude estimation module is obtained according to the matching relation by the optimal bipartite matching between the estimated value set output by the last attitude estimation module in the cascade attitude estimation module and the true value, the output loss value of each stage of attitude estimation module is further calculated in sequence according to the difference value between the true value and the estimated value, and finally, all the loss values are added to obtain the total loss.

And an estimated value is allocated to a true value by adopting a Hungary matching algorithm in bipartite graph optimal matching, so that the model has the capacity of autonomously filtering redundant head gesture output. In addition, a deep supervision method is adopted in model training to calculate the loss value of the output of each stage of attitude estimation module, so that model convergence is accelerated, and training efficiency is improved. As shown in fig. 9, which is a schematic diagram of bipartite graph matching and loss function calculation, the whole calculation process is divided into two parts, namely bipartite graph matching and corresponding loss function calculation. Firstly, obtaining the optimal bipartite matching between the estimated value set output by the last gesture estimation module in the cascade gesture estimation module and the true value setSpecifically, an estimated value is matched with each true value to achieve optimal bipartite graph matching, an estimated value c _t、b_t、p^local matched with the true value by an optimal t-stage posture estimation module meeting the condition is obtained, and finally the focus loss/>, of each stage is calculated in sequence according to the estimated value and the true valueFrame position L1 loss/>GIoU loss/>Local head pose estimation L2 loss/>And summing to obtain a total loss function/>

S6, model parameter fusion based on multi-data set training. In order to achieve the aim of adjusting model parameters and improving model performance, the invention provides a model parameter fusion scheme based on multi-data set training. Model parameter fusion refers to a method for obtaining a new set of weight parameters by weighted average of parameters in a plurality of models with the same structure but different weight parameters. The method can effectively improve the performance of the model, and does not increase extra computational complexity and memory overhead.

The fourth part comprises two steps:

S7, based on the training data preprocessed in the S1, debugging the end-to-end head gesture estimation network superparameter built in the steps S2 to S6 to ensure that the overall loss of the model on the verification set is not reduced any more, obtaining training models on two training sets in the S1, and finally carrying out parameter fusion on the two training model results according to the S6 to obtain weight parameters with stronger generalization capability so as to improve model precision.

S8, performing head posture estimation result test on the test set in S1 based on the final weight parameters obtained in S7, and defining evaluation indexes such as error values, processing speeds and the like of model performance.

Specifically, in steps S6 and S7, for the parameter fusion strategy of the model, M _random is a parameter randomly initialized before model training, and M ₁ to M _n represent model parameters after the model is trained through multiple data sets. M _fusion is used for carrying out weighted average on model parameters M ₁ to M _n, so that generalization capability and estimation performance of the model are improved, and a fused formula of the model is as follows:

Further, the evaluation index described in step S8 mainly includes two parts: the first part is the model estimating the mean absolute error MAE _euler for each Euler angle and the mean MAE _r of the three Euler angles MAE _euler over the test set. The second part is the model estimates the mean absolute error MAE _distance for each translation vector and the mean MAE _t of the three translation vectors MAE _distance over the test set.

The method provided by the invention is further described below with a specific example.

S1, firstly, processing a head posture estimation common data set, and constructing a local 6DoF head posture for training and testing. The training set WIDER FACE dataset contains rich face information, but there is no artificially annotated head pose truth, and the invention annotates the 6DoF head pose in the dataset with RETINAFACE model. FIG. 3 is a WIDER FACE dataset preprocessing flow chart. The process includes detecting a face boundary box and two-dimensional key point coordinates thereof in WIDER FACE data set by RETINAFACE, counting the detected quantity of each picture, solving the mapping relation between two-dimensional face key points in an image and three-dimensional key points of a standard face model in a head coordinate system by adopting a SolvePnP algorithm, thus obtaining a global 6DoF head posture vector p ^global*, and finally solving to obtain a local 6DoF head posture vector p ^local* by utilizing a global-to-local method. In order to enable the proposed feature set Q ₀ to learn the background feature information in the image, the number of feature vectors of the set Q ₀ needs to be greater than the number of faces in the picture, so that the image with the number of face samples greater than 100 and the true value data thereof are deleted in the data set. The resulting WIDER FACE training set contained 12638 images and 101449 local 6DoF head pose vectors, and the validation set contained 3167 pictures and 22945 local 6DoF head pose vectors. Similarly, in order to make up for the defect that WIDER FACE data sets lack of gesture true values, the invention uses 300W-LP containing gesture true value information to carry out training in a second stage, converts Euler angle true values provided by the 300W-LP data sets and 68 key points of a human face into global rotation vectors r ^global through a projection matrix and a Rodrigas formula, uses the 68 key point true values of the human face to obtain global translation vectors t ^global by applying SolevePnP algorithm, finally obtains local 6DoF head gesture vectors p ^local* by applying a global-to-local method, and finally obtains 122450 local 6DoF head gestures.

In particular, the present invention uses BIWI data sets, AFLW2000-3D data sets as test sets, wherein since BIWI data sets only contain Euler angle truth value data representing the rotation direction of the human head, the test on BIWI data sets only evaluates the performance of the model in estimating Euler angle, i.e. the rotation vector. The AFLW2000-3D data set comprises Euler angle true values representing the rotation of the head and 68 face key point true values, and the invention utilizes the face key point true values given by the data set to apply SolevePnP algorithm to obtain the true value of translation vector, so that the test evaluation model on the AFLW2000-3D data set estimates the performance of 6DoF head gesture, including the estimation effect of rotation vector and translation vector.

Specifically, the BIWI dataset contained 20 experimenters including 14 men, 6 women, 4 glasses-on, 24 video sequences in total, 15678 video images. The pixel sizes of the images are 640 x 480. The first 2000 pictures in AFLW2000-3D data sets are processed according to the 300W-LP making mode, the picture pixel size is 450 x 450, and the three Euler angle truth values of the head and the key point coordinate truth values of 68 human faces are included, and different postures, illumination and facial expressions are included.

S2, building a backbone network and building a feature pyramid network, as shown in fig. 4.

S3, building a cascade gesture estimation module, as shown in fig. 5.

In particular, the method comprises the steps of,From 300 learnable potential local 6DoF head pose vectors/>Composition of/>, wherein The method comprises 300 learnable candidate frames B _i,0, wherein B _i,0＝(x_i,y_i,w_i,h_i) is equal to or less than 1 and equal to or less than 300, wherein the number of cascade modules adopted in the example of the invention is 6, which is an optimal module order empirical value obtained by the invention in a subsequent ablation experiment, (x _i,y_i) represents the center point coordinates of the candidate frames, (w _i,h_i) represents the width and height of the candidate frames, and B ₀ learns the statistical data of the potential positions of the human face in the image through subsequent training and represents the initial guess of the head positions in the image by the model; finally, introducing a feature set Q ₀ with rich suggestions to make up for a great amount of detail information such as background distribution and texture features which are lack of the candidate frame set B ₀ and the potential local 6DoF head gesture set P ₀ ^local, further assisting the potential local head gesture feature information P ₀ ^local and the candidate frame set B ₀ in model optimization, and setting/> It consists of 300 learnable feature vectors q _i,0, where/>

S4, building a gesture conversion module based on the camera projection matrix, as shown in FIG. 8.

S5, an estimated value is distributed to a true value by adopting a Hungary matching algorithm in bipartite graph optimal matching, so that the model has the capacity of autonomously filtering redundant head gesture output, as shown in fig. 9.

S6, in the step, weight parameter fusion parameters are carried out based on model training weight results of multiple data sets, as shown in fig. 10, M _random is a parameter randomly initialized before model training, and M ₁ to M _n represent model parameters after the model is trained through the multiple data sets. M _fusion A model with model parameters M ₁ to M _n subjected to weighted average is trained on two preprocessed data sets in S1 and subjected to parameter fusion, and the model parameters are used as final weight parameters of the invention.

S7, performing model training by combining the data set processing and the network model building of the steps S1 to S6, wherein the specific scheme is as follows: training was performed using NVIDIA GeForce RTX 3090,24GB, using ResNet-18 and ResNet-50 based FPN networks as the backbone network, and pre-training was performed on the COCO dataset. For training, the model trained 30 epochs with a pre-processed WIDER FACE dataset and with a batch size of 12 ADAMW optimizer, with a learning rate of 2.5e-5 initialized. For the loss function, the weight λ _cls of the focus loss function is set to 2, the weight of the loss function λ _L1 is set to 5, the loss function λ _giou is 2, and the loss function λ _dof is 5. The model _wider was obtained after training. Next we fine tune the model using a fixed initial learning rate of 2.5e-6 and constant loss function weights without data enhancement. This process produces a model _wlp. Finally, fine tuning is carried out on the model based on a Model soups method, so that the calculation complexity of overall estimation is reduced:

model_fusion＝x*model_wider+(1-x)*model_wlp

wherein x E [0,1] is a weight factor.

S8, setting model test evaluation indexes, wherein the model test evaluation indexes specifically comprise Euler angle vector errors and translation vector errors. Specifically, the head pose estimation vector is p= (r, t), r= (r ^x,r^y,r^z) represents a local rotation vector, and t= (t ^x,t^y,t^z) represents a global translation vector. The rotation vector r is converted into a rotation matrix by the rodgers formula and expressed as a yaw angle θ _y, a pitch angle θ _x, and a roll angle θ _z.

Specifically, the evaluation index of the model performance is divided into two parts:

The first part is the model estimating the mean absolute error per euler angle MAE _euler and the mean value MAE _r of three euler angles MAE _euler over the test set:

The second part is the model estimating the mean absolute error MAE _distance for each translation vector over the test set and the mean MAE _t for the three translation vectors MAE _distance:

Further, the built end-to-end head posture estimation model is loaded with trained fusion weights, a BIWI dataset and a AFLW2000-3D dataset are input for testing, and corresponding evaluation indexes are calculated. FIG. 11 is a graph showing test results on a test dataset and comparison with other models according to the present invention.

In summary, the invention provides a scheme for learning vector sets, aiming at the problem of excessive manually set parameters in the traditional deep network model, and a group of candidate frames which can be learned are preset in the network, so that the candidate frames can learn the prior face position and size through back propagation in the training process. Therefore, parameters such as the size, the number and the size of the face target frames do not need to be set in the model, and therefore the area candidate network can be discarded. Corresponding to the learnable candidate boxes, the same number of learnable suggested features and potential head pose estimation vectors are also set.

Aiming at the problem of generating redundant head posture estimated values in the traditional detection system, the invention provides a loss function calculation scheme of a Hungary matching algorithm commonly used in bipartite graph optimal matching, and when a loss function is calculated, the head posture estimated values of a model are matched with a true value one to one, low-confidence redundant head posture data are filtered and filtered based on the loss function, and the weight of the model is updated through a back propagation algorithm, so that the model can output the head posture estimated values with high confidence, and a non-maximum value suppression module can be abandoned, thereby avoiding the influence caused by the non-maximum value suppression module.

In the aspect of model structural design, the invention adds the attention module and the dynamic convolution module to strengthen the head gesture characteristics of the face area and inhibit the head gesture characteristics of the non-face area, thereby improving the head gesture estimation performance of the model.

In the aspect of model training design, the invention provides a model parameter fusion scheme based on multi-data set training. Different knowledge information can be learned by training the model in different training sets, and the head posture estimation performance of the model is further improved by taking advantage of the shortages of different models.

The embodiment of the invention provides an end-to-end head gesture estimation system based on a learnable vector and an attention mechanism, which comprises the following components: a computer readable storage medium and a processor;

The computer-readable storage medium is for storing executable instructions;

the processor is configured to read executable instructions stored in the computer readable storage medium and perform a method as in any of the embodiments described above.

Embodiments of the present invention provide a computer readable storage medium storing computer instructions for causing a processor to perform a method as described in any of the embodiments above.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An end-to-end pose estimation method based on a learnable vector and an attention mechanism, comprising:

Wherein the head pose estimation model comprises:

The cascade posture estimation module is used for processing the X ^FPN to obtain a local head posture and a human face boundary frame; the cascade gesture estimation module comprises a plurality of cascade sub-modules, wherein each sub-module comprises an attention module, a region characteristic aggregation module, a head gesture dynamic convolution module and a candidate frame dynamic convolution module; for the t sub-module, the input is the local head gesture set output by the t-1 sub-module The advice characteristic set Q _t-1 and the face boundary candidate frame set B _t-1, and the attention module is used for carrying out attention calculation on the Q _t-1 to obtain/>The region feature aggregation module is used for extracting feature sets/>, of the region of interest from X ^FPN according to B _t-1 Head pose dynamic convolution module for pair/>And/>Convolution processing is carried out to obtain an enhanced local head posture feature set/>For/>Correction is carried out to obtain/>The candidate frame dynamic convolution module is used for pairing/>And/>The Q _t,Q_t after the convolution treatment is enhanced is respectively subjected to feature mapping treatment and dimension reduction and weighting treatment to obtain B _t and a confidence coefficient set Confidence _t; wherein, P ₀ ^local、Q₀ and B ₀ are both learnable vector sets;

the application stage comprises the following steps: inputting an image to be estimated into a trained head posture estimation model to obtain a global head posture;

the head gesture dynamic convolution module comprises a first full-connection layer and a second full-connection layer;

Two sets of convolution Kernel parameter sets Kernel 1 _t 'and Kernel 2 _t' are output through a third full connection layer Convolving the convolution result obtained by convolving with Kernel 1 _t 'with Kernel 2 _t' again, and then passing through a second full connection layer to obtain Q _t;

For/> Correction is carried out to obtain/>Comprising the following steps:

Wherein, W _t and S _t are respectively a scaling weight and an offset weight;

Q _t is characterized to obtain B _t, which comprises the following steps:

performing feature mapping on the Q _t through the full connection layer to obtain a B _t with enhanced features;

q _t is weighted to yield Confidence _t, which comprises:

Performing dimension reduction and weighting on the Q _t through a full connection layer to obtain a local head posture confidence coefficient set Confidence _t;

In the training phase, the loss function is:

N is the number of sub-modules of the cascade gesture estimation module; c ^*、b^*、p^local* is the confidence level, the face boundary box and the true value of the local head gesture respectively; c _t、b_t, Confidence degree, face boundary box and estimation value of local head gesture output by the t sub-module respectively according to/>Confidence _t、B_t,/>, output from the t sub-modulesMatching in to obtain,/>The optimal mapping relation obtained by performing bipartite graph matching on the confidence coefficient output by the submodule at the last stage, the local head gesture and the estimated value and the true value of the candidate frame comprises c ^*, c _t、b_t, b ^* and/orMapping relation with p ^local*; /(I)Focus loss, bezel position L1 loss, GIoU loss, local head pose estimation loss; lambda _cls、λ_L1、λ_giou、λ_dof are all weight coefficients.

2. The method of claim 1, wherein a confidence minimum threshold is set for a last level sub-module of the cascade pose estimation module during an application phase to screen and filter a set of partial head poses and a set of candidate boxes output by the cascade pose estimation module.

3. The method of claim 1, wherein the feature extraction module comprises an interconnected ResNet-18 network and feature pyramid network;

4. The method of claim 1, wherein the head pose is a 6DoF or 3DoF head pose.

5. The method of claim 1 wherein the head pose estimation model is trained using a plurality of data sets, and model parameters trained using each data set are weighted as parameters of the trained model.

6. An end-to-end pose estimation system based on a learnable vector and an attention mechanism, comprising: a computer readable storage medium and a processor;

The computer-readable storage medium is for storing executable instructions;

The processor is configured to read executable instructions stored in the computer readable storage medium and perform the method of any one of claims 1-5.

7. A computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-5.