CN111695438B

CN111695438B - Head pose estimation method and device

Info

Publication number: CN111695438B
Application number: CN202010431119.XA
Authority: CN
Inventors: 户磊; 石芳; 刘其开; 朱海涛; 陈智超
Original assignee: Hefei Dilusense Technology Co Ltd
Current assignee: Hefei Dilusense Technology Co Ltd
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2023-08-04
Anticipated expiration: 2040-05-20
Also published as: CN111695438A

Abstract

The embodiment of the invention provides a head posture estimation method and a device, wherein the head posture estimation method comprises the following steps: acquiring depth image data; inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model; the head posture estimation model is obtained by training a depth image sample data serving as a sample, and a comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which are marked in advance or defined in a section and are in one-to-one correspondence with the depth image sample data serving as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimension head gesture; both the single-dimensional pose multi-classification tag and the single-dimensional regression tag are used to represent the spatial features of a single-dimensional head pose. The head posture estimation method provided by the embodiment of the invention has higher precision and generalization capability, so that the head posture estimation result is more robust.

Description

Head pose estimation method and device

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a head pose estimation method and apparatus.

Background

With the development of computer vision technology, head pose estimation is required in various scenes such as face recognition, attention detection, man-machine interaction, behavior analysis and the like. The head pose estimation is a technique for estimating the rotation angle of the head in the three-dimensional space from the face image by using methods such as computer vision and machine learning.

The head pose estimation methods in the prior art are roughly classified into three categories:

(1) The template matching-based method mainly comprises a two-dimensional image and a three-dimensional modeling method. The method based on the two-dimensional image mainly comprises the step of comparing an input image with images (each sample is provided with a gesture label) in a template library one by one, so that the most similar view and the corresponding gesture angle are obtained by matching. The three-dimensional modeling-based method is to reconstruct a three-dimensional face model of a person according to one or more Shan Erwei or single three-dimensional or multi-modal face images of the person, match the model with a three-dimensional model of a standard face gesture, completely coincide with the standard three-dimensional model after rotation correction, and calculate rotation matrix parameters to obtain a corresponding gesture angle. The method has the defects of higher computational complexity, longer time consumption and larger influence on face detection and image quality in the matching process.

(2) The model-based method is to construct a face structure by using a geometric model or construct a face model by using face key points, so that the mapping relation between the face image characteristics and the geometric model or the face model is further calculated, and finally the head gesture is estimated. The method has the defects of being easily influenced by key point detection precision, face image quality and scene environment, and poor generalization capability.

(3) The popular embedding method is used for simulating continuous change of head gestures by mapping high-dimensional space features of images to low dimensions and then is used for embedded template matching, and the feature dimension reduction method belongs to unsupervised learning, so that high correlation between low-dimensional principal component features and gesture features is difficult to ensure.

Disclosure of Invention

Embodiments of the present invention provide a head pose estimation method and apparatus that overcomes or at least partially solves the above-mentioned problems.

In a first aspect, an embodiment of the present invention provides a head pose estimation method, including: acquiring depth image data; inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model; the head posture estimation model is obtained by training with depth image sample data as a sample and a predetermined comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which correspond to the depth image sample data as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimensional head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are both used for representing the spatial features of a single-dimensional head posture.

In some embodiments, the head pose estimation model includes a feature extraction layer, a single-dimensional pose layer, and a comprehensive dimensional pose layer; the step of inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model comprises the following steps: inputting the depth image data to the feature extraction layer to obtain a single-dimensional gesture feature map and a multi-dimensional gesture feature map; inputting the single-dimensional gesture feature map into the single-dimensional gesture layer to obtain single-dimensional gesture multi-classification information and single-dimensional regression information; inputting the multidimensional gesture feature map into the comprehensive dimensional gesture layer to obtain the size and angle classification information of the comprehensive dimensional gesture; determining the head pose estimation result based on the single-dimensional pose multi-classification information, the single-dimensional regression information and the single-dimensional regression information; the determining process of the head posture estimation model comprises the following steps: the single-dimensional gesture layer is obtained by training the single-dimensional gesture sample feature map serving as a sample and a single-dimensional gesture multi-classification label and a single-dimensional regression label which are predetermined and correspond to the single-dimensional gesture sample feature map serving as sample labels; and the comprehensive dimension posture layer is obtained by training the multi-dimension posture sample feature map serving as a sample and a predetermined comprehensive dimension posture size and angle two-class label corresponding to the multi-dimension posture sample feature map serving as a sample label.

In some embodiments, the acquiring depth image data comprises: collecting an original depth image; and performing face detection on the original depth image, removing redundant background areas, and determining the depth image data.

In some embodiments, the process of determining the head pose estimation model further comprises: acquiring a three-dimensional attitude angle label corresponding to any depth image data, wherein the three-dimensional attitude angle label comprises a Yaw angle, a Pitch angle and a Roll angle; determining the two classification labels of the size and the angle of the comprehensive dimension gesture based on the comparison of the three-dimensional gesture angle label and the set gesture angle threshold value of each dimension; determining the single-dimensional attitude multi-classification label based on interval division of the three-dimensional attitude angle label; and determining the single-dimensional regression tag based on the standardized processing of the three-dimensional attitude angle tag.

In some embodiments, the head pose estimation model is trained using a total loss function that is determined based on a single-dimensional pose loss function of a single-dimensional pose layer of the head pose estimation model and a comprehensive dimensional pose loss function of a comprehensive dimensional pose layer of the head pose estimation model.

In some embodiments, the determining the total loss function based on a one-dimensional pose loss function of a one-dimensional pose layer of the head pose estimation model and a comprehensive dimensional pose loss function of a comprehensive dimensional pose layer of the head pose estimation model comprises: applying the formula

L _total ＝L _{yaw_total} +L _{pitch_total} +L _{roll_total} +αL _cls ；

Determining the total loss function; wherein L is _total As a total loss function, L _{yaw_total} 、L _{pitch_total} 、L _{roll_total} One-dimensional attitude loss functions of the Yaw angle, the Pitch angle and the Roll angle, L _cls Is a comprehensive dimension attitude loss function, and alpha is a comprehensive dimension attitude loss function L _cls Is a weight of (a).

In some embodiments, the determining of the head pose estimation model further comprises: extracting a verification set from the depth image sample data according to a preset sampling strategy; and adopting an RMSProp optimizer to dynamically adjust the learning rate, and utilizing the verification set to verify generalization and precision of the head posture estimation model to determine the head posture estimation model.

In a second aspect, an embodiment of the present invention provides a head pose estimation apparatus, including: an acquisition unit configured to acquire depth image data; the processing unit is used for inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model; the head posture estimation model is obtained by training with depth image sample data as a sample and a predetermined comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which correspond to the depth image sample data as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimensional head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are both used for representing the spatial features of a single-dimensional head posture.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the head pose estimation method provided by any possible implementation of the first aspect when the program is executed.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the head pose estimation method provided by any possible implementation of the first aspect.

According to the head posture estimation method, the head posture estimation device, the electronic equipment and the non-transitory computer readable storage medium, a large amount of depth image sample data and corresponding comprehensive dimension posture size and angle two-class labels, single-dimension posture multi-class labels and single-dimension regression labels are used as sample labels to train a head posture estimation model, and then the head posture estimation model is used for obtaining a head posture estimation result, so that the mapping relation between the face image characteristics and the head posture Euler angles can be obtained through fitting better, the head posture estimation method and the head posture estimation device have higher precision and generalization capability, and the head posture estimation result is more robust.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a head pose estimation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a head pose estimation model according to an embodiment of the present invention;

FIG. 3 is a flow chart of a head pose estimation method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a head pose estimation method according to an embodiment of the present invention;

FIG. 5 is a schematic view of a head pose estimation device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A head pose estimation method according to an embodiment of the present invention is described below with reference to fig. 1 to 4.

As shown in fig. 1, the head pose estimation method according to the embodiment of the present invention includes the following steps S100 to S200.

Step S100, obtaining depth image data.

The depth image (depth image) is also called a range image, and refers to an image in which the distance (depth) from an image collector to each point in a scene is taken as a pixel value, and directly reflects the geometry of the visible surface of the scene. The depth image may be calculated as point cloud data through coordinate conversion, and the point cloud data of regular and necessary information may also be inversely calculated as depth image data.

The depth image data in the embodiment of the invention is a human head depth image, is mainly used for estimating the head gesture, and can be used for acquiring the depth image with rich head gesture (related to different angles of each dimension as far as possible) under each scene (not limited to factors such as distance, illumination, shielding, blurring, ornaments and the like) by adopting a depth camera.

Step 200, inputting the depth image data into the head posture estimation model to obtain a head posture estimation result output by the head posture estimation model.

It will be appreciated that the depth image data may be processed using a head pose estimation model to obtain a corresponding head pose estimation result.

The head posture estimation model is obtained by training a depth image sample data serving as a sample, and a comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which are marked in advance or defined in a section and correspond to the depth image sample data one by one serving as sample labels. The comprehensive dimension gesture size and angle two-classification tag is used for representing the spatial features of the head gesture in multiple dimensions; both the single-dimensional pose multi-classification tag and the single-dimensional regression tag are used to characterize the spatial features of the head pose in a single dimension.

It can be understood that the head posture estimation model is obtained by training a large number of depth image sample data and corresponding comprehensive dimension posture size angle two-class labels, single-dimension posture multi-class labels and single-dimension regression labels as sample labels.

It should be noted that the head gesture features are often described from three angles, and the comprehensive dimension gesture size angle two-classification label integrally describes the head gesture size features from three angles; the single-dimensional gesture multi-classification label describes the head gesture characteristics by multi-classification according to each angle; the single-dimensional regression tag describes the head posture characteristics by regressing to a certain interval according to each angle.

According to the embodiment of the invention, a large amount of depth image sample data and corresponding comprehensive dimension gesture size angle two-classification labels, single dimension gesture multi-classification labels and single dimension regression labels are used as sample labels to train the head gesture estimation model, so that the head gesture estimation result is obtained by using the head gesture estimation model, the mapping relation between the face image characteristics and the head gesture Euler angles can be obtained by better fitting, the precision and generalization capability are higher, and the head gesture estimation result is more robust.

As shown in fig. 2, in some embodiments, the head pose estimation model includes a feature extraction layer, a single-dimensional pose layer, and a comprehensive dimensional pose layer.

As shown in fig. 3, the depth image data is input into the head pose estimation model to obtain the head pose estimation result output by the head pose estimation model, which includes the following steps S210-S230.

And step S210, inputting the depth image data into a feature extraction layer to obtain a single-dimensional gesture feature map and a multi-dimensional gesture feature map.

Step S220, inputting the single-dimensional gesture feature map into a single-dimensional gesture layer to obtain single-dimensional gesture multi-classification information and single-dimensional regression information.

And step S230, inputting the multidimensional gesture feature map into a comprehensive dimension gesture layer to obtain the two classification information of the size and the angle of the comprehensive dimension gesture.

Step S240: and determining the head posture estimation result based on the single-dimensional posture multi-classification information, the single-dimensional regression information and the single-dimensional regression information.

The determination process of the head pose estimation model includes the following processes.

The single-dimensional gesture layer is obtained by training with a single-dimensional gesture sample feature map as a sample and a single-dimensional gesture multi-classification label and a single-dimensional regression label which are predetermined and correspond to the single-dimensional gesture sample feature map as sample labels.

The comprehensive dimension posture layer is obtained by training with a multi-dimension posture sample feature map as a sample and a predetermined comprehensive dimension posture size and angle two-class label corresponding to the multi-dimension posture sample feature map as a sample label.

It is understood that the head pose estimation model may include a feature extraction layer, a single-dimensional pose layer, and a comprehensive dimensional pose layer. The feature extraction layer is mainly used for extracting feature graphs based on a SheffeNet_Pose (lightweight network for posture estimation) and sharing weights; the single-dimensional attitude layer mainly adopts a regression angle of the single-dimensional attitude to assist multi-classification training corresponding to the single dimension; the comprehensive dimension gesture layer is designed mainly aiming at the problem that extremely large angle data in single dimension gesture training data are less and extremely large angle estimation is not robust, and aims to monitor and restrict the extremely large angle estimation of the single dimension gesture layer.

The embodiment of the invention aims at the SheffeNet (lightweight network) to perform network pruning and the more simplified design of parameter control, and the obtained head posture estimation model has high running speed and high accuracy and can meet the real-time and accuracy requirements of face recognition scenes.

According to the embodiment of the invention, the head posture estimation model is divided into the feature extraction layer, the single-dimensional posture layer and the comprehensive dimensional posture layer, so that the depth image data can be processed from multiple dimensions, the head posture features can be represented from the multiple dimensions, and the head posture estimation result data output by the model is more accurate.

As shown in fig. 4, in some embodiments, acquiring depth image data includes the following steps S110-S120.

Step S110, acquiring an original depth image.

Step S120, face detection is carried out on the original depth image, redundant background areas are removed, and depth image data are determined.

It can be understood that face detection is performed on the original depth image, and the face frame is enlarged to a proper size according to the length of the longest edge of the detection frame, so that the whole face is ensured to be in a cutting range, no redundant background area is contained, depth image data is determined, and the data format of the depth image data can be binary depth point cloud with uniform size for storing depth information in the cutting area.

According to the embodiment of the invention, the original depth image is preprocessed, so that the interference of irrelevant data can be removed, and the depth image data is more accurate.

In some embodiments, the process of determining the head pose estimation model further includes the following.

And acquiring a three-dimensional attitude angle label corresponding to any depth image data, wherein the three-dimensional attitude angle label comprises a Yaw angle, a Pitch angle and a Roll angle.

It will be appreciated that the three-dimensional attitude angle label characterizes depth image data in three angles, namely Yaw (Yaw angle of left-right rotation), pitch (Pitch angle of up-down rotation) and Roll (Roll angle of horizontal rotation).

It can be understood that in the head posture estimation model of the embodiment of the present invention, in actual training, a two-class label with a comprehensive dimensional posture, a multi-class label with a single dimensional posture, and a one-dimensional regression label are required to be constructed simultaneously, where each angle (Yaw angle, pitch angle, and Roll angle) in the multi-class label with a single dimensional posture corresponds to one multi-class label, and each angle (Yaw angle, pitch angle, and Roll angle) in the one-dimensional regression label corresponds to one regression label, and the 7-dimensional label vectors are shared according to the angle.

And determining a two-class label of the size and the angle of the comprehensive dimension gesture based on the comparison of the three-dimensional gesture angle label and the set threshold value of each dimension gesture angle.

It can be understood that the two classification labels of the size and the angle of the comprehensive dimension gesture are determined by setting a threshold value, and as the spatial characteristics of the gesture in the threshold value critical angle area are not easy to distinguish, a certain interval is set in the threshold value critical angle, the specific threshold value is set as follows:

(1) Large angle: abs (yaw) >45& abs (pitch) >35& abs (roll) >40, tag 1;

(2) Small angle: abs (yaw) <40or abs (pitch) <30or abs (roll) <30, the label is 0;

(3) Critical angle area: and the label is-1 between the large angle area and the small angle area, and only the label is marked due to the characteristic ambiguity of the part of data, so that the two-class training of the comprehensive dimension is not actually participated.

And determining the single-dimensional attitude multi-classification label based on the interval division of the three-dimensional attitude angle label.

It can be understood that the single-dimensional attitude multi-classification labels are obtained by dividing attitude angles of corresponding dimensions into sections according to 3 degrees, taking a yaw angle as an example, the sections can be divided into 60 subcategories, and the corresponding labels are discrete values in the sections of [0, 59 ]; the pitch and roll are each divided into 40 classes, corresponding to discrete values in the [0, 39] interval.

And determining a one-dimensional regression label based on the standardized processing of the three-dimensional attitude angle label.

It is understood that the single-dimensional regression labels are obtained by respectively normalizing the attitude angles of the corresponding dimensions to the range of [ -1,1 ].

According to the embodiment of the invention, the specific generation modes of the three labels are limited, so that the training of the head posture estimation model is more refined, and the accuracy of the head posture estimation model is higher.

Determining a total loss function based on the one-dimensional pose loss function of the one-dimensional pose layer of the head pose estimation model and the comprehensive dimensional pose loss function of the comprehensive dimensional pose layer of the head pose estimation model, comprising: applying the formula

L _total ＝L _{yaw_total} +L _{pitch_total} +L _{roll_total} +αL _cls ；

Determining a total loss function;

wherein L is _total As a total loss function, L _{yaw_total} 、L _{pitch_total} 、L _{roll_total} One-dimensional attitude loss functions of the Yaw angle, the Pitch angle and the Roll angle, L _cls Is a comprehensive dimension attitude loss function, and alpha is a comprehensive dimension attitude loss function L _cls Is a weight of (a).

It should be noted that, the main function of the single-dimensional attitude loss function is to perform the two-class training of the size and angle of the comprehensive dimension, and aims to restrict the fine-class training of correcting the single-dimensional attitude for the extremely large angle, and the cross entropy loss function is adopted, and the formula is as follows:

L _cls ＝-[y*log(p)+(1-y)*log(1-p)]；

wherein L is _cls Representing a comprehensive dimension attitude loss function, wherein y represents a large-angle label of a sample, 1 is a large angle, and 0 is a small angle; p is the probability that the sample predicts as a large angle.

The comprehensive dimension attitude loss function is obtained by adopting multi-loss calculation combining classification and regression loss functions for each dimension attitude, and mainly monitors multi-classification learning of the corresponding dimension by using the regression angle of the single dimension attitude, so that classification results of all subclasses are more accurate, and a more accurate continuous attitude angle is predicted. The comprehensive dimension attitude loss function is calculated according to different dimensions as follows, taking the Yaw angle as an example:

L _{yaw_cls} ＝H(x，x′)；

L _{yaw_total} ＝L _{yaw_cls} +βL _{yaw_mse} ；

wherein L is _{yaw_cls} Multi-class cross entropy loss for the Yaw angle, β is the weight of the regression lossHeavy parameter, L _{yaw_mse} And Z is a single-dimensional regression label of the Yaw, and Z' is a regression angle value solved according to the predicted sub-classification interval probability map. The specific calculation formula of Z' is as follows:

wherein p is ^k Mu, the probability distribution being located in the kth subinterval ^k The feature vector of the prediction result in the kth subinterval and the rest of the dimension gesture loss are designed to be L _{yaw_total} 。

For the aforementioned integrated dimension gesture L _cls And a single-dimensional gesture L _{yaw_total} ，L _{pitch_total} L and L _{roll_total} And carrying out weighted summation to determine a total loss function.

Aiming at the problem that the current head posture estimation is inaccurate for larger angle estimation, the embodiment of the invention combines the comprehensive posture dimension and the single posture dimension to design the total loss function to assist training, and further improves the performance of the head posture estimation model.

In some embodiments, the process of determining the head pose estimation model further comprises the following process.

And extracting a verification set from the depth image sample data according to a preset sampling strategy.

It will be appreciated that the sampling strategy is first set according to a preset sampling strategy, for example, such a sampling strategy may be a training set: verification set = 4:1, respectively extracting training set and verification set from the depth image sample data, wherein the training set is used for training the head posture estimation model, and the verification set is used for verifying network generalization and precision in the training process.

It should be noted that, because each depth image sample data corresponds to a 7-dimensional label, the embodiment of the invention designs a preset sampling strategy to ensure the uniformity of the distribution of the depth image sample data in each sampling training, and simultaneously adds data enhancement such as random clipping and point cloud dithering in the training process to enrich the training data.

And adopting an RMSProp optimizer to dynamically adjust the learning rate, and utilizing a verification set to verify the generalization and the precision of the head posture estimation model to determine the head posture estimation model.

It should be noted that, in the embodiment of the present invention, an RMSProp optimizer is used in the aspect of super-parameter configuration, the initial learning rate is 0.01, and the learning rate is dynamically adjusted in the training process of the head posture estimation model, and the learning rate is stepped down along with the increase of the iteration number, so as to ensure that when the training is deeper, the optimizer does not generate large-amplitude oscillation due to the larger learning rate. In order to ensure that loss can continuously converge in the training process, variation trend adjustment parameters of loss and precision indexes in the training process are observed in real time, and a network parameter model is automatically stored for later use according to a set step length in the training process.

According to the embodiment of the invention, the verification set is divided from the depth image sample data and used for verifying the generalization and the precision of the head posture estimation model, so that the model performance in the training process can be monitored in real time, and the accuracy and the generalization capability of the head posture estimation model are further improved.

The head posture estimating apparatus provided by the embodiment of the present invention will be described below, and the head posture estimating apparatus described below and the head posture estimating method described above may be referred to correspondingly to each other.

As shown in fig. 5, an embodiment of the present invention provides a head pose estimation apparatus, including an acquisition unit 510 and a processing unit 520.

Wherein, the obtaining unit 510 is configured to obtain depth image data.

The processing unit 520 is configured to input the depth image data into the head pose estimation model, and obtain a head pose estimation result output by the head pose estimation model.

The head posture estimation model is obtained by training a depth image sample data serving as a sample, and a comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which are marked in advance or defined in a section and are in one-to-one correspondence with the depth image sample data serving as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimension head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are both used for representing the spatial features of a single-dimensional head posture.

The embodiment of the present invention provides a head posture estimation device for executing the above head posture estimation method, and the specific implementation manner of the device is consistent with the implementation manner of the method, which is not described herein.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a head pose estimation method comprising: acquiring depth image data; inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model; the head posture estimation model is obtained by training a depth image sample data serving as a sample, and a comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which are marked in advance or defined in a section and are in one-to-one correspondence with the depth image sample data serving as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimension head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are both used for representing the spatial features of a single-dimensional head posture.

It should be noted that, in this embodiment, the electronic device may be a server, a PC, or other devices in the specific implementation, so long as the structure of the electronic device includes a processor 610, a communication interface 620, a memory 630, and a communication bus 640 as shown in fig. 6, where the processor 610, the communication interface 620, and the memory 630 complete communication with each other through the communication bus 640, and the processor 610 may call logic instructions in the memory 630 to execute the above method. The embodiment does not limit a specific implementation form of the electronic device.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the head pose estimation method provided by the above-described method embodiments, the method comprising: acquiring depth image data; inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model; the head posture estimation model is obtained by training a depth image sample data serving as a sample, and a comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which are marked in advance or defined in a section and are in one-to-one correspondence with the depth image sample data serving as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimension head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are both used for representing the spatial features of a single-dimensional head posture.

In another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the head pose estimation method provided in the above embodiments, the method comprising: acquiring depth image data; inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model; the head posture estimation model is obtained by training a depth image sample data serving as a sample, and a comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which are marked in advance or defined in a section and are in one-to-one correspondence with the depth image sample data serving as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimension head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are both used for representing the spatial features of a single-dimensional head posture.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A head pose estimation method, comprising:

acquiring depth image data;

inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model;

the head posture estimation model is obtained by training a depth image sample data serving as a sample and a comprehensive dimension posture size and angle two-class label, a single-dimension posture multi-class label and a single-dimension regression label which are marked in advance or defined in a section and are in one-to-one correspondence with the depth image sample data serving as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimensional head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are used for representing the spatial characteristics of the single-dimensional head posture;

the determining process of the head pose estimation model further comprises:

acquiring a three-dimensional attitude angle label corresponding to any depth image data, wherein the three-dimensional attitude angle label comprises a Yaw angle, a Pitch angle and a Roll angle;

determining the two classification labels of the size and the angle of the comprehensive dimension gesture based on the comparison of the three-dimensional gesture angle label and the set gesture angle threshold value of each dimension;

determining the single-dimensional attitude multi-classification label based on interval division of the three-dimensional attitude angle label;

and determining the single-dimensional regression tag based on the standardized processing of the three-dimensional attitude angle tag.

2. The head pose estimation method according to claim 1, wherein the head pose estimation model includes a feature extraction layer, a single-dimensional pose layer, and a comprehensive dimensional pose layer;

the step of inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model comprises the following steps:

inputting the depth image data to the feature extraction layer to obtain a single-dimensional gesture feature map and a multi-dimensional gesture feature map;

inputting the single-dimensional gesture feature map into the single-dimensional gesture layer to obtain single-dimensional gesture multi-classification information and single-dimensional regression information;

inputting the multidimensional gesture feature map into the comprehensive dimensional gesture layer to obtain the size and angle classification information of the comprehensive dimensional gesture;

determining the head pose estimation result based on the single-dimensional pose multi-classification information, the single-dimensional regression information and the single-dimensional regression information;

the determining process of the head posture estimation model comprises the following steps:

the single-dimensional posture layer is obtained by training a single-dimensional posture sample feature map serving as a sample and a single-dimensional posture multi-classification label and a single-dimensional regression label which are predetermined and correspond to the single-dimensional posture sample feature map serving as sample labels;

the comprehensive dimension posture layer is obtained by training a multi-dimension posture sample feature map serving as a sample and a predetermined comprehensive dimension posture size and angle two-class label corresponding to the multi-dimension posture sample feature map serving as a sample label.

3. The head pose estimation method according to claim 1, wherein the acquiring depth image data comprises:

collecting an original depth image;

and performing face detection on the original depth image, removing redundant background areas, and determining the depth image data.

4. The head pose estimation method according to claim 1, wherein the head pose estimation model is trained using a total loss function, the total loss function being determined based on a single-dimensional pose loss function of a single-dimensional pose layer of the head pose estimation model and a comprehensive dimensional pose loss function of a comprehensive dimensional pose layer of the head pose estimation model.

5. The head pose estimation method according to claim 4, wherein,

the determining the total loss function based on the single-dimensional pose loss function of the single-dimensional pose layer of the head pose estimation model and the comprehensive dimensional pose loss function of the comprehensive dimensional pose layer of the head pose estimation model comprises the following steps: applying the formula

L _total ＝L _{yaw_total} +L _{pitch_total} +L _{roll_total} +αL _cls ；

Determining the total loss function;

6. The head pose estimation method according to any one of claims 1-5, wherein the process of determining the head pose estimation model further comprises:

extracting a verification set from the depth image sample data according to a preset sampling strategy;

and adopting an RMSProp optimizer to dynamically adjust the learning rate, and utilizing the verification set to verify generalization and precision of the head posture estimation model to determine the head posture estimation model.

7. A head pose estimation device, comprising:

an acquisition unit configured to acquire depth image data;

the processing unit is used for inputting the depth image data into a head posture estimation model to obtain a head posture estimation result output by the head posture estimation model;

the head posture estimation model is obtained by training with depth image sample data as a sample and a predetermined comprehensive dimension posture size and angle two-classification label, a single-dimension posture multi-classification label and a single-dimension regression label which correspond to the depth image sample data as sample labels; the comprehensive dimension gesture size and angle two-class label is used for representing the spatial characteristics of the comprehensive multi-dimensional head gesture; the single-dimensional posture multi-classification tag and the single-dimensional regression tag are used for representing the spatial characteristics of the single-dimensional head posture;

the determining process of the head pose estimation model further comprises:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the head pose estimation method according to any of claims 1 to 6 when the program is executed.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the head pose estimation method according to any of claims 1 to 6.