CN115147472A

CN115147472A - Head pose estimation method, system, device, medium, and vehicle

Info

Publication number: CN115147472A
Application number: CN202110335574.4A
Authority: CN
Inventors: 徐欣奕; 刘鹏; 边宁; 郑睿姣
Original assignee: Dongfeng Motor Corp; Uisee Technologies Beijing Co Ltd
Current assignee: Dongfeng Motor Corp; Uisee Technologies Beijing Co Ltd
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-10-04

Abstract

The present disclosure relates to a head pose estimation method, system, device, medium, and vehicle, the method including: acquiring a monocular image; determining semantic features through a feature extraction model based on the monocular image; wherein the feature extraction model is obtained through model training; acquiring a three-dimensional reference model; based on a stereo reference model and semantic features. According to the technical scheme, the head posture estimation method is directly based on a single monocular image, semantic features are obtained by utilizing a feature extraction model, and the head posture of a user is further determined by combining a stereo reference model, so that the operation complexity is low, and the operation efficiency is favorably improved; meanwhile, the method is applicable to various different scenes, and has the advantages of wide applicability, small limitation by scenes, small influence of application scenes on operation results, and high accuracy under each scene.

Description

Head pose estimation method, system, device, medium, and vehicle

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a head pose estimation method, system, device, medium, and vehicle.

Background

The head pose estimation is to acquire a pose angle of the head using the face image.

The existing head posture estimation methods mainly comprise two methods, one method is that a depth sensor device is used for obtaining three-dimensional information so as to estimate the head posture; however, the device is complex and the data processing process is complicated, and meanwhile, the method is poor in universality because the current mainstream image acquisition device generally acquires two-dimensional images.

The other method is to carry out detection, tracking and prediction of key feature points by collecting multi-view images or video sequences so as to estimate the head posture. The method has relatively high universality, but when the method is applied to the condition of large change of the attitude angle, the accuracy is poor and the operation efficiency is low.

Disclosure of Invention

To solve the above technical problems, or to at least partially solve the same, the present disclosure provides a head pose estimation method, system, device, medium, and vehicle.

The present disclosure provides a head pose estimation method based on monocular images, comprising:

acquiring a monocular image;

determining semantic features through a feature extraction model based on the monocular image; wherein the feature extraction model is obtained by model training;

acquiring a three-dimensional reference model;

determining a head pose of the user based on the stereo reference model and the semantic features.

The present disclosure also provides a head pose estimation system based on monocular images, comprising:

the monocular image acquiring module is used for acquiring a monocular image;

the semantic feature extraction module is used for determining semantic features through a feature extraction model based on the monocular image; wherein the feature extraction model is obtained through model training;

the reference model acquisition module is used for acquiring a three-dimensional reference model;

a head pose determination module to determine a head pose of the user based on the stereo reference model and the semantic features.

The present disclosure also provides an electronic device, including:

a memory and one or more processors;

wherein the memory is communicatively coupled to the one or more processors and stores instructions executable by the one or more processors, the instructions, when executed by the one or more processors, causing the electronic device to implement any of the above methods.

The present disclosure also provides a computer-readable storage medium having stored thereon computer-executable instructions, which, when executed by a computing device, may be used to implement the method of any of the above.

The present disclosure also provides a vehicle, which implements the monocular image-based estimation of the head pose of the driver by applying any one of the above methods, or includes any one of the above systems.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

according to the head posture estimation method, the head posture estimation system, the head posture estimation equipment, the head posture estimation medium and the head posture estimation vehicle, the semantic features can be output by acquiring the monocular image and taking the acquired monocular image as the input of the feature extraction model; the head posture of the user can be determined by further combining the three-dimensional reference model, the operation complexity is low, and the operation efficiency is favorably improved; meanwhile, the method is applicable to various different scenes, and has the advantages of wide applicability, small limitation by scenes, small influence of application scenes on operation results, and high accuracy under each scene.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a head pose estimation method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a monocular imaging process provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a 3D standard model provided in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a head pose angle provided by an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a head pose estimation system provided in an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of another head pose estimation system provided in the embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The technical scheme provided by the embodiment of the disclosure can be applied to any scenes in which the head posture of the user needs to be estimated, including but not limited to scenes such as human-computer interaction, virtual reality, intelligent driving and the like. In the implementation process of the method, monocular image information including the head of a user can be acquired based on an image acquisition device such as a camera or a camera, and the head posture of the user is estimated by combining camera parameters, a stereo reference model and the like so as to determine the head posture of the user.

In some embodiments, the solution may be applied to a taxi for automatic driving (Robotaxi), a monocular image including a head of a driver (e.g., a security officer) is acquired by using a vehicle-mounted monocular camera in the taxi for automatic driving, and the acquired monocular image is used as an input of a feature extraction model, i.e., a semantic feature may be output; and further combining the three-dimensional reference model, the head posture of the driver can be obtained, so that the state of the driver can be monitored.

The method for estimating the head pose based on the monocular image provided by the embodiment of the present disclosure is exemplarily described below with reference to fig. 1 to 4.

In some embodiments, fig. 1 is a schematic flow chart of a head pose estimation method provided by an embodiment of the present disclosure.

In S110, a monocular image is acquired.

The monocular image is an image photographed by a monocular camera, and is a two-dimensional plane image, and the imaging principle thereof can be seen in fig. 2. The monocular image includes a head region of the user to determine the head pose of the user.

In this step, a monocular image acquisition module may be used to acquire a two-dimensional plane image.

In S120, semantic features are determined by the feature extraction model based on the monocular image.

Wherein the feature extraction model is obtained through model training; the input may be a monocular image and the output may be a semantic feature.

The semantic features are set to describe key parts of the head of the user, and are applied to the monocular image, which may include the outline of the identified target and the corresponding semantic tag, where the target may be a facial part corresponding to a head region (also referred to as a facial region) in the monocular image, as will be exemplified later.

In the step, the monocular image is input into the feature extraction model, and semantic features representing key parts of the head of the user are obtained.

In S130, a stereo reference model is acquired.

Among them, the stereo reference model may also be referred to as a 3D face standard model, and the stereo reference model includes a head reference part corresponding to a head key part described by a semantic feature to represent the corresponding same part of the user's head, and the head reference part may be characterized by a reference feature, which may be referred to in fig. 3 and fig. 4, which is exemplarily described below.

In S140, a head pose of the user is determined based on the stereo reference model and the semantic features.

The head pose of the user may be represented by a head pose angle of the user, which may be a rotation angle of the head of the user from a reference position in a three-dimensional coordinate system, see fig. 4.

The head key part described by the semantic features is spatially associated with the corresponding part in the stereo reference model, and the head posture of the user can be determined by determining the spatial conversion relationship between the head key part and the corresponding part.

In this step, the head pose of the user may be determined using a head pose determination module.

According to the head posture estimation method based on the monocular image, provided by the embodiment of the disclosure, the semantic features can be output by acquiring the monocular image and taking the acquired monocular image as the input of the feature extraction model; the head posture of the user can be determined by further combining a stereo reference model, wherein three-dimensional information, multi-view images and video sequences are not needed, but the head posture of the user is determined based on two-dimensional plane images, the operation complexity is low, and the operation efficiency is favorably improved; meanwhile, the method is applicable to various different scenes, and has the advantages of wide applicability, small limitation by scenes, small influence of application scenes on operation results, and high accuracy under each scene.

In some embodiments, before extracting the semantic features based on the monocular image by using the feature extraction model, training the feature extraction model may be further included to obtain the feature extraction model capable of accurately identifying the semantic features in the monocular image. The process may specifically comprise the steps of:

acquiring a monocular image sample and a semantic feature sample; wherein the monocular image samples comprise samples at a plurality of different perspectives, including samples at a positive perspective;

adding semantic annotation to the monocular image sample; respectively labeling the same target in samples under different visual angles;

and training a feature extraction model based on the monocular image sample and the semantic feature sample added with the semantic annotation to obtain the trained feature extraction model.

Specifically, the monocular image sample includes a head region of the user, and specifically may include a head key; based on the above, labeling the key part of the head in the monocular image sample, namely adding semantic labeling; and training the feature extraction model by using the labeled monocular image samples and the semantic feature samples, so that the trained feature extraction model can accurately identify the semantic features in the monocular image samples based on the monocular image.

In the prior art, there is no method for performing head pose calculation based on semantic features. This is because, in general, a 2D image (monocular image) is input to a feature extraction model based on semantic feature labeling, and then a target of the 2D image and a semantic feature corresponding to the target can be extracted. For example, the feature extraction model identifies an outline of an ear and its corresponding semantic tag "ear" in the 2D image. Then, the central coordinates corresponding to the contour of the object in the 2D image can represent the object in the 2D image, and then match with the central coordinates of the same object in the 3D model (stereo reference model). However, since the captured 2D image is likely to be a non-front view, for example, the head of the driver or passenger is not looking forward, in this case, the target recognized in the 2D image is deflected, and its center coordinate is different from the center of the target when not deflected, which may cause a deviation in the calculation of the final head posture. Therefore, there is no method for determining head pose based on semantic feature recognition in the prior art.

In order to solve the problem that final head pose determination errors are caused by target deflection in a 2D image, the same target is labeled for multiple times when a monocular image sample is labeled, contours of the same target at different angles (the specific angles of the contours do not need to be known during labeling) are labeled, and the multiple labeling comprises a contour without head deflection; after the characteristic extraction model identifies the target in the 2D image, namely based on the identified initial contour of the target, the identified contour is corrected based on the contour (reference contour) of the target without deflection, and a corrected contour is obtained; the coordinates of the center of the modified contour are determined. In this way, because the contours of multiple angles are labeled for one target, on one hand, after the 2D image is input into the feature extraction model, the target matching precision can be improved, and on the other hand, a central coordinate position corresponding to the world coordinate system can be obtained based on the central coordinate of the corrected contour, and the central coordinate of the same target in the 3D reference model is subjected to space position conversion to obtain the head posture of the target.

Based on the method, the head space position of the user can be accurately positioned, so that the head posture of the user is more accurate.

In some embodiments, the semantic tags may include eyes, mouth, nose, chin, ears, roots of ears, which respectively correspond to corresponding parts of the face in the image.

Correspondingly, the semantic feature sample may include an eye, a mouth, a nose, a chin, an ear root, and a corresponding target contour, and the reference feature corresponding to the semantic feature may also include an eye, a mouth, a nose, a chin, an ear root, and a corresponding target contour.

In other embodiments, the semantic tags in the semantic features may further include tags corresponding to other features of the head or face of the user, and the semantic feature sample and the reference feature are correspondingly set, which is not limited herein.

In some embodiments, the semantic features and reference features correlate information for corresponding facial key feature points; the number of facial key feature points is 68, as shown in fig. 3.

The key feature points of the face of the user are used for representing the head posture of the user, and include but are not limited to the whole contour point of the face of the user, the contour points of the five sense organs and other convex-concave feature points. The information of the user's facial key feature points may be the coordinate positions of the user's facial key feature points in a world coordinate system.

For example, the coordinate position corresponding to the semantic feature association is the actual position of the head of the user, and may be obtained by converting the coordinate position in the two-dimensional rectangular coordinate system corresponding to the monocular image into the coordinate position in the three-dimensional spatial coordinate system corresponding to the camera coordinate system. Specifically, the coordinate position in the two-dimensional rectangular coordinate system corresponding to the monocular image and the camera parameter may be obtained by combining, and refer to formula 2 below.

For example, the coordinate position corresponding to the reference feature association is a reference position of the head of the user, and may be obtained based on a three-dimensional space coordinate system corresponding to the stereo reference model.

Based on the above, the coordinate position associated with the semantic feature and the coordinate position associated with the reference feature are converted, that is, the actual position of the head of the user and the reference position are subjected to association conversion to obtain a conversion relation between the actual position and the reference position, so that the head posture of the user is determined.

The number of the key feature points of the face of the user can be 68, see fig. 3, so that the key parts of each head can be accurately positioned. In other embodiments, the number of the key feature points of the face of the user can be set to be more, so as to improve the estimation accuracy of the head pose; the number of the head pose estimation methods can be set to be less, so that the operation complexity is reduced, the operation efficiency is improved, and the head pose estimation methods can be set based on the requirements of the head pose estimation methods, and are not limited herein.

Illustratively, with continued reference to fig. 3, the information of the key feature points of the face is associated with semantic feature correspondences, which may correspond to facial features including the corners of the eyes, the corners of the mouth, the tip of the nose, the left and right end points of the alar, the tip of the chin, the ears, and the root of the ear; so as to realize accurate positioning of eyes, mouth, nose, chin, ears, ear root and other parts.

Therefore, the face of the user can be accurately positioned, the five sense organs in the face are positioned, and the information of the whole face contour, the ears, the ear roots and other positions is also included, so that the accurate determination of the estimation of the head posture of the user is improved.

On the basis of the above embodiment, before the monocular image is acquired by using the monocular camera, the monocular camera may be calibrated, and the camera parameters of the monocular camera after calibration are applied to the conversion between the two-dimensional rectangular coordinate system and the three-dimensional spatial coordinate system, so as to ensure the accuracy of the coordinate conversion, thereby ensuring that the estimation accuracy of the head pose estimation method is high.

Thus, in some embodiments, on the basis of fig. 1, before S110, the method may further include:

calibrating a monocular camera offline;

and acquiring calibrated camera parameters.

Specifically, a camera in the monocular camera is calibrated to obtain calibrated camera parameters.

Illustratively, the camera parameters may include intrinsic parameters of the camera, including but not limited to the focal length of the camera, the optical center of the image (which may be simply referred to as "optical center"), and distortion parameters, including but not limited to the radial distortion coefficient and the tangential distortion coefficient of the camera.

The off-line calibration of the monocular camera may be performed in any manner known to those skilled in the art, which is neither described nor limited herein.

In the above embodiment, after the monocular image is acquired, the feature extraction model may be used to determine the face region information first, and then further determine the semantic features.

In some embodiments, based on fig. 1, S120 may specifically include:

determining facial region information of the user using a first layer in the feature extraction model based on the monocular image;

and based on the face region information, recognizing the key parts of the face of the user by utilizing a second layer in the feature extraction model, and determining semantic features.

Wherein the shape of the face region is rectangular, circular or elliptical.

Specifically, face detection can be performed on the monocular image by using the first layer in the feature extraction model, so that face region information of the user is obtained. For example, the face area may be represented by a rectangular frame, and may also be represented by other shapes such as a circle or an ellipse.

The first layer in the feature extraction model may be implemented based on a face detection algorithm, which is to perform user face detection on a monocular image by using a face classifier, and includes but is not limited to using a conventional visual algorithm, a deep learning algorithm, or other algorithms known to those skilled in the art.

The face area information refers to two-dimensional coordinate information of a face area of a user under a current monocular image coordinate system, and the two-dimensional coordinate information comprises top left corner vertex coordinates and length and width information of a rectangular frame, or circle center and radius information of a circle, and positioning information and representation information of other shapes.

Wherein the second layer in the feature extraction model can be implemented based on a visual algorithm or a deep learning algorithm. Correspondingly, based on the face region information, key parts of the face can be automatically identified and positioned by utilizing a visual algorithm or a deep learning algorithm, so that semantic features are obtained, and associated position information is obtained.

In the above embodiment, the head pose of the user may be determined by combining the camera coordinate system, solving the projective transformation relationship between the world coordinate system and the image coordinate system, that is, converting the coordinate position associated with the semantic feature and the coordinate position associated with the reference feature to obtain the conversion relationship therebetween. Specifically, the method can comprise the following steps:

based on the camera parameters, the information of the key facial feature points and the three-dimensional point information corresponding to the key facial feature points one by one, solving a projection transformation relation among a world coordinate system, a camera coordinate system and an image coordinate system;

determining a head pose of the user based on the projective transformation relationship;

the camera parameters comprise focal lengths of the monocular camera in the X direction and the Y direction, positions of a lens optical center in the monocular camera in the X direction and the Y direction, and scale parameters during coordinate conversion.

Specifically, by comparing the information of the key facial feature points with the stereo reference model, three-dimensional point information corresponding to the key facial feature points in the stereo reference model can be determined, and a feature point set corresponding to the key facial feature points of the user in the stereo reference model is obtained, wherein the feature point set is a three-dimensional point set based on the stereo reference model.

The conversion between the feature point set and the information of the facial key feature points is realized based on a projection conversion relationship, which can be understood in conjunction with the principle of monocular imaging process shown in fig. 2. Based on the above, the projection transformation relation represents the relation of coordinate transformation among the world coordinate system, the camera coordinate system and the image coordinate system, and the included angle change can be used for representing the head posture of the user, so that the estimation of the head posture of the user is realized.

In some embodiments, with continued reference to fig. 2, the "solving a projective transformation relationship between the world coordinate system, the camera coordinate system, and the image coordinate system" may specifically include:

and solving a projective transformation relation among a world coordinate system, a camera coordinate system and an image coordinate system based on the formula 1 and the formula 2:

wherein, (X, y) represents face key feature point information, (X) _c ,Y _c ,Z _c ) Representing the position of the corresponding point in the monocular camera coordinate system, (X) _w ,Y _w ,Z _w ) Representing corresponding three-dimensional point information, R representing a rotation matrix, t representing a translation vector, f _x And f _y Respectively represent monocular camerasFocal lengths in the X and Y directions, c _x And c _y Respectively representing the positions of the lens optical center in the X direction and the Y direction in the monocular camera, and s represents a proportional parameter during coordinate conversion;

wherein the camera parameters include f _x 、f _y 、c _x 、c _y And s;

the head pose of the user is an euler angle determined based on the rotation matrix R, including a pitch angle, a yaw angle, and a roll angle.

The camera parameters in the calculation process are set to include the focal lengths of the monocular camera in the X direction and the Y direction, the positions of the lens optical center in the monocular camera in the X direction and the Y direction and the proportion parameters during coordinate conversion, the head gesture of the user can be restored based on the monocular image more accurately by combining the performance of the camera, and therefore the calculation accuracy of the head gesture of the user is improved.

In other embodiments, in order to further reduce the influence of the camera performance on the coordinate transformation and improve the operation accuracy, the camera parameters may further include other parameters that affect the image effect of the monocular image, such as a radial distortion coefficient and a tangential distortion coefficient, which are not limited herein.

In connection with fig. 4, the pitch angle indicates the angle of rotation about the X-axis, denoted pitch; yaw, also known as yaw, represents the angle of rotation about the Y axis, denoted by yaw; the roll angle represents the angle of rotation about the Z axis, in roll.

In other embodiments, when the coordinate conversion is implemented by using other coordinate systems, the head pose of the user may also be represented by using other manners, which is neither described nor limited herein.

In some embodiments, solving a projective transformation relationship between the world coordinate system, the camera coordinate system, and the image coordinate system comprises:

and solving the projection transformation relation among the world coordinate system, the camera coordinate system and the image coordinate system by adopting a direct linear transformation method based on the formula 1 and the formula 2.

The projection transformation relation is solved by using a direct linear transformation method, and the solving method is simple, low in operation difficulty and high in operation efficiency.

In other embodiments, other transformation methods can be used to solve the projective transformation relationship to further improve the operation accuracy, thereby improving the accuracy of the head pose estimation method.

The embodiment of the disclosure also provides a head pose estimation system based on a monocular image, which is used for executing the steps of any one of the methods to realize corresponding effects.

In some embodiments, fig. 5 is a schematic structural diagram of a head pose estimation system provided in an embodiment of the present disclosure.

As shown in fig. 5, the monocular image based head pose estimation system includes: a monocular image acquiring module 210 configured to acquire a monocular image; a semantic feature extraction module 220, configured to determine semantic features through a feature extraction model based on the monocular image; wherein the feature extraction model is obtained through model training; a reference model obtaining module 230, configured to obtain a stereo reference model; a head pose determination module 240 for determining a head pose of the user based on the stereo reference model and the semantic features.

Therefore, in the head posture estimation system based on the monocular image, the monocular image can be obtained through the synergistic effect of the functional modules, and the obtained monocular image is used as the input of the feature extraction model, namely, the semantic features can be output; and further, the head posture angle of the user can be determined by combining the stereo reference model, the method is not limited by scenes, can be applied to various different platforms, and has the advantages of low computation complexity, high computation efficiency and high accuracy.

In some embodiments, when the system is applied in a vehicle, the image acquisition module may comprise an on-board monocular camera, such as an on-board camera module; correspondingly, the user comprises a driver.

Therefore, the system can estimate the head posture of the driver, realize the monitoring of the posture of the driver and is beneficial to improving the driving safety.

In some embodiments, fig. 6 is a schematic structural diagram of another head pose estimation system provided by the embodiments of the present disclosure. As shown in fig. 6, the system may include:

the vehicle-mounted camera module 310 may be implemented in the form of a vehicle-mounted camera, and the vehicle-mounted camera is installed at a horizontal position above an instrument panel in the vehicle and is used for collecting image information of a driver.

The mounting position of the vehicle-mounted camera is optimal to be able to acquire a large degree of facial information of the driver, and the operation of the driver cannot be affected.

In some embodiments, the onboard camera needs to be calibrated offline. Based on this, a camera offline calibration module 340 is provided, which can calibrate the vehicle-mounted camera to obtain camera parameters.

In some embodiments, in conjunction with fig. 5 and 6, the feature extraction module 220 may specifically include a face detection module 320 and a face shape localization module 330.

The face detection module 320 may perform face detection based on the monocular image to obtain facial region information of the driver. Specifically, the face detection module 320 may be implemented based on a face detection algorithm, including performing driver detection on monocular image information by using a face classifier to obtain facial region information of a driver.

The facial shape positioning module 330 may automatically position the position information of the salient feature points of the face according to the facial region information determined by the face detection module 320, i.e. the semantic features and the information of the key feature points of the face associated with the semantic features.

Specifically, the facial salient feature points include the positions of various key components constituting the face, such as eyes, mouth corners, nose tips, the contour of the face, and the like.

Where the positioning of the facial key feature points may be viewed as a regression problem, including but not limited to solving using conventional visual or deep learning algorithms.

Therefore, based on the system shown in fig. 6, the head posture of the driver can be judged, and the system has high operation efficiency and high accuracy.

FIG. 7 is a schematic diagram of a structure suitable for implementing an electronic device according to an embodiment of the disclosure.

As shown in fig. 7, the electronic apparatus 500 includes a Central Processing Unit (CPU) 501 that can execute various processes in the embodiment shown in fig. 1 described above according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The CPU501, ROM502, and RAM503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted on the storage section 508 as necessary.

In particular, the method described above with reference to fig. 1 may be implemented as a computer software program according to an embodiment of the present disclosure. For example, an implementation of an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program containing program code for performing the method of fig. 1. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the disclosed embodiment also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present disclosure.

As another aspect, embodiments of the present disclosure further provide a vehicle, where any of the above embodiments may be applied to achieve estimation of a head pose of a driver based on a monocular image; correspondingly, the vehicle may include any of the systems described above. Specifically, firstly, off-line calibration is carried out on the vehicle-mounted camera module, and parameters of the camera are obtained; secondly, acquiring image information of the driver through a camera, detecting facial area information of the driver, and positioning face shape information on the basis; then, a transformation relation matrix is solved based on projection transformation and a direct linear transformation algorithm between different coordinate systems in the image imaging process, and the head posture angle of the driver is judged. Therefore, the head posture of the driver is estimated, and the calculation efficiency and the accuracy are high; meanwhile, the monocular image is acquired through the vehicle-mounted camera, an image sensor does not need to be installed on the body part of a driver, and the use mode is simple and convenient.

In summary, the embodiments of the present disclosure provide a head pose estimation method, a head pose estimation system, an electronic device, a computer-readable storage medium thereof, and a vehicle. Head posture estimation is carried out on the basis of the monocular image and the stereo reference model, so that the operation complexity is low, and the operation efficiency is high; meanwhile, the method is suitable for various different scenes, and has the advantages of wide applicability, small limitation by scenes, small influence of application scenes on operation results, and high accuracy under each scene.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. It is to be understood that the foregoing detailed description of the disclosed embodiments is merely exemplary or explanatory of the principles of the disclosed embodiments, and is not restrictive of the embodiments, as claimed. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the embodiments of the present disclosure should be included in the protection scope of the embodiments of the present disclosure. Furthermore, it is intended that the appended claims cover all such changes and modifications that fall within the scope and range of equivalents of the appended claims, or the equivalents of such scope and range.

Claims

1. A method for estimating a head pose based on a monocular image is characterized by comprising the following steps:

acquiring a monocular image;

on the basis of the monocular image in question, determining semantic features through a feature extraction model; wherein the feature extraction model is obtained by model training;

acquiring a three-dimensional reference model;

2. The method of claim 1, wherein prior to determining semantic features by a feature extraction model based on the monocular image, further comprising:

adding semantic annotation to the monocular image sample; respectively labeling the same target in the samples under different visual angles;

and training the feature extraction model based on the monocular image sample added with the semantic annotation and the semantic feature sample to obtain the trained feature extraction model.

3. The method of claim 1, wherein deriving a head pose of a user based on the stereo reference model and the semantic features comprises:

determining a reference feature corresponding to the semantic feature in the stereo reference model based on the semantic feature;

and performing coordinate conversion based on the semantic features and the reference features to determine the head posture of the user.

4. The method of any of claims 1-3, wherein the semantic features include outlines of identified objects and corresponding semantic tags including eyes, mouth, nose, chin, ears, and ear roots.

5. The method of claim 3, wherein the semantic features and the reference features correlate information of corresponding facial key feature points;

the number of the face key feature points is 68.

6. The method of claim 1, wherein deriving a head pose of a user based on the stereo reference model and the semantic features comprises:

determining an initial contour of the identified target corresponding to the semantic features based on the semantic features;

acquiring a reference contour of a corresponding target under a positive visual angle;

correcting the initial contour based on the reference contour to obtain a corrected contour;

determining center coordinates of the contour based on the corrected contour;

determining a head pose of the user based on the center coordinates and corresponding coordinates in the stereo reference model.

7. The method of claim 1, wherein prior to acquiring the monocular image, further comprising:

calibrating a monocular camera offline;

and acquiring calibrated camera parameters.

8. A monocular image based head pose estimation system, comprising:

the monocular image acquiring module is used for acquiring a monocular image;

the semantic feature extraction module is used for determining semantic features through a feature extraction model based on the monocular image; wherein the feature extraction model is obtained by model training;

9. The system of claim 8, wherein the image acquisition module comprises an onboard monocular camera and the user comprises a driver.

10. An electronic device, comprising:

a memory and one or more processors;

wherein the memory is communicatively coupled to the one or more processors and has stored therein instructions executable by the one or more processors, the electronic device being configured to implement the method of any of claims 1-6 when the instructions are executed by the one or more processors.

11. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a computing device, are operable to implement the method of any one of claims 1 to 7.

12. A vehicle characterized in that a method according to any of claims 1-7 is applied for estimating the head pose of a driver based on monocular images, or comprising a system according to claim 8 or 9.